개선된 OneHotEncoder 알아보기(v1.1 이후)

이번에 scikit-learn에서 버전을 업데이트하면서 OneHotEncoder가 수정이 됐는데,

일반적으로 카테고리 처리하는 데 고민이 되던 부분을 많이 개선해서 올려준 것 같다.

기존에는 하려면, 복잡하게 짜서 해야하지만 이제는 그러지 않아도 돼서 좋은 것 같다 ㅎㅎ

https://scikit-learn.org/stable/modules/preprocessing.html#one-hot-encoder-infrequent-categories

6.3. Preprocessing data

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream esti...

scikit-learn.org

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

sklearn.preprocessing.OneHotEncoder

Examples using sklearn.preprocessing.OneHotEncoder: Release Highlights for scikit-learn 1.1 Release Highlights for scikit-learn 1.1, Release Highlights for scikit-learn 1.0 Release Highlights for s...

scikit-learn.org

categories
- auto
- list
  - 미리 정해진 사전이 있다면 사전내에서만 작동하게 해주는 유용한 기능이다.
  - 보통 onehot encoder를 fitting하는데도 데이터가 큰 경우에 이슈가 있는데, 이런 부분에서 많은 도움이 될 것 같다.
drop
- first
  - dummy화 시켜주는 것과 같은 역할을 한다.
- if_binary
  - binary 같은 경우만 한개를 버리게 한다.
sparse
- bool
handle_unknown
- error
- ignore
  - 새로운 클래수가 나오는 경우 0으로 만들어준다.
- infrequent_if_exist
  - 빈번하게 나오지 않는 것에 대해서도 0으로 만들어주게 한다
min_frequency
- int
  - cardanility가 더 작은 범주는 드문 것으로 간주되게 한다.
- float
  - min_frequency * n_samples 보다 더 적은 값을 가지는 경우 범주가 드문 것으로 간주하게 한다.
max_categories
- 카테고리를 제한해서 나머지는 드물게 만든다.

코드

코드는 scikit-learn 홈페이지를 참고하시면 된다.

새로운 변수 나오면 무시하기 및 dummy화 해주기

enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

enc.categories_

enc.transform([['Female', 1], ['Male', 4]]).toarray()


enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])


enc.get_feature_names_out(['gender', 'group'])
drop_enc = OneHotEncoder(drop='first').fit(X)
drop_enc.categories_

drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()

카테고리를 아는 경우

genders = ['female', 'male']
locations = ['from Africa', 'from Asia', 'from Europe', 'from US']
browsers = ['uses Chrome', 'uses Firefox', 'uses IE', 'uses Safari']
enc = preprocessing.OneHotEncoder(categories=[genders, locations, browsers])
# Note that for there are missing categorical values for the 2nd and 3rd
# feature
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

빈도를 기반으로 처리하기

enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()

나중에 이런 것 처리할 필요가 있는 경우를 고민해보고 사용하면 좋을 것 같다.

저작자표시 (새창열림)

'분석 Python > Scikit Learn (싸이킷런)' 카테고리의 다른 글

[TIP / Sklearn ] Custom Estimator (Ex : Combined Regressor) (0)	2020.10.27
[sklearn] TSNE, MDS, SpectralEmbedding Estimator를 Pipeline 에 적용 시키는 방법 (0)	2020.08.23
scikit-learn 파이프라인 시각화 기능 사용 및 재사용 (pipeline visualization) (0)	2020.05.15
scikit-learn 0.23 이 되면서 변한 점 (0)	2020.05.15
Scikit-learn Custom Pipeline Save & Reload (저장 및 재사용) (0)	2020.02.28

개선된 OneHotEncoder 알아보기(v1.1 이후)

코드

'분석 Python > Scikit Learn (싸이킷런)' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바