numpy.unique, numpy.searchsorted

카테고리를 정수로 변환하기!

pandas에는 cat.codes가 있다.

유니크 범위 : ( 0 , 카레고리수 -1 )

from itertools import combinations
possible_categories = list(map(lambda x: x[0] + x[1], list(combinations('abcdefghijklmn', 2))))
categories = np.random.choice(possible_categories, size=10000)
print(categories)

['al' 'kl' 'jk' ... 'jm' 'bm' 'hj']

unique_categories, new_categories = np.unique(categories, return_inverse=True)
print(unique_categories)
print(new_categories)

['ab' 'ac' 'ad' 'ae' 'af' 'ag' 'ah' 'ai' 'aj' 'ak' 'al' 'am' 'an' 'bc'
 'bd' 'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bk' 'bl' 'bm' 'bn' 'cd' 'ce' 'cf'
 'cg' 'ch' 'ci' 'cj' 'ck' 'cl' 'cm' 'cn' 'de' 'df' 'dg' 'dh' 'di' 'dj'
 'dk' 'dl' 'dm' 'dn' 'ef' 'eg' 'eh' 'ei' 'ej' 'ek' 'el' 'em' 'en' 'fg'
 'fh' 'fi' 'fj' 'fk' 'fl' 'fm' 'fn' 'gh' 'gi' 'gj' 'gk' 'gl' 'gm' 'gn'
 'hi' 'hj' 'hk' 'hl' 'hm' 'hn' 'ij' 'ik' 'il' 'im' 'in' 'jk' 'jl' 'jm'
 'jn' 'kl' 'km' 'kn' 'lm' 'ln' 'mn']
[10 85 81 ... 83 23 71]

Mapping new categories

categories2 = np.random.choice(possible_categories, size=10000)
new_categories2 = np.searchsorted(unique_categories, categories2)
print(categories2)
print(unique_categories[new_categories2])

# ['gn' 'hn' 'be' ..., 'il' 'bk' 'bl']
# ['gn' 'hn' 'be' ..., 'il' 'bk' 'bl']

Checking values present in the lookup

set을 사용할 수도 있지만, 이 방식이 더 빠르게 체크가 가능하다!!

np.in1d(['ab', 'ac', 'something new'], unique_categories) 
## array([ True,  True, False])

Handling missing categories

train data에는 있지만 test data에 없는 경우 굉장히 까다롭다.

존재하는 범주에 랜덤으로 주던지
새로운 변수가 들어오면 UnKnown으로 들어가게 설정을 하던지 해야한다.

class CategoryMapper:
    def fit(self, categories):
        self.lookup = np.unique(categories)
        return self
        
    def transform(self, categories):
        """Converts categories to numbers, 0 is reserved for new values (not present in fitted data)"""
        return (np.searchsorted(self.lookup, categories) + 1) * np.in1d(categories, self.lookup)    
        
        
print(categories)
CategoryMapper().fit(categories).transform([unique_categories[0], 'abd'])

['al' 'kl' 'jk' ... 'jm' 'bm' 'hj']
## 앞에는 존재하므로 1 , 뒤에는 없으므로 0 
array([1, 0])

참고 :

http://arogozhnikov.github.io/2015/09/29/NumpyTipsAndTricks1.html

Data manipulation with numpy: tips and tricks, part 1

also, since algebra is you friend, you can do it times faster: $$||x_i - x_j||^2 = ||x_i||^2 + ||x_j||^2 - 2 (x_i, x_j) $$ so dot products is the only thing you need.

arogozhnikov.github.io

'분석 Python > Numpy Tip' 카테고리의 다른 글

Incredibly Fast Random Sampling in Python - 리뷰 (0)	2019.06.14
Broadcasting, numpy.newaxis (0)	2019.05.26
numpy argsort (0)	2019.05.26
넘파이 좋은 팁이 많음! (0)	2019.05.25
Numpy (0)	2017.12.25

numpy.unique, numpy.searchsorted

Mapping new categories

Checking values present in the lookup

Handling missing categories

'분석 Python > Numpy Tip' 카테고리의 다른 글

AI 도구

AI 도구 사이드 패널

티스토리툴바