numpy.unique, numpy.searchsorted

2019. 5. 26. 22:12분석 Python/Numpy Tip

728x90

카테고리를 정수로 변환하기!

pandas에는 cat.codes가 있다.

유니크 범위 : ( 0 , 카레고리수 -1 )

from itertools import combinations
possible_categories = list(map(lambda x: x[0] + x[1], list(combinations('abcdefghijklmn', 2))))
categories = np.random.choice(possible_categories, size=10000)
print(categories)

['al' 'kl' 'jk' ... 'jm' 'bm' 'hj']

unique_categories, new_categories = np.unique(categories, return_inverse=True)
print(unique_categories)
print(new_categories)

['ab' 'ac' 'ad' 'ae' 'af' 'ag' 'ah' 'ai' 'aj' 'ak' 'al' 'am' 'an' 'bc'
 'bd' 'be' 'bf' 'bg' 'bh' 'bi' 'bj' 'bk' 'bl' 'bm' 'bn' 'cd' 'ce' 'cf'
 'cg' 'ch' 'ci' 'cj' 'ck' 'cl' 'cm' 'cn' 'de' 'df' 'dg' 'dh' 'di' 'dj'
 'dk' 'dl' 'dm' 'dn' 'ef' 'eg' 'eh' 'ei' 'ej' 'ek' 'el' 'em' 'en' 'fg'
 'fh' 'fi' 'fj' 'fk' 'fl' 'fm' 'fn' 'gh' 'gi' 'gj' 'gk' 'gl' 'gm' 'gn'
 'hi' 'hj' 'hk' 'hl' 'hm' 'hn' 'ij' 'ik' 'il' 'im' 'in' 'jk' 'jl' 'jm'
 'jn' 'kl' 'km' 'kn' 'lm' 'ln' 'mn']
[10 85 81 ... 83 23 71]

Mapping new categories

categories2 = np.random.choice(possible_categories, size=10000)
new_categories2 = np.searchsorted(unique_categories, categories2)
print(categories2)
print(unique_categories[new_categories2])

# ['gn' 'hn' 'be' ..., 'il' 'bk' 'bl']
# ['gn' 'hn' 'be' ..., 'il' 'bk' 'bl']


Checking values present in the lookup

set을 사용할 수도 있지만, 이 방식이 더 빠르게 체크가 가능하다!!

 

np.in1d(['ab', 'ac', 'something new'], unique_categories) 
## array([ True,  True, False])

Handling missing categories

train data에는 있지만 test data에 없는 경우 굉장히 까다롭다.

  • 존재하는 범주에 랜덤으로 주던지
  • 새로운 변수가 들어오면 UnKnown으로 들어가게 설정을 하던지 해야한다.
class CategoryMapper:
    def fit(self, categories):
        self.lookup = np.unique(categories)
        return self
        
    def transform(self, categories):
        """Converts categories to numbers, 0 is reserved for new values (not present in fitted data)"""
        return (np.searchsorted(self.lookup, categories) + 1) * np.in1d(categories, self.lookup)    
        
        
print(categories)
CategoryMapper().fit(categories).transform([unique_categories[0], 'abd'])

['al' 'kl' 'jk' ... 'jm' 'bm' 'hj']
## 앞에는 존재하므로 1 , 뒤에는 없으므로 0 
array([1, 0])

 

 

참고 : 

http://arogozhnikov.github.io/2015/09/29/NumpyTipsAndTricks1.html

 

Data manipulation with numpy: tips and tricks, part 1

also, since algebra is you friend, you can do it times faster: $$||x_i - x_j||^2 = ||x_i||^2 + ||x_j||^2 - 2 (x_i, x_j) $$ so dot products is the only thing you need.

arogozhnikov.github.io

 

728x90

'분석 Python > Numpy Tip' 카테고리의 다른 글

Incredibly Fast Random Sampling in Python - 리뷰  (0) 2019.06.14
Broadcasting, numpy.newaxis  (0) 2019.05.26
numpy argsort  (0) 2019.05.26
넘파이 좋은 팁이 많음!  (0) 2019.05.25
Numpy  (0) 2017.12.25