[ Python ] Pandas idxmin , idxmax, pd.cut 함수 알아보기

데이터분석뉴비 2019. 10. 29. 19:20

데이터셋 만들기

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd

Iris = load_iris()
concat = np.concatenate( (Iris.data , np.array(Iris.target).reshape(-1,1)) , axis = 1)
data = pd.DataFrame(concat , columns = Iris.feature_names + ["Species"])

가끔 먼가 데이터에서 가장 작은 값? 큰 값을 찾고 싶을 때가 있다.

그럴 때 보통 코드는 다음과 같다.

idxmin() , idxmax()

## min
data[data['sepal length (cm)'] == data['sepal length (cm)'].min()]

## max
data[data['sepal length (cm)'] == data['sepal length (cm)'].max()]

어디서 문득 찾아보니 다음과 같은 것이 있어서 공유 훨씬 간결하다!

## min
data.loc[data['sepal length (cm)'].idxmin(),:]

## max
data.loc[data['sepal length (cm)'].idxmax(),:]

pd.cut

카테고리로 범주화해주고 싶을 때 사용한다.

양쪽 사이드를 포함할지 여부는 다음과 argument가 한다

include_lowest / right

오른쪽만 포함시키고 싶은 경우

pd.cut(pd.Series(range(101)),
       [0, 24, 49, 74, 100] , 
       include_lowest= False ,
       right=True
      )

왼쪽만 포함시키고 싶은 경우

pd.cut(pd.Series(range(101)),
       [0, 24, 49, 74, 100] , 
       right= False , 
       include_lowest= True)

그냥 5개 구간으로 나누고 싶은 경우

pd.cut(data['sepal length (cm)'], 5 ,)

특별히 라벨을 부여하고 싶은 경우

pd.cut(data['sepal length (cm)'],
       bins = 5, 
       include_lowest= True , 
       labels = ["Group {}".format(i) for i in np.arange(1,6)])

라벨을 그냥 안 주고 나누고 싶은 경우

pd.cut(data['sepal length (cm)'], 5 , labels=False)