[sklearn] Ray를 사용하여 Regression Variable Selection 병렬로 하기

2020. 8. 20. 00:49분석 Python/Ray

ray-project 중에서 tune-sklearn 패키지가 있는 것을 확인했다.

이 패키지는 scikit-learn 모델들을 ray를 사용해서 병렬 처리를 하게 해 준다.

ray를 잘 쓰고 싶은 사람이기 때문에 테스트를 해봤다.

 

이번 글에서는 scikit-learn에서 제공하는 변수 선택법을 회귀 모델에 적용해서, 가장 최적의 변수를 찾기 위한 작업을 GridSearchCV로 진행해 본 것을 공유한다.

 

패키지 설치 방법 

pip install tune-sklearn ray[tune]

 

from sklearn.feature_selection import (
    VarianceThreshold , 
    SelectFromModel,
    SelectKBest,GenericUnivariateSelect,SelectPercentile,
    f_regression , mutual_info_regression
)

from sklearn.datasets import load_boston
data = load_boston(return_X_y=False)

보다 원활한 학습을 위해서 타깃만 전처리를 진행하였다.

fig , ax = plt.subplots(1,2)
axes = ax.flatten()
axes[0].hist(data.target)
axes[1].hist(np.log( data.target+ 1 ) )

 

X , y = data.data , np.log(data.target + 1 )
import seaborn as sns
from sklearn.svm import LinearSVR
from sklearn import linear_model
from tune_sklearn import TuneSearchCV
from tune_sklearn import TuneGridSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.decomposition import PCA, NMF
from sklearn.pipeline import Pipeline
## https://github.com/ray-project/tune-sklearn/tree/master/examples

def score_plot(scores, labels=None, threshold =None) :
    plt.bar([i for i in range(len(scores))], scores)
    if labels is not None :
        plt.xticks([i for i in range(len(scores))], labels = labels,rotation=90)
    if threshold is not None :
        plt.axhline(y=threshold, color='r', linestyle='-')
    plt.show()
def comparision_of_real_n_pred(y,y_pred) :
    plt.scatter(y_pred , y)
    plt.scatter(y,y, color = "red")
    plt.xlabel("y_pred")
    plt.ylabel("y")
    plt.show()

일단 제공하는 패키지를 하나씩 써보자

SelectKBest , mutual_info_regression

feature_selection_2 =  SelectKBest(mutual_info_regression, k=10)
feature_selection_2.fit(X,y)
select_col = list(data.feature_names[feature_selection_2.get_support()])
print(select_col)
# ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT']
score_plot(feature_selection_2.scores_)

SelectKBest , f_regression

feature_selection_2 =  SelectKBest(f_regression, k=10)
feature_selection_2.fit(X,y)
select_col = list(data.feature_names[feature_selection_2.get_support()])
print(select_col)
## ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
score_plot(feature_selection_2.scores_)

SelectFromModel , linear_model.BayesianRidge using pipeline

model = MLPRegressor(hidden_layer_sizes=(40,20,10),
                     max_iter= 10000, early_stopping=True ,
                    learning_rate_init=0.0001,)
clf = Pipeline([
        ('feature_selection', SelectFromModel(linear_model.BayesianRidge(n_iter=1000,
                                                                         alpha_1 = 1e-3,
                                                                         alpha_2 = 1e-3,
                                                                         normalize=True))),
        ('regressor', model)])
_ = clf.fit(X, y)
y_pred = clf.predict(X)
comparision_of_real_n_pred(y,y_pred)

이런 식으로 기존에 scikit-learn에서는 모델을 사용하던지, mutual information을 사용해서 변수를 선택할 수 있다.

이번에는 tune-sklearn을 사용해서 여러 가지 조합을 병렬로 처리하는 것을 소개한다.

모델 fitting은 scikit-learn에서 제공하는 MLPRegressor를 적용해봤다.

총 94개의 경우의 수를 적용해보기로 하였다.

model = MLPRegressor(hidden_layer_sizes=(40,20,10),
                     max_iter= 10000, early_stopping=True ,
                    learning_rate_init=0.0001,)
pipe  = Pipeline([
        ('reduce_dim', "passthrough"),
        ('regressor', model)])
## SelectKBest(method_, k=k)
N_FEATURES_OPTIONS = list(range(5,13,2))
activation_OPTIONS = ["relu","tanh"]
score_func = [mutual_info_regression , f_regression]
Percentile_FEATURES_OPTIONS =  list(np.arange(0.5,1,0.1))
max_features_OPTIONS = ["sqrt","log2"]
max_depth_OPTIONS = [5,7,9]
criterion_OPTIONS = ["mae","mse"]
import itertools
c = list(itertools.product(max_features_OPTIONS,max_depth_OPTIONS,criterion_OPTIONS))
models = [RandomForestRegressor(max_depth=depth , max_features= feature , criterion=criterion) for feature , depth , criterion in c]
param_grid = [
    {
        "reduce_dim": [PCA(iterated_power=7), NMF()],
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "regressor__activation": activation_OPTIONS
    },
    {
        "reduce_dim": [SelectKBest()],
        "reduce_dim__score_func" : score_func,
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "regressor__activation": activation_OPTIONS
    },
    {
        "reduce_dim": [SelectPercentile()],
        "reduce_dim__score_func" : score_func,
        "reduce_dim__percentile": Percentile_FEATURES_OPTIONS,
        "regressor__activation": activation_OPTIONS
    },
    {
        "reduce_dim": [SelectFromModel(linear_model.BayesianRidge(n_iter=1000,
                                                                  alpha_2 = 1e-3,
                                                                  normalize=True))],
#         "reduce_dim___estimator__alpha_1": alpha_1_OPTIONS,
        "regressor__activation": activation_OPTIONS
    },
    {
        "reduce_dim": [SelectFromModel(estimator=RandomForestRegressor(),)],
        "reduce_dim__estimator": models,
        "regressor__activation": activation_OPTIONS
    }
]


grid = TuneGridSearchCV(pipe, 
                        param_grid=param_grid,
                        cv=3, use_gpu=False
                       )
_ = grid.fit(X, y)
grid.best_params_
{'reduce_dim': SelectFromModel(estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                                 criterion='mse', max_depth=9,
                                                 max_features='sqrt',
                                                 max_leaf_nodes=None,
                                                 max_samples=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100, n_jobs=None,
                                                 oob_score=False,
                                                 random_state=None, verbose=0,
                                                 warm_start=False),
                 max_features=None, norm_order=1, prefit=False, threshold=None),
 'reduce_dim__estimator': RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                       max_depth=9, max_features='sqrt', max_leaf_nodes=None,
                       max_samples=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False),
 'regressor__activation': 'tanh'}
data.feature_names[grid.best_estimator_[-2].get_support()]

## array(['CRIM', 'NOX', 'RM', 'PTRATIO', 'LSTAT'], dtype='<U7')
score_plot(grid.best_estimator_[-2].estimator_.feature_importances_, 
               labels= data.feature_names,
               threshold=grid.best_estimator_[-2].threshold_)

y_pred = grid.best_estimator_.predict(X)
comparision_of_real_n_pred(y,y_pred)

이런 식으로 94개의 모델을 만들어서 변수 선택하는 것을 tune-sklearn와 scikit-learn을 사용해서, 변수 선택 후 모델링을 진행해봤다.

 

아주 빠르게 여러 가지 변수 선택 방법을 시도할 수 있는 것 같아서 좋았고, 더 다양한 패키지들이 많은 것 같아서 해보는 맛이 있는 것 같다.

 

 

https://github.com/ray-project/tune-sklearn

 

ray-project/tune-sklearn

A scikit-learn API on RayTune. Contribute to ray-project/tune-sklearn development by creating an account on GitHub.

github.com

 

728x90