[ Python ] Neural Network의 적당한 구조와 hyperparameter 찾는 방법

[ Python ] Neural Network의 적당한 구조와 hyperparameter 찾는 방법

2019. 9. 8. 21:08ㆍML(머신러닝)/Optimization

hyperparameter를 찾는 우리의 옵션은 몇 가지가 있다.

1. Hand Tuning or Manual Search

하나씩 시도해서 올바른 구조를 찾는 것은 굉장히 고된 일이다.

그러나 약간의 경험과 초기 결과에 대한 섬세한 분석은 도움이 될 수 있다.

2. Grid Search

최적화를 하기 위해서 원하는 각각의 범위를 정해서 통과시킨다.

그러나 이러한 방법은 다 해보기 때문에, 보고자 하는 파라미터가 많아질수록 시간이 많이 걸릴 것이다.

3. Random Search

모든 가능한 조합에서 랜덤하게 선택하는 방법으로 결국 Grid Search의 subset이 된다.

4. Bayesian Optimization/Other probabilistic optimizations

나도 개인적으로 제대로 최적화를 할 때는 이 방법을 선호한다.

이 방법은 수학적으로도 다음 파라미터를 추정하는 데 있어서 합리적으로 해주는 것 같다.

깊이 들어가면 너무 어렵긴 하다 ㅠ

암튼 대략적으로 말하자면, Bayesian optimization은 Gaussian Process라는 방법을 사용해서 objective function을 추측한다. 하이퍼 파라미터들의 올바른 셋을 찾음으로써, 로스를 최소화한다.

이 방법에서는 매우 비용이 많이 든다고 가정하기 때문에 객관적 기능을 평가하고자 하는 횟수에 제한이 설정되어 있다.

처음에는 함수의 값을 관측하기 위해 범위 내의 파라미터에서 랜덤으로 몇 개를 뽑아서 값을 얻는다.

그다음에는 Gaussian Process를 사용해서 최적화 작업을 진행한다.

acquisition function이라는 것을 통해서 다음 샘플을 결정한다.

패키지로는 Hyperopt가 있다. TPE (https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf)

Grid Search Implementation

만약 SKLEARN 모델을 사용한다면 GridSearchCV가 있다.

그리고 KERAS에서도 사용 가능하다고 한다.

search_params = {
    "batch_size": [20, 30, 40],
    "time_steps": [30, 60, 90], 
    "lr": [0.01, 0.001, 0.0001],
    "epochs": [30, 50, 70]
}

def eval_model():
    """
    implement your logic to build a model, train it and then calculate validation loss.
    Save this validation loss using CSVLogger of Keras or in a text file. Later you can
    query to get the best combination.
    """
    pass

def get_all_combinations(params):
    all_names = params.keys()
    combinations = it.product(*(params[name] for name in all_names))
    return list(combinations)

def run_search(mat, params):
    param_combs = get_all_combinations(params) # list of tuples
    logging.info("Total combinations to try = {}".format(len(param_combs)))
    for i, combination in enumerate(param_combs):
        logging.info("Trying combo no. {} {}".format(i, combination))
        eval_model(mat, combination, i)

run_search(x_input, search_params)

Other Smarter Search Implementation

더 똑똑한 탐색 알고리즘을 사용한 목적 함수를 최소화하는 것이 몇 가지 오픈 소스가 있다.

>> HyperOpt and Talos

hyperopt

아래 코드에서

search_space : 바꾸고 싶은 파라미터

fmin : 최소화하기 위한 실제 함수

from hyperopt import Trials, STATUS_OK, tpe, fmin, hp

def data(batch_size, time_steps):
    """
    function that returns data to be fed into objective function and model is trained on it subsequently.
    """
    global mat

    BATCH_SIZE = batch_size
    TIME_STEPS = time_steps
    x_train, x_test = train_test_split(mat, train_size=0.8, test_size=0.2, shuffle=False)
    # scale the train and test dataset
    min_max_scaler = MinMaxScaler()
    x_train = min_max_scaler.fit_transform(x_train)
    x_test = min_max_scaler.transform(x_test)

    x_train_ts, y_train_ts = build_timeseries(x_train, 3, TIME_STEPS)
    x_test_ts, y_test_ts = build_timeseries(x_test, 3, TIME_STEPS)
    x_train_ts = trim_dataset(x_train_ts, BATCH_SIZE)
    y_train_ts = trim_dataset(y_train_ts, BATCH_SIZE)
    x_test_ts = trim_dataset(x_test_ts, BATCH_SIZE)
    y_test_ts = trim_dataset(y_test_ts, BATCH_SIZE)
    return x_train_ts, y_train_ts, x_test_ts, y_test_ts

search_space = {
    'batch_size': hp.choice('bs', [30,40,50,60,70]),
    'time_steps': hp.choice('ts', [30,50,60,80,90]),
    'lstm1_nodes': hp.choice('units_lsmt1', [70,80,100,130]),
    'lstm1_dropouts': hp.uniform('dos_lstm1',0,1),
    'lstm_layers': hp.choice('num_layers_lstm',[
        {
            'layers':'one', 
        },
        {
            'layers':'two',
            'lstm2_nodes': hp.choice('units_lstm2', [20,30,40,50]),
            'lstm2_dropouts': hp.uniform('dos_lstm2',0,1)  
        }
        ]),
    'dense_layers': hp.choice('num_layers_dense',[
        {
            'layers':'one'
        },
        {
            'layers':'two',
            'dense2_nodes': hp.choice('units_dense', [10,20,30,40])
        }
        ]),
    "lr": hp.uniform('lr',0,1),
    "epochs": hp.choice('epochs', [30, 40, 50, 60, 70]),
    "optimizer": hp.choice('optmz',["sgd", "rms"])
}

def create_model_hypopt(params):
    """
    This method is called for each combination of parameter set to train the model and validate it against validation data
    to see all the results, from which best can be selected.
    """
    print("Trying params:",params)
    batch_size = params["batch_size"]
    time_steps = params["time_steps"]
    # For most cases preparation of data can be done once and used 'n' number of times in this method to train the model
    # but in this case we want to find optimal value for batch_size and time_steps too. So our data preparation has to be done
    # based on that. Hence calling it from here.
    x_train_ts, y_train_ts, x_test_ts, y_test_ts = data(batch_size, time_steps)
    lstm_model = Sequential()
    # (batch_size, timesteps, data_dim)
    lstm_model.add(LSTM(params["lstm1_nodes"], batch_input_shape=(batch_size, time_steps, x_train_ts.shape[2]), dropout=params["lstm1_dropouts"],
                        recurrent_dropout=params["lstm1_dropouts"], stateful=True, return_sequences=True,
                        kernel_initializer='random_uniform'))  
    if params["lstm_layers"]["layers"] == "two":
        lstm_model.add(LSTM(params["lstm_layers"]["lstm2_nodes"], dropout=params["lstm_layers"]["lstm2_dropouts"]))
    else:
        lstm_model.add(Flatten())

    if params["dense_layers"]["layers"] == 'two':
        lstm_model.add(Dense(params["dense_layers"]["dense2_nodes"], activation='relu'))
    
    lstm_model.add(Dense(1, activation='sigmoid'))

    lr = params["lr"]
    epochs = params["epochs"]
    if params["optimizer"] == 'rms':
        optimizer = optimizers.RMSprop(lr=lr)
    else:
        optimizer = optimizers.SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=True)

    lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)  # binary_crossentropy
    history = lstm_model.fit(x_train_ts, y_train_ts, epochs=epochs, verbose=2, batch_size=batch_size,
                             validation_data=[x_test_ts, y_test_ts],
                             callbacks=[your_csv_logger])
    val_error = np.amin(history.history['val_loss']) 
    print('Best validation error of epoch:', val_error)
    return {'loss': val_error, 'status': STATUS_OK, 'model': lstm_model} # if accuracy use '-' sign
    # return history, lstm_model

# Trails object let's you return and store extra information from objective function, which
# can be analysed later. Check "trails.trails" which returns all the list of dictionaries 
trials = Trials()
best = fmin(create_model_hypopt,
    space=search_space,
    algo=tpe.suggest, # type random.suggest to select param values randomly
    max_evals=200, # max number of evaluations you want to do on objective function
    trials=trials)

hp.choice : 그 list 안에서 뽑는다.

hp.uniform : 2nd , 3nd 사이에서 뽑는다.

Talos

이전 툴과 거의 비슷하고 모델을 구축하고 그것을 훈련하고 평가할 새로운 함수를 만들어야 한다.

dictionary 대신에 모델과 객체의 keras 기록을 리턴 받는다.

def data(search_params):
    """
    The function that prepares the data for LSTM training specific to this problem as per values in search_params.
    """
    global mat

    BATCH_SIZE = search_params["batch_size"]
    TIME_STEPS = search_params["time_steps"]
    x_train, x_test = train_test_split(mat, train_size=0.8, test_size=0.2, shuffle=False)

    # scale the train and test dataset
    min_max_scaler = MinMaxScaler()
    x_train = min_max_scaler.fit_transform(x_train)
    x_test = min_max_scaler.transform(x_test)

    x_train_ts, y_train_ts = build_timeseries(x_train, 3, TIME_STEPS)
    x_test_ts, y_test_ts = build_timeseries(x_test, 3, TIME_STEPS)
    x_train_ts = trim_dataset(x_train_ts, BATCH_SIZE)
    y_train_ts = trim_dataset(y_train_ts, BATCH_SIZE)
    x_test_ts = trim_dataset(x_test_ts, BATCH_SIZE)
    y_test_ts = trim_dataset(y_test_ts, BATCH_SIZE)
    print("Test size(trimmed) {}, {}".format(x_test_ts.shape, y_test_ts.shape))
    return x_train_ts, y_train_ts, x_test_ts, y_test_ts
  
  def create_model_talos(x_train_ts, y_train_ts, x_test_ts, y_test_ts, params):
    """
    function that builds model, trains, evaluates on validation data and returns Keras history object and model for
    talos scanning. Here I am creating data inside function because data preparation varies as per the selected value of 
    batch_size and time_steps during searching. So we ignore data that's received here as argument from scan method of Talos.
    """
    x_train_ts, y_train_ts, x_test_ts, y_test_ts = data(params)
    BATCH_SIZE = params["batch_size"]
    TIME_STEPS = params["time_steps"]
    lstm_model = Sequential()
    # (batch_size, timesteps, data_dim)
    lstm_model.add(LSTM(params["lstm1_nodes"], batch_input_shape=(BATCH_SIZE, TIME_STEPS, x_train_ts.shape[2]), dropout=0.2,
                        recurrent_dropout=0.2, stateful=True, return_sequences=True,
                        kernel_initializer='random_uniform'))
    if params["lstm_layers"] == 2:
        lstm_model.add(LSTM(params["lstm2_nodes"], dropout=0.2))
    else:
        lstm_model.add(Flatten())

    if params["dense_layers"] == 2:
        lstm_model.add(Dense(params["dense2_nodes"], activation='relu'))

    lstm_model.add(Dense(1, activation='sigmoid'))
    if params["optimizer"] == 'rms':
        optimizer = optimizers.RMSprop(lr=params["lr"])
    else:
        optimizer = optimizers.SGD(lr=params["lr"], decay=1e-6, momentum=0.9, nesterov=True)
    lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)  # binary_crossentropy
    history = lstm_model.fit(x_train_ts, y_train_ts, epochs=params["epochs"], verbose=2, batch_size=BATCH_SIZE,
                             validation_data=[x_test_ts, y_test_ts],
                             callbacks=[LogMetrics(search_params, params, -1), csv_logger])
    return history, lstm_model
  
print("Starting Talos scanning...")
t = ta.Scan(x=mat, # data parameter is ignored in this example as here data varies based on batch_size & time_steps
            y=mat[:,0], # dummy data just to avoid errors. input and output calculated in create_model_talos
            model=create_model_talos,
            params=search_params,
            dataset_name='stock_ge',
            experiment_no='1',
            reduction_interval=10)

pickle.dump(t, open(os.path.join(OUTPUT_PATH,"talos_res"),"wb"))

Hyperas (Hyperopt + Keras),

이 패키지의 주요 장점은 hyperopt의 문법이나 함수들을 배울 필요가 없다는 것이다.

네가 해야 할 것은 search space를 정의하는 것이다.

def create_model(x_train_ts, y_train_ts, x_test_ts, y_test_ts):
    x_train_ts, y_train_ts, x_test_ts, y_test_ts = build_data()
    lstm_model = Sequential()
    # (batch_size, timesteps, data_dim)
    lstm_model.add(LSTM({{choice([50, 100, 150])}}, batch_input_shape=(BATCH_SIZE, TIME_STEPS, x_train_ts.shape[2]), dropout=0.2,
                        recurrent_dropout=0.2, stateful=True, return_sequences=True,
                        kernel_initializer='random_uniform'))  
    if {{choice(['one_lstm','two_lstm'])}} == 'two_lstm':
        lstm_model.add(LSTM({{choice([30, 60, 80])}}, dropout={{choice([0.1,0.2,0.3])}}))
    if {{choice(['one_dense','two_dense'])}} == 'two_dense':
        lstm_model.add(Dense({{choice([10, 20])}}, activation='relu'))
    lstm_model.add(Dense(1, activation='sigmoid'))
    if {{choice(['sgd','rms'])}} == 'rms':
        optimizer = optimizers.RMSprop(lr={{uniform(000.1, 0.1)}})
    else:
        optimizer = optimizers.SGD(lr={{uniform(000.1, 0.1)}}, decay=1e-6, momentum=0.9, nesterov=True)
    lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)  # binary_crossentropy
    
    history = lstm_model.fit(x_train_ts, y_train_ts, epochs={{choice([20, 40, 60, 70])}}, verbose=2, batch_size=BATCH_SIZE,
                             validation_data=[x_test_ts, y_test_ts],
                             callbacks=[LogMetrics(search_params, params, comb_no), csv_logger])
    
    val_error = np.amax(history.history['val_acc']) 
    print('Best validation acc of epoch:', val_error)
    return {'loss': val_error, 'status': STATUS_OK, 'model': lstm_model} # if accuracy use '-' sign


best_run, best_model = optim.minimize(model=create_model,
                                      data=data_dummy,
                                      algo=tpe.suggest,
                                      max_evals=2000,
                                      trials=Trials())

그러나 Hyperas는 이 경우 작동하지 않는다. 왜냐하면 나는 '모델' 기능에서 '데이터' 기능을 호출하고 있었는데 이중 곱슬 브레이스를 사용하는 구문이 어떤 문제를 일으켰다. 다른 사용하는 툴이 있어서 이 이슈에 대해서는 파지 않았다고 한다.

그럼에도 불구하고 이것인 가치가 있다라고 생각하고 꽤 쉽다고 한다.

앞에서 말한 tool 중에서 최고가 머냐고 물어본다면 절충안을 고려해야 한다. 즉, 세부 튜닝에 얼마나 많은 시간을 투자해야 하는지, 검증 손실은 얼마나 많은지.

search_params = {
    "batch_size": [20, 30, 40],
    "time_steps": [30, 60, 90],
    "lr": [0.01, 0.001, 0.0001],
    "epochs": [30, 50, 70]
}

저자가 grid search로 하였을 때 24시간 동안 81개의 조합으로 했다고 한다.

1번째 실수는 logging을 안하는 실수를 했다고 한다.

그리고 81개로 했지만 결과는 엉망이였다고 한다.

2번째 실수는 layer 수와 같은 것을 최적화를 안 했다는 것이다.

어쨌든 gird search 최적의 결과에서 얻은 걸로 다시 수동으로 최적화하기로 했다고 한다.

3번째 실수는 뉴럴 네트워크의 힘의 이해가 부족하다는 것이다

굳이 2 layer를 안 하고 1개의 layer로도 충분했던 것이다.

그러면 어떻게 모델의 과적합이 된다는 것을 알 수 있을까?

일반적인 방법인 train과 validation을 loss를 비교한다.

1번째

train과 valid 사이에 loss gap이 크다면 overfitting 된 것이다. train에는 잘 되지만 새로운 데이터에서는 안 되는 일반화가 안된다.

2번째

이것은 모델이 단지 새로운 데이터에 대한 무작위 값을 예측하고 있다는 것을 의미하며, 그렇기 때문에 시대에 걸친 검증 손실 사이에는 거의 관계가 없다.

다른 것은 epoch log들을 보는 것이다.

만약 train loss 는 감소하는 것이 유지되고, validation은 fluctuate 되거나 동일하게 유지된다면 아마도 overfitting으로 간다는 조짐이다.

그래 결국 저자는 2번째 layer를 지우고 dropout을 추가하면 0.2에서 0.5로 하니, 다음과 같은 결과가 나온다!

auto-sklearn

최근에 찾다보니 다음과 같은 패키지가 있었다.

내가 볼 때 그냥 sklearn에서 grid search나 bayesian optimization으로 찾는 것보다 모형을 찾는 것보다는

차라리 auto-sklearn을 사용해서 앙상블까지도 고려해주는 모델을 만들기를 선택하겠다.

사용법은 정말 간단하고, 설명은 아래 미디엄을 참고하면 직관적으로 이해할 수 있을 것이다.

https://automl.github.io/auto-sklearn/master/

auto-sklearn — AutoSklearn 0.6.0 documentation

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator: >>> import autosklearn.classification >>> cls = autosklearn.classification.AutoSklearnClassifier() >>> cls.fit(X_train, y_train) >>>

automl.github.io

https://medium.com/dlift/%EC%A0%81%EB%8B%B9%ED%95%9C-%EC%A0%95%ED%99%95%EB%8F%84-%EA%B0%80-%EB%B3%B4%EC%9E%A5%EB%90%98%EB%8A%94-%EB%AA%A8%EB%8D%B8%EC%9D%84-%EC%9E%90%EB%8F%99%EC%9C%BC%EB%A1%9C-%EB%A7%8C%EB%93%A4-%EC%88%98%EB%8A%94-%EC%97%86%EC%9D%84%EA%B9%8C-f0da4a6a9607

적당한 ‘정확도’가 보장되는 모델을 ‘자동으로’ 만들 수는 없을까?

데이터의 '특성'에 따라 선택하면 좋을 알고리즘과 어울리는 파라미터들을 '합리적'으로 '잘 찾는' 문제 = AutoML 이라고 한다는 것이다. (게다가 AutoML 의 구현체들은 https://github.com/automl 에서 찾을 수 있다) 정말 library 를 import 하는 코드라인을 제외하고, 단 두 줄이면 Random Search 코드와…

medium.com

나만의 결론

예전에 한번 Neural Network를 최적화시키기 위해서, Bayesian Optmization을 사용하여 layer의 수 , drop out 등등 모든 걸 조절하게 하면서 만들었는데, 성과가 좋지 않게 나왔다.

일단 실제 문제점은 학습을 시키려고 할 때, 시간이 촉박해서 epoch을 작게 하면서 여러 파라미터를 하다 보니

성능도 일반 tree-based model 보다 좋지 않았다.

위의 글쓴이가 말한 것처럼 Neural Network의 힘을 믿고 간단한 모델부터 해봐야 하는 것 같다.

일단 내가 만약에 다시 하게 된다면, 일단 전체적인 데이터 구조를 보고 모형의 전체적 크기를 가장 최소화로 하게 해서 빠르게 bayesian optimization을 돌려서 결과를 얻은 다음에 얻은 결과에 대해서 분석을 통해 manual search를 해야 할 것 같다.

https://towardsdatascience.com/finding-the-right-architecture-for-neural-network-b0439efa4587

How to Find the Right Architecture for Neural Network and Fine Tune Hyperparameters

And Fine Tuning Hyperparameters

towardsdatascience.com

https://github.com/DarkKnight1991/Stock-Price-Prediction/blob/master/stock_pred_hyperopt.py

DarkKnight1991/Stock-Price-Prediction

Predicting stock price using historical data of a company, using Neural networks (LSTM). - DarkKnight1991/Stock-Price-Prediction

github.com

'ML(머신러닝) > Optimization' 카테고리의 다른 글

Differentiable Convex Optimization Layers (0)	2019.11.17
Optuna: A Next-generation Hyperparameter Optimization Framework (0)	2019.11.17
[Python] Lightgbm Bayesian Optimization (0)	2019.06.01
[Python] Catboost Bayesian Optimization (0)	2019.06.01
sklearn - skopt Bayesian Optimization (0)	2019.05.31

All I Need Is Data.