Python) featuretools를 사용한 자동 변수 생성

2022. 1. 22. 14:02분석 Python/구현 및 자료

목차

     

     

    featuretools 버전 1.4.0을 기준으로 작성합니다.

    예제 코드에서도 현재 1.4.0을 반영하지 않은 예제만 있다 보니, 그대로 실행하는 데 어려움이 있었고, 간단하게 변수 생성을 해보면서 정리를 해보고자 함.

    home-credit-risk data(https://www.kaggle.com/c/home-credit-default-risk/data)

    패키지 설치

    pip install featuretools==1.4.0

    Featuretools

    Featuretools는 자동화된 기능 엔지니어링을 수행하기 위한 오픈 소스 라이브러리

    Feature 생성 프로세스를 빠르게 진행하여 기계 학습 모델 구축의 다른 측면에 더 많은 시간을 집중할 수 있도록 설계된 훌륭한 도구입니다. 즉, 데이터를 "머신 러닝 준비" 상태로 만듭니다.

    크게 3가지에 대해서 알아야 함.

    • Entities
    • Deep Feature Synthesis (DFS)
    • Feature primitive

    Entities

    Entity는 Pandas DataFrame의 표현으로 간주될 수 있습니다. 여러 Entity의 컬렉션을 Entityset이라고 합니다.

    RelationShip

    relatioship은 RDBMS에서 사용되는 것과 동일한 추상적 개념

    Deep Feature Synthesis

    DFS(Deep Feature Synthesis)는 실제로 Feature Engineering 방법이며 Featuretools의 백본입니다.

    단일 데이터 프레임과 다중 데이터 프레임에서 새로운 기능을 생성할 수 있습니다.

    Feature primitive

    DFS는 EntitySet의 Entity-relationships에 Feature primitive 적용하여 Feature을 생성합니다.

    이러한 primitive는 기능을 수동으로 생성하는 데 자주 사용되는 방법입니다. 예를 들어, 기본 "평균"은 집계 수준에서 변수의 평균을 찾습니다.

    주요 기능으로는 Aggregation과 Transformation이 있습니다.

    Aggregation

    최소, 최대, 평균 및 표준 편차와 같은 통계 계산을 위해 상위 테이블의 하위를 그룹화합니다.

    Transformation

    한 테이블의 하나 이상의 열에 대해 수행되는 작업입니다. 두 열 값의 차이를 계산합니다.

    Implementation

    read data & merge data

    import featuretools as ft 
    import gc
    import numpy as np
    import pandas as pd
    import warnings
    warnings.simplefilter('ignore')
    from os.path import join as pjoin
    
    
    data_dir = "./home-credit-risk"
    
    filepaths = {
        'data_desc': pjoin(data_dir, 'HomeCredit_columns_description.csv'),
        'app_train': pjoin(data_dir, 'application_train.csv'),
        'app_test': pjoin(data_dir, 'application_test.csv'),
        'bureau': pjoin(data_dir, 'bureau.csv'),
        'bureau_bl': pjoin(data_dir, 'bureau_balance.csv'),
        'credit_bl': pjoin(data_dir, 'credit_card_balance.csv'),
        'install_pays': pjoin(data_dir, 'installments_payments.csv'),
        'pc_balance': pjoin(data_dir, 'POS_CASH_balance.csv'),
        'app_prev': pjoin(data_dir, 'previous_application.csv'),
    
    }
    
    nrows = 10000
    
    # load main datasets
    df_train = pd.read_csv(
        filepaths['app_train'], 
        low_memory=False, engine='c',
        nrows=nrows,
    )
    df_test = pd.read_csv(
        filepaths['app_test'], 
        low_memory=False, 
        engine='c',
    )
    df_joint = pd.concat([df_train, df_test])
    
    del df_train, df_test
    gc.collect()
    
    df_app_prev = pd.read_csv(
        filepaths['app_prev'], 
        engine='c', 
        low_memory=False,
        # first X*3 rows are taken for faster calculations, substitute this by whole dataset
        nrows=nrows*3,
    )

    transform type

    int_cols = df_joint.select_dtypes(include=[np.int64]).columns
    float_cols = df_joint.select_dtypes(include=[np.float64]).columns 
    
    df_joint[int_cols] = df_joint[int_cols].astype(np.int32)
    df_joint[float_cols] = df_joint[float_cols].astype(np.float32)
    
    # df_joint.set_index('SK_ID_CURR', inplace=True, drop=True)
    target_col = 'TARGET'
    
    
    int_cols = df_app_prev.select_dtypes(include=[np.int64]).columns
    float_cols = df_app_prev.select_dtypes(include=[np.float64]).columns 
    
    df_app_prev[int_cols] = df_app_prev[int_cols].astype(np.int32)
    df_app_prev[float_cols] = df_app_prev[float_cols].astype(np.float32)
    
    today = pd.to_datetime('2018-06-11')
    df_app_prev['DAYS_DECISION'] = today + pd.to_timedelta(df_app_prev['DAYS_DECISION'], unit='d')

    define entity set

    기존 코드들에서는 에러가 발생하는 부분을 수정함.

    2개의 entity의 2 개간의 relationship을 정의함.

    groupby를 하기 위해 새롭게 테이블을 생성함.

    # add entities (application table itself)
    from woodwork.logical_types import Categorical
    
    es = ft.EntitySet('application_data')
    # add entities (application table itself)
    es.add_dataframe(
        dataframe_name='apps', # define entity id
        dataframe=df_joint.drop('TARGET', axis=1), # select underlying data
        index='SK_ID_CURR', # define unique index column
        # specify some datatypes manually (if needed)
        logical_types={
            f: Categorical 
            for f in df_joint.columns if f.startswith('FLAG_')
        }
    )
    # add entities (previous applications table)
    es = es.add_dataframe(
        dataframe_name = 'prev_apps', 
        dataframe = df_app_prev,
        index = 'SK_ID_PREV',
        time_index = 'DAYS_DECISION',
        logical_types={
            f: Categorical 
            for f in df_app_prev.columns if f.startswith('NFLAG_')
        }
    )
    # add relationships
    r_app_cur_to_app_prev = ft.Relationship(
        entityset=es,
        parent_dataframe_name="apps", parent_column_name="SK_ID_CURR",
        child_dataframe_name="prev_apps" , child_column_name="SK_ID_CURR"
    )
    # Add the relationship to the entity set
    es = es.add_relationship(relationship=r_app_cur_to_app_prev)
    # Create new table for groupby
    es.normalize_dataframe(new_dataframe_name="apps_new",
                        base_dataframe_name="apps",
                        index="NAME_CONTRACT_TYPE")
    es

    primitives의 종류

    해당 데이터를 가지고 생성할 수 있는 기본 feature primitives는 다음과 같다.

    이게 변수 한개당 생성할 수 있는 양이다 보니 어마어마하게 생성할 수 있다.

      name type dask_compatible koalas_compatible description valid_inputs return_type
    0 sum aggregation True True Calculates the total addition, ignoring NaN. <ColumnSchema (Semantic Tags = ['numeric'])>  
    1 avg_time_between aggregation False False Computes the average number of seconds between consecutive events. <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    2 num_true aggregation True False Counts the number of True values. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    3 max aggregation True True Calculates the highest value, ignoring NaN values. <ColumnSchema (Semantic Tags = ['numeric'])>  
    4 mode aggregation False False Determines the most commonly repeated value. <ColumnSchema (Semantic Tags = ['category'])>  
    5 all aggregation True False Calculates if all values are 'True' in a list. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    6 count aggregation True True Determines the total number of values, excluding NaN. <ColumnSchema (Semantic Tags = ['index'])>  
    7 last aggregation False False Determines the last value in a list.    
    8 std aggregation True True Computes the dispersion relative to the mean value, ignoring NaN. <ColumnSchema (Semantic Tags = ['numeric'])>  
    9 median aggregation False False Determines the middlemost number in a list of values. <ColumnSchema (Semantic Tags = ['numeric'])>  
    10 n_most_common aggregation False False Determines the n most common elements. <ColumnSchema (Semantic Tags = ['category'])>  
    11 num_unique aggregation True True Determines the number of distinct values, ignoring NaN values. <ColumnSchema (Semantic Tags = ['category'])>  
    12 entropy aggregation False False Calculates the entropy for a categorical column <ColumnSchema (Semantic Tags = ['category'])>  
    13 min aggregation True True Calculates the smallest value, ignoring NaN values. <ColumnSchema (Semantic Tags = ['numeric'])>  
    14 time_since_last aggregation False False Calculates the time elapsed since the last datetime (default in seconds). <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    15 trend aggregation False False Calculates the trend of a column over time. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    16 mean aggregation True True Computes the average for a list of values. <ColumnSchema (Semantic Tags = ['numeric'])>  
    17 any aggregation True False Determines if any value is 'True' in a list. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    18 time_since_first aggregation False False Calculates the time elapsed since the first datetime (in seconds). <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    19 percent_true aggregation True False Determines the percent of True values. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    20 first aggregation False False Determines the first value in a list.    
    21 skew aggregation False False Computes the extent to which a distribution differs from a normal distribution. <ColumnSchema (Semantic Tags = ['numeric'])>  
    22 is_null transform True True Determines if a value is null.    
    23 is_in_geobox transform False False Determines if coordinates are inside a box defined by two <ColumnSchema (Logical Type = LatLong)>  
    24 less_than transform True True Determines if values in one list are less than another list. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    25 less_than_scalar transform True True Determines if values are less than a given scalar. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    26 numeric_lag transform False False Shifts an array of values by a specified number of periods. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Semantic Tags = ['time_index'])>  
    27 multiply_numeric_scalar transform True True Multiply each element in the list by a scalar. <ColumnSchema (Semantic Tags = ['numeric'])>  
    28 percentile transform False False Determines the percentile rank for each value in a list. <ColumnSchema (Semantic Tags = ['numeric'])>  
    29 rolling_max transform False False Determines the maximum of entries over a given window. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    30 greater_than_equal_to transform True True Determines if values in one list are greater than or equal to another list. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    31 add_numeric transform True True Element-wise addition of two lists. <ColumnSchema (Semantic Tags = ['numeric'])>  
    32 is_free_email_domain transform False False Determines if an email address is from a free email domain. <ColumnSchema (Logical Type = EmailAddress)>  
    33 negate transform True True Negates a numeric value. <ColumnSchema (Semantic Tags = ['numeric'])>  
    34 hour transform True True Determines the hour value of a datetime. <ColumnSchema (Logical Type = Datetime)>  
    35 equal_scalar transform True True Determines if values in a list are equal to a given scalar.    
    36 week transform True True Determines the week of the year from a datetime. <ColumnSchema (Logical Type = Datetime)>  
    37 url_to_protocol transform False False Determines the protocol (http or https) of a url. <ColumnSchema (Logical Type = URL)>  
    38 isin transform True True Determines whether a value is present in a provided list.    
    39 time_since_previous transform False False Compute the time since the previous entry in a list. <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    40 modulo_numeric_scalar transform True True Return the modulo of each element in the list by a scalar. <ColumnSchema (Semantic Tags = ['numeric'])>  
    41 or transform True True Element-wise logical OR of two lists. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    42 divide_numeric transform True True Element-wise division of two lists. <ColumnSchema (Semantic Tags = ['numeric'])>  
    43 cum_sum transform False False Calculates the cumulative sum. <ColumnSchema (Semantic Tags = ['numeric'])>  
    44 multiply_numeric transform True True Element-wise multiplication of two lists. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    45 num_characters transform True True Calculates the number of characters in a string. <ColumnSchema (Logical Type = NaturalLanguage)>  
    46 cityblock_distance transform False False Calculates the distance between points in a city road grid. <ColumnSchema (Logical Type = LatLong)>  
    47 divide_by_feature transform True True Divide a scalar by each value in the list. <ColumnSchema (Semantic Tags = ['numeric'])>  
    48 divide_numeric_scalar transform True True Divide each element in the list by a scalar. <ColumnSchema (Semantic Tags = ['numeric'])>  
    49 subtract_numeric transform True False Element-wise subtraction of two lists. <ColumnSchema (Semantic Tags = ['numeric'])>  
    50 subtract_numeric_scalar transform True True Subtract a scalar from each element in the list. <ColumnSchema (Semantic Tags = ['numeric'])>  
    51 is_weekend transform True True Determines if a date falls on a weekend. <ColumnSchema (Logical Type = Datetime)>  
    52 equal transform True True Determines if values in one list are equal to another list.    
    53 rolling_std transform False False Calculates the standard deviation of entries over a given window. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    54 haversine transform False False Calculates the approximate haversine distance between two LatLong columns. <ColumnSchema (Logical Type = LatLong)>  
    55 time_since transform True False Calculates time from a value to a specified cutoff datetime. <ColumnSchema (Logical Type = Datetime)>  
    56 cum_min transform False False Calculates the cumulative minimum. <ColumnSchema (Semantic Tags = ['numeric'])>  
    57 absolute transform True True Computes the absolute value of a number. <ColumnSchema (Semantic Tags = ['numeric'])>  
    58 and transform True True Element-wise logical AND of two lists. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    59 less_than_equal_to transform True True Determines if values in one list are less than or equal to another list. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    60 multiply_boolean transform True False Element-wise multiplication of two lists of boolean values. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    61 latitude transform False False Returns the first tuple value in a list of LatLong tuples. <ColumnSchema (Logical Type = LatLong)>  
    62 rolling_min transform False False Determines the minimum of entries over a given window. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    63 greater_than_equal_to_scalar transform True True Determines if values are greater than or equal to a given scalar. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    64 email_address_to_domain transform False False Determines the domain of an email <ColumnSchema (Logical Type = EmailAddress)>  
    65 not transform True True Negates a boolean value. <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>  
    66 cum_count transform False False Calculates the cumulative count. <ColumnSchema (Semantic Tags = ['category'])>, <ColumnSchema (Semantic Tags = ['foreign_key'])>  
    67 month transform True True Determines the month value of a datetime. <ColumnSchema (Logical Type = Datetime)>  
    68 greater_than_scalar transform True True Determines if values are greater than a given scalar. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    69 not_equal transform True False Determines if values in one list are not equal to another list.    
    70 not_equal_scalar transform True True Determines if values in a list are not equal to a given scalar.    
    71 url_to_tld transform False False Determines the top level domain of a url. <ColumnSchema (Logical Type = URL)>  
    72 diff transform False False Compute the difference between the value in a list and the <ColumnSchema (Semantic Tags = ['numeric'])>  
    73 greater_than transform True False Determines if values in one list are greater than another list. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    74 minute transform True True Determines the minutes value of a datetime. <ColumnSchema (Logical Type = Datetime)>  
    75 modulo_by_feature transform True True Return the modulo of a scalar by each element in the list. <ColumnSchema (Semantic Tags = ['numeric'])>  
    76 url_to_domain transform False False Determines the domain of a url. <ColumnSchema (Logical Type = URL)>  
    77 num_words transform True True Determines the number of words in a string by counting the spaces. <ColumnSchema (Logical Type = NaturalLanguage)>  
    78 second transform True True Determines the seconds value of a datetime. <ColumnSchema (Logical Type = Datetime)>  
    79 modulo_numeric transform True True Element-wise modulo of two lists. <ColumnSchema (Semantic Tags = ['numeric'])>  
    80 scalar_subtract_numeric_feature transform True True Subtract each value in the list from a given scalar. <ColumnSchema (Semantic Tags = ['numeric'])>  
    81 weekday transform True True Determines the day of the week from a datetime. <ColumnSchema (Logical Type = Datetime)>  
    82 geomidpoint transform False False Determines the geographic center of two coordinates. <ColumnSchema (Logical Type = LatLong)>  
    83 add_numeric_scalar transform True True Add a scalar to each value in the list. <ColumnSchema (Semantic Tags = ['numeric'])>  
    84 rolling_count transform False False Determines a rolling count of events over a given window. <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    85 age transform True False Calculates the age in years as a floating point number given a <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['date_of_birth'])>  
    86 cum_max transform False False Calculates the cumulative maximum. <ColumnSchema (Semantic Tags = ['numeric'])>  
    87 day transform True True Determines the day of the month from a datetime. <ColumnSchema (Logical Type = Datetime)>  
    88 year transform True True Determines the year value of a datetime. <ColumnSchema (Logical Type = Datetime)>  
    89 less_than_equal_to_scalar transform True True Determines if values are less than or equal to a given scalar. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>  
    90 rolling_mean transform False False Calculates the mean of entries over a given window. <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>  
    91 longitude transform False False Returns the second tuple value in a list of LatLong tuples. <ColumnSchema (Logical Type = LatLong)>  
    92 cum_mean transform False False Calculates the cumulative mean. <ColumnSchema (Semantic Tags = ['numeric'])>  

    DFS) Feature List

    생성되는 변수 정보를 볼 수 있습니다.

    여기서 생성되는 변수의 개수는 총 328개가 됩니다.

    groupby_trans_primitives 는 transform만 가능함.

    feature_defs = ft.dfs(
        entityset=es, 
        target_dataframe_name="apps", 
        features_only=True,
        agg_primitives=[
            "avg_time_between",
            "time_since_last", 
            "num_unique", 
            "mean", 
            "sum", 
        ],
        trans_primitives=[
            "time_since_previous",
            #"add",
        ],
        groupby_trans_primitives=["cum_mean","cum_min"],
        max_depth=1,
        training_window=ft.Timedelta(60, "d"), # use only last X days in computations
        max_features=1000,
        chunk_size=10000,
        verbose=True,
    )
    print(feature_defs)
    # Built 328 features

    DFS) Generation Feature

    fm, feature_defs = ft.dfs(
        entityset=es, 
        target_dataframe_name="apps", 
        features_only=False,
        agg_primitives=[
            "avg_time_between",
            "time_since_last", 
            "num_unique", 
            "mean", 
            "sum", 
        ],
        trans_primitives=[
            "time_since_previous",
            #"add",
        ],
        groupby_trans_primitives=["cum_mean","cum_min"],
        max_depth=1,
        training_window=ft.Timedelta(60, "d"), # use only last X days in computations
        max_features=1000,
        chunk_size=10000,
        verbose=True,
    )
    fm.shape # (58744, 328)
    fm = fm.drop_duplicates()
    print(fm.shape)
    fm[50:100]

    check data type

    기존보다 훨씬 많은 float (통계치) 변수들이 생성함.

    fm_dtype = pd.DataFrame(fm.dtypes).reset_index(drop=False)
    fm_dtype.columns = ["feature", "dtype"]
    fm_dtype["dtype"] = fm_dtype["dtype"].astype(str)
    fm_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))

    ori_df_dtype = pd.DataFrame(df_joint.dtypes).reset_index(drop=False)
    ori_df_dtype.columns = ["feature", "dtype"]
    ori_df_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))

    다양한 통계치를 이용해서 변수를 한 번에 생성하는 Featuretools에 대해서 알아봤고, 개인적으로 마음에 든다. 하지만 변수를 생성하고 나서 무의미한 변수들을 찾는 과정을 필수적으로 해야 하는 작업은 반드시 필요해 보인다.

    Reference

    * https://www.kaggle.com/frednavruzov/auto-feature-generation-featuretools-example

    * https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb

    * https://analyticsindiamag.com/introduction-to-featuretools-a-python-framework-for-automated-feature-engineering/

    * https://docs.featuretools.com/en/v0.16.0/ecosystem.html

    * https://www.kaggle.com/willkoehrsen/featuretools-for-good

    * https://www.kaggle.com/willkoehrsen/tuning-automated-feature-engineering-exploratory

    * https://medium.com/analytics-vidhya/feature-engineering-using-featuretools-with-code-10f8c83e5f68

    728x90