Python) featuretools를 사용한 자동 변수 생성

Python) featuretools를 사용한 자동 변수 생성

2022. 1. 22. 14:02ㆍ분석 Python/구현 및 자료

featuretools 버전 1.4.0을 기준으로 작성합니다.

예제 코드에서도 현재 1.4.0을 반영하지 않은 예제만 있다 보니, 그대로 실행하는 데 어려움이 있었고, 간단하게 변수 생성을 해보면서 정리를 해보고자 함.

home-credit-risk data(https://www.kaggle.com/c/home-credit-default-risk/data)

패키지 설치

pip install featuretools==1.4.0

Featuretools

Featuretools는 자동화된 기능 엔지니어링을 수행하기 위한 오픈 소스 라이브러리

Feature 생성 프로세스를 빠르게 진행하여 기계 학습 모델 구축의 다른 측면에 더 많은 시간을 집중할 수 있도록 설계된 훌륭한 도구입니다. 즉, 데이터를 "머신 러닝 준비" 상태로 만듭니다.

크게 3가지에 대해서 알아야 함.

Entities
Deep Feature Synthesis (DFS)
Feature primitive

Entities

Entity는 Pandas DataFrame의 표현으로 간주될 수 있습니다. 여러 Entity의 컬렉션을 Entityset이라고 합니다.

RelationShip

relatioship은 RDBMS에서 사용되는 것과 동일한 추상적 개념

Deep Feature Synthesis

DFS(Deep Feature Synthesis)는 실제로 Feature Engineering 방법이며 Featuretools의 백본입니다.

단일 데이터 프레임과 다중 데이터 프레임에서 새로운 기능을 생성할 수 있습니다.

Feature primitive

DFS는 EntitySet의 Entity-relationships에 Feature primitive 적용하여 Feature을 생성합니다.

이러한 primitive는 기능을 수동으로 생성하는 데 자주 사용되는 방법입니다. 예를 들어, 기본 "평균"은 집계 수준에서 변수의 평균을 찾습니다.

주요 기능으로는 Aggregation과 Transformation이 있습니다.

Aggregation

최소, 최대, 평균 및 표준 편차와 같은 통계 계산을 위해 상위 테이블의 하위를 그룹화합니다.

Transformation

한 테이블의 하나 이상의 열에 대해 수행되는 작업입니다. 두 열 값의 차이를 계산합니다.

Implementation

read data & merge data

import featuretools as ft 
import gc
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter('ignore')
from os.path import join as pjoin


data_dir = "./home-credit-risk"

filepaths = {
    'data_desc': pjoin(data_dir, 'HomeCredit_columns_description.csv'),
    'app_train': pjoin(data_dir, 'application_train.csv'),
    'app_test': pjoin(data_dir, 'application_test.csv'),
    'bureau': pjoin(data_dir, 'bureau.csv'),
    'bureau_bl': pjoin(data_dir, 'bureau_balance.csv'),
    'credit_bl': pjoin(data_dir, 'credit_card_balance.csv'),
    'install_pays': pjoin(data_dir, 'installments_payments.csv'),
    'pc_balance': pjoin(data_dir, 'POS_CASH_balance.csv'),
    'app_prev': pjoin(data_dir, 'previous_application.csv'),

}

nrows = 10000

# load main datasets
df_train = pd.read_csv(
    filepaths['app_train'], 
    low_memory=False, engine='c',
    nrows=nrows,
)
df_test = pd.read_csv(
    filepaths['app_test'], 
    low_memory=False, 
    engine='c',
)
df_joint = pd.concat([df_train, df_test])

del df_train, df_test
gc.collect()

df_app_prev = pd.read_csv(
    filepaths['app_prev'], 
    engine='c', 
    low_memory=False,
    # first X*3 rows are taken for faster calculations, substitute this by whole dataset
    nrows=nrows*3,
)

transform type

int_cols = df_joint.select_dtypes(include=[np.int64]).columns
float_cols = df_joint.select_dtypes(include=[np.float64]).columns 

df_joint[int_cols] = df_joint[int_cols].astype(np.int32)
df_joint[float_cols] = df_joint[float_cols].astype(np.float32)

# df_joint.set_index('SK_ID_CURR', inplace=True, drop=True)
target_col = 'TARGET'


int_cols = df_app_prev.select_dtypes(include=[np.int64]).columns
float_cols = df_app_prev.select_dtypes(include=[np.float64]).columns 

df_app_prev[int_cols] = df_app_prev[int_cols].astype(np.int32)
df_app_prev[float_cols] = df_app_prev[float_cols].astype(np.float32)

today = pd.to_datetime('2018-06-11')
df_app_prev['DAYS_DECISION'] = today + pd.to_timedelta(df_app_prev['DAYS_DECISION'], unit='d')

define entity set

기존 코드들에서는 에러가 발생하는 부분을 수정함.

2개의 entity의 2 개간의 relationship을 정의함.

groupby를 하기 위해 새롭게 테이블을 생성함.

# add entities (application table itself)
from woodwork.logical_types import Categorical

es = ft.EntitySet('application_data')
# add entities (application table itself)
es.add_dataframe(
    dataframe_name='apps', # define entity id
    dataframe=df_joint.drop('TARGET', axis=1), # select underlying data
    index='SK_ID_CURR', # define unique index column
    # specify some datatypes manually (if needed)
    logical_types={
        f: Categorical 
        for f in df_joint.columns if f.startswith('FLAG_')
    }
)
# add entities (previous applications table)
es = es.add_dataframe(
    dataframe_name = 'prev_apps', 
    dataframe = df_app_prev,
    index = 'SK_ID_PREV',
    time_index = 'DAYS_DECISION',
    logical_types={
        f: Categorical 
        for f in df_app_prev.columns if f.startswith('NFLAG_')
    }
)
# add relationships
r_app_cur_to_app_prev = ft.Relationship(
    entityset=es,
    parent_dataframe_name="apps", parent_column_name="SK_ID_CURR",
    child_dataframe_name="prev_apps" , child_column_name="SK_ID_CURR"
)
# Add the relationship to the entity set
es = es.add_relationship(relationship=r_app_cur_to_app_prev)
# Create new table for groupby
es.normalize_dataframe(new_dataframe_name="apps_new",
                    base_dataframe_name="apps",
                    index="NAME_CONTRACT_TYPE")
es

primitives의 종류

해당 데이터를 가지고 생성할 수 있는 기본 feature primitives는 다음과 같다.

이게 변수 한개당 생성할 수 있는 양이다 보니 어마어마하게 생성할 수 있다.

	name	type	dask_compatible	koalas_compatible	description	valid_inputs
0	sum	aggregation	True	True	Calculates the total addition, ignoring `NaN`.	<ColumnSchema (Semantic Tags = ['numeric'])>
1	avg_time_between	aggregation	False	False	Computes the average number of seconds between consecutive events.	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
2	num_true	aggregation	True	False	Counts the number of `True` values.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
3	max	aggregation	True	True	Calculates the highest value, ignoring `NaN` values.	<ColumnSchema (Semantic Tags = ['numeric'])>
4	mode	aggregation	False	False	Determines the most commonly repeated value.	<ColumnSchema (Semantic Tags = ['category'])>
5	all	aggregation	True	False	Calculates if all values are 'True' in a list.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
6	count	aggregation	True	True	Determines the total number of values, excluding `NaN`.	<ColumnSchema (Semantic Tags = ['index'])>
7	last	aggregation	False	False	Determines the last value in a list.
8	std	aggregation	True	True	Computes the dispersion relative to the mean value, ignoring `NaN`.	<ColumnSchema (Semantic Tags = ['numeric'])>
9	median	aggregation	False	False	Determines the middlemost number in a list of values.	<ColumnSchema (Semantic Tags = ['numeric'])>
10	n_most_common	aggregation	False	False	Determines the `n` most common elements.	<ColumnSchema (Semantic Tags = ['category'])>
11	num_unique	aggregation	True	True	Determines the number of distinct values, ignoring `NaN` values.	<ColumnSchema (Semantic Tags = ['category'])>
12	entropy	aggregation	False	False	Calculates the entropy for a categorical column	<ColumnSchema (Semantic Tags = ['category'])>
13	min	aggregation	True	True	Calculates the smallest value, ignoring `NaN` values.	<ColumnSchema (Semantic Tags = ['numeric'])>
14	time_since_last	aggregation	False	False	Calculates the time elapsed since the last datetime (default in seconds).	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
15	trend	aggregation	False	False	Calculates the trend of a column over time.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
16	mean	aggregation	True	True	Computes the average for a list of values.	<ColumnSchema (Semantic Tags = ['numeric'])>
17	any	aggregation	True	False	Determines if any value is 'True' in a list.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
18	time_since_first	aggregation	False	False	Calculates the time elapsed since the first datetime (in seconds).	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
19	percent_true	aggregation	True	False	Determines the percent of `True` values.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
20	first	aggregation	False	False	Determines the first value in a list.
21	skew	aggregation	False	False	Computes the extent to which a distribution differs from a normal distribution.	<ColumnSchema (Semantic Tags = ['numeric'])>
22	is_null	transform	True	True	Determines if a value is null.
23	is_in_geobox	transform	False	False	Determines if coordinates are inside a box defined by two	<ColumnSchema (Logical Type = LatLong)>
24	less_than	transform	True	True	Determines if values in one list are less than another list.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
25	less_than_scalar	transform	True	True	Determines if values are less than a given scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
26	numeric_lag	transform	False	False	Shifts an array of values by a specified number of periods.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Semantic Tags = ['time_index'])>
27	multiply_numeric_scalar	transform	True	True	Multiply each element in the list by a scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>
28	percentile	transform	False	False	Determines the percentile rank for each value in a list.	<ColumnSchema (Semantic Tags = ['numeric'])>
29	rolling_max	transform	False	False	Determines the maximum of entries over a given window.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
30	greater_than_equal_to	transform	True	True	Determines if values in one list are greater than or equal to another list.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
31	add_numeric	transform	True	True	Element-wise addition of two lists.	<ColumnSchema (Semantic Tags = ['numeric'])>
32	is_free_email_domain	transform	False	False	Determines if an email address is from a free email domain.	<ColumnSchema (Logical Type = EmailAddress)>
33	negate	transform	True	True	Negates a numeric value.	<ColumnSchema (Semantic Tags = ['numeric'])>
34	hour	transform	True	True	Determines the hour value of a datetime.	<ColumnSchema (Logical Type = Datetime)>
35	equal_scalar	transform	True	True	Determines if values in a list are equal to a given scalar.
36	week	transform	True	True	Determines the week of the year from a datetime.	<ColumnSchema (Logical Type = Datetime)>
37	url_to_protocol	transform	False	False	Determines the protocol (http or https) of a url.	<ColumnSchema (Logical Type = URL)>
38	isin	transform	True	True	Determines whether a value is present in a provided list.
39	time_since_previous	transform	False	False	Compute the time since the previous entry in a list.	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
40	modulo_numeric_scalar	transform	True	True	Return the modulo of each element in the list by a scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>
41	or	transform	True	True	Element-wise logical OR of two lists.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
42	divide_numeric	transform	True	True	Element-wise division of two lists.	<ColumnSchema (Semantic Tags = ['numeric'])>
43	cum_sum	transform	False	False	Calculates the cumulative sum.	<ColumnSchema (Semantic Tags = ['numeric'])>
44	multiply_numeric	transform	True	True	Element-wise multiplication of two lists.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
45	num_characters	transform	True	True	Calculates the number of characters in a string.	<ColumnSchema (Logical Type = NaturalLanguage)>
46	cityblock_distance	transform	False	False	Calculates the distance between points in a city road grid.	<ColumnSchema (Logical Type = LatLong)>
47	divide_by_feature	transform	True	True	Divide a scalar by each value in the list.	<ColumnSchema (Semantic Tags = ['numeric'])>
48	divide_numeric_scalar	transform	True	True	Divide each element in the list by a scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>
49	subtract_numeric	transform	True	False	Element-wise subtraction of two lists.	<ColumnSchema (Semantic Tags = ['numeric'])>
50	subtract_numeric_scalar	transform	True	True	Subtract a scalar from each element in the list.	<ColumnSchema (Semantic Tags = ['numeric'])>
51	is_weekend	transform	True	True	Determines if a date falls on a weekend.	<ColumnSchema (Logical Type = Datetime)>
52	equal	transform	True	True	Determines if values in one list are equal to another list.
53	rolling_std	transform	False	False	Calculates the standard deviation of entries over a given window.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
54	haversine	transform	False	False	Calculates the approximate haversine distance between two LatLong columns.	<ColumnSchema (Logical Type = LatLong)>
55	time_since	transform	True	False	Calculates time from a value to a specified cutoff datetime.	<ColumnSchema (Logical Type = Datetime)>
56	cum_min	transform	False	False	Calculates the cumulative minimum.	<ColumnSchema (Semantic Tags = ['numeric'])>
57	absolute	transform	True	True	Computes the absolute value of a number.	<ColumnSchema (Semantic Tags = ['numeric'])>
58	and	transform	True	True	Element-wise logical AND of two lists.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
59	less_than_equal_to	transform	True	True	Determines if values in one list are less than or equal to another list.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
60	multiply_boolean	transform	True	False	Element-wise multiplication of two lists of boolean values.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
61	latitude	transform	False	False	Returns the first tuple value in a list of LatLong tuples.	<ColumnSchema (Logical Type = LatLong)>
62	rolling_min	transform	False	False	Determines the minimum of entries over a given window.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
63	greater_than_equal_to_scalar	transform	True	True	Determines if values are greater than or equal to a given scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
64	email_address_to_domain	transform	False	False	Determines the domain of an email	<ColumnSchema (Logical Type = EmailAddress)>
65	not	transform	True	True	Negates a boolean value.	<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)>
66	cum_count	transform	False	False	Calculates the cumulative count.	<ColumnSchema (Semantic Tags = ['category'])>, <ColumnSchema (Semantic Tags = ['foreign_key'])>
67	month	transform	True	True	Determines the month value of a datetime.	<ColumnSchema (Logical Type = Datetime)>
68	greater_than_scalar	transform	True	True	Determines if values are greater than a given scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
69	not_equal	transform	True	False	Determines if values in one list are not equal to another list.
70	not_equal_scalar	transform	True	True	Determines if values in a list are not equal to a given scalar.
71	url_to_tld	transform	False	False	Determines the top level domain of a url.	<ColumnSchema (Logical Type = URL)>
72	diff	transform	False	False	Compute the difference between the value in a list and the	<ColumnSchema (Semantic Tags = ['numeric'])>
73	greater_than	transform	True	False	Determines if values in one list are greater than another list.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
74	minute	transform	True	True	Determines the minutes value of a datetime.	<ColumnSchema (Logical Type = Datetime)>
75	modulo_by_feature	transform	True	True	Return the modulo of a scalar by each element in the list.	<ColumnSchema (Semantic Tags = ['numeric'])>
76	url_to_domain	transform	False	False	Determines the domain of a url.	<ColumnSchema (Logical Type = URL)>
77	num_words	transform	True	True	Determines the number of words in a string by counting the spaces.	<ColumnSchema (Logical Type = NaturalLanguage)>
78	second	transform	True	True	Determines the seconds value of a datetime.	<ColumnSchema (Logical Type = Datetime)>
79	modulo_numeric	transform	True	True	Element-wise modulo of two lists.	<ColumnSchema (Semantic Tags = ['numeric'])>
80	scalar_subtract_numeric_feature	transform	True	True	Subtract each value in the list from a given scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>
81	weekday	transform	True	True	Determines the day of the week from a datetime.	<ColumnSchema (Logical Type = Datetime)>
82	geomidpoint	transform	False	False	Determines the geographic center of two coordinates.	<ColumnSchema (Logical Type = LatLong)>
83	add_numeric_scalar	transform	True	True	Add a scalar to each value in the list.	<ColumnSchema (Semantic Tags = ['numeric'])>
84	rolling_count	transform	False	False	Determines a rolling count of events over a given window.	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
85	age	transform	True	False	Calculates the age in years as a floating point number given a	<ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['date_of_birth'])>
86	cum_max	transform	False	False	Calculates the cumulative maximum.	<ColumnSchema (Semantic Tags = ['numeric'])>
87	day	transform	True	True	Determines the day of the month from a datetime.	<ColumnSchema (Logical Type = Datetime)>
88	year	transform	True	True	Determines the year value of a datetime.	<ColumnSchema (Logical Type = Datetime)>
89	less_than_equal_to_scalar	transform	True	True	Determines if values are less than or equal to a given scalar.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)>
90	rolling_mean	transform	False	False	Calculates the mean of entries over a given window.	<ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])>
91	longitude	transform	False	False	Returns the second tuple value in a list of LatLong tuples.	<ColumnSchema (Logical Type = LatLong)>
92	cum_mean	transform	False	False	Calculates the cumulative mean.	<ColumnSchema (Semantic Tags = ['numeric'])>

DFS) Feature List

생성되는 변수 정보를 볼 수 있습니다.

여기서 생성되는 변수의 개수는 총 328개가 됩니다.

groupby_trans_primitives 는 transform만 가능함.

feature_defs = ft.dfs(
    entityset=es, 
    target_dataframe_name="apps", 
    features_only=True,
    agg_primitives=[
        "avg_time_between",
        "time_since_last", 
        "num_unique", 
        "mean", 
        "sum", 
    ],
    trans_primitives=[
        "time_since_previous",
        #"add",
    ],
    groupby_trans_primitives=["cum_mean","cum_min"],
    max_depth=1,
    training_window=ft.Timedelta(60, "d"), # use only last X days in computations
    max_features=1000,
    chunk_size=10000,
    verbose=True,
)
print(feature_defs)
# Built 328 features

DFS) Generation Feature

fm, feature_defs = ft.dfs(
    entityset=es, 
    target_dataframe_name="apps", 
    features_only=False,
    agg_primitives=[
        "avg_time_between",
        "time_since_last", 
        "num_unique", 
        "mean", 
        "sum", 
    ],
    trans_primitives=[
        "time_since_previous",
        #"add",
    ],
    groupby_trans_primitives=["cum_mean","cum_min"],
    max_depth=1,
    training_window=ft.Timedelta(60, "d"), # use only last X days in computations
    max_features=1000,
    chunk_size=10000,
    verbose=True,
)
fm.shape # (58744, 328)
fm = fm.drop_duplicates()
print(fm.shape)
fm[50:100]

check data type

기존보다 훨씬 많은 float (통계치) 변수들이 생성함.

fm_dtype = pd.DataFrame(fm.dtypes).reset_index(drop=False)
fm_dtype.columns = ["feature", "dtype"]
fm_dtype["dtype"] = fm_dtype["dtype"].astype(str)
fm_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))

ori_df_dtype = pd.DataFrame(df_joint.dtypes).reset_index(drop=False)
ori_df_dtype.columns = ["feature", "dtype"]
ori_df_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))

다양한 통계치를 이용해서 변수를 한 번에 생성하는 Featuretools에 대해서 알아봤고, 개인적으로 마음에 든다. 하지만 변수를 생성하고 나서 무의미한 변수들을 찾는 과정을 필수적으로 해야 하는 작업은 반드시 필요해 보인다.

Reference

* https://www.kaggle.com/frednavruzov/auto-feature-generation-featuretools-example

* https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb

* https://analyticsindiamag.com/introduction-to-featuretools-a-python-framework-for-automated-feature-engineering/

* https://docs.featuretools.com/en/v0.16.0/ecosystem.html

* https://www.kaggle.com/willkoehrsen/featuretools-for-good

* https://www.kaggle.com/willkoehrsen/tuning-automated-feature-engineering-exploratory

* https://medium.com/analytics-vidhya/feature-engineering-using-featuretools-with-code-10f8c83e5f68

저작자표시

'분석 Python > 구현 및 자료' 카테고리의 다른 글

Python) Permutation Importance 다양하게 표현하는 방법 (0)	2022.01.31
Python) Sphinx를 사용하여 문서화하기 + Github Pages + Gitlab (2)	2022.01.26
Python) most frequent speed test (0)	2021.12.24
선형 Kalman Filtering 알아보기 (0)	2021.10.08
Python) 회귀 분석 기본 사용법 정리(scikit-learn, statsmodels) (2)	2021.08.11

All I Need Is Data.