2022. 1. 22. 14:02ㆍ분석 Python/구현 및 자료
목차
featuretools 버전 1.4.0을 기준으로 작성합니다.
예제 코드에서도 현재 1.4.0을 반영하지 않은 예제만 있다 보니, 그대로 실행하는 데 어려움이 있었고, 간단하게 변수 생성을 해보면서 정리를 해보고자 함.
home-credit-risk data(https://www.kaggle.com/c/home-credit-default-risk/data)
패키지 설치
pip install featuretools==1.4.0
Featuretools
Featuretools는 자동화된 기능 엔지니어링을 수행하기 위한 오픈 소스 라이브러리
Feature 생성 프로세스를 빠르게 진행하여 기계 학습 모델 구축의 다른 측면에 더 많은 시간을 집중할 수 있도록 설계된 훌륭한 도구입니다. 즉, 데이터를 "머신 러닝 준비" 상태로 만듭니다.
크게 3가지에 대해서 알아야 함.
- Entities
- Deep Feature Synthesis (DFS)
- Feature primitive
Entities
Entity는 Pandas DataFrame의 표현으로 간주될 수 있습니다. 여러 Entity의 컬렉션을 Entityset이라고 합니다.
RelationShip
relatioship은 RDBMS에서 사용되는 것과 동일한 추상적 개념
Deep Feature Synthesis
DFS(Deep Feature Synthesis)는 실제로 Feature Engineering 방법이며 Featuretools의 백본입니다.
단일 데이터 프레임과 다중 데이터 프레임에서 새로운 기능을 생성할 수 있습니다.
Feature primitive
DFS는 EntitySet의 Entity-relationships에 Feature primitive 적용하여 Feature을 생성합니다.
이러한 primitive는 기능을 수동으로 생성하는 데 자주 사용되는 방법입니다. 예를 들어, 기본 "평균"은 집계 수준에서 변수의 평균을 찾습니다.
주요 기능으로는 Aggregation과 Transformation이 있습니다.
Aggregation
최소, 최대, 평균 및 표준 편차와 같은 통계 계산을 위해 상위 테이블의 하위를 그룹화합니다.
Transformation
한 테이블의 하나 이상의 열에 대해 수행되는 작업입니다. 두 열 값의 차이를 계산합니다.
Implementation
read data & merge data
import featuretools as ft
import gc
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter('ignore')
from os.path import join as pjoin
data_dir = "./home-credit-risk"
filepaths = {
'data_desc': pjoin(data_dir, 'HomeCredit_columns_description.csv'),
'app_train': pjoin(data_dir, 'application_train.csv'),
'app_test': pjoin(data_dir, 'application_test.csv'),
'bureau': pjoin(data_dir, 'bureau.csv'),
'bureau_bl': pjoin(data_dir, 'bureau_balance.csv'),
'credit_bl': pjoin(data_dir, 'credit_card_balance.csv'),
'install_pays': pjoin(data_dir, 'installments_payments.csv'),
'pc_balance': pjoin(data_dir, 'POS_CASH_balance.csv'),
'app_prev': pjoin(data_dir, 'previous_application.csv'),
}
nrows = 10000
# load main datasets
df_train = pd.read_csv(
filepaths['app_train'],
low_memory=False, engine='c',
nrows=nrows,
)
df_test = pd.read_csv(
filepaths['app_test'],
low_memory=False,
engine='c',
)
df_joint = pd.concat([df_train, df_test])
del df_train, df_test
gc.collect()
df_app_prev = pd.read_csv(
filepaths['app_prev'],
engine='c',
low_memory=False,
# first X*3 rows are taken for faster calculations, substitute this by whole dataset
nrows=nrows*3,
)
transform type
int_cols = df_joint.select_dtypes(include=[np.int64]).columns
float_cols = df_joint.select_dtypes(include=[np.float64]).columns
df_joint[int_cols] = df_joint[int_cols].astype(np.int32)
df_joint[float_cols] = df_joint[float_cols].astype(np.float32)
# df_joint.set_index('SK_ID_CURR', inplace=True, drop=True)
target_col = 'TARGET'
int_cols = df_app_prev.select_dtypes(include=[np.int64]).columns
float_cols = df_app_prev.select_dtypes(include=[np.float64]).columns
df_app_prev[int_cols] = df_app_prev[int_cols].astype(np.int32)
df_app_prev[float_cols] = df_app_prev[float_cols].astype(np.float32)
today = pd.to_datetime('2018-06-11')
df_app_prev['DAYS_DECISION'] = today + pd.to_timedelta(df_app_prev['DAYS_DECISION'], unit='d')
define entity set
기존 코드들에서는 에러가 발생하는 부분을 수정함.
2개의 entity의 2 개간의 relationship을 정의함.
groupby를 하기 위해 새롭게 테이블을 생성함.
# add entities (application table itself)
from woodwork.logical_types import Categorical
es = ft.EntitySet('application_data')
# add entities (application table itself)
es.add_dataframe(
dataframe_name='apps', # define entity id
dataframe=df_joint.drop('TARGET', axis=1), # select underlying data
index='SK_ID_CURR', # define unique index column
# specify some datatypes manually (if needed)
logical_types={
f: Categorical
for f in df_joint.columns if f.startswith('FLAG_')
}
)
# add entities (previous applications table)
es = es.add_dataframe(
dataframe_name = 'prev_apps',
dataframe = df_app_prev,
index = 'SK_ID_PREV',
time_index = 'DAYS_DECISION',
logical_types={
f: Categorical
for f in df_app_prev.columns if f.startswith('NFLAG_')
}
)
# add relationships
r_app_cur_to_app_prev = ft.Relationship(
entityset=es,
parent_dataframe_name="apps", parent_column_name="SK_ID_CURR",
child_dataframe_name="prev_apps" , child_column_name="SK_ID_CURR"
)
# Add the relationship to the entity set
es = es.add_relationship(relationship=r_app_cur_to_app_prev)
# Create new table for groupby
es.normalize_dataframe(new_dataframe_name="apps_new",
base_dataframe_name="apps",
index="NAME_CONTRACT_TYPE")
es
primitives의 종류
해당 데이터를 가지고 생성할 수 있는 기본 feature primitives는 다음과 같다.
이게 변수 한개당 생성할 수 있는 양이다 보니 어마어마하게 생성할 수 있다.
name | type | dask_compatible | koalas_compatible | description | valid_inputs | return_type | |
---|---|---|---|---|---|---|---|
0 | sum | aggregation | True | True | Calculates the total addition, ignoring NaN . |
<ColumnSchema (Semantic Tags = ['numeric'])> | |
1 | avg_time_between | aggregation | False | False | Computes the average number of seconds between consecutive events. | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
2 | num_true | aggregation | True | False | Counts the number of True values. |
<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
3 | max | aggregation | True | True | Calculates the highest value, ignoring NaN values. |
<ColumnSchema (Semantic Tags = ['numeric'])> | |
4 | mode | aggregation | False | False | Determines the most commonly repeated value. | <ColumnSchema (Semantic Tags = ['category'])> | |
5 | all | aggregation | True | False | Calculates if all values are 'True' in a list. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
6 | count | aggregation | True | True | Determines the total number of values, excluding NaN . |
<ColumnSchema (Semantic Tags = ['index'])> | |
7 | last | aggregation | False | False | Determines the last value in a list. | ||
8 | std | aggregation | True | True | Computes the dispersion relative to the mean value, ignoring NaN . |
<ColumnSchema (Semantic Tags = ['numeric'])> | |
9 | median | aggregation | False | False | Determines the middlemost number in a list of values. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
10 | n_most_common | aggregation | False | False | Determines the n most common elements. |
<ColumnSchema (Semantic Tags = ['category'])> | |
11 | num_unique | aggregation | True | True | Determines the number of distinct values, ignoring NaN values. |
<ColumnSchema (Semantic Tags = ['category'])> | |
12 | entropy | aggregation | False | False | Calculates the entropy for a categorical column | <ColumnSchema (Semantic Tags = ['category'])> | |
13 | min | aggregation | True | True | Calculates the smallest value, ignoring NaN values. |
<ColumnSchema (Semantic Tags = ['numeric'])> | |
14 | time_since_last | aggregation | False | False | Calculates the time elapsed since the last datetime (default in seconds). | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
15 | trend | aggregation | False | False | Calculates the trend of a column over time. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
16 | mean | aggregation | True | True | Computes the average for a list of values. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
17 | any | aggregation | True | False | Determines if any value is 'True' in a list. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
18 | time_since_first | aggregation | False | False | Calculates the time elapsed since the first datetime (in seconds). | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
19 | percent_true | aggregation | True | False | Determines the percent of True values. |
<ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
20 | first | aggregation | False | False | Determines the first value in a list. | ||
21 | skew | aggregation | False | False | Computes the extent to which a distribution differs from a normal distribution. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
22 | is_null | transform | True | True | Determines if a value is null. | ||
23 | is_in_geobox | transform | False | False | Determines if coordinates are inside a box defined by two | <ColumnSchema (Logical Type = LatLong)> | |
24 | less_than | transform | True | True | Determines if values in one list are less than another list. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
25 | less_than_scalar | transform | True | True | Determines if values are less than a given scalar. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
26 | numeric_lag | transform | False | False | Shifts an array of values by a specified number of periods. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Semantic Tags = ['time_index'])> | |
27 | multiply_numeric_scalar | transform | True | True | Multiply each element in the list by a scalar. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
28 | percentile | transform | False | False | Determines the percentile rank for each value in a list. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
29 | rolling_max | transform | False | False | Determines the maximum of entries over a given window. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
30 | greater_than_equal_to | transform | True | True | Determines if values in one list are greater than or equal to another list. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
31 | add_numeric | transform | True | True | Element-wise addition of two lists. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
32 | is_free_email_domain | transform | False | False | Determines if an email address is from a free email domain. | <ColumnSchema (Logical Type = EmailAddress)> | |
33 | negate | transform | True | True | Negates a numeric value. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
34 | hour | transform | True | True | Determines the hour value of a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
35 | equal_scalar | transform | True | True | Determines if values in a list are equal to a given scalar. | ||
36 | week | transform | True | True | Determines the week of the year from a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
37 | url_to_protocol | transform | False | False | Determines the protocol (http or https) of a url. | <ColumnSchema (Logical Type = URL)> | |
38 | isin | transform | True | True | Determines whether a value is present in a provided list. | ||
39 | time_since_previous | transform | False | False | Compute the time since the previous entry in a list. | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
40 | modulo_numeric_scalar | transform | True | True | Return the modulo of each element in the list by a scalar. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
41 | or | transform | True | True | Element-wise logical OR of two lists. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
42 | divide_numeric | transform | True | True | Element-wise division of two lists. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
43 | cum_sum | transform | False | False | Calculates the cumulative sum. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
44 | multiply_numeric | transform | True | True | Element-wise multiplication of two lists. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
45 | num_characters | transform | True | True | Calculates the number of characters in a string. | <ColumnSchema (Logical Type = NaturalLanguage)> | |
46 | cityblock_distance | transform | False | False | Calculates the distance between points in a city road grid. | <ColumnSchema (Logical Type = LatLong)> | |
47 | divide_by_feature | transform | True | True | Divide a scalar by each value in the list. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
48 | divide_numeric_scalar | transform | True | True | Divide each element in the list by a scalar. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
49 | subtract_numeric | transform | True | False | Element-wise subtraction of two lists. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
50 | subtract_numeric_scalar | transform | True | True | Subtract a scalar from each element in the list. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
51 | is_weekend | transform | True | True | Determines if a date falls on a weekend. | <ColumnSchema (Logical Type = Datetime)> | |
52 | equal | transform | True | True | Determines if values in one list are equal to another list. | ||
53 | rolling_std | transform | False | False | Calculates the standard deviation of entries over a given window. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
54 | haversine | transform | False | False | Calculates the approximate haversine distance between two LatLong columns. | <ColumnSchema (Logical Type = LatLong)> | |
55 | time_since | transform | True | False | Calculates time from a value to a specified cutoff datetime. | <ColumnSchema (Logical Type = Datetime)> | |
56 | cum_min | transform | False | False | Calculates the cumulative minimum. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
57 | absolute | transform | True | True | Computes the absolute value of a number. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
58 | and | transform | True | True | Element-wise logical AND of two lists. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
59 | less_than_equal_to | transform | True | True | Determines if values in one list are less than or equal to another list. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
60 | multiply_boolean | transform | True | False | Element-wise multiplication of two lists of boolean values. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
61 | latitude | transform | False | False | Returns the first tuple value in a list of LatLong tuples. | <ColumnSchema (Logical Type = LatLong)> | |
62 | rolling_min | transform | False | False | Determines the minimum of entries over a given window. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
63 | greater_than_equal_to_scalar | transform | True | True | Determines if values are greater than or equal to a given scalar. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
64 | email_address_to_domain | transform | False | False | Determines the domain of an email | <ColumnSchema (Logical Type = EmailAddress)> | |
65 | not | transform | True | True | Negates a boolean value. | <ColumnSchema (Logical Type = BooleanNullable)>, <ColumnSchema (Logical Type = Boolean)> | |
66 | cum_count | transform | False | False | Calculates the cumulative count. | <ColumnSchema (Semantic Tags = ['category'])>, <ColumnSchema (Semantic Tags = ['foreign_key'])> | |
67 | month | transform | True | True | Determines the month value of a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
68 | greater_than_scalar | transform | True | True | Determines if values are greater than a given scalar. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
69 | not_equal | transform | True | False | Determines if values in one list are not equal to another list. | ||
70 | not_equal_scalar | transform | True | True | Determines if values in a list are not equal to a given scalar. | ||
71 | url_to_tld | transform | False | False | Determines the top level domain of a url. | <ColumnSchema (Logical Type = URL)> | |
72 | diff | transform | False | False | Compute the difference between the value in a list and the | <ColumnSchema (Semantic Tags = ['numeric'])> | |
73 | greater_than | transform | True | False | Determines if values in one list are greater than another list. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
74 | minute | transform | True | True | Determines the minutes value of a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
75 | modulo_by_feature | transform | True | True | Return the modulo of a scalar by each element in the list. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
76 | url_to_domain | transform | False | False | Determines the domain of a url. | <ColumnSchema (Logical Type = URL)> | |
77 | num_words | transform | True | True | Determines the number of words in a string by counting the spaces. | <ColumnSchema (Logical Type = NaturalLanguage)> | |
78 | second | transform | True | True | Determines the seconds value of a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
79 | modulo_numeric | transform | True | True | Element-wise modulo of two lists. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
80 | scalar_subtract_numeric_feature | transform | True | True | Subtract each value in the list from a given scalar. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
81 | weekday | transform | True | True | Determines the day of the week from a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
82 | geomidpoint | transform | False | False | Determines the geographic center of two coordinates. | <ColumnSchema (Logical Type = LatLong)> | |
83 | add_numeric_scalar | transform | True | True | Add a scalar to each value in the list. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
84 | rolling_count | transform | False | False | Determines a rolling count of events over a given window. | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
85 | age | transform | True | False | Calculates the age in years as a floating point number given a | <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['date_of_birth'])> | |
86 | cum_max | transform | False | False | Calculates the cumulative maximum. | <ColumnSchema (Semantic Tags = ['numeric'])> | |
87 | day | transform | True | True | Determines the day of the month from a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
88 | year | transform | True | True | Determines the year value of a datetime. | <ColumnSchema (Logical Type = Datetime)> | |
89 | less_than_equal_to_scalar | transform | True | True | Determines if values are less than or equal to a given scalar. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime)>, <ColumnSchema (Logical Type = Ordinal)> | |
90 | rolling_mean | transform | False | False | Calculates the mean of entries over a given window. | <ColumnSchema (Semantic Tags = ['numeric'])>, <ColumnSchema (Logical Type = Datetime) (Semantic Tags = ['time_index'])> | |
91 | longitude | transform | False | False | Returns the second tuple value in a list of LatLong tuples. | <ColumnSchema (Logical Type = LatLong)> | |
92 | cum_mean | transform | False | False | Calculates the cumulative mean. | <ColumnSchema (Semantic Tags = ['numeric'])> |
DFS) Feature List
생성되는 변수 정보를 볼 수 있습니다.
여기서 생성되는 변수의 개수는 총 328개가 됩니다.
groupby_trans_primitives 는 transform만 가능함.
feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="apps",
features_only=True,
agg_primitives=[
"avg_time_between",
"time_since_last",
"num_unique",
"mean",
"sum",
],
trans_primitives=[
"time_since_previous",
#"add",
],
groupby_trans_primitives=["cum_mean","cum_min"],
max_depth=1,
training_window=ft.Timedelta(60, "d"), # use only last X days in computations
max_features=1000,
chunk_size=10000,
verbose=True,
)
print(feature_defs)
# Built 328 features
DFS) Generation Feature
fm, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name="apps",
features_only=False,
agg_primitives=[
"avg_time_between",
"time_since_last",
"num_unique",
"mean",
"sum",
],
trans_primitives=[
"time_since_previous",
#"add",
],
groupby_trans_primitives=["cum_mean","cum_min"],
max_depth=1,
training_window=ft.Timedelta(60, "d"), # use only last X days in computations
max_features=1000,
chunk_size=10000,
verbose=True,
)
fm.shape # (58744, 328)
fm = fm.drop_duplicates()
print(fm.shape)
fm[50:100]
check data type
기존보다 훨씬 많은 float (통계치) 변수들이 생성함.
fm_dtype = pd.DataFrame(fm.dtypes).reset_index(drop=False)
fm_dtype.columns = ["feature", "dtype"]
fm_dtype["dtype"] = fm_dtype["dtype"].astype(str)
fm_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))
ori_df_dtype = pd.DataFrame(df_joint.dtypes).reset_index(drop=False)
ori_df_dtype.columns = ["feature", "dtype"]
ori_df_dtype.groupby("dtype").apply(lambda x : int(x.nunique()))
다양한 통계치를 이용해서 변수를 한 번에 생성하는 Featuretools에 대해서 알아봤고, 개인적으로 마음에 든다. 하지만 변수를 생성하고 나서 무의미한 변수들을 찾는 과정을 필수적으로 해야 하는 작업은 반드시 필요해 보인다.
Reference
* https://www.kaggle.com/frednavruzov/auto-feature-generation-featuretools-example
* https://github.com/alteryx/open_source_demos/blob/main/predict-next-purchase/Tutorial.ipynb
* https://docs.featuretools.com/en/v0.16.0/ecosystem.html
* https://www.kaggle.com/willkoehrsen/featuretools-for-good
* https://www.kaggle.com/willkoehrsen/tuning-automated-feature-engineering-exploratory
* https://medium.com/analytics-vidhya/feature-engineering-using-featuretools-with-code-10f8c83e5f68
'분석 Python > 구현 및 자료' 카테고리의 다른 글
Python) Permutation Importance 다양하게 표현하는 방법 (0) | 2022.01.31 |
---|---|
Python) Sphinx를 사용하여 문서화하기 + Github Pages + Gitlab (2) | 2022.01.26 |
Python) most frequent speed test (0) | 2021.12.24 |
선형 Kalman Filtering 알아보기 (0) | 2021.10.08 |
Python) 회귀 분석 기본 사용법 정리(scikit-learn, statsmodels) (2) | 2021.08.11 |