Pandas Profiling 패키지 Customization 하기

2021. 3. 8. 22:43분석 Python/Pandas Tip

728x90

최신 버전 기준으로 설치해야 할 수 있다.(21/03/08 기준)

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

 

!pip install pydot
!pip install pygraphviz
from pandas_profiling.model.typeset import ProfilingTypeSet

typeset = ProfilingTypeSet()
typeset.plot_graph(dpi=100)

 

데이터 읽기!

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df['Survived'] = df['Survived'].astype(bool)
df.head()

 

import pandas_profiling as pp
report = pp.ProfileReport(df, lazy=False, dark_mode=True)
report

 

정보도 볼 수 있다.

report.description_set['variables']['Survived']

 

직접적으로 Summarizer를 사용해서 통계치를 생산할 수도 있다!! (중요)

from pandas_profiling.model.typeset import Boolean
report.summarizer.summarize(df['Survived'], Boolean)

 

 

customized summary를 생성하기 위해서는 typeset안에서 각각의 데이터 타입을 위해 호출할 함수들을 순서대로 정의할 필요가 있다.

from pandas_profiling.model.summarizer import BaseSummarizer
from pandas_profiling.model.typeset import Unsupported

def default_summary(series: pd.Series, summary: dict = {}):
    summary['length'] = len(series)
    return series, summary

def new_boolean_summary(series: pd.Series, summary: dict = {}):
    summary['probability_true'] = series.mean()
    return series, summary

mapping = {
    Boolean: [default_summary, new_boolean_summary],
    Unsupported: [default_summary]
}
new_typeset = Boolean + Unsupported # In visions, adding two (or more) types together produces a typeset.
summarizer = BaseSummarizer(mapping, new_typeset)

 

 

summarizer.summarize(df.Survived, Boolean)
{'length': 891, 'probability_true': 0.3838383838383838, 'type': Boolean}
summarizer.summarize(df.Survived, Unsupported)
{'length': 891, 'type': Unsupported}

Customizing ProfileReport Summaries


from pandas_profiling.model.typeset import ProfilingTypeSet

typeset = ProfilingTypeSet()
from pandas_profiling.model.summarizer import PandasProfilingSummarizer

custom_summarizer = PandasProfilingSummarizer(typeset)
custom_summarizer.mapping[Boolean].append(new_boolean_summary)

report = pp.ProfileReport(df, lazy=False, summarizer=custom_summarizer)
report.description_set['variables']['Survived']
{'n_distinct': 2,
 'p_distinct': 0.002244668911335578,
 'is_unique': False,
 'n_unique': 0,
 'p_unique': 0.0,
 'type': Boolean,
 'hashable': True,
 'value_counts_without_nan': False    549
 True     342
 Name: Survived, dtype: int64,
 'n_missing': 0,
 'n': 891,
 'p_missing': 0.0,
 'count': 891,
 'memory_size': 1019,
 'probability_true': 0.3838383838383838}

 

 

 

참고 )

towardsdatascience.com/customizing-pandas-profiling-summaries-b16714d0dac9

 

Customizing Pandas-Profiling Summaries

If you’ve previously used pandas-profiling, you might have observed that column summaries are unique to the data types of each feature in…

towardsdatascience.com

 

728x90