Pandas Profiling 패키지 Customization 하기
2021. 3. 8. 22:43ㆍ분석 Python/Pandas Tip
최신 버전 기준으로 설치해야 할 수 있다.(21/03/08 기준)
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
!pip install pydot
!pip install pygraphviz
from pandas_profiling.model.typeset import ProfilingTypeSet
typeset = ProfilingTypeSet()
typeset.plot_graph(dpi=100)
데이터 읽기!
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df['Survived'] = df['Survived'].astype(bool)
df.head()
import pandas_profiling as pp
report = pp.ProfileReport(df, lazy=False, dark_mode=True)
report
정보도 볼 수 있다.
report.description_set['variables']['Survived']
직접적으로 Summarizer를 사용해서 통계치를 생산할 수도 있다!! (중요)
from pandas_profiling.model.typeset import Boolean
report.summarizer.summarize(df['Survived'], Boolean)
customized summary를 생성하기 위해서는 typeset안에서 각각의 데이터 타입을 위해 호출할 함수들을 순서대로 정의할 필요가 있다.
from pandas_profiling.model.summarizer import BaseSummarizer
from pandas_profiling.model.typeset import Unsupported
def default_summary(series: pd.Series, summary: dict = {}):
summary['length'] = len(series)
return series, summary
def new_boolean_summary(series: pd.Series, summary: dict = {}):
summary['probability_true'] = series.mean()
return series, summary
mapping = {
Boolean: [default_summary, new_boolean_summary],
Unsupported: [default_summary]
}
new_typeset = Boolean + Unsupported # In visions, adding two (or more) types together produces a typeset.
summarizer = BaseSummarizer(mapping, new_typeset)
summarizer.summarize(df.Survived, Boolean)
{'length': 891, 'probability_true': 0.3838383838383838, 'type': Boolean}
summarizer.summarize(df.Survived, Unsupported)
{'length': 891, 'type': Unsupported}
Customizing ProfileReport Summaries
from pandas_profiling.model.typeset import ProfilingTypeSet
typeset = ProfilingTypeSet()
from pandas_profiling.model.summarizer import PandasProfilingSummarizer
custom_summarizer = PandasProfilingSummarizer(typeset)
custom_summarizer.mapping[Boolean].append(new_boolean_summary)
report = pp.ProfileReport(df, lazy=False, summarizer=custom_summarizer)
report.description_set['variables']['Survived']
{'n_distinct': 2,
'p_distinct': 0.002244668911335578,
'is_unique': False,
'n_unique': 0,
'p_unique': 0.0,
'type': Boolean,
'hashable': True,
'value_counts_without_nan': False 549
True 342
Name: Survived, dtype: int64,
'n_missing': 0,
'n': 891,
'p_missing': 0.0,
'count': 891,
'memory_size': 1019,
'probability_true': 0.3838383838383838}
참고 )
towardsdatascience.com/customizing-pandas-profiling-summaries-b16714d0dac9
728x90
'분석 Python > Pandas Tip' 카테고리의 다른 글
Python) datetime64[ns]로 변환하는 방법 (0) | 2022.10.10 |
---|---|
[Pandas] Code to reduce memory (0) | 2021.01.01 |
[Pandas] Pandas의 Filter 함수를 사용하여 특정 컬럼(변수) 제외하기 (0) | 2020.12.15 |
[TIP / Pandas] Change Columns Order (열 순서 바꾸기) (0) | 2020.10.28 |
[Python] Pandas를 활용하여 엑셀 시트별로 만들기 (0) | 2020.10.20 |