Pandas Profiling 패키지 Customization 하기

2021. 3. 8. 22:43ㆍ분석 Python/Pandas Tip

최신 버전 기준으로 설치해야 할 수 있다.(21/03/08 기준)

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

!pip install pydot
!pip install pygraphviz
from pandas_profiling.model.typeset import ProfilingTypeSet

typeset = ProfilingTypeSet()
typeset.plot_graph(dpi=100)

데이터 읽기!

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
df['Survived'] = df['Survived'].astype(bool)
df.head()

import pandas_profiling as pp
report = pp.ProfileReport(df, lazy=False, dark_mode=True)
report

정보도 볼 수 있다.

report.description_set['variables']['Survived']

직접적으로 Summarizer를 사용해서 통계치를 생산할 수도 있다!! (중요)

from pandas_profiling.model.typeset import Boolean
report.summarizer.summarize(df['Survived'], Boolean)

customized summary를 생성하기 위해서는 typeset안에서 각각의 데이터 타입을 위해 호출할 함수들을 순서대로 정의할 필요가 있다.

from pandas_profiling.model.summarizer import BaseSummarizer
from pandas_profiling.model.typeset import Unsupported

def default_summary(series: pd.Series, summary: dict = {}):
    summary['length'] = len(series)
    return series, summary

def new_boolean_summary(series: pd.Series, summary: dict = {}):
    summary['probability_true'] = series.mean()
    return series, summary

mapping = {
    Boolean: [default_summary, new_boolean_summary],
    Unsupported: [default_summary]
}
new_typeset = Boolean + Unsupported # In visions, adding two (or more) types together produces a typeset.
summarizer = BaseSummarizer(mapping, new_typeset)

summarizer.summarize(df.Survived, Boolean)

{'length': 891, 'probability_true': 0.3838383838383838, 'type': Boolean}

summarizer.summarize(df.Survived, Unsupported)

{'length': 891, 'type': Unsupported}

Customizing ProfileReport Summaries


from pandas_profiling.model.typeset import ProfilingTypeSet

typeset = ProfilingTypeSet()
from pandas_profiling.model.summarizer import PandasProfilingSummarizer

custom_summarizer = PandasProfilingSummarizer(typeset)
custom_summarizer.mapping[Boolean].append(new_boolean_summary)

report = pp.ProfileReport(df, lazy=False, summarizer=custom_summarizer)

report.description_set['variables']['Survived']

{'n_distinct': 2,
 'p_distinct': 0.002244668911335578,
 'is_unique': False,
 'n_unique': 0,
 'p_unique': 0.0,
 'type': Boolean,
 'hashable': True,
 'value_counts_without_nan': False    549
 True     342
 Name: Survived, dtype: int64,
 'n_missing': 0,
 'n': 891,
 'p_missing': 0.0,
 'count': 891,
 'memory_size': 1019,
 'probability_true': 0.3838383838383838}

참고 )

towardsdatascience.com/customizing-pandas-profiling-summaries-b16714d0dac9

Customizing Pandas-Profiling Summaries

If you’ve previously used pandas-profiling, you might have observed that column summaries are unique to the data types of each feature in…

towardsdatascience.com

저작자표시

'분석 Python > Pandas Tip' 카테고리의 다른 글

Python) datetime64[ns]로 변환하는 방법 (0)	2022.10.10
[Pandas] Code to reduce memory (0)	2021.01.01
[Pandas] Pandas의 Filter 함수를 사용하여 특정 컬럼(변수) 제외하기 (0)	2020.12.15
[TIP / Pandas] Change Columns Order (열 순서 바꾸기) (0)	2020.10.28
[Python] Pandas를 활용하여 엑셀 시트별로 만들기 (0)	2020.10.20

All I Need Is Data.

All I Need Is Data.

태그

최근글

댓글

공지사항

아카이브

Customizing ProfileReport Summaries

'분석 Python > Pandas Tip' 카테고리의 다른 글

관련글

티스토리툴바