Why using a mean for missing data is a bad idea. Alternative imputation algorithms.

2019. 6. 30. 23:02포스팅 후보

https://towardsdatascience.com/why-using-a-mean-for-missing-data-is-a-bad-idea-alternative-imputation-algorithms-837c731c1008

 

Why using a mean for missing data is a bad idea. Alternative imputation algorithms.

We all know the pain when the dataset we want to use for Machine Learning contains missing data. The quick and easy workaround is to…

towardsdatascience.com

가장 인상 깊은 부분은 이것

 

Mean reduces a variance of the data

the variance was reduced (that big change is because the dataset is very small) after using the Mean Imputation. Going deeper into mathematics, a smaller variance leads to the narrower confidence interval in the probability distribution

평균 대체는 분산을 작게하는데, 분산이 작게 되면 신뢰 구간은 좁아지게 된다. 

즉 모델이 편향되게 할 수 있다!

 

MAR, MCAR, MNAR 잘 설명한 곳

https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

 

How to Handle Missing Data

“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin)

towardsdatascience.com

 

728x90

'포스팅 후보' 카테고리의 다른 글

Stacking Classifier 연습해보기  (0) 2019.07.24
Differential Privacy 관련 좋은 글  (0) 2019.07.01
7 Tips for Dealing With Small Image Data  (0) 2019.06.30
CatBoost + Interpretation  (0) 2019.06.30
regularization group lasso for NN  (0) 2019.06.17