Why using a mean for missing data is a bad idea. Alternative imputation algorithms.

2019. 6. 30. 23:02포스팅 후보




Why using a mean for missing data is a bad idea. Alternative imputation algorithms.

We all know the pain when the dataset we want to use for Machine Learning contains missing data. The quick and easy workaround is to…


가장 인상 깊은 부분은 이것


Mean reduces a variance of the data

the variance was reduced (that big change is because the dataset is very small) after using the Mean Imputation. Going deeper into mathematics, a smaller variance leads to the narrower confidence interval in the probability distribution

평균 대체는 분산을 작게하는데, 분산이 작게 되면 신뢰 구간은 좁아지게 된다. 

즉 모델이 편향되게 할 수 있다!


MAR, MCAR, MNAR 잘 설명한 곳



How to Handle Missing Data

“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin)




'포스팅 후보' 카테고리의 다른 글

Stacking Classifier 연습해보기  (0) 2019.07.24
Differential Privacy 관련 좋은 글  (0) 2019.07.01
7 Tips for Dealing With Small Image Data  (0) 2019.06.30
CatBoost + Interpretation  (0) 2019.06.30
regularization group lasso for NN  (0) 2019.06.17