Safe handling instructions for missing data

Dillon Niederhut

doi:10.25080/Majora-4af1f417-008

Safe handling instructions for missing data

Dillon Niederhut
Enthought, Inc.

Abstract

In machine learning tasks, it is common to handle missing data by removing observations with missing values, or replacing missing data with the mean value for its feature. To show why this is problematic, we use listwise deletion and mean imputing to recover missing values from artificially created datasets, and we compare those models against ones with full information. Unless quite strong independence assumptions are met, we observe large biases in the resulting coefficients and an increase in the model's prediction error. We include a set of recommendations for handling missing data safely, and a case study showing how to put those recommendations into practice.

Keywords

data science, missing data, imputation

DOI

10.25080/Majora-4af1f417-008

Bibtex entry

Full text PDF

Proceedings

Safe handling instructions for missing data