

作者: 医科研 | 来源:发表于2020-03-18 16:24 被阅读0次




来自[Michiel Van Herwegen], merely curious


Keeping missing values a a valid value

If you're going to impute (no matter the technique) a categorical variable where the data is not MAR, you are best case missing out on information. Worst-case, you're really clobbering up the variable.


For that reason alone, i find it often valuable to consider missing as just another value for the categorical variable.(仅出于这个原因,我发现将缺失作为分类变量的另一个值通常很有价值)

My favorite example here is one i heard in a class, years ago. The department had done a churn project for a large retailer, based on loyalty card data. The best predictor of churn? Whether the customer had filled in his/her email address when registering for the card.

Imputing missing values(缺失值插补)

Sometimes, imputing the value can be the best option though:

  • When there is no reason to expect missing to have a meaning of its own. (缺失值没有意义)

  • When missing values are really rare, it may not be worth it to have an extra value for the categorical value. (缺失值很少的情况下)

  • When you are using a technique which has trouble with high-dimensional data, since an extra categorical value means an extra variable when dummy-encoding.

The common way to impute the missing values then is to use the mode. Main advantage: really simple, really fast. But of course not that nuanced.

Sometimes, a model is used to predict the missing values. I'm less a fan of that, for two reasons:

  1. Time constraints. :) It is the scarcest resource you have, so the variable better be really important and the missing values common before spending a lot of time on just imputing it with a model of its own.



  2. It introduces - to some extent - artificial relations between your predictors. So you better be careful to remember that when interpreting the model afterwards. Also, it is likely to introduce correlation between predictors, which may lead to higher uncertainty in your coefficients.


Dropping the records 删除缺失记录

Thirdly, you can remove the records with missing values. But that means you lose data. If there are many variables (who can have missing values), you may end up losing a lot of data while actually the total 'cells' with missing data is rather limited. (这种方法如果对于少量数据缺失是可取的,但是如果研究变量缺失过多,做这样的删除你会损失很多数据,可能会导致偏倚。)

On the other hand, if you are really interested in an explanatory model, you may want to get rid of them (again, assuming MAR) in order not to dilute/confound effects.


Lastly, there is actually yet another 'trick'. You impute, but you also add an extra variable which keeps track of whether you imputed the variable for the record or not. If the fact that it is missing, has a meaning, its impact can then be taken into account by that variable.

Bonus advantage: if someone else uses the data, the fact that you imputed data is not lost on them. :)


[Shehroz Khan],ML Researcher, Postdoc @U of Toronto


  1. Delete the records with missing values 删除有缺失的记录

  2. Leave them as is , or 保留

  3. Perform imputation 插补


If you are using decision trees, then you can keep missing attribute values as '?' or any arbitrary value not present in your attributes. Decision trees will take this as a separate attribute value and will give you predictions. You can use the same idea for Bagging, Boosting and Random forest. I think this is the strategy you mentioned above, but you cannot declare a variable as "0", unless you define what "0" means, or else it can mess up your computations. However, the problem with this approach is that if missingness is high in your data, then the predictive ability of your model deteriorate drastically. This is more of a lazy and adhoc approach. Therefore, better approaches are needed. Read Below.

插补意味着将丢失的属性值替换为其他内容。常见的选择是将其替换为属性中所有值的模式,However, a central question is "Is mode a good representation for missing attribute value?". 您可以使用其他分类模型来预测缺少的属性值。但是,尚不清楚在没有完整数据的情况下这些预测模型将如何学习。 KNN may work because in principle it doesn't do any training but it is O(N2)O(N2), so super slow on large datasets and the value of K needs to be optimized. There has been more work done to handle this question in principled manner. Read Below.


There are systematic ways to handle missing data by performing Multiple Imputation(http://www.stefvanbuuren.nl/mi/MI.html). Rubin [1] and Joseph L. Schafer [2] 等人在缺失数据处理方面做了很多工作。

  • Chapters 7 and 8 of Schafer's book specifically deals with categorical data imputation. There are some software that imputes categorical data, you can see here Multiple imputation software(http://www.stefvanbuuren.nl/mi/Software.html).

  • Search for "categorical" and you will find CRAN - Package cat(http://cran.r-project.org/web/packages/cat/index.html) which is a R software based on Schaefer's work . Another R software that uses Non-Parametric Bayesian Multiple Imputation for Categorical Data [3].

  • 在多重插补中,对一个缺失值进行多次插补. Therefore, once you impute data multiple times, you can perform different things such as create ensembles or averaging different imputations (for categorical data averaging is not the obvious thing to do).

数据科学家[Giuliano Janson]

  • Since nowadays there are plenty models that deal with missing values very well, like Gradient Boosting or Random Forest, I usually just set a special value for missing categorical levels (like 'MISSING', or -999) and let the model figure out if "missingness" has predictive power.

    这位作者表示它通常会将缺失值设置为一个特殊值 Missing,然后让模型来看缺失值是否有预测能力。

  • The algorithms I mentioned do not even need one hot encoding, but if you wanted you could encode the 'MISSING' level the same way you encode any other level.

  • Everything else, like median, using another model to predict the missing value, dropping the row,... based on my experience, do not work as well or are way more complex and produce minimal improvement.







