Quora讨论|对于分类变量的缺失值究竟该如何处理

作者: 医科研 | 来源:发表于2020-03-18 16:24 被阅读0次

Quora讨论|对于分类变量的缺失值究竟该如何处理
笔记|数据分析之pandas基础----Series与DataF
Series第五讲缺失值处理
第3章 11条数据化运营不得不知道的数据预处理经验
对于缺失值的处理
pandas学习笔记之缺失值处理
2020-08-05--Pandas-03--缺失值处理
应用：数据预处理-缺失值填充
缺失值处理
分类算法处理缺失值

关于分类变量的缺失值究竟如何处理，我也咨询过很多人，包括统计方面的专家。有人说要插补数据，也有人说保留记录，以全局常量的形式插入。咨询完很多答案之后还是没有弄明白，以下是我进一步搜集的相关资料，与大家分享，如果你也遇到了同样的问题，欢迎入群跟我们一起交流。

以下是来自Quora的部分发言结果：

点赞最高的发言

来自[Michiel Van Herwegen], merely curious

第一个要问自己的问题：为什么这些价值观缺失了？实际上，数据很少是MAR（随机丢失），因此，数据丢失的事实具有其自身的含义。

Keeping missing values a a valid value

If you're going to impute (no matter the technique) a categorical variable where the data is not MAR, you are best case missing out on information. Worst-case, you're really clobbering up the variable.

如果您要插补（无论采用哪种技术）数据不是MAR的分类变量，则最好的情况是在信息上漏掉。最坏的情况是，您确实在破坏变量。

For that reason alone, i find it often valuable to consider missing as just another value for the categorical variable.（仅出于这个原因，我发现将缺失作为分类变量的另一个值通常很有价值）

My favorite example here is one i heard in a class, years ago. The department had done a churn project for a large retailer, based on loyalty card data. The best predictor of churn? Whether the customer had filled in his/her email address when registering for the card.

Imputing missing values(缺失值插补)

Sometimes, imputing the value can be the best option though:

When there is no reason to expect missing to have a meaning of its own. （缺失值没有意义）
When missing values are really rare, it may not be worth it to have an extra value for the categorical value. （缺失值很少的情况下）
When you are using a technique which has trouble with high-dimensional data, since an extra categorical value means an extra variable when dummy-encoding.

The common way to impute the missing values then is to use the mode. Main advantage: really simple, really fast. But of course not that nuanced.

Sometimes, a model is used to predict the missing values. I'm less a fan of that, for two reasons:

Time constraints. :) It is the scarcest resource you have, so the variable better be really important and the missing values common before spending a lot of time on just imputing it with a model of its own.

这位作者表示，它并不喜欢用模型来预测缺失值，第一个原因就是时间限制，他认为时间是最稀缺的资源。

为了来插补缺失数据要花很多时间来构建一个插补缺失的模型，太耗费时间了。
It introduces - to some extent - artificial relations between your predictors. So you better be careful to remember that when interpreting the model afterwards. Also, it is likely to introduce correlation between predictors, which may lead to higher uncertainty in your coefficients.

这样还会导致纳入模型的X之间有人为的相关性，这样对于模型的解释又增加了难度。

Dropping the records 删除缺失记录

Thirdly, you can remove the records with missing values. But that means you lose data. If there are many variables (who can have missing values), you may end up losing a lot of data while actually the total 'cells' with missing data is rather limited. （这种方法如果对于少量数据缺失是可取的，但是如果研究变量缺失过多，做这样的删除你会损失很多数据，可能会导致偏倚。)

On the other hand, if you are really interested in an explanatory model, you may want to get rid of them (again, assuming MAR) in order not to dilute/confound effects.

混合法

Lastly, there is actually yet another 'trick'. You impute, but you also add an extra variable which keeps track of whether you imputed the variable for the record or not. If the fact that it is missing, has a meaning, its impact can then be taken into account by that variable.

Bonus advantage: if someone else uses the data, the fact that you imputed data is not lost on them. :)

其它观点

[Shehroz Khan],ML Researcher, Postdoc @U of Toronto

认为处理方式有三种，包括:

Delete the records with missing values 删除有缺失的记录
Leave them as is , or 保留
Perform imputation 插补

您可以删除缺少值的行或记录，以避免对其进行处理。但是，如果数据中的缺失很大，则会影响模型的预测能力。因此，应避免这种做法。

If you are using decision trees, then you can keep missing attribute values as '?' or any arbitrary value not present in your attributes. Decision trees will take this as a separate attribute value and will give you predictions. You can use the same idea for Bagging, Boosting and Random forest. I think this is the strategy you mentioned above, but you cannot declare a variable as "0", unless you define what "0" means, or else it can mess up your computations. However, the problem with this approach is that if missingness is high in your data, then the predictive ability of your model deteriorate drastically. This is more of a lazy and adhoc approach. Therefore, better approaches are needed. Read Below.

插补意味着将丢失的属性值替换为其他内容。常见的选择是将其替换为属性中所有值的模式，However, a central question is "Is mode a good representation for missing attribute value?". 您可以使用其他分类模型来预测缺少的属性值。但是，尚不清楚在没有完整数据的情况下这些预测模型将如何学习。 KNN may work because in principle it doesn't do any training but it is O(N2)O(N2), so super slow on large datasets and the value of K needs to be optimized. There has been more work done to handle this question in principled manner. Read Below.

多重插补

There are systematic ways to handle missing data by performing Multiple Imputation(http://www.stefvanbuuren.nl/mi/MI.html). Rubin [1] and Joseph L. Schafer [2] 等人在缺失数据处理方面做了很多工作。

Chapters 7 and 8 of Schafer's book specifically deals with categorical data imputation. There are some software that imputes categorical data, you can see here Multiple imputation software(http://www.stefvanbuuren.nl/mi/Software.html).
Search for "categorical" and you will find CRAN - Package cat(http://cran.r-project.org/web/packages/cat/index.html) which is a R software based on Schaefer's work . Another R software that uses Non-Parametric Bayesian Multiple Imputation for Categorical Data [3].
在多重插补中，对一个缺失值进行多次插补. Therefore, once you impute data multiple times, you can perform different things such as create ensembles or averaging different imputations (for categorical data averaging is not the obvious thing to do).

数据科学家[Giuliano Janson]

Since nowadays there are plenty models that deal with missing values very well, like Gradient Boosting or Random Forest, I usually just set a special value for missing categorical levels (like 'MISSING', or -999) and let the model figure out if "missingness" has predictive power.

这位作者表示它通常会将缺失值设置为一个特殊值 Missing，然后让模型来看缺失值是否有预测能力。
The algorithms I mentioned do not even need one hot encoding, but if you wanted you could encode the 'MISSING' level the same way you encode any other level.
Everything else, like median, using another model to predict the missing value, dropping the row,... based on my experience, do not work as well or are way more complex and produce minimal improvement.

作者认为使用中位值，使用其它模型来预测缺失值，删去观测这些方法他都觉得不好，甚至是更复杂的方法也只能有极小的改善。

参考资料

以上讨论仅供学习使用，参考资料见：

https://www.quora.com/How-do-I-handle-missing-categorical-variable-in-an-easy-way

Quora讨论|对于分类变量的缺失值究竟该如何处理
关于分类变量的缺失值究竟如何处理，我也咨询过很多人，包括统计方面的专家。有人说要插补数据，也有人说保留记录，以全局...
笔记|数据分析之pandas基础----Series与DataF
如何处理缺失数据在练习中经常遇到pandas使用浮点值NaN来表示数组中的缺失数据。那我们该如何处理这些缺失数据...
Series第五讲缺失值处理
Series第五讲缺失值处理本节课将讲解如何处理pandas里的缺失值缺失值处理 Series.fillna...
第3章 11条数据化运营不得不知道的数据预处理经验
目录：3.1 数据清洗：缺失值、异常值和重复值的处理3.2 将分类数据和顺序数据转化为标志变量3.3 大数据时代的...
对于缺失值的处理
建议：不同场景下的数据缺失机制不同，这需要工程师基于对业务选择合适的填充方法。如何判断缺失值类型？缺失值的分类按...
pandas学习笔记之缺失值处理
对于数据中的缺失值，有两种处理思路：删除插补如何判断数据中是否存在缺失值? pd.isnull(df) ->...
2020-08-05--Pandas-03--缺失值处理
这一章节我们来看下如何使用Pandas处理缺失值。什么是缺失值在了解缺失值（也叫控制）如何处理之前，首先要知道...
应用：数据预处理-缺失值填充
个人不建议填充缺失值，建议设置哑变量或者剔除该变量，填充成本较高常见填充缺失值的方法： 1.均值、众数填充，填充...
缺失值处理
对于缺失值的处理，从总体上来说分为删除存在缺失值的个案和缺失值插补。对于主观数据，人将影响数据的真实性，存在缺失值...
分类算法处理缺失值
整体而言，树模型+bayes对于缺失值都不太敏感；涉及到度量问题的模型（SVM+KNN）就相对敏感。决策树其实...

Quora讨论|对于分类变量的缺失值究竟该如何处理

以下是来自Quora的部分发言结果：

点赞最高的发言

Keeping missing values a a valid value

Imputing missing values(缺失值插补)

Dropping the records 删除缺失记录

混合法

其它观点

[Shehroz Khan],ML Researcher, Postdoc @U of Toronto

多重插补

数据科学家[Giuliano Janson]

参考资料

相关文章

Quora讨论|对于分类变量的缺失值究竟该如何处理

笔记|数据分析之pandas基础----Series与DataF

Series第五讲缺失值处理

第3章 11条数据化运营不得不知道的数据预处理经验

对于缺失值的处理

pandas学习笔记之缺失值处理

2020-08-05--Pandas-03--缺失值处理

应用：数据预处理-缺失值填充

缺失值处理

分类算法处理缺失值

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读