美文网首页
Boruta 特征选择

Boruta 特征选择

作者: X_cda6 | 来源:发表于2019-01-07 21:38 被阅读0次

    机器学习任务中,在正式训练模型之前,我们一般会从原始数据中尽可能多的提取特征,作为模型的输入。

    但是特征也不是越多越好,就像《赌神》里的戒指梗:高进在最近500盘赌局中偷鸡时喜欢摸戒指,让对手误以为发现了他的习惯:一摸戒指就要偷鸡,结果在比赛中上当了。摸戒指就是一个不好的特征,把它纳入模型中,反而会使模型变差。

    本文介绍的Boruta算法就是一种特征选择方法,使用特征的重要性来选取特征。boruta_py是Boruta算法的python实现,类似于sklearn的扩展,用起来很方便。

    Boruta的主要步骤如下:

    1. 创建阴影特征 (shadow feature) : 对每个真实特征R,随机打乱顺序,得到阴影特征矩阵S,拼接到真实特征后面,构成新的特征矩阵N = [R, S].

    2. 用新的特征矩阵N作为输入,训练模型,能输出feature_importances_的模型,如RandomForest, lightgbm,xgboost都可以,得到真实特征和阴影特征的feature importances,

    3. 取阴影特征feature importance的最大值S_max,真实特征中feature importance大于S_max的,记录一次命中。

    4. 用(3)中记录的真实特征累计命中,标记特征重要或不重要。原论文中用Bonferroni校正作显著性检验,boruta_py认为Bonferroni校正太过保守,默认增加了FDR校正,用two_step参数可以切换两种检验方法。

    5. 删除不重要的特征,重复1-4,直到所有特征都被标记。

    训练结束后,boruta_py 还可以输出特征ranking_,表示特征的重要性等级,在特征选择中也是一个很有用的指标。

    下面结合代码看一下两种检验的区别,boruta_py的特征检验在实现如下。

    def _do_tests(self, dec_reg, hit_reg, _iter):
        active_features = np.where(dec_reg >= 0)[0]
        hits = hit_reg[active_features]
        # get uncorrected p values based on hit_reg
        to_accept_ps = sp.stats.binom.sf(hits - 1, _iter, .5).flatten()
        to_reject_ps = sp.stats.binom.cdf(hits, _iter, .5).flatten()
    
        if self.two_step:
            # two step multicor process
            # first we correct for testing several features in each round using FDR
            to_accept = self._fdrcorrection(to_accept_ps, alpha=self.alpha)[0]
            to_reject = self._fdrcorrection(to_reject_ps, alpha=self.alpha)[0]
    
            # second we correct for testing the same feature over and over again
            # using bonferroni
            to_accept2 = to_accept_ps <= self.alpha / float(_iter)
            to_reject2 = to_reject_ps <= self.alpha / float(_iter)
    
            # combine the two multi corrections, and get indexes
            to_accept *= to_accept2
            to_reject *= to_reject2
        else:
            # as in th original Boruta, we simply do bonferroni correction
            # with the total n_feat in each iteration
            to_accept = to_accept_ps <= self.alpha / float(len(dec_reg))
            to_reject = to_reject_ps <= self.alpha / float(len(dec_reg))
    
        # find features which are 0 and have been rejected or accepted
        to_accept = np.where((dec_reg[active_features] == 0) * to_accept)[0]
        to_reject = np.where((dec_reg[active_features] == 0) * to_reject)[0]
    
        # updating dec_reg
        dec_reg[active_features[to_accept]] = 1
        dec_reg[active_features[to_reject]] = -1
        return dec_reg
    
    def _fdrcorrection(self, pvals, alpha=0.05):
        """
        Benjamini/Hochberg p-value correction for false discovery rate, from
        statsmodels package. Included here for decoupling dependency on statsmodels.
    
        Parameters
        ----------
        pvals : array_like
            set of p-values of the individual tests.
        alpha : float
            error rate
    
        Returns
        -------
        rejected : array, bool
            True if a hypothesis is rejected, False if not
        pvalue-corrected : array
            pvalues adjusted for multiple hypothesis testing to limit FDR
        """
        pvals = np.asarray(pvals)
        pvals_sortind = np.argsort(pvals)
        pvals_sorted = np.take(pvals, pvals_sortind)
        nobs = len(pvals_sorted)
        ecdffactor = np.arange(1, nobs + 1) / float(nobs)
    
        reject = pvals_sorted <= ecdffactor * alpha
        if reject.any():
            rejectmax = max(np.nonzero(reject)[0])
            reject[:rejectmax] = True
    
        pvals_corrected_raw = pvals_sorted / ecdffactor
        pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]
        pvals_corrected[pvals_corrected > 1] = 1
        # reorder p-values and rejection mask to original order of pvals
        pvals_corrected_ = np.empty_like(pvals_corrected)
        pvals_corrected_[pvals_sortind] = pvals_corrected
        reject_ = np.empty_like(reject)
        reject_[pvals_sortind] = reject
        return reject_, pvals_corrected_
    

    其中 dec_reg 中用 1/0/-1 分别表示特征 接受/待定/拒绝 三种状态,hit_reg 存储了多次实验后每个特征的累计hit数,_iter为已经进行的实验次数,self.alpha参数是显著水平,默认值0.05.

    Bonferroni校正:如果在同一数据集上同时检验n个独立的假设,那么用于每一假设的统计显著水平,应为仅检验一个假设时的显著水平的1/n,所以在比较时用self.alpha 除以 float(len(dec_reg))。

    需要注意的是,boruta_py做特征选择时,会检查特征矩阵中是否包含NAN, INF这样的无效值,如果用lightgbm这样可以处理NAN值的模型来输出feature importances,可以把这一步去掉。

    参考

    Kursa M B, Rudnicki W R. Feature Selection with the Boruta Package[J]. Journal of Statistical Software, 2010, 36(11):1-13.

    相关文章

      网友评论

          本文标题:Boruta 特征选择

          本文链接:https://www.haomeiwen.com/subject/tmnprqtx.html