美文网首页
最近数据挖掘的一些杂七杂八

最近数据挖掘的一些杂七杂八

作者: cf1244c50db8 | 来源:发表于2018-04-17 22:49 被阅读0次

拉普拉斯平滑:

最简单的例子:
中国男足vs韩国男足的前5场的比分是0:5,那预测第六场中国队胜出的概率是多少时难道给0/5,这绝壁不行。所以分子分母都加1,变成1/6。

贝叶斯网络:

贝叶斯网络.png

p(a,b,c) = p(c|a,b)p(b|a)p(a)

马尔科夫链:

贝叶斯网络拉成一条线,并假设当前节点发生的概率只与当前节点的前一个节点有关。

时间序列:

时间序列简单的说就是各时间点上形成的数值序列,时间序列分析就是通过观察历史数据预测未来的值。在这里需要强调一点的是,时间序列分析并不是关于时间的回归,它主要是研究自身的变化规律的(这里不考虑含外生变量的时间序列)。

决策树

SLIQ:

introduce:
SLIQ stands for Supervised Learning In Quest, where Quest is the Data Mining project at the IBM Almaden Research Center.
SLIQ is a decision tree classifier that can handle both numeric and categorical attributes.
advantages:
SLIQ uses the novel techniques of pre-sorting, breadth first growth, and MDL-based pruning.
pre-sorting:
SLIQ uses a pre-sorting technique in the tree-growth phase to reduce the cost of evaluating numeric attributes.
MDL-based pruning:
SLIQ also uses a new tree-pruning algorithm based on the Minimum Description Length principle [11]. This algorithm is inexpensive, and results in compact and accurate trees.

Best-First decision:

different :
The only difference is that, standard decision tree learning expands nodes in depth-first order, while best-first decision tree learning expands the ”best” node first.
split is the split with the maximal reduction of impurity

different between best-first and depth-first

In this example, considering the fully-expanded best-first decision tree the benefit of expanding node N2 is greater than the benefit of expanding N3.

two splitting criteria to measure impurity:
Gini gain,information gain
These splitting criteria were introduced to measure impurity of a node.

splitting rules:
split with the maximal reduction of impurity
对于连续型/数值型变量,对特征进行预排序,寻找最佳分割点

The method of dealing with missing values:
以不同的权重进入不同的分支

pruning method:pre-pruning post-pruning:
As mentioned before, pre-pruning stops splitting when the splitting cannot improve predictive performance.
In other words, post-pruning prunes off branches which do not improve accuracy.

梯度提升 (GB)

AnyBoost

设C是损失函数 C是关于F的函数
F是一个弱学习器 ~F是一个弱学习器的集合
F' 代表F的导数, 我们要找到一个f属于~F,
使得<-F',f>最大,<,>代表内积,内积大,代表相似度高
当内积小于零,我们停止迭代
内积,损失函数,步长根据特定情况规定

A gradient descent view of voting methods

规定内积:


内积1,G,F为弱学习器

反向梯度的公式:


梯度 Existing voting methods viewed as AnyBoost on margin cost functions.

免费午餐定理(NLF)

在没有实际背景下,没有一种算法比随机胡猜的效果好
it is hopeless to dream for a learning algorithm which is consistently better than other learning algorithms.

Ensembles Methods

Boost

examples:


数据.png
步骤.png
基算法.png

Bagging

boost 是顺序集成
bagging 是平行集成,基学习器平行生成,利用独立性
Bagging: Bootstrap AGGregating
采用Bootstrap sampling for training data / sampling with replacement
最常用的策略: voting for classificationg averaging for regression
Bagging有巨大的方差减小效应,对不稳定的学习器非常有效(常见的稳定学习器:k-nearest / neighbor classifer)

相关文章

网友评论

      本文标题:最近数据挖掘的一些杂七杂八

      本文链接:https://www.haomeiwen.com/subject/defzhftx.html