最近数据挖掘的一些杂七杂八

作者: cf1244c50db8 | 来源:发表于2018-04-17 22:49 被阅读0次

最近数据挖掘的一些杂七杂八
机器学习_总结篇_十大经典算法与算法选择
lncRNA数据
大数据产品经理必备的数据挖掘知识概述（一）认识数据
「数据分析」01数据分析简述及numpy和pandas模块的使用
记-数据挖掘-数据挖掘概述认识
数据挖掘第四讲
数据PM的自我修炼
[GBD数据库挖掘] 1.数据的下载与整合
学习笔记--(移动数据挖掘引言)

拉普拉斯平滑：

最简单的例子：
中国男足vs韩国男足的前5场的比分是0:5，那预测第六场中国队胜出的概率是多少时难道给0/5，这绝壁不行。所以分子分母都加1，变成1/6。

贝叶斯网络：

贝叶斯网络.png

p(a,b,c) = p(c|a,b)p(b|a)p(a)

马尔科夫链：

贝叶斯网络拉成一条线，并假设当前节点发生的概率只与当前节点的前一个节点有关。

时间序列:

时间序列简单的说就是各时间点上形成的数值序列，时间序列分析就是通过观察历史数据预测未来的值。在这里需要强调一点的是，时间序列分析并不是关于时间的回归，它主要是研究自身的变化规律的（这里不考虑含外生变量的时间序列）。

决策树

SLIQ：

introduce:
SLIQ stands for Supervised Learning In Quest, where Quest is the Data Mining project at the IBM Almaden Research Center.
SLIQ is a decision tree classifier that can handle both numeric and categorical attributes.
advantages:
SLIQ uses the novel techniques of pre-sorting, breadth first growth, and MDL-based pruning.
pre-sorting:
SLIQ uses a pre-sorting technique in the tree-growth phase to reduce the cost of evaluating numeric attributes.
MDL-based pruning:
SLIQ also uses a new tree-pruning algorithm based on the Minimum Description Length principle [11]. This algorithm is inexpensive, and results in compact and accurate trees.

Best-First decision:

different ：
The only difference is that, standard decision tree learning expands nodes in depth-first order, while best-first decision tree learning expands the ”best” node first.
split is the split with the maximal reduction of impurity

different between best-first and depth-first

In this example, considering the fully-expanded best-first decision tree the benefit of expanding node N2 is greater than the benefit of expanding N3.

two splitting criteria to measure impurity:
Gini gain,information gain
These splitting criteria were introduced to measure impurity of a node.

splitting rules:
split with the maximal reduction of impurity
对于连续型/数值型变量，对特征进行预排序，寻找最佳分割点

The method of dealing with missing values:
以不同的权重进入不同的分支

pruning method:pre-pruning post-pruning:
As mentioned before, pre-pruning stops splitting when the splitting cannot improve predictive performance.
In other words, post-pruning prunes off branches which do not improve accuracy.

梯度提升（GB）

AnyBoost

设C是损失函数 C是关于F的函数
F是一个弱学习器 ~F是一个弱学习器的集合
F' 代表F的导数，我们要找到一个f属于~F，
使得<-F',f>最大，<,>代表内积，内积大，代表相似度高
当内积小于零，我们停止迭代
内积，损失函数，步长根据特定情况规定

A gradient descent view of voting methods

规定内积：

内积1，G,F为弱学习器

反向梯度的公式：

梯度

Existing voting methods viewed as AnyBoost on margin cost functions.

免费午餐定理（NLF）

在没有实际背景下，没有一种算法比随机胡猜的效果好
it is hopeless to dream for a learning algorithm which is consistently better than other learning algorithms.

Ensembles Methods

Boost

examples:

数据.png

步骤.png

基算法.png

Bagging

boost 是顺序集成
bagging 是平行集成，基学习器平行生成，利用独立性
Bagging： Bootstrap AGGregating
采用Bootstrap sampling for training data / sampling with replacement
最常用的策略： voting for classificationg averaging for regression
Bagging有巨大的方差减小效应，对不稳定的学习器非常有效（常见的稳定学习器：k-nearest / neighbor classifer）

网友评论

本文标题：最近数据挖掘的一些杂七杂八

本文链接：https://www.haomeiwen.com/subject/defzhftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

最近数据挖掘的一些杂七杂八

拉普拉斯平滑：

贝叶斯网络：

马尔科夫链：

时间序列:

决策树

SLIQ：

Best-First decision:

梯度提升（GB）

AnyBoost

A gradient descent view of voting methods

免费午餐定理（NLF）

Ensembles Methods

Boost

Bagging

相关文章

最近数据挖掘的一些杂七杂八

机器学习_总结篇_十大经典算法与算法选择

lncRNA数据

大数据产品经理必备的数据挖掘知识概述（一）认识数据

「数据分析」01数据分析简述及numpy和pandas模块的使用

记-数据挖掘-数据挖掘概述认识

数据挖掘第四讲

数据PM的自我修炼

[GBD数据库挖掘] 1.数据的下载与整合

学习笔记--(移动数据挖掘引言)

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

最近数据挖掘的一些杂七杂八

拉普拉斯平滑：

贝叶斯网络：

马尔科夫链：

时间序列:

决策树

SLIQ：

Best-First decision:

梯度提升 （GB）

AnyBoost

A gradient descent view of voting methods

免费午餐定理（NLF）

Ensembles Methods

Boost

Bagging

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

梯度提升（GB）