部分转自统计学习方法
Except for reading, the first two parts are supposed to be finished in 2 weeks.
1.Coding
We were discussing K-Means algorithm carefully and thoroughly. And please do remember
that you need to be able to implement that all by yourselves in Python or in C++.
Besides, if your target are those big names, K-Means++ is also compulsory.
So you are highly recommended to implement K-Means++ in Python or in C++. And you are
welcome to accomplish this based on the K-Means code we provided in our course.
k-means ++
转载自csdn https://blog.csdn.net/i000zheng/article/details/79913204
2.Mathematical Problems
The last topic of our last week is Decision Tree [You can find it in our slides] where we have 3
algorithms (ID3, C4.5 and CART) in which the fundamental one is the ID3 algorithm which was
implemented in 1979 initially by Quinlan.
In our slides, we provide a specific example of implementing a decision tree by using ID3
algorithm step by step. The only problem of that is we intentionaly missed some computing
details of how to build up the tree.
Therefore your task is to supplyment details of calculation of each node to the tree and
provide the evidence why you split a node in such way.
You can follow what we did in slides. And to check whether you set up the correctly or not, you
can also find the final tree in our slides.
P.S.
A. If you still think this is too hard to complete, please don't worry. We can cover this topic in
Week 15.
B. If you still have time, you can learn C4.5 and CART by yourself as well from our slides and
please try to answer questions:
I. What is Gain Ratio?
II. Why we are prone to use Gain Ratio?
III. How to split a node by using Gain Ratio?
由于信息增益比对取值数目较少的特征具有一定偏好,孤儿先从候选划分属性中找出信息增益高于平均水平的属性,再从中选择增益率最高的。
IV. What Gini Index?
V. How to split a node by using Gini Index?
选择基尼指数最小的特征为决策树的结点。
VI. Why people are likely to use C4.5 or CART rather than ID3?
ID3算法中信息增益准则对取值数目较多的特征具有偏好。
C4.5算法中信息增益比准则对取值数目较少的特征具有偏好。
Answer
ID3、C4.5、CART三种决策树算法,分别使用信息增益(information gain)、信息增益比(information gain ratio)、基尼指数(gain index)来进行决策树结点的选择。
首先介绍熵的概念,熵(entropy)是表示随机变量不确定性的度量。设X是一个取有限个值的离散随机变量,其概率分布为:
则随机变量X的熵定义为:
信息增益:
特征A对训练数据集D的信息增益g(D,A),定义为集合D的经验熵H(D)与特征A给定条件下D的经验条件熵H(D|A)之差,即:
信息增益比:
基尼指数:
CART 决策树[Breiman et al., 1984] 使用"基尼指数" (Gini index)来选择划分属性.数据集D 的纯度可用基尼值来度量:
直观来说, Gini(D) 反映了从数据集D 中随机抽取两个样本,其类别标记不一致的概率。因此, Gini(D) 越小,则数据集D 的纯度越高。特征α 的基尼指数定义为
1.Gain Ratio 信息增益比
3.Further reading
[If you have limited time, just ignore this part. It's not critical for most companies.
If you are fascinated in machine learning or if your targets are those superstar companies,
then please go on.]
AdaBoost
There is another classical machine tool named AdaBoost that we were not talking about. The
most famous application for that in CV field is face detection implemented by Viola and Jones
alongside Haar features (Actually this algorithm is so successful that it has been included in
OpenCV which means you can just call this function to do simple face detection in just no more
than 3 lines).
So you can read the materials we provide in our slides and try to understand these two
concepts:
A. What is AdaBoost algorithm and B. What is Haar Feature.
Then try to answer the following questions one by one:
For AdaBoost:
a. What is a weak classifier?
b. What is a strong classifier?
c. How to combine those weak classifiers?
d. How to update a weak classifier?
e. How to update the strong classifier?
f. Can you complete the mathematical derivation by hand?
For Haar feature:
g. What is a Haar feature?
h. Can you find out any upgrade Haar features besides the original one?
i. Can you implement a Haar feature in Python or C++?
j. Can you implement the algorithm in a accelerated way? Like integral image?
k. How to combine Haar feature with AdaBoost?
If you can answer them all, then trust me, you've got a full score.
网友评论