Feature Selection
Ensemble Learning
- bagging method and boosting method
Bagging
- sampling with replacement
- decrease variance by introducing randomness into your model framework
- random forest = bagging + decision tree
Random Forest
- description of random forest
there are n samples with m features in the training data- take n observations with replacement each time 行上有放回的抽样
- in these n observations, take k features (k < m) to calculate a best decision tree 列上抽样不放回
- repeat the below steps for several times, combine all the decision tree to a random forest
- features:
- decrease variance by introducing randomness into the model framework
- the destributions of each n observations are the same as the original training data
- we don't need to do pruning for each "weak" decision tree
- less overfitting
- parallel implementation
- Feature Importance value in Random Forest
(advance topic: out of bag evaluation)
how to define performance? let the column of the feature be a list of random numbers and calculate the output by the model, get the loss- not negative nor positive, just show how much a special feature influences the model
Support Vector Machine
-
SVM: maximize the minimum margin
image.png
image.png
if not linear, add noise factors into the equation to map it to a higher dimension space, applying kernel function image.png
Why Feature Selection?
- reduce overfitting
- better understanding your model
- improve model stability (i.e. improve generalization)
取决于你想要做什么,如果是做一个调查,想研究每一个feature的贡献,则需要删除一些data以减少相关性太大的features对模型的影响;如果想要做prediction,则只关心结果是否准确,不太需要删除features。模型稳定性差:某一个feature变化一点点而导致系数变化特别大,说明模型不稳定variance特别大,原因可能是model特别复杂或者相关性features太多。解决办法最直观的:regularization
Pearson Correlation
to measrue linear dependency between features
- means covariance and means standard deviation
- covariance:
where
Regularization Models
L1 tends to provide sparse solution
L2 tends to spread out more equality
网友评论