一. 数据集介绍
我们使用sklearn官方的数据集: California housing dataset
代码:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
import pydotplus
from IPython.display import Image
housing = fetch_california_housing()
print("#######################################")
print(housing.DESCR)
print("#######################################")
print(housing.data.shape)
print("#######################################")
print(housing.data[0:10])
测试记录:
#######################################
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
An household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surpinsingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
#######################################
(20640, 8)
#######################################
[[ 8.32520000e+00 4.10000000e+01 6.98412698e+00 1.02380952e+00
3.22000000e+02 2.55555556e+00 3.78800000e+01 -1.22230000e+02]
[ 8.30140000e+00 2.10000000e+01 6.23813708e+00 9.71880492e-01
2.40100000e+03 2.10984183e+00 3.78600000e+01 -1.22220000e+02]
[ 7.25740000e+00 5.20000000e+01 8.28813559e+00 1.07344633e+00
4.96000000e+02 2.80225989e+00 3.78500000e+01 -1.22240000e+02]
[ 5.64310000e+00 5.20000000e+01 5.81735160e+00 1.07305936e+00
5.58000000e+02 2.54794521e+00 3.78500000e+01 -1.22250000e+02]
[ 3.84620000e+00 5.20000000e+01 6.28185328e+00 1.08108108e+00
5.65000000e+02 2.18146718e+00 3.78500000e+01 -1.22250000e+02]
[ 4.03680000e+00 5.20000000e+01 4.76165803e+00 1.10362694e+00
4.13000000e+02 2.13989637e+00 3.78500000e+01 -1.22250000e+02]
[ 3.65910000e+00 5.20000000e+01 4.93190661e+00 9.51361868e-01
1.09400000e+03 2.12840467e+00 3.78400000e+01 -1.22250000e+02]
[ 3.12000000e+00 5.20000000e+01 4.79752705e+00 1.06182380e+00
1.15700000e+03 1.78825348e+00 3.78400000e+01 -1.22250000e+02]
[ 2.08040000e+00 4.20000000e+01 4.29411765e+00 1.11764706e+00
1.20600000e+03 2.02689076e+00 3.78400000e+01 -1.22260000e+02]
[ 3.69120000e+00 5.20000000e+01 4.97058824e+00 9.90196078e-01
1.55100000e+03 2.17226891e+00 3.78400000e+01 -1.22250000e+02]]
二. 使用sklearn构建决策树
代码:
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
import pydotplus
from IPython.display import Image
# 读取数据集
housing = fetch_california_housing()
#print(housing.DESCR)
#print(housing.data.shape)
# 指定参数 最大深度为2
dtr = tree.DecisionTreeRegressor(max_depth = 2)
# fit传入X 和 y,此处我们只选择第6和7列,精度和纬度
dtr.fit(housing.data[:, [6, 7]], housing.target)
# 将决策树模型画出来
dot_data = \
tree.export_graphviz(
dtr, #决策树模型
out_file = None,
feature_names = housing.feature_names[6:8], # 传入的X和y
filled = True,
impurity = False,
rounded = True
)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.get_nodes()[7].set_fillcolor("#FFF2DD")
Image(graph.create_png())
graph.write_png("dtr_white_background.png")
测试记录:
三. 调参
3.1 树模型参数
criterion gini or entropy
-
splitter best or random
前者是在所有特征中找最好的切分点 后者是在部分特征中(数据量大的时候)
-
-
max_features None
(所有),log2,sqrt,N 特征小于50的时候一般使用所有的
-
-
max_depth
数据少或者特征少的时候可以不管这个值,如果模型样本量多,特征也多的情况下,可以尝试限制下
-
-
min_samples_split
如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分如果样本量不大,不需要管这个值。如果样本量数量级非常大,则推荐增大这个值。
-
-
min_samples_leaf
这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝,如果样本量不大,不需要管这个值,大些如10W可是尝试下5
-
-
min_weight_fraction_leaf
这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝默认是0,就是不考虑权重问题。一般来说,如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。
-
-
max_leaf_nodes
通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。如果加了限制,算法会建立在最大叶子节点数内最优的决策树。如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制具体的值可以通过交叉验证得到。
-
-
class_weight
指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重如果使用“balanced”,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。
-
-
min_impurity_split
这个值限制了决策树的增长,如果某节点的不纯度(基尼系数,信息增益,均方差,绝对差)小于这个阈值则该节点不再生成子节点。即为叶子节点 。
-
-
n_estimators
:要建立树的个数
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# 读取数据集
housing = fetch_california_housing()
#print(housing.DESCR)
#print(housing.data.shape)
# 划分训练集和测试集
data_train, data_test, target_train, target_test = \
train_test_split(housing.data, housing.target, test_size=0.1, random_state=42)
# 使用GridSearchCV记性交叉验证,验证最佳参数
tree_param_grid = {'min_samples_split': list((3, 6, 9)), 'n_estimators': list((10, 50, 100))}
grid = GridSearchCV(RandomForestRegressor(), param_grid=tree_param_grid, cv=5)
grid.fit(data_train, target_train)
print(grid.best_params_, grid.best_score_)
print("###########################################")
print(grid.cv_results_)
print("******************************************")
# 使用上一步验证的参数来训练模型
rfr = RandomForestRegressor(min_samples_split=3, n_estimators=100, random_state=42)
rfr.fit(data_train, target_train)
print(rfr.score(data_test, target_test))
print("###########################################")
# 输出各参数的权重值
result_1 = pd.Series(rfr.feature_importances_, index=housing.feature_names).sort_values(ascending=False)
print(result_1)
测试记录:
{'min_samples_split': 3, 'n_estimators': 100} 0.8068921554242904
###########################################
{'mean_fit_time': array([0.8004458 , 4.03023052, 7.95025473, 0.74344239, 3.7716157 ,
7.44682593, 0.71324077, 3.56500397, 7.17021012]), 'std_fit_time': array([0.00475824, 0.07281077, 0.04690861, 0.00344107, 0.0298989 ,
0.07583201, 0.00172054, 0.01466244, 0.06738456]), 'mean_score_time': array([0.00940046, 0.04360251, 0.08440475, 0.00740047, 0.03520207,
0.0692039 , 0.00680041, 0.03140182, 0.06140351]), 'std_score_time': array([0.00079997, 0.00119998, 0.00079999, 0.00048992, 0.00146967,
0.0014698 , 0.00040007, 0.00101995, 0.00048998]), 'param_min_samples_split': masked_array(data=[3, 3, 3, 6, 6, 6, 9, 9, 9],
mask=[False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object), 'param_n_estimators': masked_array(data=[10, 50, 100, 10, 50, 100, 10, 50, 100],
mask=[False, False, False, False, False, False, False, False,
False],
fill_value='?',
dtype=object), 'params': [{'min_samples_split': 3, 'n_estimators': 10}, {'min_samples_split': 3, 'n_estimators': 50}, {'min_samples_split': 3, 'n_estimators': 100}, {'min_samples_split': 6, 'n_estimators': 10}, {'min_samples_split': 6, 'n_estimators': 50}, {'min_samples_split': 6, 'n_estimators': 100}, {'min_samples_split': 9, 'n_estimators': 10}, {'min_samples_split': 9, 'n_estimators': 50}, {'min_samples_split': 9, 'n_estimators': 100}], 'split0_test_score': array([0.79160023, 0.81033663, 0.81080664, 0.79252505, 0.80765124,
0.8117226 , 0.79482236, 0.80654121, 0.81009268]), 'split1_test_score': array([0.78890171, 0.79928278, 0.80141166, 0.78598424, 0.79600627,
0.79906143, 0.77847628, 0.80139533, 0.80106729]), 'split2_test_score': array([0.78774617, 0.80061737, 0.80477793, 0.7863239 , 0.80015027,
0.80422593, 0.78908726, 0.7983616 , 0.80104587]), 'split3_test_score': array([0.78962335, 0.80730398, 0.8103788 , 0.791012 , 0.80609438,
0.81097377, 0.79964151, 0.80805967, 0.81052746]), 'split4_test_score': array([0.786275 , 0.80545325, 0.8070847 , 0.7935861 , 0.80582379,
0.80788122, 0.78913747, 0.80539517, 0.80880244]), 'mean_test_score': array([0.78882944, 0.80459911, 0.80689216, 0.7898864 , 0.80314543,
0.80677326, 0.79023322, 0.80395074, 0.80630735]), 'std_test_score': array([0.00178956, 0.00412521, 0.00352203, 0.00315705, 0.00438431,
0.0046761 , 0.00707542, 0.00356223, 0.00432444]), 'rank_test_score': array([9, 4, 1, 8, 6, 2, 7, 5, 3]), 'split0_train_score': array([0.95768184, 0.96868382, 0.96984789, 0.94577487, 0.95695691,
0.95793227, 0.93473427, 0.94331768, 0.94445909]), 'split1_train_score': array([0.95861403, 0.96806912, 0.96997044, 0.94650039, 0.95641239,
0.95734931, 0.93327575, 0.9443791 , 0.94508244]), 'split2_train_score': array([0.9597377 , 0.96897156, 0.97018228, 0.94723203, 0.9570711 ,
0.95760447, 0.93364428, 0.94375518, 0.94538129]), 'split3_train_score': array([0.9594541 , 0.96937818, 0.96999027, 0.94439822, 0.95580838,
0.95731595, 0.93525466, 0.94336581, 0.94493027]), 'split4_train_score': array([0.95786358, 0.96852439, 0.96974745, 0.94598214, 0.95641961,
0.9580851 , 0.93313765, 0.94409264, 0.94537141]), 'mean_train_score': array([0.95867025, 0.96872541, 0.96994767, 0.94597753, 0.95653368,
0.95765742, 0.93400932, 0.94378208, 0.9450449 ]), 'std_train_score': array([0.00082276, 0.00043807, 0.00014658, 0.00093621, 0.00045204,
0.0003075 , 0.00083757, 0.0004105 , 0.00033985])}
******************************************
0.8090829049653158
###########################################
MedInc 0.524257
AveOccup 0.137947
Latitude 0.090622
Longitude 0.089414
HouseAge 0.053970
AveRooms 0.044443
Population 0.030263
AveBedrms 0.029084
dtype: float64
网友评论