美文网首页
2017-12-30

2017-12-30

作者: 陆文斌 | 来源:发表于2017-12-30 07:55 被阅读0次

    project checklist

    frame the problem

    select a performance measure

    RMSE:均方误差根
    MAE: 平均绝对误差
    范数越大对大特征值更有效,会忽略小特征值,但数据正态分布时,RSEM性能更好。

    Download and load the data

    Take a quick look at the data strucure

    data.head()

    data.info()
    data[‘attribute’].value_counts()
    data.describe()
    也可以画直方图来了解各个数字型属性的分布
    data.hist(bins = 50,figsize=(20,15))

    create a test set

    random select
    from sklearn.model_selection import train_test_split
    train_set,test_set = train_test_split(data,test_size = 0.2, random_state = 42)
    stratified sampling通过对分组属性进行分层采样划分
    from sklearn.model_selection import StratifiedShuffleSplit
    spliter = StratifiedShuffleSplit(n_splits = 1,test_size = 0.2,random_state = 42)
    for train_index,test_index in spliter.split(data,data[‘category’]):
    strat_train_set = data.loc[train_index]
    start_test_set = data.loc[test_index]

    exploring the data:discover and visualize the data to gain insights

    visualizing geographical data
    housing.plot(kind = ’scatter’,x= ‘longitude’,y = ‘latitude’ ,alpha = 0.4,s = housing[‘population’]/100,label = ‘population’,c = “median_house_value”,camp = plt.get_cmap(“jet”),colorbar = Ture)
    plt.legend()

    looking for correlations

    相关文章

      网友评论

          本文标题:2017-12-30

          本文链接:https://www.haomeiwen.com/subject/pnkggxtx.html