美文网首页
Kaggle|Exercise3:Underfitting an

Kaggle|Exercise3:Underfitting an

作者: 十二支箭 | 来源:发表于2020-04-10 19:14 被阅读0次

    来自kaggle官网的机器学习标准化流程。

    Recap

    You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.

    # Code you have previously used to load data
    import pandas as pd
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeRegressor
    
    
    # Path of the file to read
    iowa_file_path = '../input/home-data-for-ml-course/train.csv'
    
    home_data = pd.read_csv(iowa_file_path)
    # Create target object and call it y
    y = home_data.SalePrice
    # Create X
    features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
    X = home_data[features]
    
    # Split into validation and training data
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
    
    # Specify Model
    iowa_model = DecisionTreeRegressor(random_state=1)
    # Fit Model
    iowa_model.fit(train_X, train_y)
    
    # Make validation predictions and calculate mean absolute error
    val_predictions = iowa_model.predict(val_X)
    val_mae = mean_absolute_error(val_predictions, val_y)
    print("Validation MAE: {:,.0f}".format(val_mae))
    
    # Set up code checking
    from learntools.core import binder
    binder.bind(globals())
    from learntools.machine_learning.ex5 import *
    print("\nSetup complete")
    val_mae
    print("Validation MAE:{:.2f}".format(val_mae))
    
    Validation MAE: 29,653
    
    Setup complete
    Validation MAE:29652.93
    

    Exercises

    You could write the function get_mae yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below.

    def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
        model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
        model.fit(train_X, train_y)
        preds_val = model.predict(val_X)
        mae = mean_absolute_error(val_y, preds_val)
        return(mae)
    

    Step 1: Compare Different Tree Sizes

    Write a loop that tries the following values for max_leaf_nodes from a set of possible values.

    Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes that gives the most accurate model on your data.
    接下来是要写一个循环,找出并且储存mae最小的最大叶节点数max_leaf_nodes,

    candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
    # Write loop to find the ideal tree size from candidate_max_leaf_nodes
    
    for i in candidate_max_leaf_nodes:
        mae_ = get_mae(i,train_X,val_X,train_y,val_y)
        print("when the max_leaf_nodes is %d the mae is %d"%(i,mae_))
    

    这样不够好,我们这次是既要输出也要储存,对于这样一对数据,最好使用字典的方式进行存储,而且循环最好使用表达式,这样比循环体运行要快。
    值得注意的是,从字典中取出最大值对应的键应使用方法:max(dict,key=dict.get)
    max(dict, key)方法首先遍历迭代器,并将返回值作为参数传递给key对应的函数,然后将函数的执行结果传给key,并以此时key值为标准进行大小判断,返回最大值。

    candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
    # Write loop to find the ideal tree size from candidate_max_leaf_nodes
    scores = {leaf_size:get_mae(leaf_size,train_X,val_X,train_y,val_y) 
              for leaf_size in candidate_max_leaf_nodes}
    
    # Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
    best_tree_size = min(scores,key=scores.get)
    
    # Check your answer
    step_1.check()
    

    Step 2: Fit Model Using All Data

    You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

    # Fill in argument to make optimal size and uncomment
    final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=1)
    
    # fit the final model and uncomment the next two lines
    final_model.fit(X,y)
    
    # Check your answer
    step_2.check()
    

    You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.

    Keep Going

    You are ready for [Random Forests].

    To be continued

    相关文章

      网友评论

          本文标题:Kaggle|Exercise3:Underfitting an

          本文链接:https://www.haomeiwen.com/subject/qbcdmhtx.html