美文网首页
Kaggle|Exercise5:Machine Learnin

Kaggle|Exercise5:Machine Learnin

作者: 十二支箭 | 来源:发表于2020-04-11 19:30 被阅读0次

    来自kaggle官网的机器学习标准化流程。

    Introduction

    Machine learning competitions are a great way to improve your data science skills and measure your progress.

    In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

    The steps in this notebook are:

    1. Build a Random Forest model with all of your data (X and y)
    2. Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
    3. Submit those predictions to the competition and see your score.
    4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.
    # Code you have previously used to load data
    import pandas as pd
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeRegressor
    
    # Set up code checking
    import os
    if not os.path.exists("../input/train.csv"):
        os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
        os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
    from learntools.core import binder
    binder.bind(globals())
    from learntools.machine_learning.ex7 import *
    
    # Path of the file to read. We changed the directory structure to simplify submitting to a competition
    iowa_file_path = '../input/train.csv'
    
    home_data = pd.read_csv(iowa_file_path)
    # Create target object and call it y
    y = home_data.SalePrice
    # Create X
    features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
    X = home_data[features]
    
    # Split into validation and training data
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
    
    # Specify Model
    iowa_model = DecisionTreeRegressor(random_state=1)
    # Fit Model
    iowa_model.fit(train_X, train_y)
    
    # Make validation predictions and calculate mean absolute error
    val_predictions = iowa_model.predict(val_X)
    val_mae = mean_absolute_error(val_predictions, val_y)
    print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))
    
    # Using best value for max_leaf_nodes
    iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
    iowa_model.fit(train_X, train_y)
    val_predictions = iowa_model.predict(val_X)
    val_mae = mean_absolute_error(val_predictions, val_y)
    print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))
    
    # Define the model. Set random_state to 1
    rf_model = RandomForestRegressor(random_state=1)
    rf_model.fit(train_X, train_y)
    rf_val_predictions = rf_model.predict(val_X)
    rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
    
    print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))
    
    Validation MAE when not specifying max_leaf_nodes: 29,653
    Validation MAE for best value of max_leaf_nodes: 27,283
    Validation MAE for Random Forest Model: 21,857
    

    Creating a Model For the Competition

    Build a Random Forest model and train it on all of X and y.

    # To improve accuracy, create a new Random Forest model which you will train on all training data
    rf_model_on_full_data = RandomForestRegressor(random_state=1)
    
    # fit rf_model_on_full_data on all data from the training data
    rf_model_on_full_data.fit(X,y)
    

    Make Predictions

    Read the file of "test" data. And apply your model to make predictions

    # path to file you will use for predictions
    test_data_path = '../input/test.csv'
    
    # read test data file using pandas
    test_data = pd.read_csv(test_data_path)
    
    # create test_X which comes from test_data but includes only the columns you used for prediction.
    # The list of columns is stored in a variable called features
    test_X = test_data[features]
    
    # make predictions which we will submit. 
    test_preds = rf_model_on_full_data.predict(test_X)
    
    # The lines below shows how to save predictions in format used for competition scoring
    # Just uncomment them.
    
    output = pd.DataFrame({'Id': test_data.Id,
                           'SalePrice': test_preds})
    output.to_csv('submission.csv', index=False)
    

    Test Your Work

    To test your results, you'll need to join the competition (if you haven't already). So open a new window by clicking on this link. Then click on the Join Competition button.

    Next, follow the instructions below:

    1. Begin by clicking on the blue Save Version button in the top right corner of this window. This will generate a pop-up window.
    2. Ensure that the Save and Run All option is selected, and then click on the blue Save button.
    3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
    4. Click on the Output tab on the right of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.

    You have now successfully submitted to the competition!

    1. If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

    Continuing Your Progress

    There are many ways to improve your model, and experimenting is a great way to learn at this point.

    The best way to improve your model is to add features. Look at the list of columns and think about what might affect home prices. Some features will cause errors because of issues like missing values or non-numeric data types.

    The Intermediate Machine Learning micro-course will teach you how to handle these types of features. You will also learn to use xgboost, a technique giving even better accuracy than Random Forest.

    Other Micro-Courses

    The Pandas micro-course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects.

    You are also ready for the Deep Learning micro-course, where you will build models with better-than-human level performance at computer vision tasks.

    相关文章

      网友评论

          本文标题:Kaggle|Exercise5:Machine Learnin

          本文链接:https://www.haomeiwen.com/subject/spfjmhtx.html