美文网首页
Titantic项目学习练习

Titantic项目学习练习

作者: 雷_哥 | 来源:发表于2018-08-09 08:01 被阅读0次

    之前对ML和DL学习已经有一段时间了,后面准备用简书对Kaggle上的学习和实践进行一些记录,方便技术的积累。同时也用它记录些AI相关的文章

    Titanic项目地址:

    https://www.kaggle.com/c/titanic/kernels?sortBy=voteCount&group=everyone&pageSize=20&language=Python&competitionId=3136

    其中学习了其中的两个项目:

    1.https://www.kaggle.com/startupsci/titanic-data-science-solutions

    第一个项目讲得比较细,讲了解决此问题的思路,然后对其思路进行一步一步验证,建设特征工程,特征工程中的每一个特征的建设都会进行可视化的观察,最后用了多种机器学习方法进行模型训练,预测,并进行了对比

    所以适合对机器学习实践不太了解的同学学习

    下面是我对此文章学习的经验总结:

    # 总结:

    # 1.将数据分为数值型和字符型(分类型),然后分别看他们的分布和survived的关系,describe()看下总体分布

    # 2.还有他们联合起来看和survived的关系

    # 查看分布后,会有些假设,比如本利假设survived和性别,仓位等级,年龄等比较相关

    # 3.对fea的drop,比如说空缺比较多的feature,和survived关系不大的fea,比如姓名,passid等

    # 4.对空缺fea进行填充,数值型建议使用中位数进行填充,分类型使用种类最多的进行填充

    # 5.新增fea,这个就需要看经验了,比如说本例中增加了familySize,IsAlone等

    # 6.将分类型fea映射为数值型

    # 技巧

    # 对数值型进行统计,#应该是对每列进行排序后,计算最小,1/4位,2/4等的数值

    # train_df.describe()

    # # 对str型的数据进行统计:count,unique,top,freq

    # train_df.describe(include=['O'])

    # sns.FacetGrid

    # 使用柱状图对数值型进行分析,使用散点图(sns.pointplot)对分类型进行分析 [14]

    2.https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

    第二个项目基本以代码说话,特征工程和第一个项目类似,所以再对第一个工程熟悉后,看第二个工程会比较容易,同时也能够对学习方法进行学习总结

    总体来说对特征工程建设这方面,要对可视化分析比较熟悉

    网上代码有些错误,修正后特征工程代码如下:

    # Load in our libraries

    import pandas as pd

    import numpy as np

    import random as rnd

    import re

    # visualization

    import seaborn as sns

    import matplotlib.pyplot as plt

    %matplotlib inline

    # machine learning

    from sklearn.linear_model import LogisticRegression

    from sklearn.svm import SVC, LinearSVC

    from sklearn.ensemble import RandomForestClassifier

    from sklearn.neighbors import KNeighborsClassifier

    from sklearn.naive_bayes import GaussianNB

    from sklearn.linear_model import Perceptron

    from sklearn.linear_model import SGDClassifier

    from sklearn.tree import DecisionTreeClassifier

    from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,

                              GradientBoostingClassifier, ExtraTreesClassifier)

    from sklearn.svm import SVC

    from sklearn.cross_validation import KFold

    train_df = pd.read_csv('./data/train.csv')

    test_df = pd.read_csv('./data/test.csv')

    combine = [train_df, test_df]

    train_df['Name_length'] = train_df['Name'].apply(len)

    test_df['Name_length'] = test_df['Name'].apply(len)

    # Feature that tells whether a passenger had a cabin on the Titanic

    train_df['Has_Cabin'] = train_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

    test_df['Has_Cabin'] = test_df["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

    # train_df.info()

    # test_df.info()

    # drop掉无用列

    train_df = train_df.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)

    test_df = test_df.drop(['Cabin', 'PassengerId'], axis=1)

    # print train_df.describe()

    # print train_df.describe()

    # print train_df.info()

    # print train_df

    def get_title(name):

        title_search = re.search(' ([A-Za-z]+)\.', name)

        # If the title exists, extract and return it.

        if title_search:

            return title_search.group(1)

        return ""

    full_df = []

    for dataset in combine :

        # 处理object类型

        dataset = dataset.drop(['Cabin', 'PassengerId', 'Ticket'], axis=1)

        #Title

        dataset['Title'] = dataset['Name'].apply(get_title)

        dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

        dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

        dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')

        dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

        # Mapping titles

        title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

        dataset['Title'] = dataset['Title'].map(title_mapping)

        dataset['Title'] = dataset['Title'].fillna(0)

        dataset = dataset.drop('Name',axis=1)

    #    print dataset['Title'].groupby('Title')

        dataset['Sex'] = dataset['Sex'].map({'male':0, 'female':1}).astype(int)

        dataset['Embarked'] = dataset['Embarked'].fillna('S').map({'S':0, 'C':1, 'Q':2}).astype(int)

        #处理数值类型

        dataset['Pclass'] = dataset['Pclass'].fillna(dataset['Pclass'].mean())

        dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

        # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc

        dataset['IsAlone'] = 0

        dataset.loc[dataset['FamilySize']==1,'IsAlone'] = 1

        dataset = dataset.drop(['SibSp'], axis=1)

        #对Fare列进行处理,通过describe 看 25%,50%, 75%的数据:7.9104,14.4542,31

        #sns.distplot(train_df['Fare'], bins=10, rug=True);

        dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].dropna().mean())

        dataset.loc[dataset['Fare']<7.9104, 'Fare'] = 0

        dataset.loc[(dataset['Fare']>=7.9104) & (dataset['Fare']<14.4542), 'Fare'] = 1

        dataset.loc[(dataset['Fare']>=14.4542) & (dataset['Fare']<31), 'Fare'] = 2

        dataset.loc[dataset['Fare']>=31, 'Fare'] = 3

        dataset['Fare'] = dataset['Fare'].astype(int)

        #age处理

        dataset['Age'] = dataset['Age'].fillna(dataset['Age'].dropna().median())

    #    sns.distplot(dataset['Age'], bins=10, rug=True);

        # Mapping Age

        dataset.loc[ dataset['Age'] <= 16, 'Age']       = 0

        dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1

        dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2

        dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3

        dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

        dataset['Age'] = dataset['Age'].astype(int)

        full_df.append(dataset)

    #    print dataset.head(3)

    #    print 'end ******'

    train,test = full_df

    # Some useful parameters which will come in handy later on

    ntrain = train.shape[0]

    ntest = test.shape[0]

    SEED = 0 # for reproducibility

    NFOLDS = 5 # set folds for out-of-fold prediction

    kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

    rf_params = {

        'n_jobs': -1,

        'n_estimators': 500,

        'warm_start': True,

        #'max_features': 0.2,

        'max_depth': 6,

        'min_samples_leaf': 2,

        'max_features' : 'sqrt',

        'verbose': 0

    }

    # Extra Trees Parameters

    et_params = {

        'n_jobs': -1,

        'n_estimators':500,

        #'max_features': 0.5,

        'max_depth': 8,

        'min_samples_leaf': 2,

        'verbose': 0

    }

    # AdaBoost parameters

    ada_params = {

        'n_estimators': 500,

        'learning_rate' : 0.75

    }

    # Gradient Boosting parameters

    gb_params = {

        'n_estimators': 500,

        #'max_features': 0.2,

        'max_depth': 5,

        'min_samples_leaf': 2,

        'verbose': 0

    }

    # Support Vector Classifier parameters

    svc_params = {

        'kernel' : 'linear',

        'C' : 0.025

    }

    # Class to extend the Sklearn classifier

    class SklearnHelper(object):

        def __init__(self, clf, seed=0, params=None):

            params['random_state'] = seed

            self.clf = clf(**params)

        def train(self, x_train, y_train):

            self.clf.fit(x_train, y_train)

        def predict(self, x):

            return self.clf.predict(x)

        def fit(self,x,y):

            return self.clf.fit(x,y)

        def feature_importances(self,x,y):

            print(self.clf.fit(x,y).feature_importances_)

    # Class to extend XGboost classifer

    # Put in our parameters for said classifiers

    # Random Forest parameters

    def get_oof(clf, x_train, y_train, x_test):

        oof_train = np.zeros((ntrain,))

        oof_test = np.zeros((ntest,))

        oof_test_skf = np.empty((NFOLDS, ntest))

        for i, (train_index, test_index) in enumerate(kf):

            x_tr = x_train[train_index]

            y_tr = y_train[train_index]

            x_te = x_train[test_index]

            clf.train(x_tr, y_tr)

            oof_train[test_index] = clf.predict(x_te)

            oof_test_skf[i, :] = clf.predict(x_test)

        oof_test[:] = oof_test_skf.mean(axis=0)

        return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

    print '*'*50

    print rf_params

    a={'a':'test'}

    # Create 5 objects that represent our 4 models

    rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)

    et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)

    ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)

    gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)

    svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

    # Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models

    y_train = train['Survived'].ravel()

    train = train.drop(['Survived'], axis=1)

    x_train = train.values # Creates an array of the train data

    x_test = test.values # Creats an array of the test data

    # Create our OOF train and test predictions. These base results will be used as new features

    et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees

    rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest

    ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost

    gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost

    svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

    print("Training is complete")

    rf_feature = rf.feature_importances(x_train,y_train)

    et_feature = et.feature_importances(x_train, y_train)

    ada_feature = ada.feature_importances(x_train, y_train)

    gb_feature = gb.feature_importances(x_train,y_train)

    相关文章

      网友评论

          本文标题:Titantic项目学习练习

          本文链接:https://www.haomeiwen.com/subject/ljzfbftx.html