Titanic泰坦尼克——kaggle赛题

作者: 皮皮大 | 来源:发表于2019-12-19 23:48 被阅读0次

    本文中最重要的是学习到了如何进行特征工程的处理,其他内容还有

    • 中位数填充缺失值
    • 将数据中的字符串改成数值型
    • 建模过程

    导入相关库

    import numpy as np 
    import pandas as pd
    import matplotlib.pyplot as plt
    %matplotlib inline
    import seaborn as sns
    sns.set() # setting seaborn default for plots
    
    train = pd.read_csv("/Users/peter/data-visualization/train.csv")
    test = pd.read_csv("/Users/peter/data-visualization/test.csv")
    

    查看数据信息及缺失值

    train.head(3)
    
    image
    train.info()  # age 字段非常缺失(714)
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Name           891 non-null object
    Sex            891 non-null object
    Age            714 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    dtypes: float64(2), int64(5), object(5)
    memory usage: 83.7+ KB
    
    test.head()
    
    image
    print(train.shape)
    print(test.shape)  # 少了预测的结果列
    
    (891, 12)
    (418, 11)
    
    test.info()   # age 字段缺失
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 418 entries, 0 to 417
    Data columns (total 11 columns):
    PassengerId    418 non-null int64
    Pclass         418 non-null int64
    Name           418 non-null object
    Sex            418 non-null object
    Age            332 non-null float64
    SibSp          418 non-null int64
    Parch          418 non-null int64
    Ticket         418 non-null object
    Fare           417 non-null float64
    Cabin          91 non-null object
    Embarked       418 non-null object
    dtypes: float64(2), int64(4), object(5)
    memory usage: 36.0+ KB
    
    train.isnull().sum()  # 查看缺失总数
    
    PassengerId      0
    Survived         0
    Pclass           0
    Name             0
    Sex              0
    Age            177
    SibSp            0
    Parch            0
    Ticket           0
    Fare             0
    Cabin          687
    Embarked         2
    dtype: int64
    

    Bar Chart for Categorical Features

    • Pclass
    • Sex
    • SibSp ( # of siblings and spouse)
    • Parch ( # of parents and children)
    • Embarked
    • Cabin
    def  bar_chart(feature):
        # 定义两个列字段:survived, dead
        survived = train[train['Survived'] == 1][feature].value_counts()
        dead = train[train['Survived'] == 0][feature].value_counts()
        df = pd.DataFrame([survived, dead])
        df.index = ['Survived', 'Dead']
        df.plot(kind='bar', stacked=True, figsize=(10,5))
    
    bar_chart("Sex")
    
    image
    bar_chart('Pclass')
    
    image
    bar_chart('SibSp')
    
    image
    bar_chart('Parch')
    
    image
    # 先找出存活的所有数据,再找出属性P(1,2,3)中存活的人,然后统计属性P的分类人数
    train[train['Survived']==1]['Pclass'].value_counts()
    
    1    136
    3    119
    2     87
    Name: Pclass, dtype: int64
    
    train[train['Survived']==1]['Pclass']
    
    1      1
    2      3
    3      1
    8      3
    9      2
          ..
    875    3
    879    1
    880    2
    887    1
    889    1
    Name: Pclass, Length: 342, dtype: int64
    

    特征工程

    Feature engineering is the process of using domain knowledge of the data
    to create features (feature vectors) that make machine learning algorithms work.

    特征工程的处理:如何将原始数据中的字符串数据转换成数值类型

    Name

    train_test_data = [train, test]  # 将测试集和训练集合并
    for dataset in train_test_data:
        dataset["Title"] = dataset["Name"].str.extract('([A-Za-z]+)\.', expand=False)  # str.extract  从正则表达式中返回第一个匹配中字符
    
    train["Title"].value_counts()  # 统计个数 train["Title"].value_counts()
    
    Mr          517
    Miss        182
    Mrs         125
    Master       40
    Dr            7
    Rev           6
    Major         2
    Col           2
    Mlle          2
    Lady          1
    Capt          1
    Sir           1
    Countess      1
    Mme           1
    Don           1
    Jonkheer      1
    Ms            1
    Name: Title, dtype: int64
    
    test["Title"].value_counts()  # 统计个数
    
    Mr        240
    Miss       78
    Mrs        72
    Master     21
    Rev         2
    Col         2
    Dr          1
    Ms          1
    Dona        1
    Name: Title, dtype: int64
    

    Title map

    • Mr : 0
    • Miss : 1
    • Mrs: 2
    • Others: 3
    title_mapping = {"Mr": 0, "Miss": 1, "Mrs": 2, "Master": 3, "Dr": 3, 
                     "Rev": 3, "Col": 3, "Major": 3, "Mlle": 3,"Countess": 3,
                     "Ms": 3, "Lady": 3, "Jonkheer": 3, "Don": 3, "Dona" : 3,
                     "Mme": 3,"Capt": 3,"Sir": 3 }
    
    dataset['Title']  # 取出每个名字的title_map部分
    
    0          Mr
    1         Mrs
    2          Mr
    3          Mr
    4         Mrs
            ...  
    413        Mr
    414      Dona
    415        Mr
    416        Mr
    417    Master
    Name: Title, Length: 418, dtype: object
    
    for dataset in train_test_data:
        # 将title_map和定义的title_mapping 进行对比
        dataset['Title'] = dataset['Title'].map(title_mapping)  
    
    dataset.head(3)  # 最后面新增了一列:title
    
    image
    image
    bar_chart("Title")
    
    image
    # 删除某个非必须属性
    train.drop('Name', axis=1, inplace=True)
    test.drop('Name', axis=1, inplace=True)
    
    train.head()
    
    image

    Sex

    • male:0
    • female:1
    sex_mapping = {"male": 0, "female": 1}
    for dataset in train_test_data:
        dataset['Sex'] = dataset['Sex'].map(sex_mapping)  # 从dataset中取出Sex属性的值再和 map函数中定义的字典进行对比,找出符合要求的,再赋值给Sex 属性
    
    bar_chart('Sex')
    
    image

    Age

    Age字段中有很多缺失值,用中位数进行填充

    fillna函数后中位数进行填充
    # 某个字段用中位数进行填充 fillna 函数
    # transform之前要指定操作的列(Age),它只能对某个列进行操作,用于求最值、方差、中位数等
    train['Age'].fillna(train.groupby('Title')['Age'].transform("median"), inplace=True)  
    
    test['Age'].fillna(train.groupby('Title')['Age'].transform("median"), inplace=True)  
    
    train.groupby("Title")["Age"].transform("median")
    
    0      30.0
    1      35.0
    2      21.0
    3      35.0
    4      30.0
           ... 
    886     9.0
    887    21.0
    888    21.0
    889    30.0
    890    30.0
    Name: Age, Length: 891, dtype: float64
    
    facet = sns.FacetGrid(train, hue="Survived",aspect=4)
    facet.map(sns.kdeplot,'Age',shade= True)
    facet.set(xlim=(0, train['Age'].max()))
    facet.add_legend()
     
    plt.show() 
    
    image
    facet = sns.FacetGrid(train, hue="Survived", aspect=4)
    facet.map(sns.kdeplot, "Age", shade=True)
    facet.set(xlim=(0, train['Age'].max()))
    facet.add_legend()
    plt.xlim(0, 20)   # 分段画出来 plt.xlim(20, 30),(30, 40),(40, 60)
    
    (0, 20)
    
    image
    train.info()   # Age 字段已经填充
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Survived       891 non-null int64
    Pclass         891 non-null int64
    Sex            891 non-null int64
    Age            891 non-null float64
    SibSp          891 non-null int64
    Parch          891 non-null int64
    Ticket         891 non-null object
    Fare           891 non-null float64
    Cabin          204 non-null object
    Embarked       889 non-null object
    Title          891 non-null int64
    dtypes: float64(2), int64(7), object(3)
    memory usage: 83.7+ KB
    
    如何将一个属性变成一个分类变量形式
    # 将 Age 年龄变成 Categorical Variable 分类变量
    # 根据不同的年龄段划分成不同的数值
    for dataset in train_test_data:
        dataset.loc[dataset['Age'] <= 16, 'Age'] = 0,
        dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 26), 'Age'] = 1,
        dataset.loc[(dataset['Age'] > 26) & (dataset['Age'] <= 36), 'Age'] = 2,
        dataset.loc[(dataset['Age'] > 36) & (dataset['Age'] <= 62), 'Age'] = 3,
        dataset.loc[ dataset['Age'] > 62, 'Age'] = 4
    
    train.head()
    
    image
    bar_chart("Age")
    
    image

    Embarked

    根据属性的多种不同取值来绘制图形
    train[train['Pclass']==1]['Embarked']  # 找出P属性中值为1的每个 Embarked 属性值
    
    1      C
    3      S
    6      S
    11     S
    23     S
          ..
    871    S
    872    S
    879    C
    887    S
    889    C
    Name: Embarked, Length: 216, dtype: object
    
    # 找出P属性中值为1的每个 Embarked 属性值,再进行分类统计
    Pclass1 = train[train['Pclass']==1]['Embarked'].value_counts()
    Pclass2 = train[train['Pclass']==2]['Embarked'].value_counts()
    Pclass3 = train[train['Pclass']==3]['Embarked'].value_counts()
    
    df = pd.DataFrame([Pclass1, Pclass2, Pclass3])  # 将新生成的3个数据组成DF数据框
    df.index = ['1st class','2nd class', '3rd class']  # 添加上索引
    df.plot(kind='bar',stacked=True, figsize=(10,5))   # 通过叠加的方式绘制柱状堆叠图
    
    image
    df
    
    image
    # fill out missing embark with S embark
    for dataset in train_test_data:
        dataset['Embarked'] = dataset['Embarked'].fillna('S')  # 用S来填充缺失值
        
    
    如何将属性中的字符串转成数值型?
    embarked_mapping = {"S": 0, "C": 1, "Q": 2}
    for dataset in train_test_data:
        dataset['Embarked'] = dataset['Embarked'].map(embarked_mapping)  # map函数进行匹配
    

    Fare

    缺失值填充中位数
    # fill missing Fare with median fare for each Pclass
    
    train["Fare"].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace=True)   # 对Fare属性进行缺失值填充:通过 Pclass 属性分组,指定Fare的中位数填充缺失值
    test["Fare"].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace=True)
    
    绘制图形
    facet = sns.FacetGrid(train, hue="Survived",aspect=4)
    facet.map(sns.kdeplot,'Fare',shade= True)
    facet.set(xlim=(0, train['Fare'].max()))
    facet.add_legend()
     
    plt.show()  
    
    image
    将Fare属性分段
    for dataset in train_test_data:
        dataset.loc[ dataset['Fare'] <= 17, 'Fare'] = 0,
        dataset.loc[(dataset['Fare'] > 17) & (dataset['Fare'] <= 30), 'Fare'] = 1,
        dataset.loc[(dataset['Fare'] > 30) & (dataset['Fare'] <= 100), 'Fare'] = 2,
        dataset.loc[ dataset['Fare'] > 100, 'Fare'] = 3
    

    Cabin

    train.Cabin.value_counts()
    
    B96 B98        4
    C23 C25 C27    4
    G6             4
    E101           3
    F33            3
                  ..
    C62 C64        1
    D28            1
    D46            1
    B41            1
    E17            1
    Name: Cabin, Length: 147, dtype: int64
    
    for dataset in train_test_data:
        # 取出Cabin中的第一个字母
        dataset['Cabin'] = dataset['Cabin'].str[:1]   # Cabin 已经变成了单个字母
    
    train[train['Pclass']==1]['Cabin'].value_counts()
    
    C    59
    B    47
    D    29
    E    25
    A    15
    T     1
    Name: Cabin, dtype: int64
    
    # 统计每个Pclass中的每个字母各出现多少次
    Pclass1 = train[train['Pclass']==1]['Cabin'].value_counts()
    Pclass2 = train[train['Pclass']==2]['Cabin'].value_counts()
    Pclass3 = train[train['Pclass']==3]['Cabin'].value_counts()
    
    # 生成数据框和行索引,绘图
    df = pd.DataFrame([Pclass1, Pclass2, Pclass3])
    df.index = ['1st class','2nd class', '3rd class']
    df.plot(kind='bar',stacked=True, figsize=(10,5))
    
    image
    cabin_mapping = {"A": 0, "B": 0.4, 
                     "C": 0.8, "D": 1.2, 
                     "E": 1.6, "F": 2, 
                     "G": 2.4, "T": 2.8}  # 每个字母匹配不同的数字
    for dataset in train_test_data:
        dataset['Cabin'] = dataset['Cabin'].map(cabin_mapping)  # 将Cabin中的字母变成数字
    
    # fill missing Fare with median fare for each Pclass
    train["Cabin"].fillna(train.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
    test["Cabin"].fillna(test.groupby("Pclass")["Cabin"].transform("median"), inplace=True)
    

    FamilySize

    添加属性familysize
    train["FamilySize"] = train["SibSp"] + train["Parch"] + 1
    test["FamilySize"] = test["SibSp"] + test["Parch"] + 1
    
    facet = sns.FacetGrid(train, hue="Survived",aspect=4)
    facet.map(sns.kdeplot,'FamilySize',shade= True)
    facet.set(xlim=(0, train['FamilySize'].max()))
    facet.add_legend()
    plt.xlim(0)
    
    (0, 11.0)
    
    image
    train.head()  # 最后添加了familysize属性
    
    image
    family_mapping = {1: 0, 2: 0.4, 3: 0.8, 4: 1.2, 5: 1.6, 6: 2, 7: 2.4, 8: 2.8, 9: 3.2, 10: 3.6, 11: 4}
    for dataset in train_test_data:
        dataset['FamilySize'] = dataset['FamilySize'].map(family_mapping)   
    
    删除某些属性
    # 使用drop删除不需要的属性
    features_drop = ['Ticket', 'SibSp', 'Parch']
    train = train.drop(features_drop, axis=1)
    test = test.drop(features_drop, axis=1)
    train = train.drop(['PassengerId'], axis=1)
    
    train_data = train.drop('Survived', axis=1)  # train_data是删除3个属性后的数据
    target = train['Survived']  
    
    train_data.shape, target.shape
    
    ((891, 8), (891,))
    
    train_data.head(10)
    
    image

    建模

    导入各种模型

    # Importing Classifier Modules
    from sklearn.neighbors import KNeighborsClassifier  # K近邻
    from sklearn.tree import DecisionTreeClassifier   # 决策树
    from sklearn.ensemble import RandomForestClassifier   # 随机森林
    from sklearn.naive_bayes import GaussianNB   # 贝叶斯分类器
    from sklearn.svm import SVC  # 支持向量机
    
    train.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 9 columns):
    Survived      891 non-null int64
    Pclass        891 non-null int64
    Sex           891 non-null int64
    Age           891 non-null float64
    Fare          891 non-null float64
    Cabin         891 non-null float64
    Embarked      891 non-null int64
    Title         891 non-null int64
    FamilySize    891 non-null float64
    dtypes: float64(4), int64(5)
    memory usage: 62.8 KB
    

    交叉验证

    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score
    k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
    
    # KNN 
    clf = KNeighborsClassifier(n_neighbors = 13)
    scoring = 'accuracy'
    score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
    print(score)
    
    [0.82222222 0.76404494 0.80898876 0.83146067 0.87640449 0.82022472
     0.85393258 0.79775281 0.84269663 0.84269663]
    
    round(np.mean(score)*100, 2)  # KNN score
    
    82.6
    
    ## 决策树
    clf = DecisionTreeClassifier()
    scoring = 'accuracy'
    score = cross_val_score(clf, train_data, 
                            target, cv=k_fold, 
                            n_jobs=1, scoring=scoring)
    print(score)
    
    [0.76666667 0.82022472 0.76404494 0.7752809  0.88764045 0.76404494
     0.83146067 0.82022472 0.74157303 0.79775281]
    
    round(np.mean(score)*100, 2)
    
    79.69
    
    ### 随机森林
    clf = RandomForestClassifier(n_estimators=13)  # 13 个评估器
    scoring = 'accuracy'
    score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
    print(score)
    
    [0.8        0.82022472 0.79775281 0.76404494 0.86516854 0.82022472
     0.80898876 0.80898876 0.75280899 0.80898876]
    
    round(np.mean(score)*100, 2)  # Random Forest Score
    
    80.47
    
    #### 贝叶斯
    clf = GaussianNB()
    scoring = 'accuracy'
    score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
    print(score)
    
    [0.85555556 0.73033708 0.75280899 0.75280899 0.70786517 0.80898876
     0.76404494 0.80898876 0.86516854 0.83146067]
    
    round(np.mean(score)*100, 2)
    
    78.78
    
    ##### SVM
    clf = SVC()
    scoring = 'accuracy'
    score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, scoring=scoring)
    print(score)
    
    [0.83333333 0.80898876 0.83146067 0.82022472 0.84269663 0.82022472
     0.84269663 0.85393258 0.83146067 0.86516854]
    
    round(np.mean(score)*100,2)
    
    83.5
    

    testing

    从上面的结果中观察到使用支持向量机的效果是最好的。

    clf = SVC()
    clf.fit(train_data, target)
    
    test_data = test.drop("PassengerId", axis=1).copy()
    prediction = clf.predict(test_data)
    
    submission = pd.DataFrame({
            "PassengerId": test["PassengerId"],
            "Survived": prediction
        })
    
    submission.to_csv('submission.csv', index=False)  # 将最终的结果文件写入csv
    
    submission = pd.read_csv('submission.csv')
    submission.head()  # 读取文件的前5行数据
    
    image

    相关文章

      网友评论

        本文标题:Titanic泰坦尼克——kaggle赛题

        本文链接:https://www.haomeiwen.com/subject/rfovnctx.html