2018-03-03

作者: 杨小彤 | 来源:发表于2018-03-03 21:19 被阅读0次

2018-03-03
2018-03-03洪霞
React Navigation 集成 Redux
2018-03-05
2018-02-26~2018-03-03 周记
如何搭建一个博客系统
伯爵返利机器人-通向用AI自动赚钱之路
【王鹏翔论语札记47】君子的风范
有哪些适合女生读，提升气质Level的好书？
阿含子开示三昧：善思维是走向真如本性的法宝

探索数据集-泰坦尼克号数据

一、读取数据

import pandas as pd import numpy as np df = pd.read_csv('titanic-data.csv') df

#查看具体信息字段 df.info()

#数据概况 df.describe()

数据整体概况： 1.总共有189个数据 2.总的存活率是38.4%，乘客的平均年龄是30岁， 3.Age,Cabin,Embarked列存在缺失值

二、提出问题

Survived为因变量，Pclass、Sex、Age、SibSp、Parch、Fare、Cabin、Embarked为自变量。希望通过分析数据，得出自变量对因变量的影响程度，几个猜想：

1、性别影响生还率，女性的生还率显著高于男性

2、社会地位影响生还率，社会地位高的乘客生还率显著高于社会地位低的乘客

3、年龄影响生还率，老人和孩子的生还率显著高于中年

三、数据整理阶段

处理缺失值-Embarked（2个空值）

#空值定位 df.Embarked[df.Embarked.isnull()]

61 NaN

829 NaN

Name: Embarked, dtype: object

#估计Embarked的取值情况 df.groupby('Embarked').Survived.count()

Embarked

C 168

Q 77

S 644

Name: Survived, dtype: int64

Embarked取值只有3个值，每个值对应的人数有统计量，发现基本上大部分取值是'S'，只有两个空值，可全部补充为's'

#众数赋值 df["Embarked"] = df["Embarked"].fillna("S")

处理缺失值-Age（177个空值）

## 使用 RandomForestClassifier 填补缺失的年龄属性 from sklearn.ensemble import RandomForestRegressor def set_missing_ages(df): age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]#把已有的数值型数据取出来丢进Random Forest Regressor中 known_age = age_df[age_df.Age.notnull()].as_matrix()#乘客分成已知年龄和未知年龄两部分 unknown_age = age_df[age_df.Age.isnull()].as_matrix() y = known_age[:, 0]# y即目标年龄 X = known_age[:, 1:] # X即特征属性值 rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) rfr.fit(X, y) # fit到RandomForestRegressor之中 predictedAges = rfr.predict(unknown_age[:, 1::]) # 用得到的模型进行未知年龄结果预测 df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges # 用得到的预测结果填补原缺失数据 return df, rfr set_missing_ages(df) df

数据类型转换-Age/Cabin/Embarked转换成整型数据

#数据类型 df.dtypes

PassengerId int64

Survived int64

Pclass int64

Name object

Sex object

Age float64

SibSp int64

Parch int64

Ticket object

Fare float64

Cabin object

Embarked object

dtype: object

#性别转化为整数型 df['Sex'] = df['Sex'].map({'female':0, 'male':1}).astype(int) df['Sex'].value_counts()#计数

1 577

0 314

Name: Sex, dtype: int64

男性乘客有577名，女性乘客有314名

#有船舱的赋值1，缺失值赋值0，默认为没有固定船舱 df.loc[ (df.Cabin.notnull()), 'Cabin' ] = 1 df.loc[ (df.Cabin.isnull()), 'Cabin'] = 0 df['Cabin'].value_counts()#计数 #登船码头赋值 df['Embarked'] = df['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype(int) df['Embarked'].value_counts()#计数

有船舱的乘客有204人，没有船舱的乘客有687人，从S码头上船的乘客有646人，从C码头上船的乘客有168人,从Q码头上船的乘客有77人。

数据重构-利用SibSp、Parch特征构建两个新特征（家庭人口总数 familysize、是否单身 isalone）

df.loc[:,'SibSp']#兄妹个数 df.loc[:,'Parch']#父母子女个数 df['familysize'] = df.loc[:,'SibSp'] + df.loc[:,'Parch'] + 1 #增加一列column，表示家庭成员个数 df['isalone'] = 0 #增加一列column,表示是否是独身，设初始值为0，代表不是独身 df.loc[df['familysize'] == 1,'isalone'] = 1 #定位familysize是1的元素，isalone相应元素赋值为1，代表是独身

四、探索阶段

A.单变量数据分析

1.Pclass单变量分析

#不同社会等级的乘客数量 df.groupby('Pclass')['PassengerId'].count()

Pclass

1 216

2 184

3 491

import matplotlib.pyplot as plt import seaborn as sns df.groupby('Pclass')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Pclass VS Count') plt.show()

一等舱乘客有214位，占24%；二等舱乘客有184位，占21%；三等舱乘客有491位，占55%；样本数量都大约30，具有统计意义。

2.Sex单变量分析

df.groupby('Sex')['PassengerId'].count() #男女的乘客数量 df.groupby('Sex')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Sex VS Count') plt.show()

Sex

0 314

1 577

#我们也可以将乘客分为男，女，小孩，为原有数据库新增一个字段，此字段因此包含两个属性年龄和性别 def male_famle_child(passenger): age,sex = passenger if age < 16: return int(2) #小孩用2表示 else： return sex #增加字段df["Person"].value_counts() #男、女、小孩的数量 df["Person"] = df[["Age","Sex"]].apply(male_famle_child,axis=1) df["Person"].value_counts() #男、女、小孩的数量 df.groupby('Person')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Person VS Count') plt.show()

1.0 532

0.0 264

2.0 95

成年男性乘客有537位，占大约60%，成年女性有271位，占大约30%，小孩有83位，占大约9%。样本数量都大于30，具有统计学意义。

3.isalone单变量分析

df.groupby('isalone')['PassengerId'].count() df.groupby('isalone')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('isalone VS Count') plt.show()

单身乘客有537位，占60%；有家庭乘客有354位，占40%；样本数量都大于30，具有统计意义。

4.Age单变量分析

bins = [0, 12, 18, 65, 100] #将年龄划分为4个年龄段，儿童、青少年，青中年，老人 df['Age_group'] = pd.cut(df['Age'], bins) #增加'Age_group']列 df.groupby('Age_group')['PassengerId'].count() #每个年龄段的乘客人数 df.groupby('Age_group')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Age_group VS count') plt.show()

0-12岁的乘客有69位，12-18岁的乘客有70位，18-65岁的乘客有567位，这三个年龄段的样本数量都大约30，具有统计意义；65-100岁的乘客只有8位，样本量太小，统计过程中会存在很大的误差，不具有统计意义。

5.Fare单变量分析

bins = [0, 10, 50, 100, 300,520] #将票价划分为6个价格段 df['Fare_group'] = pd.cut(df['Fare'], bins) #增加'Fare_group']列df.groupby('Fare_group') df['PassengerId'].count() df.groupby('Fare_group')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Fare_group VS count') plt.show()

票价在0-10美元的乘客有321位，票价在10-50美元的乘客有395位，票价在50-100美元的乘客有107位，票价在100- 300美元的乘客有50位，这四个票价范围的样本数超过30，具有统计意义，票价在300-520美元的乘客有3位，样本量太小，没有统计意义。

6.Cabin单变量分析

df.groupby('Cabin')['PassengerId'].count() #有无客舱的乘客的数量 df.groupby('Cabin')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Cabin VS count') plt.show()

没有具体客舱的乘客有687位乘客，有具体客舱的有204位乘客，都具有统计意义

7.Embarked单变量分析

df.groupby('Embarked')['PassengerId'].count()#从S,C,Q码头登船的乘客的数量 df.groupby('Embarked')['PassengerId'].count().plot(kind = "pie",autopct = "%.0f%%") plt.title('Embarked VS count') plt.show()

从s码头上船的乘客有646位，从c码头上船的乘客有168位，从q码头上船的乘客有77位，都具有统计意义。

B.描述性分析

1.性别是否影响生还率

x = df[['Sex', 'Survived']].groupby(['Sex']).mean() #男性女性存活率 plt.bar( [0,1], [x.loc[0,'Survived'],x.loc[1,'Survived']], 0.5, color='g', alpha=0.7 ) plt.xticks([0,1],['female','male']) plt.xlabel('Sex') plt.ylabel('survived_rate') plt.title('Sex VS Survived') plt.show()

如图，女性的存活率更高，并且高出了50%左右。

x1=df[['Person','Survived']].groupby(['Person']).mean()#男性女性存活率 plt.bar([0,1,2], [x1.loc[0,'Survived'],x1.loc[1,'Survived'],x1.loc[2,'Survived']], 0.5, color='g', alpha=0.7) plt.xticks([0,1,2],['female','male','kids']) plt.xlabel('Person') plt.ylabel('survived_rate') plt.title('Person VS Survived') plt.show()

女性和孩子的存活率高于成年男性

2.有无家庭是否影响生还率

d=df[['isalone','Survived']].groupby(['isalone']).mean()#有无家庭的乘客存活率 plt.bar([0,1],[d.loc[0,'Survived'],d.loc[1,'Survived']],0.5,color='r',alpha=0.5,) plt.xticks([0,1],['notalone','alone']) plt.xlabel('isalone') plt.ylabel('survived_rate') plt.title('isalone VS Survived') plt.show()

如图，有家庭乘客的存活率更高

%pylabinline sns.factorplot("isalone",data=df,hue="Person",kind="count")#有无家庭中男女小孩的分布 plt.xlabel('isalone_Pclass') plt.ylabel('count') plt.title('isalone_Person VS count')

单身乘客中成年男性较多，成年男性的存活率低从而拉低单身乘客的存活率，可能存活率跟是否有家人没有关系。

3.社会地位是否影响生还率

p = df[['Pclass', 'Survived']].groupby(['Pclass']).mean() #不同社会等级的乘客存活率 plt.bar( [0,1,2], [p.loc[1,'Survived'],p.loc[2,'Survived'],p.loc[3,'Survived']], 0.5, color='b', alpha=0.7) plt.xticks([0,1,2],[1,2,3]) plt.xlabel('Pclass') plt.ylabel('survived_rate') plt.title('Pclass VS Survived') plt.show()

sns.factorplot("Pclass",data=df,hue="Person",kind="count")#各等级中男女小孩的分布 plt.xlabel('Pclass_Person') plt.ylabel('count') plt.title('Pclass_Person VS count'

虽然等级越高存活率也越高，但3等舱乘客中成年男性占大多数，所以等级高的存活率高除等级影响外，也收性别影响性别的影响。

4.年龄是否影响生还率

by_age = df.groupby('Age_group')['Survived'].mean() by_age.plot(kind = "bar") plt.xlabel('Age_group') plt.ylabel('survived_rate') plt.title('Age_group VS Survived')

如图，儿童和青少年的存活率高。

sns.factorplot("Age_group",data=df,hue="Pclass",kind="count")#各年龄段等级分布 plt.xlabel('Age_group_pclass') plt.ylabel('count') plt.title('Age_group_Pclass VS count')

青中年乘客中一等舱比例高于儿童和青少年，但青中年乘客的存活率确低于儿童和青少年，说明等级对存活率的影响没有年龄对生存率的影响大。

sns.factorplot("Age_group",data=df,hue="Sex",kind="count")#各年龄段男女分布 plt.xlabel('Age_group_sex') plt.ylabel('count') plt.title('Age_group_Sex VS count')

青中年乘客中男性比女性多超过50%，其他年龄段，男女比例差不多，所以，青中年乘客的存活率比儿童和青少年的存活率低，除了年龄的影响也可能是受性别的影响。

5.乘客票价是否影响生还率

plt.figure(figsize=(10,5)) df['Fare'].hist(bins = 70) #把票价分为70个小组 plt.xlabel('Fare') plt.ylabel('count') plt.title('Fare VS count') df.boxplot(column='Fare', by='Pclass', showfliers=False) plt.xlabel('Fare_pclass') plt.ylabel('count') plt.show()

fare_not_survived = df["Fare"][df["Survived"] == 0] fare_survived = df["Fare"][df["Survived"] == 1] avgerage_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()]) std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()]) avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False) plt.xlabel('survived_rate') plt.ylabel('Fare') plt.title('Fare VS Survived') plt.show()

可见，票价与生还有一定相关性，生还者的平均票价要比未生还的高。

6.有无舱位是否影响生还率

c =df[['Cabin', 'Survived']].groupby(['Cabin']).mean() plt.bar( [0,1], [c.loc[0,'Survived'],c.loc[1,'Survived']], 0.5, color='c', alpha=0.9，） plt.xticks([0,1],['isnull','notnull']) plt.xlabel('Cabin') plt.ylabel('survived_rate') plt.title('Cabin VS Survived') plt.show()

如图，有舱号的存活率更高，缺失值的乘客可能是没有舱位

sns.factorplot("Cabin",data=df,hue="Person",kind="count")#有无船舱成年男性、女性、小孩的分布 plt.xlabel('Cabin_pclass') plt.ylabel('count') plt.title('Cabin_Person VS count')

没有船舱的乘客中男性占75%，有船舱的乘客中男性大约只占50%，而存活率无船舱的比有船舱的低37%，有无船舱的存活率差异受年龄性别影响。

7.登船码头不同是否影响生还率

e=df[['Embarked', 'Survived']].groupby(['Embarked']).mean() plt.bar( [0,1,2], [e.loc[0,'Survived'],e.loc[1,'Survived'],e.loc[2,'Survived']], 0.5, color='g', alpha=0.4 ) plt.xticks([0,1,2],['S','C','Q']) plt.xlabel('Embarked') plt.ylabel('survived_rate') plt.title('Embarked VS Survived') plt.show()

如图，从S码头上船的乘客存活率最低，从C码头上船的存活率最高

sns.factorplot("Embarked",data=df,hue="Person",kind="count")#不同码头上船的乘客中成年男性、女性、小孩的分布 plt.xlabel('Embarked_person') plt.ylabel('count') plt.title('Embarked_Person VS count')

从S码头上船的乘客中男性比例很高，可能影响S码头生存率的是性别和年龄

结论阶段

1.报告中使用的数据不是全部乘客数据，报告使用的数据有891个样本，不是全部的乘客数据并且这891个样本同是含有一定数量的缺失值，所以样本可能会有偏差。样本虽然不能够代表整体人口，但样本来自整体，样本量也比较多，分析是有说服力的。

2.对数据的处理有一定的偏差和不确定性。 ①对Age数据的处理，Age有177个缺失值，用 RandomForestClassifier填补缺失值，随机填补的缺失值和缺失年龄乘客的真实年龄肯定存在偏差②对Cabin数据的处理，Cabin有87个缺失值，我的处理方式是把Cabin分为2类数据，一类是有cabin值的，一类是没有cabin的，这样分类的前提是假设cabin值缺失的乘客是没有具体客舱的，但这个假设不是肯定成立的，缺失值里也可能包含大量有具体客舱但丢失客舱信息的乘客，可能会有一点的偏差③对Embarked数据的处理，有2个缺失值，我的处理方式是填充了众数，但Embarked的数据极有可能跟Pclass和Fare有关，因为一般情况下，等级越高距离越远的票价会更贵，但此列只有两个缺失值，相对于891个样本，不会带入太多偏差

3.乘客是否会游泳、乘客的身体素质也可能会影响存活率，但此数据表中没有相关数据。

2018-03-03
2018-03-03 180 戴娴简书作者 2018.03.03 08:56 打开App 2018-03-03 ...
2018-03-03洪霞
2018-03-03 180 KellyWellin 简书作者 2018.03.03 19:21 打开App （稻...
React Navigation 集成 Redux
更新时间：2018-03-03 修复因 React Navigation 更新引起的 addListener is...
2018-03-05
2018-03-03 180 戴师傅简书作者 2018.03.05 20:49 打开App 2018-03-05...
2018-02-26~2018-03-03 周记
2018-02-26~2018-03-03 一、单词： Economic [ˌikəˈnɑmɪk]1. Econo...
如何搭建一个博客系统
date: 2018-03-03 18:37:50tags: Hexo作者：饥人谷方方转载：黄洪涛根据方...
伯爵返利机器人-通向用AI自动赚钱之路
title: 伯爵返利机器人-通向用AI自动赚钱之路date: 2018-03-03 12:34:34catego...
【王鹏翔论语札记47】君子的风范
书香联盟2018-03-03 原文《论语八佾篇》 3.7 子曰：“君子无所争，必也射乎？揖让而升，下...
有哪些适合女生读，提升气质Level的好书？
文：Yua 青枣号 2018-03-03 01 “ 有形易逝无形永存” 有形的东西都会消失，而无形的将永存...
阿含子开示三昧：善思维是走向真如本性的法宝
阿含子开示三昧：善思维是走向真如本性的法宝教行已关注 2018-03-03 20:34 · 字数 1588 ·...