This notebook describes and implements a basic approach to solving the Titanic Survival Prediction problem. The prediction is made using a Random Forest Classifier.
1. Exploring training and test sets
First, load required packages.
In [1]:
importreimportnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltimportwarningsfromsklearn.ensembleimportRandomForestClassifierwarnings.filterwarnings("ignore")plt.style.use('ggplot')
Read training and test sets. Both datasets will be used in exploring and predicting.
In [2]:
train=pd.read_csv("../input/train.csv")test=pd.read_csv("../input/test.csv")
In [3]:
train.sample(frac=1).head(3)
Out[3]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
72372402Hodges, Mr. Henry Pricemale50.00025064313.0000NaNS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.01534707731.3875NaNS
74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.0000B22S
In [4]:
test.sample(frac=1).head(3)
Out[4]:
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
24711392Drew, Mr. James Vivianmale42.0112822032.500NaNS
29111833Daly, Miss. Margaret Marcella Maggie""female30.0003826506.950NaNQ
58973Svensson, Mr. Johan Cervinmale14.00075389.225NaNS
Looks like there are missing (NaN) values among both datasets.
In [5]:
train.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
In [6]:
test.info()
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
Cabincolumn stores quite a lot of different qualitative values and has a relatively large amount of missing data.
In [7]:
missing_val_df=pd.DataFrame(index=["Total","Unique Cabin","Missing Cabin"])forname,dfinzip(("Training data","Test data"),(train,test)):total=df.shape[0]unique_cabin=len(df["Cabin"].unique())missing_cabin=df["Cabin"].isnull().sum()missing_val_df[name]=[total,unique_cabin,missing_cabin]missing_val_df
Out[7]:
Training dataTest data
Total891418
Unique Cabin14877
Missing Cabin687327
We shall removeCabincolumns from our dataframes.
Also, we can excludePassengerIdfrom the training set, since IDs are unnecessary for classification.
In [8]:
train.drop("PassengerId",axis=1,inplace=True)fordfintrain,test:df.drop("Cabin",axis=1,inplace=True)
Fill in missing rows inEmbarkedcolumn withS(Southampton Port), since it's the most frequent.
In [9]:
non_empty_embarked=train["Embarked"].dropna()unique_values,value_counts=non_empty_embarked.unique(),non_empty_embarked.value_counts()X=range(len(unique_values))colors=["brown","grey","purple"]plt.bar(left=X,height=value_counts,color=colors,tick_label=unique_values)plt.xlabel("Port of Embarkation")plt.ylabel("Amount of embarked")plt.title("Bar plot of embarked in Southampton, Queenstown, Cherbourg")
Out[9]:
Consider the distributions of passenger ages and fares (excluding NaN values).
In [10]:
survived=train[train["Survived"]==1]["Age"].dropna()perished=train[train["Survived"]==0]["Age"].dropna()fig,(ax1,ax2)=plt.subplots(nrows=2,ncols=1)fig.set_size_inches(12,6)fig.subplots_adjust(hspace=0.5)ax1.hist(survived,facecolor='green',alpha=0.75)ax1.set(title="Survived",xlabel="Age",ylabel="Amount")ax2.hist(perished,facecolor='brown',alpha=0.75)ax2.set(title="Dead",xlabel="Age",ylabel="Amount")
Out[10]:
[,
,
]
In [11]:
survived=train[train["Survived"]==1]["Fare"].dropna()perished=train[train["Survived"]==0]["Fare"].dropna()fig,(ax1,ax2)=plt.subplots(nrows=2,ncols=1)fig.set_size_inches(12,8)fig.subplots_adjust(hspace=0.5)ax1.hist(survived,facecolor='darkgreen',alpha=0.75)ax1.set(title="Survived",xlabel="Age",ylabel="Amount")ax2.hist(perished,facecolor='darkred',alpha=0.75)ax2.set(title="Dead",xlabel="Age",ylabel="Amount")
Out[11]:
[,
,
]
We can clean upAgeandFarecolumns filling in all of the missing values withmedianof all values in the training set.
In [12]:
fordfintrain,test:df["Embarked"].fillna("S",inplace=True)forfeaturein"Age","Fare":df[feature].fillna(train[feature].mean(),inplace=True)
Converting non-numeric columns
All of the non-numeric features exceptEmbarkedaren't particularly informative.
We shall convertEmbarkedandSexcolumns to numeric because we can't feed non-numeric columns into a Machine Learning algorithm.
In [13]:
fordfintrain,test:forkey,valueinzip(("S","C","Q"),(0,1,2)):df.loc[df["Embarked"]==key,"Embarked"]=valueforkey,valueinzip(("female","male"),(0,1)):df.loc[df["Sex"]==key,"Sex"]=value
Map every unique ticket to numeric ID value.
In [14]:
fordfintrain,test:ticket_mapping=dict()tickets=list()timer=0for_,sampleindf.iterrows():ifsample["Ticket"]notinticket_mapping:timer+=1ticket_mapping[sample["Ticket"]]=timertickets.append(timer)df["Ticket"]=tickets
SibSpSibSp+ParchParch+11gives the total number of people in a family.
In [15]:
fordfintrain,test:df["FamilySize"]=df["SibSp"]+df["Parch"]+1
Extract the passengers' titles (Mr., Mrs., Rev., etc.) from their names.
In [16]:
fordfintrain,test:titles=list()forrowindf["Name"]:surname,title,name=re.split(r"[,.]",row,maxsplit=2)titles.append(title.strip())df["Title"]=titles
In [17]:
title=train["Title"]unique_values,value_counts=title.unique(),title.value_counts()X=range(len(unique_values))fig,ax=plt.subplots()fig.set_size_inches(18,10)ax.bar(left=X,height=value_counts,width=0.5,tick_label=unique_values)ax.set_xlabel("Title")ax.set_ylabel("Count")ax.set_title("Passenger titles")ax.grid(color='g',linestyle='--',linewidth=0.5)
Looks like some titles are very rare. Let's map them into related titles.
In [18]:
fordfintrain,test:forkey,valueinzip(("Mr","Mrs","Miss","Master","Dr","Rev"),list(range(6))):df.loc[df["Title"]==key,"Title"]=valuedf.loc[df["Title"]=="Ms","Title"]=1fortitlein"Major","Col","Capt":df.loc[df["Title"]==title,"Title"]=6fortitlein"Mlle","Mme":df.loc[df["Title"]==title,"Title"]=7fortitlein"Don","Sir":df.loc[df["Title"]==title,"Title"]=8fortitlein"Lady","the Countess","Jonkheer":df.loc[df["Title"]==title,"Title"]=9test["Title"][414]=0
Finally, we get
In [19]:
train.sample(frac=1).head(10)
Out[19]:
SurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedFamilySizeTitle
28503Stankovic, Mr. Ivan133.000000002558.6625110
77412Hocking, Mrs. Elizabeth (Eliza Needs)054.0000001360923.0000051
51211McGough, Mr. James Robert136.0000000042926.2875010
46803Scanlan, Mr. James129.699118003987.7250210
12903Ekstrom, Mr. Johan145.000000001216.9750010
85813Baclini, Mrs. Solomon (Latifa Qurban)024.0000000365819.2583141
17503Klasen, Mr. Klas Albin118.000000111607.8542030
82813McCormack, Mr. Thomas Joseph129.699118006427.7500210
60503Lindell, Mr. Edvard Bengtsson136.0000001049815.5500020
75803Theobald, Mr. Thomas Leonard134.000000005988.0500010
Choose the most informative predictors and randomly split the training data.
In [20]:
fromsklearn.model_selectionimporttrain_test_splitpredictors=["Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Embarked","FamilySize","Title"]X_train,X_test,y_train,y_test=train_test_split(train[predictors],train["Survived"])
Build a Random Forest model from the training set and evaluate the mean accuracy on the given test set.
In [21]:
forest=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=5,min_samples_split=10,min_samples_leaf=5,random_state=0)forest.fit(X_train,y_train)print("Random Forest score:{0:.2}".format(forest.score(X_test,y_test)))
Random Forest score: 0.81
Examine the feature importances.
In [22]:
plt.bar(range(len(predictors)),forest.feature_importances_)plt.xticks(range(len(predictors)),predictors,rotation='vertical')
Out[22]:
([,
,
,
,
,
,
,
,
,
],
Pick the best features and make a submission.
In [23]:
predictors=["Title","Sex","Fare","Pclass","Age","Ticket"]clf=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=5,min_samples_split=10,min_samples_leaf=5,random_state=0)clf.fit(train[predictors],train["Survived"])prediction=clf.predict(test[predictors])submission=pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":prediction})submission.to_csv("submission.csv",index=False)
网友评论