美文网首页
无标题文章

无标题文章

作者: Brakeman | 来源:发表于2017-09-07 23:24 被阅读0次

    Introduction

    This notebook describes and implements a basic approach to solving the Titanic Survival Prediction problem. The prediction is made using a Random Forest Classifier.

    1. Exploring training and test sets

    First, load required packages.

    In [1]:

    importreimportnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltimportwarningsfromsklearn.ensembleimportRandomForestClassifierwarnings.filterwarnings("ignore")plt.style.use('ggplot')

    Read training and test sets. Both datasets will be used in exploring and predicting.

    In [2]:

    train=pd.read_csv("../input/train.csv")test=pd.read_csv("../input/test.csv")

    In [3]:

    train.sample(frac=1).head(3)

    Out[3]:

    PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

    72372402Hodges, Mr. Henry Pricemale50.00025064313.0000NaNS

    252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.01534707731.3875NaNS

    74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.0000B22S

    In [4]:

    test.sample(frac=1).head(3)

    Out[4]:

    PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

    24711392Drew, Mr. James Vivianmale42.0112822032.500NaNS

    29111833Daly, Miss. Margaret Marcella Maggie""female30.0003826506.950NaNQ

    58973Svensson, Mr. Johan Cervinmale14.00075389.225NaNS

    2. Exploring missing data

    Looks like there are missing (NaN) values among both datasets.

    In [5]:

    train.info()

    RangeIndex: 891 entries, 0 to 890

    Data columns (total 12 columns):

    PassengerId    891 non-null int64

    Survived      891 non-null int64

    Pclass        891 non-null int64

    Name          891 non-null object

    Sex            891 non-null object

    Age            714 non-null float64

    SibSp          891 non-null int64

    Parch          891 non-null int64

    Ticket        891 non-null object

    Fare          891 non-null float64

    Cabin          204 non-null object

    Embarked      889 non-null object

    dtypes: float64(2), int64(5), object(5)

    memory usage: 83.6+ KB

    In [6]:

    test.info()

    RangeIndex: 418 entries, 0 to 417

    Data columns (total 11 columns):

    PassengerId    418 non-null int64

    Pclass        418 non-null int64

    Name          418 non-null object

    Sex            418 non-null object

    Age            332 non-null float64

    SibSp          418 non-null int64

    Parch          418 non-null int64

    Ticket        418 non-null object

    Fare          417 non-null float64

    Cabin          91 non-null object

    Embarked      418 non-null object

    dtypes: float64(2), int64(4), object(5)

    memory usage: 36.0+ KB

    Non-numeric data

    Cabincolumn stores quite a lot of different qualitative values and has a relatively large amount of missing data.

    In [7]:

    missing_val_df=pd.DataFrame(index=["Total","Unique Cabin","Missing Cabin"])forname,dfinzip(("Training data","Test data"),(train,test)):total=df.shape[0]unique_cabin=len(df["Cabin"].unique())missing_cabin=df["Cabin"].isnull().sum()missing_val_df[name]=[total,unique_cabin,missing_cabin]missing_val_df

    Out[7]:

    Training dataTest data

    Total891418

    Unique Cabin14877

    Missing Cabin687327

    We shall removeCabincolumns from our dataframes.

    Also, we can excludePassengerIdfrom the training set, since IDs are unnecessary for classification.

    In [8]:

    train.drop("PassengerId",axis=1,inplace=True)fordfintrain,test:df.drop("Cabin",axis=1,inplace=True)

    Fill in missing rows inEmbarkedcolumn withS(Southampton Port), since it's the most frequent.

    In [9]:

    non_empty_embarked=train["Embarked"].dropna()unique_values,value_counts=non_empty_embarked.unique(),non_empty_embarked.value_counts()X=range(len(unique_values))colors=["brown","grey","purple"]plt.bar(left=X,height=value_counts,color=colors,tick_label=unique_values)plt.xlabel("Port of Embarkation")plt.ylabel("Amount of embarked")plt.title("Bar plot of embarked in Southampton, Queenstown, Cherbourg")

    Out[9]:

    Quantitative data

    Consider the distributions of passenger ages and fares (excluding NaN values).

    In [10]:

    survived=train[train["Survived"]==1]["Age"].dropna()perished=train[train["Survived"]==0]["Age"].dropna()fig,(ax1,ax2)=plt.subplots(nrows=2,ncols=1)fig.set_size_inches(12,6)fig.subplots_adjust(hspace=0.5)ax1.hist(survived,facecolor='green',alpha=0.75)ax1.set(title="Survived",xlabel="Age",ylabel="Amount")ax2.hist(perished,facecolor='brown',alpha=0.75)ax2.set(title="Dead",xlabel="Age",ylabel="Amount")

    Out[10]:

    [,

    ,

    ]

    In [11]:

    survived=train[train["Survived"]==1]["Fare"].dropna()perished=train[train["Survived"]==0]["Fare"].dropna()fig,(ax1,ax2)=plt.subplots(nrows=2,ncols=1)fig.set_size_inches(12,8)fig.subplots_adjust(hspace=0.5)ax1.hist(survived,facecolor='darkgreen',alpha=0.75)ax1.set(title="Survived",xlabel="Age",ylabel="Amount")ax2.hist(perished,facecolor='darkred',alpha=0.75)ax2.set(title="Dead",xlabel="Age",ylabel="Amount")

    Out[11]:

    [,

    ,

    ]

    We can clean upAgeandFarecolumns filling in all of the missing values withmedianof all values in the training set.

    In [12]:

    fordfintrain,test:df["Embarked"].fillna("S",inplace=True)forfeaturein"Age","Fare":df[feature].fillna(train[feature].mean(),inplace=True)

    3. Feature engineering

    Converting non-numeric columns

    All of the non-numeric features exceptEmbarkedaren't particularly informative.

    We shall convertEmbarkedandSexcolumns to numeric because we can't feed non-numeric columns into a Machine Learning algorithm.

    In [13]:

    fordfintrain,test:forkey,valueinzip(("S","C","Q"),(0,1,2)):df.loc[df["Embarked"]==key,"Embarked"]=valueforkey,valueinzip(("female","male"),(0,1)):df.loc[df["Sex"]==key,"Sex"]=value

    Map every unique ticket to numeric ID value.

    In [14]:

    fordfintrain,test:ticket_mapping=dict()tickets=list()timer=0for_,sampleindf.iterrows():ifsample["Ticket"]notinticket_mapping:timer+=1ticket_mapping[sample["Ticket"]]=timertickets.append(timer)df["Ticket"]=tickets

    Generating new features

    SibSpSibSp+ParchParch+11gives the total number of people in a family.

    In [15]:

    fordfintrain,test:df["FamilySize"]=df["SibSp"]+df["Parch"]+1

    Extract the passengers' titles (Mr., Mrs., Rev., etc.) from their names.

    In [16]:

    fordfintrain,test:titles=list()forrowindf["Name"]:surname,title,name=re.split(r"[,.]",row,maxsplit=2)titles.append(title.strip())df["Title"]=titles

    In [17]:

    title=train["Title"]unique_values,value_counts=title.unique(),title.value_counts()X=range(len(unique_values))fig,ax=plt.subplots()fig.set_size_inches(18,10)ax.bar(left=X,height=value_counts,width=0.5,tick_label=unique_values)ax.set_xlabel("Title")ax.set_ylabel("Count")ax.set_title("Passenger titles")ax.grid(color='g',linestyle='--',linewidth=0.5)

    Looks like some titles are very rare. Let's map them into related titles.

    In [18]:

    fordfintrain,test:forkey,valueinzip(("Mr","Mrs","Miss","Master","Dr","Rev"),list(range(6))):df.loc[df["Title"]==key,"Title"]=valuedf.loc[df["Title"]=="Ms","Title"]=1fortitlein"Major","Col","Capt":df.loc[df["Title"]==title,"Title"]=6fortitlein"Mlle","Mme":df.loc[df["Title"]==title,"Title"]=7fortitlein"Don","Sir":df.loc[df["Title"]==title,"Title"]=8fortitlein"Lady","the Countess","Jonkheer":df.loc[df["Title"]==title,"Title"]=9test["Title"][414]=0

    Finally, we get

    In [19]:

    train.sample(frac=1).head(10)

    Out[19]:

    SurvivedPclassNameSexAgeSibSpParchTicketFareEmbarkedFamilySizeTitle

    28503Stankovic, Mr. Ivan133.000000002558.6625110

    77412Hocking, Mrs. Elizabeth (Eliza Needs)054.0000001360923.0000051

    51211McGough, Mr. James Robert136.0000000042926.2875010

    46803Scanlan, Mr. James129.699118003987.7250210

    12903Ekstrom, Mr. Johan145.000000001216.9750010

    85813Baclini, Mrs. Solomon (Latifa Qurban)024.0000000365819.2583141

    17503Klasen, Mr. Klas Albin118.000000111607.8542030

    82813McCormack, Mr. Thomas Joseph129.699118006427.7500210

    60503Lindell, Mr. Edvard Bengtsson136.0000001049815.5500020

    75803Theobald, Mr. Thomas Leonard134.000000005988.0500010

    4. Prediction

    Choose the most informative predictors and randomly split the training data.

    In [20]:

    fromsklearn.model_selectionimporttrain_test_splitpredictors=["Pclass","Sex","Age","SibSp","Parch","Ticket","Fare","Embarked","FamilySize","Title"]X_train,X_test,y_train,y_test=train_test_split(train[predictors],train["Survived"])

    Build a Random Forest model from the training set and evaluate the mean accuracy on the given test set.

    In [21]:

    forest=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=5,min_samples_split=10,min_samples_leaf=5,random_state=0)forest.fit(X_train,y_train)print("Random Forest score:{0:.2}".format(forest.score(X_test,y_test)))

    Random Forest score: 0.81

    Examine the feature importances.

    In [22]:

    plt.bar(range(len(predictors)),forest.feature_importances_)plt.xticks(range(len(predictors)),predictors,rotation='vertical')

    Out[22]:

    ([,

    ,

    ,

    ,

    ,

    ,

    ,

    ,

    ,

    ],

    )

    Pick the best features and make a submission.

    In [23]:

    predictors=["Title","Sex","Fare","Pclass","Age","Ticket"]clf=RandomForestClassifier(n_estimators=100,criterion='gini',max_depth=5,min_samples_split=10,min_samples_leaf=5,random_state=0)clf.fit(train[predictors],train["Survived"])prediction=clf.predict(test[predictors])submission=pd.DataFrame({"PassengerId":test["PassengerId"],"Survived":prediction})submission.to_csv("submission.csv",index=False)

    相关文章

      网友评论

          本文标题:无标题文章

          本文链接:https://www.haomeiwen.com/subject/hwbwjxtx.html