美文网首页
Titanic 数据处理 - Kaggle

Titanic 数据处理 - Kaggle

作者: 程序猪小羊 | 来源:发表于2018-02-24 05:18 被阅读47次

https://www.kaggle.com/c/titanic/

Here is a step-by-step solution

binary event

Analyze by describing data¶

which one is the best in your life and how you want to achieve that

feature selection
Choose: intuition
Drop: intuition(it does not contribute to survival.), high ratio of duplicates (22%)

A histogram chart - analyzing continous numerical variables - specific bands

x-axis = the count of samples

panda package

train_df = pd.read_csv(‘../input/train.csv') 
# read the file

train_df.head()
train_df.tail()
# open the first few lines of data 

train_df.info()
# get the information, like Data columns, the info about the features 

train_df.describe()
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`

train_df.describe(include=['O'])
# sth more generalized, like count unique, top, freq

train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# find the correlation between categories

Visulization

  • confirming assumptions
  • using visualizations for analyzing the data.

Correlating:
numerical features¶
numerical and ordinal feature
categorical features

Observations.

  • Infants (Age <=4) had high survival rate.
  • Oldest passengers (Age = 80) survived.
  • Large number of 15-25 year olds did not survive.
  • Most passengers are in 15-35 age range.

——————————————————
经过一系列visualization后,我们有一些观察(observations),推出一些假设(assumptions)
Wrangle data¶

  • Correcting by dropping features¶
  • Creating new feature extracting from existing¶
  • Completing a numerical continuous feature¶
  • Create new feature combining existing features¶

Completing a numerical continuous feature¶
estimating and completing features with missing or null values.
Method: 随机数(random noise);use other correlated features(correlation)

处理数据:

  • 补充缺少的数据(随机数生成(但会产生random noise),mean)
  • 数值化 “类别数据”
  • 处理连续数据 -如设置bands(年龄-年龄范围,)
  • drop 不相关的数据
  • Combine 可能相关的数据

相关文章

网友评论

      本文标题:Titanic 数据处理 - Kaggle

      本文链接:https://www.haomeiwen.com/subject/jsgfxftx.html