https://www.kaggle.com/c/titanic/
Here is a step-by-step solution
binary event
Analyze by describing data¶
which one is the best in your life and how you want to achieve that
feature selection
Choose: intuition
Drop: intuition(it does not contribute to survival.), high ratio of duplicates (22%)
A histogram chart - analyzing continous numerical variables - specific bands
x-axis = the count of samples
panda package
train_df = pd.read_csv(‘../input/train.csv')
# read the file
train_df.head()
train_df.tail()
# open the first few lines of data
train_df.info()
# get the information, like Data columns, the info about the features
train_df.describe()
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
train_df.describe(include=['O'])
# sth more generalized, like count unique, top, freq
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
# find the correlation between categories
Visulization
- confirming assumptions
- using visualizations for analyzing the data.
Correlating:
numerical features¶
numerical and ordinal feature
categorical features
Observations.
- Infants (Age <=4) had high survival rate.
- Oldest passengers (Age = 80) survived.
- Large number of 15-25 year olds did not survive.
- Most passengers are in 15-35 age range.
——————————————————
经过一系列visualization后,我们有一些观察(observations),推出一些假设(assumptions)
Wrangle data¶
- Correcting by dropping features¶
- Creating new feature extracting from existing¶
- Completing a numerical continuous feature¶
- Create new feature combining existing features¶
Completing a numerical continuous feature¶
estimating and completing features with missing or null values.
Method: 随机数(random noise);use other correlated features(correlation)
处理数据:
- 补充缺少的数据(随机数生成(但会产生random noise),mean)
- 数值化 “类别数据”
- 处理连续数据 -如设置bands(年龄-年龄范围,)
- drop 不相关的数据
- Combine 可能相关的数据
网友评论