美文网首页
Titanic 数据处理 - Kaggle

Titanic 数据处理 - Kaggle

作者: 程序猪小羊 | 来源:发表于2018-02-24 05:18 被阅读47次

    https://www.kaggle.com/c/titanic/

    Here is a step-by-step solution

    binary event

    Analyze by describing data¶

    which one is the best in your life and how you want to achieve that

    feature selection
    Choose: intuition
    Drop: intuition(it does not contribute to survival.), high ratio of duplicates (22%)

    A histogram chart - analyzing continous numerical variables - specific bands

    x-axis = the count of samples

    panda package

    train_df = pd.read_csv(‘../input/train.csv') 
    # read the file
    
    train_df.head()
    train_df.tail()
    # open the first few lines of data 
    
    train_df.info()
    # get the information, like Data columns, the info about the features 
    
    train_df.describe()
    # Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
    # Review Parch distribution using `percentiles=[.75, .8]`
    # SibSp distribution `[.68, .69]`
    # Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
    
    train_df.describe(include=['O'])
    # sth more generalized, like count unique, top, freq
    
    train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
    # find the correlation between categories
    

    Visulization

    • confirming assumptions
    • using visualizations for analyzing the data.

    Correlating:
    numerical features¶
    numerical and ordinal feature
    categorical features

    Observations.

    • Infants (Age <=4) had high survival rate.
    • Oldest passengers (Age = 80) survived.
    • Large number of 15-25 year olds did not survive.
    • Most passengers are in 15-35 age range.

    ——————————————————
    经过一系列visualization后,我们有一些观察(observations),推出一些假设(assumptions)
    Wrangle data¶

    • Correcting by dropping features¶
    • Creating new feature extracting from existing¶
    • Completing a numerical continuous feature¶
    • Create new feature combining existing features¶

    Completing a numerical continuous feature¶
    estimating and completing features with missing or null values.
    Method: 随机数(random noise);use other correlated features(correlation)

    处理数据:

    • 补充缺少的数据(随机数生成(但会产生random noise),mean)
    • 数值化 “类别数据”
    • 处理连续数据 -如设置bands(年龄-年龄范围,)
    • drop 不相关的数据
    • Combine 可能相关的数据

    相关文章

      网友评论

          本文标题:Titanic 数据处理 - Kaggle

          本文链接:https://www.haomeiwen.com/subject/jsgfxftx.html