美文网首页我爱编程
#2.1.5 Working With Missing Data

#2.1.5 Working With Missing Data

作者: 禮記 | 来源:发表于2017-09-28 19:11 被阅读0次

    1.NaN(空值)与None(缺失值)

    Missing data can take a few different forms:

    • In Python, the None
      keyword and type indicates no value.
    • The Pandas library uses NaN
      , which stands for "not a number", to indicate a missing value.

    In general terms, both NaN and None can be called null values.

    2.判断缺失值/空字符:pandas.isnull(XXX)

    If we want to see which values are NaN, we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False
    values, the same way that NumPy did when we compared arrays.

    input
    age = titanic_survival["age"]
    print(age.loc[10:20])
    age_is_null = pandas.isnull(age)    # 如果是NaN或者None,返回True;否则,返回False
    age_null_true = age[age_is_null]    # 
    age_null_count = len(age_null_true)
    print(age_null_count)
    
    output
    10    47.0 
    11    18.0 
    12    24.0 
    13    26.0 
    14    80.0 
    15     NaN 
    16    24.0 
    17    50.0 
    18    32.0 
    19    36.0 
    20    37.0 
    Name: age, dtype: float64 
    264
    

    3.有null值时做加减乘除法

    
    #计算有null值下的平均年龄
    
    age_is_null = pd.isnull(titanic_survival["age"])
    
    good_ages = titanic_survival['age'][age_is_null == False]
    
    correct_mean_age1 = sum(good_ages) / len(good_ages)
    
    #使用Series.mean()
    
    correct_mean_age2 = titanic_survival["age"].mean()
    
    

    4.用词典统计不同等级船舱的票价问题

    input
    
    passenger_classes = [1, 2, 3]  #泰坦尼克的船舱等级分为1,2,3
    
    fares_by_class = {}   #创建一个空字典
    
    for this_class in passenger_classes:
        pclass_rows = titanic_survival[titanic_survival['pclass'] == this_class]  # X等舱的所有数据
        mean_fares = pclass_rows['fare'].mean()   # X等舱的船票均值
        fares_by_class[this_class] = mean_fares  # 构建词典用于统计
    print(fares_by_class)
    
    output
    {1: 87.508991640866881, 2: 21.179196389891697, 3: 13.302888700564973}
    

    5.使用Dataframe.pivot_table()

    Pivot tables provide an easy way to subset by one column and then apply a calculation like a sum or a mean.

    刚才第4点的问题,可以用Dataframe.pivot_table()

    • The first parameter of the method, index
      tells the method which column to group by.

    • The second parameter values
      is the column that we want to apply the calculation to.

    • aggfunc
      specifies the calculation we want to perform. The default for the aggfunc
      parameter is actually the mean

    input1
    
    passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=numpy.mean)
    
    print(passenger_class_fares)
    
    
    output1
    
    pclass 
    
    1.0    87.508992 
    
    2.0    21.179196 
    
    3.0    13.302889 
    
    Name: fare, dtype: float64
    
    
    input2
    
    passenger_age = titanic_survival.pivot_table(index="pclass", values="age",aggfunc=numpy.mean)
    
    print(passenger_age)
    
    
    output2
    pclass 
    1.0    39.159918 
    2.0    29.506705 
    3.0    24.816367 
    Name: age, dtype: float64
    
    input3
    import numpy as np
    port_stats = titanic_survival.pivot_table(index='embarked', values=["fare", "survived"], aggfunc=numpy.sum)
    print(port_stats)
    
    
    output3
                    fare  survivedembarked                      C         16830.7922     150.0Q          1526.3085      44.0S         25033.3862     304.0
    

    6.剔除缺失值:DataFrame.dropna()

    The methodDataFrame.dropna()
    will drop any rows that contain missing values.

    drop_na_rows = titanic_survival.dropna(axis=0)  # 剔除所有含缺失值的行
    drop_na_columns = titanic_survival.dropna(axis=1) # 剔除所有含缺失值的列
    new_titanic_survival = titanic_survival.dropna(axis=0,subset=["age", "sex"])  # 剔除所有在‘age’和‘sex’中,有缺失值的行
    

    7.Dataframe.loc[4]与Dataframe.iloc[4]

    input
    
    # We have already sorted new_titanic_survival by age
    first_five_rows_1 = new_titanic_survival.iloc[5]   # 定位到按顺序第5的对象
    first_five_rows_2 = new_titanic_survival.loc[5]   # 定位到索引值为5的对象
    row_index_25_survived = new_titanic_survival.loc[25, 'survived']  # 定位到索引值为5,且列名为'survived'的对象
    print(first_five_rows_1)
    print('------------------------------------------')
    print(first_five_rows_2)
    
    output
    pclass                          3survived                        0name         Connors, Mr. Patricksex                          maleage                          70.5sibsp                           0parch                           0ticket                     370369fare                         7.75cabin                         NaNembarked                        Qboat                          NaNbody                          171home.dest                     NaNName: 727, dtype: object------------------------------------------pclass                         1survived                       1name         Anderson, Mr. Harrysex                         maleage                           48sibsp                          0parch                          0ticket                     19952fare                       26.55cabin                        E12embarked                       Sboat                           3body                         NaNhome.dest           New York, NYName: 5, dtype: object
    

    8.重新整理索引值:Dataframe.reset_index(drop=True)

    input
    
    titanic_reindexed = new_titanic_survival.reset_index(drop=True)
    print(titanic_reindexed.iloc[0:5,0:3])
    
    output
       pclass  survived                                               name0     1.0       1.0               Barkworth, Mr. Algernon Henry Wilson1     1.0       1.0  Cavendish, Mrs. Tyrell William (Julia Florence...2     3.0       0.0                                Svensson, Mr. Johan3     1.0       0.0                          Goldschmidt, Mr. George B4     1.0       0.0                            Artagaveytia, Mr. Ramon
    

    9.Apply Functions Over a DataFrame

    DataFrame.apply() will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, apply() method passes each column to the parameter as a pandas series.
    DataFrame可以调用apply函数对每一列(行)应用一个函数

    input
    def not_null_count(column):
        columns_null = pandas.isnull(column)  #
        null = column[column_null]
        return len(null)
    column_null_count = titanic_survival.apply(not_null_count)
    print(column_null_count)
    
    output
    pclass          1survived        1name            1sex             1age           264sibsp           1parch           1ticket          1fare            2cabin        1015embarked        3boat          824body         1189home.dest     565dtype: int64
    

    10.Applying a Function to a Row

    input
    
    def age_label(row):
        age = row['age']
        if pandas.isnull(age):
            return 'unknown'
        elif age < 18:
            return 'minor'
        else:
            return 'adult'
    age_labels = titanic_survival.apply(age_label, axis=1)  # use axis=1
     so that the apply()
     method applies your function over the rows 
    print(age_labels[0:5])
    
    output
    0    adult1    minor2    minor3    adult4    adultdtype: object
    

    11.Calculating Survival Percentage by Age Group

    Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.
    We have added an "age_labels"
    column to the dataframe containing the age_labels
    variable from the previous step.

    input
    age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
    print(age_group_survival)
    
    output
    age_labelsadult      0.387892minor      0.525974unknown    0.277567Name: survived, dtype: float64
    

    相关文章

      网友评论

        本文标题:#2.1.5 Working With Missing Data

        本文链接:https://www.haomeiwen.com/subject/vjuvextx.html