美文网首页
Pandas - 10.2 转换与过滤

Pandas - 10.2 转换与过滤

作者: 陈天睡懒觉 | 来源:发表于2022-07-30 22:23 被阅读0次

    transform 转换

    转换与聚合成单个值的计算不同,数据转换后数量不会变,比如标准化,只是在不同的类中进行标准化。

    import pandas as pd
    df = pd.read_csv('data/gapminder.tsv', sep='\t')
    
    def my_zscore(x):
        return ((x - x.mean())/x.std())
    
    transform_z = df.groupby('year').lifeExp.transform(my_zscore)
    print(transform_z.shape) # (1704,)
    print(df.shape) # (1704, 6)
    

    对比分组标准化和不分组标准化,两个分组标准化结果类似,但不分组区别很大

    from scipy.stats import zscore
    
    sp_z_grouped = df.groupby('year').lifeExp.transform(zscore)
    sp_z_nogroup = zscore(df.lifeExp)
    
    print(transform_z.head())
    '''
    0   -1.656854
    1   -1.731249
    2   -1.786543
    3   -1.848157
    4   -1.894173
    Name: lifeExp, dtype: float64
    '''
    
    print(sp_z_grouped.head())
    '''
    0   -1.662719
    1   -1.737377
    2   -1.792867
    3   -1.854699
    4   -1.900878
    Name: lifeExp, dtype: float64
    '''
    
    print(sp_z_nogroup[:5])
    # [-2.37533395 -2.25677417 -2.1278375  -1.97117751 -1.81103275]
    

    以缺失值填充为例,用组内平均值代替,而不是整个数据的平均值。比如男性和女性的消费能力不同,区分男女计算平均值代替缺失值更加合理。

    import seaborn as sns
    import numpy as np
    
    np.random.seed(42)
    # 取出10个样本
    tips_10 = sns.load_dataset('tips').sample(10)
    # 随机将四个样本的'total_bill'值改成缺失值
    tips_10.loc[np.random.permutation(tips_10.index)[:4], 'total_bill'] = np.NaN
    print(tips_10)
    '''
         total_bill   tip     sex smoker   day    time  size
    24        19.82  3.18    Male     No   Sat  Dinner     2
    6          8.77  2.00    Male     No   Sun  Dinner     2
    153         NaN  2.00    Male     No   Sun  Dinner     4
    211         NaN  5.16    Male    Yes   Sat  Dinner     4
    198         NaN  2.00  Female    Yes  Thur   Lunch     2
    176         NaN  2.00    Male    Yes   Sun  Dinner     2
    192       28.44  2.56    Male    Yes  Thur   Lunch     2
    124       12.48  2.52  Female     No  Thur   Lunch     2
    9         14.78  3.23    Male     No   Sun  Dinner     2
    101       15.38  3.00  Female    Yes   Fri  Dinner     2
    '''
    
    # 按sex统计缺失值的数量,Male3个,Female1个
    count_sex = tips_10.groupby('sex').count()
    print(count_sex)
    '''
            total_bill  tip  smoker  day  time  size
    sex                                             
    Male             4    7       7    7     7     7
    Female           2    3       3    3     3     3
    '''
    
    # 返回给定向量的平均值
    def fill_na_mean(x):
        avg = x.mean()
        return (x.fillna(avg))
    
    total_bill_group_mean = tips_10.groupby('sex').total_bill.transform(fill_na_mean)
    tips_10['fill_total_bill'] = total_bill_group_mean
    print(tips_10)
    '''
         total_bill   tip     sex smoker   day    time  size  fill_total_bill
    24        19.82  3.18    Male     No   Sat  Dinner     2          19.8200
    6          8.77  2.00    Male     No   Sun  Dinner     2           8.7700
    153         NaN  2.00    Male     No   Sun  Dinner     4          17.9525
    211         NaN  5.16    Male    Yes   Sat  Dinner     4          17.9525
    198         NaN  2.00  Female    Yes  Thur   Lunch     2          13.9300
    176         NaN  2.00    Male    Yes   Sun  Dinner     2          17.9525
    192       28.44  2.56    Male    Yes  Thur   Lunch     2          28.4400
    124       12.48  2.52  Female     No  Thur   Lunch     2          12.4800
    9         14.78  3.23    Male     No   Sun  Dinner     2          14.7800
    101       15.38  3.00  Female    Yes   Fri  Dinner     2          15.3800
    '''
    

    filter 过滤器

    import pandas as pd
    import seaborn as sns
    
    tips = sns.load_dataset('tips')
    print(tips.shape) # (244, 7)
    
    print(tips['size'].value_counts())
    '''
    2    156
    3     38
    4     37
    5      5
    6      4
    1      4
    Name: size, dtype: int64
    '''
    

    输出结果显示,人数为1、5和6的情况不常见,需要过滤掉这些数据,要求每组数量要超过30

    tips_filtered = tips.groupby('size').filter(lambda x: x['size'].count() >= 30)
    print(tips_filtered.shape) # (231, 7)
    print(tips_filtered['size'].value_counts())
    '''
    (231, 7)
    2    156
    3     38
    4     37
    Name: size, dtype: int64
    '''
    

    相关文章

      网友评论

          本文标题:Pandas - 10.2 转换与过滤

          本文链接:https://www.haomeiwen.com/subject/womhwrtx.html