美文网首页
pandas笔记(Data Transformation)

pandas笔记(Data Transformation)

作者: 生信start_site | 来源:发表于2020-06-26 10:34 被阅读0次

    上一篇学习笔记学习了如何处理缺失值(数据清理),这一篇笔记继续学习pandas的数据处理:数据转换

    Removing Duplicates去重

    #首先还是进入ipython,当然你也可以用python进行练习
    $ ipython
    Python 3.7.6 (default, Jan  8 2020, 19:59:22)
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
    
    In [1]: import pandas as pd
    
    In [2]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
       ...: 'k2': [1, 1, 2, 3, 3, 4, 4]}) #构建一个dataframe
    
    In [3]: data #这里索引5和索引6对应的行是完全一样的
    Out[3]:
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    6  two   4
    
    In [4]: data.duplicated() #检查是否有重复的行,这里索引6的结果返回的是True
    Out[4]:
    0    False
    1    False
    2    False
    3    False
    4    False
    5    False
    6     True
    dtype: bool
    
    In [5]: data.drop_duplicates() #去掉.duplicated()函数判断结果为True的行,只保留返回结果为False的行
    Out[5]:
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    

    上面的去重是直接把整行去掉,你也可以按照某一列里的元素进行去重:

    In [6]: data['v1'] = range(7) #把dataframe加一列v1
    
    In [7]: data
    Out[7]:
        k1  k2  v1
    0  one   1   0
    1  two   1   1
    2  one   2   2
    3  two   3   3
    4  one   3   4
    5  two   4   5
    6  two   4   6
    
    In [8]: data.drop_duplicates(['k1']) #根据k1这一列里的元素进行去重,所以只返回了索引为0和1的两行,因为对于k1列来说,后面的元素都是one和two,所以都是重复项
    Out[8]:
        k1  k2  v1
    0  one   1   0
    1  two   1   1
    

    duplicateddrop_duplicates都只保留了重复项的第一项,你也可以选择只保留所有重复项里最后出现的那一项:

    In [9]: data.drop_duplicates(['k1', 'k2'], keep='last')
    Out[9]:
        k1  k2  v1
    0  one   1   0
    1  two   1   1
    2  one   2   2
    3  two   3   3
    4  one   3   4
    6  two   4   6 #索引5和6对应的两行是重复的,这里只保留了索引6对应的行
    

    Transforming Data Using a Function or Mapping

    对于许多数据集,你可能想根据数组、Series或Dataframe其中某一列中的值执行一些转换,举个例子:

    In [11]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
        ...: 'Pastrami', 'corned beef', 'Bacon',
        ...: 'pastrami', 'honey ham', 'nova lox'],
        ...: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
    
    In [12]: data #创建一个dataframe,一列是food,一列是ounces数
    Out[12]:
              food  ounces
    0        bacon     4.0
    1  pulled pork     3.0
    2        bacon    12.0
    3     Pastrami     6.0
    4  corned beef     7.5
    5        Bacon     8.0
    6     pastrami     3.0
    7    honey ham     5.0
    8     nova lox     6.0
    

    然后再加一列,是每一种食物对应的喂养的动物:

    In [13]: meat_to_animal ={
            'bacon': 'pig',
             'pulled pork': 'pig',
            'pastrami': 'cow',
             'corned beef': 'cow',
            'honey ham': 'pig',
            'nova lox': 'salmon'
            }
    

    The map method on a Series accepts a function or dict-like object containing a map‐
    ping, but here we have a small problem in that some of the meats are capitalized and
    others are not. Thus, we need to convert each value to lowercase using the str.lower
    Series method
    这里需要注意的是,如果你要使用map功能,你必须使你要map的两个对象的大小写一致,比如Pastrami这个单词,在data和meat_to_animal里的开头字母的大小写不一致,所以我们要先把data里的food一栏都改成小写开头:

    In [14]: lowercased = data['food'].str.lower()
    
    In [15]: lowercased
    Out[15]:
    0          bacon
    1    pulled pork
    2          bacon
    3       pastrami
    4    corned beef
    5          bacon
    6       pastrami
    7      honey ham
    8       nova lox
    Name: food, dtype: object
    

    然后将data和meat_to_animal合并,并根据food一栏进行map:

    In [16]: data['animal'] = lowercased.map(meat_to_animal)
    
    In [17]: data
    Out[17]:
              food  ounces  animal
    0        bacon     4.0     pig
    1  pulled pork     3.0     pig
    2        bacon    12.0     pig
    3     Pastrami     6.0     cow
    4  corned beef     7.5     cow
    5        Bacon     8.0     pig
    6     pastrami     3.0     cow
    7    honey ham     5.0     pig
    8     nova lox     6.0  salmon
    

    上面是map之后,返回的是完整的dataframe,你也可以选择只返回map好的meat_to_animal一列:

    In [18]: data['food'].map(lambda x: meat_to_animal[x.lower()])
    Out[18]:
    0       pig
    1       pig
    2       pig
    3       cow
    4       cow
    5       pig
    6       cow
    7       pig
    8    salmon
    Name: food, dtype: object
    

    Replacing Values

    上一篇笔记学习了用fillna填补缺失值,这只是一个特殊情况。现在来学习一下一般值的替换。使用replace功能:

    In [19]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
    
    In [20]: data
    Out[20]:
    0       1.0
    1    -999.0
    2       2.0
    3    -999.0
    4   -1000.0
    5       3.0
    dtype: float64
    
    In [21]: import numpy as np
    
    In [22]: data.replace(-999, np.nan) #把-999的值替换成缺失值
    Out[22]:
    0       1.0
    1       NaN
    2       2.0
    3       NaN
    4   -1000.0
    5       3.0
    dtype: float64
    
    In [23]: data.replace([-999, -1000], np.nan) #把-999和-1000替换成缺失值
    Out[23]:
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    NaN
    5    3.0
    dtype: float64
    
    In [24]: data.replace([-999, -1000], [np.nan, 0]) #把-999替换成缺失值,把-1000替换成0
    Out[24]:
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    0.0
    5    3.0
    dtype: float64
    
    In [25]: data.replace({-999: np.nan, -1000: 0}) #用字典形式进行替换
    Out[25]:
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    0.0
    5    3.0
    dtype: float64
    

    Renaming Axis Indexes重命名轴索引

    除了对dataframe里的值进行map,你也可以map dataframe的行名和列名:

    #将行名重命名
    In [26]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
    
    In [27]: data
    Out[27]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    New York    8    9     10    11
    
    In [28]: transform = lambda x: x[:4].upper()
    
    In [29]: data.index.map(transform)
    Out[29]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
    
    In [30]: data.index = data.index.map(transform)
    
    In [31]: data
    Out[31]:
          one  two  three  four
    OHIO    0    1      2     3
    COLO    4    5      6     7
    NEW     8    9     10    11
    
    #将列名重命名
    #将列名进行大写处理
    In [32]: data.rename(index=str.title, columns=str.upper)
    Out[32]:
          ONE  TWO  THREE  FOUR
    Ohio    0    1      2     3
    Colo    4    5      6     7
    New     8    9     10    11
    

    rename功能同时进行行名和列名的修改:

    In [33]: data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})
    Out[33]:
             one  two  peekaboo  four
    INDIANA    0    1         2     3
    COLO       4    5         6     7
    NEW        8    9        10    11
    

    rename功能默认不覆盖原始dataframe,你也可以通过参数设置,让替换后的dataframe覆盖原始dataframe:

    In [34]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
    
    In [35]: data
    Out[35]:
             one  two  three  four
    INDIANA    0    1      2     3
    COLO       4    5      6     7
    NEW        8    9     10    11
    

    Discretization and Binning离散化和分箱

    连续型数据经常是离散的,或者被分成不同的“箱”(bin)进行分析。假设这里有一组年龄的数据,你要把这些人的年龄分成几组:

    In [36]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
    

    然后我们将把年龄分成:18-25,26-35,36-60,以及61以上的。你可以使用pandas里的cut函数:

    In [37]: bins = [18, 25, 35, 60, 100]
    
    In [38]: cats = pd.cut(ages,bins)
    
    In [39]: cats #把ages里每一个数字都对应到相应的bin里
    Out[39]:
    [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
    Length: 12
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
    

    除了把每一个元素对应到bin里,你还可以根据bin的索引来map你的每一个元素:

    #查看你的bin
    In [40]: cats.categories
    Out[40]:
    IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
                  closed='right',
                  dtype='interval[int64]')
    #根据bin的索引位置来map你的元素
    In [41]: cats.codes
    Out[41]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
    #查看每一个bin里map了多少个元素
    In [42]: pd.value_counts(cats)
    Out[42]:
    (18, 25]     5
    (35, 60]     3
    (25, 35]     3
    (60, 100]    1
    dtype: int64
    

    你也可以给每一个bin命名:

    In [43]: group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
    
    In [44]: pd.cut(ages, bins, labels=group_names)
    Out[44]:
    [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
    Length: 12
    Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
    

    上面是对于整数元素而言的,下面看看浮点元素的操作:

    In [45]: data = np.random.rand(20)
    
    In [46]: data #随机生成20个0-1的数字
    Out[46]:
    array([0.81110965, 0.47609535, 0.71813052, 0.71125473, 0.50874785,
           0.54215936, 0.51799094, 0.92532814, 0.75840793, 0.59097154,
           0.28318023, 0.580895  , 0.57077378, 0.86404974, 0.10115019,
           0.29316477, 0.57158071, 0.14108119, 0.36732534, 0.18085161])
    
    In [47]: pd.cut(data,4,precision=2) #从上面20个数字里的最小值到最大值,等分成4分,就是4个bin
    #precision=2的意思是保留2位小数点
    Out[47]:
    [(0.72, 0.93], (0.31, 0.51], (0.51, 0.72], (0.51, 0.72], (0.31, 0.51], ..., (0.1, 0.31], (0.51, 0.72], (0.1, 0.31], (0.31, 0.51], (0.1, 0.31]]
    Length: 20
    Categories (4, interval[float64]): [(0.1, 0.31] < (0.31, 0.51] < (0.51, 0.72] < (0.72, 0.93]]
    

    Detecting and Filtering Outliers检测并过滤离群值

    In [48]: data = pd.DataFrame(np.random.randn(1000, 4)) #构建一个1000行,4列的datagrame由随机数构成
    
    In [49]: data 
    Out[49]:
                0         1         2         3
    0    0.650892  2.070260  0.538299  0.679726
    1    1.074305  0.495664 -0.970136 -0.344586
    2   -0.859329  1.453186  1.612602 -0.321719
    3    0.558712  0.562590  0.646543 -0.483733
    4    0.935728  1.516592  0.852229  1.926105
    ..        ...       ...       ...       ...
    995 -1.304159 -0.085793  0.482636  1.523466
    996  0.285450 -1.898452  1.422107 -1.227373
    997  2.083328 -1.410110  0.061404 -0.860416
    998 -0.438052 -1.744427 -1.677008 -0.481408
    999  0.999009 -0.240101 -1.299566 -0.172057
    
    [1000 rows x 4 columns]
    
    In [50]: data.describe() #查看每一列的数据情况
    Out[50]:
                     0            1            2            3
    count  1000.000000  1000.000000  1000.000000  1000.000000
    mean      0.004135    -0.041167    -0.030157    -0.015668
    std       0.987452     0.996992     1.004676     0.982699
    min      -3.438090    -3.179065    -3.055933    -3.336112
    25%      -0.683605    -0.714445    -0.679892    -0.677707
    50%      -0.033715    -0.066457    -0.122572    -0.039982
    75%       0.646951     0.578665     0.643963     0.632722
    max       3.625363     3.499694     3.896634     3.082743
    

    如果你想找第2列里绝对值大于3的数字:

    In [51]: col = data[2]
    
    In [52]: col[np.abs(col) >3]
    Out[52]:
    106    3.248979
    365    3.896634
    467    3.659385
    475   -3.055933
    Name: 2, dtype: float64
    

    寻找所有行里绝对值大于3的数字:

    In [53]: data[(np.abs(data) > 3).any(1)]
    Out[53]:
                0         1         2         3
    106  0.402003  0.675224  3.248979  0.594704
    135  3.625363  0.757227  0.267661  2.870315
    143 -0.040237  3.025956 -0.950451  2.369624
    250  0.411540  3.499694 -0.268605  0.811673
    365 -1.609991 -0.691870  3.896634 -0.761072
    439  0.050316  0.433346 -0.938715 -3.277411
    467 -0.722803  2.093205  3.659385  0.373497
    475  0.326796  0.196501 -3.055933 -0.601524
    492 -3.438090  0.132217 -0.305294 -0.574376
    616 -0.623880  0.588810 -0.742245 -3.239445
    646 -0.009522 -3.179065 -1.350068  0.309251
    778  1.959143 -0.304691 -0.760753  3.082743
    871 -0.686462  3.470719  1.305890 -1.060105
    919  3.452396  1.216505 -0.495451  0.237577
    966 -1.427159 -0.546189  1.346565 -3.336112
    

    把dataframe里所有大于3和小于-3的元素,一律都设置为3:

    In [54]: data[np.abs(data) > 3] = np.sign(data) * 3
    
    In [55]: data.describe()
    Out[55]:
                     0            1            2            3
    count  1000.000000  1000.000000  1000.000000  1000.000000
    mean      0.003495    -0.041985    -0.031906    -0.014898
    std       0.982432     0.993157     0.998391     0.979717
    min      -3.000000    -3.000000    -3.000000    -3.000000
    25%      -0.683605    -0.714445    -0.679892    -0.677707
    50%      -0.033715    -0.066457    -0.122572    -0.039982
    75%       0.646951     0.578665     0.643963     0.632722
    max       3.000000     3.000000     3.000000     3.000000
    

    上面的代码里np.sign(data)的意思是把任何数根据正负判断,变成-1和1:

    In [57]: np.sign(data)
    Out[57]:
           0    1    2    3
    0    1.0  1.0  1.0  1.0
    1    1.0  1.0 -1.0 -1.0
    2   -1.0  1.0  1.0 -1.0
    3    1.0  1.0  1.0 -1.0
    4    1.0  1.0  1.0  1.0
    ..   ...  ...  ...  ...
    995 -1.0 -1.0  1.0  1.0
    996  1.0 -1.0  1.0 -1.0
    997  1.0 -1.0  1.0 -1.0
    998 -1.0 -1.0 -1.0 -1.0
    999  1.0 -1.0 -1.0 -1.0
    

    Permutation and Random Sampling随机重排

    使用np.random.permutation可以对Series或者dataframe的行和列进行重排:

    In [58]: df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
    
    In [59]: df
    Out[59]:
        0   1   2   3
    0   0   1   2   3
    1   4   5   6   7
    2   8   9  10  11
    3  12  13  14  15
    4  16  17  18  19
    
    In [60]: sampler = np.random.permutation(5)
    
    In [61]: sampler
    Out[61]: array([2, 4, 0, 3, 1])
    
    In [62]: df.take(sampler)
    Out[62]:
        0   1   2   3
    2   8   9  10  11
    4  16  17  18  19
    0   0   1   2   3
    3  12  13  14  15
    1   4   5   6   7
    

    Computing Indicator/Dummy Variables计算指标/虚拟变量

    统计建模或机器学习应用的另一种转换是将分类变量转换为“虚拟”或“指标”矩阵。如果DataFrame中的一列有k个不同的值,那么你将得到k个列,包含所有1和0的矩阵或DataFrame。pandas有一个get_dummies函数来完成这个任务:

    In [67]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
        ...: 'data1': range(6)})
    
    In [68]: df
    Out[68]:
      key  data1
    0   b      0
    1   b      1
    2   a      2
    3   c      3
    4   a      4
    5   b      5
    
    In [69]: pd.get_dummies(df['key'])
    Out[69]:
       a  b  c
    0  0  1  0
    1  0  1  0
    2  1  0  0
    3  0  0  1
    4  1  0  0
    5  0  1  0
    

    上面的函数可以这样理解:get_dummies函数是把dataframe里其中一列抽出来,把这一列的元素作为列名,行代表是否出现过key这一列的元素,出现过即为1,没出现即为0。所以对于key列来说,b在最开始出现了两次,所以在 get_dummies`返回的dataframe里,b列的前2行是1,其他列是0。以此类推。

    可以把上面得到的dataframe改一下列名:

    In [70]: dummies = pd.get_dummies(df['key'], prefix='key')
    
    In [71]: dummies
    Out[71]:
       key_a  key_b  key_c
    0      0      1      0
    1      0      1      0
    2      1      0      0
    3      0      0      1
    4      1      0      0
    5      0      1      0
    

    相关文章

      网友评论

          本文标题:pandas笔记(Data Transformation)

          本文链接:https://www.haomeiwen.com/subject/mybdfktx.html