美文网首页Python
笔记:Python之Pandas的数据结构-Dataframe

笔记:Python之Pandas的数据结构-Dataframe

作者: 南有妖尾 | 来源:发表于2019-11-17 11:26 被阅读0次

    Dataframe的相关知识点总结

    dataframe相关知识点总结表.png

    (1)Dataframe的基本概念

    data = {'name':['Jack','Tom','Mary'],
            'age':[18,19,20],
           'gender':['m','m','w']} # 字典dict
    frame = pd.DataFrame(data)
    print(frame,type(frame))
    # 用pd.DataFrame()生成Dataframe
    # Dataframe带有index(行标签)和columns(列标签)
    
    print('--------')
    print('查看frame的行标签和行标签类型:\n',frame.index,type(frame.index))
    # .index查看行标签
    print('查看frame的列标签和列标签类型:\n',frame.columns,type(frame.columns))
    # .columns查看列标签
    print('查看frame的值和值的数据类型:\n',frame.values,type(frame.values))
    # .values查看值,数据类型为ndarra
    

    输出结果:

       age gender  name
    0   18      m  Jack
    1   19      m   Tom
    2   20      w  Mary <class 'pandas.core.frame.DataFrame'>
    --------
    查看frame的行标签和行标签类型:
     RangeIndex(start=0, stop=3, step=1) <class 'pandas.indexes.range.RangeIndex'>
    查看frame的列标签和列标签类型:
     Index(['age', 'gender', 'name'], dtype='object') <class 'pandas.indexes.base.Index'>
    查看frame的值和值的数据类型:
     [[18 'm' 'Jack']
     [19 'm' 'Tom']
     [20 'w' 'Mary']] <class 'numpy.ndarray'>
    

    (2)Dataframe的创建方法

    创建Dataframe用到 pd.DataFrame() 一共有5种创建Dataframe的方法。【创建Dataframe较多使用方法1,2,3】

    --- >>> 方法1:由数组/list组成的字典创建Dataframe,columns为字典key,index为默认数字标签,字典的value的长度必须保持一致。

    # 由数组/list组成的字典 创建Dataframe,columns为字典key,index为默认数字标签
    # 字典的值的长度必须保持一致!
    data1={'a':list(range(3)),
          'b':list(range(4,7)),
          'c':list(range(7,10))}
    data2={'one':np.random.rand(3),
          'two':np.random.rand(3)}
    print(data1)
    print(data2)
    # data1是list组成的字典,data2是ndarray数组组成的字典
    df1=pd.DataFrame(data1)
    df2=pd.DataFrame(data2)
    print(df1)
    print(df2)
    print('----------------')
    # 重新定义columns列标签
    df1=pd.DataFrame(data1,columns=list('cdab'))
    print(df1)
    # 重新定义的columns比原数据多,现有数据中没有该列(比如'd'),则产生NaN值
    df1=pd.DataFrame(data1,columns=list('ac'))
    print(df1)
    # 如果columns重新指定时候,列的数量可以少于原数据
    print('----------------')
    # 重新定义index行标签
    df2=pd.DataFrame(data2,index=list('abe'))
    print(df2)
    # 重新定义的index行标签,必须与原数据一样多,不能多不能少,否则报错
    

    输出结果:

    {'a': [0, 1, 2], 'c': [7, 8, 9], 'b': [4, 5, 6]}
    {'two': array([ 0.89216661,  0.31251123,  0.53794988]), 'one': array([ 0.75671784,  0.06891223,  0.60083476])}
       a  b  c
    0  0  4  7
    1  1  5  8
    2  2  6  9
            one       two
    0  0.756718  0.892167
    1  0.068912  0.312511
    2  0.600835  0.537950
    ----------------
       c    d  a  b
    0  7  NaN  0  4
    1  8  NaN  1  5
    2  9  NaN  2  6
       a  c
    0  0  7
    1  1  8
    2  2  9
    ----------------
            one       two
    a  0.756718  0.892167
    b  0.068912  0.312511
    e  0.600835  0.537950
    

    --->>> 方法2:由Seris组成的字典创建Dataframe,columns为字典key,index为Series的标签;若Series没有重新制定标签,则是默认数字标签。

    df1=pd.DataFrame({'one':pd.Series(np.random.rand(2)),
                     'two':pd.Series(np.random.rand(3))})
    print(df1)
    df2=pd.DataFrame({'one':pd.Series(np.random.rand(2),index=list('ab')),
                     'two':pd.Series(np.random.rand(3),index=list('abc'))})
    print(df2)
    # df1没有重新定义index标签,使用Series的默认数字标签
    # df2重新定义了index标签
    # 由Series组成的字典,Series可以长度不一样,生成的Dataframe会出现NaN值
    

    输出结果:

            one       two
    0  0.764246  0.276496
    1  0.877065  0.385122
    2       NaN  0.083968
            one       two
    a  0.043106  0.609206
    b  0.175361  0.457400
    c       NaN  0.082632
    

    --->>> 方法3:通过二维数组直接创建Dataframe,得到一样形状的结果数据,如果不指定index和columns,两者均返回默认数字格式,并且index和colunms指定长度须与原数组保持一致。

    # 创建二维数组ndarray
    ar=np.reshape(np.random.rand(10),(2,5))
    print(ar)
    # 通过二维数组用pd.DataFrame()创建Dataframe
    # 若没有重新制定index行标签和columns列标签,默认均使用数字标签
    df1=pd.DataFrame(ar)
    print(df1)
    # 重新定义index和columns
    # 重新定义的行标签和列标签必须与原数据长度一致,否则多了少了都报错
    df2=pd.DataFrame(ar,index=list('ab'),columns=list('rkefi'))
    print(df2)
    

    输出结果:

    [[ 0.31690795  0.4554707   0.46110939  0.49108045  0.93137807]
     [ 0.08933855  0.13430606  0.45611142  0.40476408  0.72615298]]
              0         1         2         3         4
    0  0.316908  0.455471  0.461109  0.491080  0.931378
    1  0.089339  0.134306  0.456111  0.404764  0.726153
              r         k         e         f         i
    a  0.316908  0.455471  0.461109  0.491080  0.931378
    b  0.089339  0.134306  0.456111  0.404764  0.726153
    

    --->>> 方法4:由字典组成的列表创建Dataframe,columns为字典的key,index不做指定则为默认数组标签

    # 创建字典组成的列表
    lst=[{'one':1,'two':2},
        {'one':9,'two':12,'three':13}]
    print(lst)
    # 由字典组成的列表创建Dataframe
    df1=pd.DataFrame(lst)
    print(df1)
    print('----------')
    # 重新定义index,长度必须与原数据一致
    df2=pd.DataFrame(lst,index=list('ab'))
    print(df2)
    # 重新定义columns,columns参数可以增加和减少现有列,如出现新的列,值为NaN
    df3=pd.DataFrame(lst,columns=['one','two','four'])
    print(df3)
    

    输出结果:

    [{'two': 2, 'one': 1}, {'two': 12, 'one': 9, 'three': 13}]
       one  three  two
    0    1    NaN    2
    1    9   13.0   12
    ----------
       one  three  two
    a    1    NaN    2
    b    9   13.0   12
       one  two  four
    0    1    2   NaN
    1    9   12   NaN
    

    --->>> 方法5:由字典组成的字典创建Dataframe,columns为字典的key
    columns参数可以增加和减少现有列,如出现新的列,值为NaN;
    index在这里和之前不同,并不能重新定义新的index,如果指向新的标签,值为NaN (非常重要!)

    data = {'Jack':{'math':90,'english':89,'art':78},
           'Marry':{'math':82,'english':95,'art':92},
           'Tom':{'math':78,'english':67}}
    df1 = pd.DataFrame(data)
    print(df1)
    print('----------')
    df2 = pd.DataFrame(data, columns = ['Jack','Tom','Bob'])
    df3 = pd.DataFrame(data, index = ['a','b','c'])
    print(df2)
    print(df3)
    # df2 columns参数可以增加和减少现有列,如出现新的列,值为NaN
    # df3 重新定义index了,因此返回的都是NaN
    

    输出结果:

             Jack  Marry   Tom
    art        78     92   NaN
    english    89     95  67.0
    math       90     82  78.0
    ----------
             Jack   Tom  Bob
    art        78   NaN  NaN
    english    89  67.0  NaN
    math       90  78.0  NaN
       Jack  Marry  Tom
    a   NaN    NaN  NaN
    b   NaN    NaN  NaN
    c   NaN    NaN  NaN
    

    (3)Dataframe的索引和切片

    --->>> df[columns]默认选择列,[]中写列标签名

    # 创建Dataframe
    df=pd.DataFrame(np.random.rand(12).reshape(3,4)*100,
                   index=list('abc'),columns=['one','two','three','four'])
    print(df)
    
    # df[columns]默认选择列,[]中写列标签名
    # 因而一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突
    data1=df['one']
    print(data1,type(data1))
    print('---------')
    # 选择单列时,单选列为Series,print结果为Series格式
    data2=df[['two','four']]
    print(data2,type(data2))
    print('---------')
    # 选择多列时,多选列为Dataframe,print结果为Dataframe格式
    
    # 【注意!!!!】
    # df[]中填数字时,则默认选择行,且只能进行切片的选择,不能单独选择(df[0])
    data3=df[0:1]
    print(data3,type(data3))
    data4=df[:2]
    print(data4)
    # 输出结果即便只选择一行,类型为Dataframe
    # df[]若用来选择行,则[]内指能填数字,若填写的是索引标签名来选择行(df['one']),则报错
    # 核心重点:一般选择列用df[col],[]中写列名,选择行有其他方法。
    

    输出结果:

             one        two      three       four
    a  90.914477  19.150946  13.451741   7.575419
    b  24.002117  84.665548  21.014130  59.794000
    c  67.605976  23.585457  55.870082  37.634451
    a    90.914477
    b    24.002117
    c    67.605976
    Name: one, dtype: float64 <class 'pandas.core.series.Series'>
    ---------
             two       four
    a  19.150946   7.575419
    b  84.665548  59.794000
    c  23.585457  37.634451 <class 'pandas.core.frame.DataFrame'>
    ---------
             one        two      three      four
    a  90.914477  19.150946  13.451741  7.575419 <class 'pandas.core.frame.DataFrame'>
             one        two      three       four
    a  90.914477  19.150946  13.451741   7.575419
    b  24.002117  84.665548  21.014130  59.794000
    

    --->>> df.loc[] - 按index选择行

    # df.loc[] - 按index选择行
    df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       index = ['one','two','three','four'],
                       columns = ['a','b','c','d'])
    df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       columns = ['a','b','c','d'])
    print(df1)
    print(df2)
    print('-----')
    data1=df1.loc['one']
    data2=df2.loc[2]
    print(data1,type(data1))
    print(data2,type(data2))
    # 选择单行时,返回Series,且原数据columns变为Series的index
    print('-----')
    data3=df1.loc[['one','three']]
    data4=df2.loc[[3,0,4]]
    print(data3,type(data3))
    print(data4,type(data4))
    # 选择多行时,返回Dataframe,若标签不存在,则返回NaN
    # 顺序可变
    print('-----')
    
    # 做切片索引
    data5=df1.loc['one':'three']
    print(data5)
    # df.loc[index]做切片,末端包含
    data6=df2.loc[0:2]
    print(data6)
    #核心重点:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
    

    输出结果:

                   a          b          c          d
    one    25.422364  20.795395  11.336709  90.864404
    two    86.156487  11.198670  54.739204  45.817514
    three  83.311279  37.328900  84.881397  26.402103
    four    8.671409  61.604665  71.907952  66.940001
               a          b          c          d
    0  48.645209  99.572800  27.856945  92.607217
    1  79.674493  73.053547  18.308831  88.747135
    2  25.372909  62.037611  85.714760  49.503700
    3  71.654423  24.761800  11.173785  46.230001
    -----
    a    25.422364
    b    20.795395
    c    11.336709
    d    90.864404
    Name: one, dtype: float64 <class 'pandas.core.series.Series'>
    a    25.372909
    b    62.037611
    c    85.714760
    d    49.503700
    Name: 2, dtype: float64 <class 'pandas.core.series.Series'>
    -----
                   a          b          c          d
    one    25.422364  20.795395  11.336709  90.864404
    three  83.311279  37.328900  84.881397  26.402103 <class 'pandas.core.frame.DataFrame'>
               a        b          c          d
    3  71.654423  24.7618  11.173785  46.230001
    0  48.645209  99.5728  27.856945  92.607217
    4        NaN      NaN        NaN        NaN <class 'pandas.core.frame.DataFrame'>
    -----
                   a          b          c          d
    one    25.422364  20.795395  11.336709  90.864404
    two    86.156487  11.198670  54.739204  45.817514
    three  83.311279  37.328900  84.881397  26.402103
               a          b          c          d
    0  48.645209  99.572800  27.856945  92.607217
    1  79.674493  73.053547  18.308831  88.747135
    2  25.372909  62.037611  85.714760  49.503700
    

    --->>> df.iloc[] - 按照整数位置(从轴的0到length-1)选择行,类似list的索引,其顺序就是dataframe的整数位置,从0开始计

    # df.iloc[] - 按照整数位置(从轴的0到length-1)选择行
    # 类似list的索引,其顺序就是dataframe的整数位置,从0开始计
    df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       index = ['one','two','three','four'],
                       columns = ['a','b','c','d'])
    print(df)
    print('------')
    
    data1=df.iloc[0]
    print(data1,type(data1))
    data2=df.iloc[-1]
    print(data2,type(data2))
    print('------')
    # 单位置索引,与list的索引类似
    
    data3=df.iloc[[0,2]]
    print(data3,type(data3))
    #data4=df.iloc[[2,1,4]]
    #print(data4)输出会报错,因为索引为4的行不存在
    # 和loc索引不同,不能索引超出数据行数的整数位置
    # 多位置索引,顺序可变
    
    # df.iloc[索引]做切片,末端不包含
    data4=df.iloc[:2]
    print(data4)
    # data4定位索引为0-1的行
    

    输出结果:

                   a          b          c          d
    one    74.980847  68.412788  55.553170  70.115846
    two    33.516484  79.838096  54.487476  66.310558
    three  86.773031  73.986408  23.077972  56.278295
    four   79.205418  19.199874  10.778306  74.669667
    ------
    a    74.980847
    b    68.412788
    c    55.553170
    d    70.115846
    Name: one, dtype: float64 <class 'pandas.core.series.Series'>
    a    79.205418
    b    19.199874
    c    10.778306
    d    74.669667
    Name: four, dtype: float64 <class 'pandas.core.series.Series'>
    ------
                   a          b          c          d
    one    74.980847  68.412788  55.553170  70.115846
    three  86.773031  73.986408  23.077972  56.278295 <class 'pandas.core.frame.DataFrame'>
                 a          b          c          d
    one  74.980847  68.412788  55.553170  70.115846
    two  33.516484  79.838096  54.487476  66.310558
    

    --->>> 多重索引,同时索引行和列。应先选择列再选择行 —— 相当于对于一个数据,先筛选字段,再选择数据量。

    df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       index = ['one','two','three','four'],
                       columns = ['a','b','c','d'])
    print(df)
    print('------')
    print(df['a'].loc['two'])
    print('------')
    # 从Dataframe中索引选中指定的一个数据
    # 选择a列,two行
    
    print(df[['b','d']].loc[['one','three']])
    print('------')
    # 从Dataframe中索引选中指定的不连续的行和列
    # 选择b、d列,one、three行
    
    print(df[['a','d']].loc['two':'four'])
    print('------')
    # df[]选择多列是用逗号分隔
    # df.loc[]选择多行,用切片索引,用冒号分隔
    
    print(df[df['a'] < 50].iloc[:2])   # 选择满足判断索引的前两行数据
    

    输出结果:

                   a          b          c          d
    one    23.625057  14.995706  63.663427  67.586236
    two     6.870833  59.007275  66.547176  50.959152
    three  48.086678  50.274425  59.094988  42.759351
    four   58.284649  82.886278  16.476423  27.154450
    ------
    6.87083343585
    ------
                   b          d
    one    14.995706  67.586236
    three  50.274425  42.759351
    ------
                   a          d
    two     6.870833  50.959152
    three  48.086678  42.759351
    four   58.284649  27.154450
    ------
                 a          b          c          d
    one  23.625057  14.995706  63.663427  67.586236
    two   6.870833  59.007275  66.547176  50.959152
    

    --->>> 布尔型索引和Series原理相同。

    # 布尔型索引
    # 和Series原理相同
    df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       index = ['one','two','three','four'],
                       columns = ['a','b','c','d'])
    print(df)
    print('------')
    
    b1 = df < 20
    print(b1,type(b1))
    print(df[b1])  # 也可以书写为 df[df < 20]
    print('------')
    # 不做索引则会对数据每个值进行判断
    # 索引结果保留 所有数据:True返回原数据,False返回值为NaN
    
    b2 = df['a'] > 50
    print(b2,type(b2))
    print(df[b2])  # 也可以书写为 df[df['a'] > 50]
    print('------')
    # 单列做判断
    # 索引结果保留 单列判断为True的行数据,包括其他列
    
    b3 = df[['a','b']] > 50
    print(b3,type(b3))
    print(df[b3])  # 也可以书写为 df[df[['a','b']] > 50]
    print('------')
    # 多列做判断
    # 索引结果保留 所有数据:True返回原数据,False返回值为NaN
    
    b4 = df.loc[['one','three']] < 50
    print(b4,type(b4))
    print(df[b4])  # 也可以书写为 df[df.loc[['one','three']] < 50]
    print('------')
    # 多行做判断
    # 索引结果保留 所有数据:True返回原数据,False返回值为NaN
    

    输出结果:

                   a          b          c          d
    one    19.185849  20.303217  21.800384  45.189534
    two    50.105112  28.478878  93.669529  90.029489
    three  35.496053  19.248457  74.811841  20.711431
    four   24.604478  57.731456  49.682717  82.132866
    ------
               a      b      c      d
    one     True  False  False  False
    two    False  False  False  False
    three  False   True  False  False
    four   False  False  False  False <class 'pandas.core.frame.DataFrame'>
                   a          b   c   d
    one    19.185849        NaN NaN NaN
    two          NaN        NaN NaN NaN
    three        NaN  19.248457 NaN NaN
    four         NaN        NaN NaN NaN
    ------
    one      False
    two       True
    three    False
    four     False
    Name: a, dtype: bool <class 'pandas.core.series.Series'>
                 a          b          c          d
    two  50.105112  28.478878  93.669529  90.029489
    ------
               a      b
    one    False  False
    two     True  False
    three  False  False
    four   False   True <class 'pandas.core.frame.DataFrame'>
                   a          b   c   d
    one          NaN        NaN NaN NaN
    two    50.105112        NaN NaN NaN
    three        NaN        NaN NaN NaN
    four         NaN  57.731456 NaN NaN
    ------
              a     b      c     d
    one    True  True   True  True
    three  True  True  False  True <class 'pandas.core.frame.DataFrame'>
                   a          b          c          d
    one    19.185849  20.303217  21.800384  45.189534
    two          NaN        NaN        NaN        NaN
    three  35.496053  19.248457        NaN  20.711431
    four         NaN        NaN        NaN        NaN
    ------
    

    (4)Dataframe的基本技巧

    --->>> .head()和.tail()分别查看头部数据与尾部数据;.T转置数据

    # 数据查看、转置
    df = pd.DataFrame(np.random.rand(16).reshape(8,2)*100,
                       columns = ['a','b'])
    print(df.head(2))
    print(df.tail())
    # .head()查看头部数据
    # .tail()查看尾部数据
    # 默认查看5条
    print(df.T)
    # .T 转置 原形状是(3,4)置换后会变为(4,3),且dataframe的index与columns互换
    

    输出结果:

               a          b
    0   7.949040  32.642699
    1  92.208337  10.828741
               a          b
    3  69.807085  57.420046
    4  79.708661   0.590644
    5  65.844373  47.775450
    6   4.988854  64.613866
    7   8.913846  87.750012
               0          1          2          3          4          5  \
    a   7.949040  92.208337  82.494079  69.807085  79.708661  65.844373   
    b  32.642699  10.828741  90.075549  57.420046   0.590644  47.775450   
    
               6          7  
    a   4.988854   8.913846  
    b  64.613866  87.750012  
    

    --->>> 给Dataframe添加数据和修改数据

    df = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       columns = ['a','b','c','d'])
    print(df)
    
    # 【添加数据】
    df['g']=100
    df.loc[5]=200
    print(df)
    print('----------')
    # df['g']=100,新增列g,值均为100
    # df.loc[5]=200,新增行index为5的行,值均为200
    
    # 【修改数据】
    df['c']=120
    df.loc[0:1]=250
    print(df)
    # df['c']=120,将c列的值均修改为120
    # df.loc[0:1]=250,将index为0-1的行的值均修改为250
    

    输出结果:

               a          b          c          d
    0  81.933629  58.281026  97.975374  92.070574
    1  61.370852  56.591523  40.667676  32.055902
    2  60.103233   5.406372  66.602369  84.946211
    3  84.100587  13.442995  16.667553  13.657240
                a           b           c           d    g
    0   81.933629   58.281026   97.975374   92.070574  100
    1   61.370852   56.591523   40.667676   32.055902  100
    2   60.103233    5.406372   66.602369   84.946211  100
    3   84.100587   13.442995   16.667553   13.657240  100
    5  200.000000  200.000000  200.000000  200.000000  200
    ----------
                a           b    c           d    g
    0  250.000000  250.000000  250  250.000000  250
    1  250.000000  250.000000  250  250.000000  250
    2   60.103233    5.406372  120   84.946211  100
    3   84.100587   13.442995  120   13.657240  100
    5  200.000000  200.000000  120  200.000000  200
    

    --->>> 给Dataframe删除数据,del语句删除列,drop()删除行、列

    df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       columns = ['a','b','c','d'],index=list('njkr'))
    df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       columns = ['a','b','c','d'],index=[1,2,3,4])
    print(df1)
    print(df2)
    print('-----')
    del df1['a']
    print(df1)
    print('-----')
    # del 语句删除列
    
    print(df1.drop(['n']))
    print(df2.drop(1))
    print(df1)
    print(df2)
    print('-----')
    # drop(labe),括号内填写行index标签,删除行
    # drop()删除行,默认inplace=False → 删除后生成新的数据,不改变原数据
    
    # drop()默认删除行,
    # drop()若要删除列,需要加上axis = 1
    print(df1.drop(['d'],axis=1))
    print(df1)
    

    输出结果:

               a          b          c          d
    n  99.781228  25.472789  77.134248  88.759165
    j  30.835066  97.793713   5.044348  63.591559
    k  91.380903  54.532245  64.167292  15.215125
    r  11.363368  22.048608  11.202747  94.795838
               a          b          c          d
    1  81.331983  72.663489  87.489846  39.120100
    2  17.028264  47.835036  95.802606   0.556850
    3  97.790851  68.518948  42.980309  46.952173
    4  40.470713   2.216181  15.092647  95.034669
    -----
               b          c          d
    n  25.472789  77.134248  88.759165
    j  97.793713   5.044348  63.591559
    k  54.532245  64.167292  15.215125
    r  22.048608  11.202747  94.795838
    -----
               b          c          d
    j  97.793713   5.044348  63.591559
    k  54.532245  64.167292  15.215125
    r  22.048608  11.202747  94.795838
               a          b          c          d
    2  17.028264  47.835036  95.802606   0.556850
    3  97.790851  68.518948  42.980309  46.952173
    4  40.470713   2.216181  15.092647  95.034669
               b          c          d
    n  25.472789  77.134248  88.759165
    j  97.793713   5.044348  63.591559
    k  54.532245  64.167292  15.215125
    r  22.048608  11.202747  94.795838
               a          b          c          d
    1  81.331983  72.663489  87.489846  39.120100
    2  17.028264  47.835036  95.802606   0.556850
    3  97.790851  68.518948  42.980309  46.952173
    4  40.470713   2.216181  15.092647  95.034669
    -----
               b          c
    n  25.472789  77.134248
    j  97.793713   5.044348
    k  54.532245  64.167292
    r  22.048608  11.202747
               b          c          d
    n  25.472789  77.134248  88.759165
    j  97.793713   5.044348  63.591559
    k  54.532245  64.167292  15.215125
    r  22.048608  11.202747  94.795838
    

    --->>> Dataframe数据的对齐计算,

    df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
    df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
    print(df1 + df2)
    # DataFrame对象之间的数据自动按照列和索引(行标签)对齐
    

    输出结果:

              A         B         C   D
    0  0.375689  0.154600  0.588799 NaN
    1  0.314006  0.322111  1.521050 NaN
    2 -0.617338  1.621051  1.019352 NaN
    3  0.079740  1.035736 -0.069641 NaN
    4  0.362620  1.040316 -3.178383 NaN
    5  1.162456  1.628615  1.699686 NaN
    6 -1.306092 -0.410553 -3.453950 NaN
    7       NaN       NaN       NaN NaN
    8       NaN       NaN       NaN NaN
    9       NaN       NaN       NaN NaN
    

    --->>> Dataframe的数据排序:
    【排序1 - 按值排序,.sort_values()】

    # 排序1 - 按值排序 .sort_values
    # 同样适用于Series
    df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                       columns = ['a','b','c','d'])
    print(df1)
    print(df1.sort_values(['a'], ascending = True))  # 升序
    print(df1.sort_values(['a'], ascending = False))  # 降序
    print('------')
    # ascending参数:设置升序降序,默认升序
    # 单列排序
    
    
    df2 = pd.DataFrame({'a':[1,1,1,1,2,2,2,2],
                      'b':list(range(8)),
                      'c':list(range(8,0,-1))})
    print(df2)
    print(df2.sort_values(['a','c']))
    # 多列排序,按列顺序排序
    # 前提是Dataframe的数据本身是可以排序的
    # df2的多列排序,先排序a升序,1,2升序,然后排序c从1开始排序5,6,7,8升序
    

    输出结果:

               a          b          c          d
    0  21.476998  94.124307  74.651624  86.206047
    1  64.418365  51.001043  67.371010  57.975594
    2  98.097496  42.493171  18.868006  96.989554
    3  50.945730   5.336877  45.237293  48.395052
               a          b          c          d
    0  21.476998  94.124307  74.651624  86.206047
    3  50.945730   5.336877  45.237293  48.395052
    1  64.418365  51.001043  67.371010  57.975594
    2  98.097496  42.493171  18.868006  96.989554
               a          b          c          d
    2  98.097496  42.493171  18.868006  96.989554
    1  64.418365  51.001043  67.371010  57.975594
    3  50.945730   5.336877  45.237293  48.395052
    0  21.476998  94.124307  74.651624  86.206047
    ------
       a  b  c
    0  1  0  8
    1  1  1  7
    2  1  2  6
    3  1  3  5
    4  2  4  4
    5  2  5  3
    6  2  6  2
    7  2  7  1
       a  b  c
    3  1  3  5
    2  1  2  6
    1  1  1  7
    0  1  0  8
    7  2  7  1
    6  2  6  2
    5  2  5  3
    4  2  4  4
    

    【排序2 - 索引排序 .sort_index()】

    # 排序2 - 索引排序 .sort_index
    df1 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                      index = [5,4,3,2],
                       columns = ['a','b','c','d'])
    df2 = pd.DataFrame(np.random.rand(16).reshape(4,4)*100,
                      index = ['h','s','x','g'],
                       columns = ['a','b','c','d'])
    print(df1)
    data1=df1.sort_index()
    print(data1)
    print(df1)
    print('--------')
    print(df2)
    data2=df2.sort_index()
    print(data2)
    print(df2)
    # 按照index排序
    # 默认 ascending=True升序, inplace=False生成新数据排序
    

    输出结果:

               a          b          c          d
    5  46.148715   3.841598  28.772054  30.769325
    4  62.050299  72.191961  96.291796  86.469963
    3  34.731134  22.328996  96.434859  96.952085
    2  14.084726  64.354711  45.780757  82.532313
               a          b          c          d
    2  14.084726  64.354711  45.780757  82.532313
    3  34.731134  22.328996  96.434859  96.952085
    4  62.050299  72.191961  96.291796  86.469963
    5  46.148715   3.841598  28.772054  30.769325
               a          b          c          d
    5  46.148715   3.841598  28.772054  30.769325
    4  62.050299  72.191961  96.291796  86.469963
    3  34.731134  22.328996  96.434859  96.952085
    2  14.084726  64.354711  45.780757  82.532313
    --------
               a          b          c          d
    h  93.225186  74.185120  83.354408  98.871708
    s  69.450625  14.009624  98.526514  32.786719
    x  95.268889  97.052805  70.099924  22.170984
    g  96.089474  79.845508  66.001147  46.291457
               a          b          c          d
    g  96.089474  79.845508  66.001147  46.291457
    h  93.225186  74.185120  83.354408  98.871708
    s  69.450625  14.009624  98.526514  32.786719
    x  95.268889  97.052805  70.099924  22.170984
               a          b          c          d
    h  93.225186  74.185120  83.354408  98.871708
    s  69.450625  14.009624  98.526514  32.786719
    x  95.268889  97.052805  70.099924  22.170984
    g  96.089474  79.845508  66.001147  46.291457
    

    相关文章

      网友评论

        本文标题:笔记:Python之Pandas的数据结构-Dataframe

        本文链接:https://www.haomeiwen.com/subject/zflpictx.html