美文网首页
科学计算库pandas基础

科学计算库pandas基础

作者: 程序媛啊 | 来源:发表于2020-10-12 13:18 被阅读0次

    pandas

    pandas中主要有两种数据结构,分别是:Series和DataFrame.
    
    • Series:一种类似于一维数组的对象,是由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。仅由一组数据也可产生简单的Series对象。注意:Series中的索引值是可以重复的。
    • DataFrame:一个表格型的数据结构,包含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型等),DataFrame即有行索引也有列索引,可以被看做是由Series组成的字典。

    Series

    通过一维数组创建Series

    code:

    from pandas import Series,DataFrame
    import pandas as pd
    import numpy as np
    a1 = np.array(["Python","C++","Java","PHP"])
    ser1 = Series(a1)
    print(ser1) # 输出包含默认的序列号
    print(ser1.dtype)
    print(ser1.index)
    print(ser1.values)
    

    out:

    0    Python
    1       C++
    2      Java
    3       PHP
    dtype: object
    object
    RangeIndex(start=0, stop=4, step=1)
    ['Python' 'C++' 'Java' 'PHP']
    

    code:

    ser1.index = ["one","two","three","four"]
    print(ser1)
    

    out:

    one      Python
    two         C++
    three      Java
    four        PHP
    dtype: object
    

    code:

    ser2 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
    print(ser2)
    

    out:

    Jim        78.0
    HanMei     90.0
    LiLei      65.0
    Havorld    92.0
    dtype: float64
    

    通过字典的方式创建Series

    code

    dict1= {"Jim":84,"HanMei":68,"Havorld":96}
    ser2 = Series(dict1)
    print(ser2) # 字典的key组成Series的索引,Value组成Series的值
    

    out

    HanMei     68
    Havorld    96
    Jim        84
    dtype: int64
    

    Series值的获取

    ser3 = Series(data = [78,90,65,92],dtype = np.float64,index = ["Jim","HanMei","LiLei","Havorld"])
    print(ser3)
    
    输出:
    HanMei     68
    Havorld    96
    Jim        84
    dtype: int64
    
    print(ser3[1])
    print(ser3["Havorld"])
    print(ser3[-2])  #负数表示从右向左算
    
    输出:
    HanMei     90.0
    LiLei      65.0
    Havorld    92.0
    dtype: float64
    
    print(ser3[1:])
    
    输出:
    Havorld    96
    Jim        84
    dtype: int64
    
    
    print(ser3["Havorld":"Jim"])
    
    输出:
    Havorld    96
    Jim        84
    dtype: int64
    

    Series的运算

    • NumPy中的数组运算,在Series中都保留了,均可以使用,并且Series进行数组运算的时候,索引与值之间的映射关系不会发生改变。
    • 在操作Series的时候,基本上可以把Series看成NumPy中的ndarray数组来进行操作。ndarray数组的绝大多数操作都可以应用到Series上。

    Series缺失值检测

    ser4 = Series({"Jim":84,"HanMei":68,"Havorld":96})
    print(ser4)
    
    输出:
    HanMei     68
    Havorld    96
    Jim        84
    dtype: int64
    
    new_index={"Jim","Lucy","HanMei","Havorld"}
    ser4 = Series(ser4,index=new_index)
    print(ser4)
    
    输出:
    Jim        84.0
    Lucy        NaN
    HanMei     68.0
    Havorld    96.0
    dtype: float64
    
    ser5 = pd.isnull(ser4) #判断是否为空
    print(ser5)
    
    输出:
    Jim        False
    Lucy        True
    HanMei     False
    Havorld    False
    dtype: bool
    
    ser6 = pd.notnull(ser4) #判断是否为非空
    print(ser6)
    
    输出:
    Jim         True
    Lucy       False
    HanMei      True
    Havorld     True
    dtype: bool
    

    Series之间的运算

    当多个series对象之间进行运算的时候,series之间相同key值的元素value进行运算,不同索引key的value赋值为NaN。

    Series及其索引的name属性

    ser7 = Series({"Jim":84,"HanMei":68,"Havorld":96})
    ser7.index.name = "成绩单"
    ser7.name = "语文成绩"
    print(ser7)
    
    输出:
    成绩单
    HanMei     68
    Havorld    96
    Jim        84
    Name: 语文成绩, dtype: int64
    

    DataFrame

    通过二维数组创建DataFrame

    arr = np.array([
        ["China","USA","English"],
        [16,12,100]
    ])
    df1 = DataFrame(arr)
    print(df1)
    
    输出:
           0    1        2           列索引:columns
    0  China  USA  English           数据:values
    1     16   12      100           数据:values
    
    行索引:index
    

    创建并指定属性

    df2 = DataFrame(arr,columns = ["one","two","three"],index = ["一","二"])
    print(df2)
    
    输出:
         one  two    three
    一  China  USA  English
    二     16   12      100
    
    print(df2.columns)
    print(df2.index)
    print(df2.values)
    
    输出:
    Index(['one', 'two', 'three'], dtype='object')
    Index(['一', '二'], dtype='object')
    [['China' 'USA' 'English']
     ['16' '12' '100']]
    

    通过字典的方式创建DataFrame

    dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
    df3 = DataFrame(dict2)
    print(df3)
    
    输出:
       day  month  year
    0    1      5  1990
    1   24      7  2001
    2   12      3  1997
    3   25     12  2018
    
    #修改默认索引
    df3.index = ["one","two","three","four"]
    print(df3)
    
    输出:
           day  month  year
    one      1      5  1990
    two     24      7  2001
    three   12      3  1997
    four    25     12  2018
    

    DataFrame数据获取

    dict2= {"day":[1,24,12,25],"month":[5,7,3,12],"year":[1990,2001,1997,2018]}
    df3 = DataFrame(dict2)
    df3.index = ["one","two","three","four"]
    print(df3)
    
    输出:
           day  month  year
    one      1      5  1990
    two     24      7  2001
    three   12      3  1997
    four    25     12  2018
    
    print(df3["year"])  # 根据索引取列
    
    print(df3.ix["two"]) #根据索引取行
    
    输出:
    one      1990
    two      2001
    three    1997
    four     2018
    Name: year, dtype: int64
    
    day        24
    month       7
    year     2001
    Name: two, dtype: int64
    
    df3["century"] = 21 #新增列
    df3.ix["five"] = np.NaN #新增行
    print(df3)
    
    输出:
            day  month    year  century
    one     1.0    5.0  1990.0     21.0
    two    24.0    7.0  2001.0     21.0
    three  12.0    3.0  1997.0     21.0
    four   25.0   12.0  2018.0     21.0
    five    NaN    NaN     NaN      NaN
    

    pandas基本功能

    • 数据文件读取/文本数据读取
    • 索引、选取和数据过滤
    • 算法运算和数据对齐
    • 函数的应用和映射
    • 重置索引

    pandas本地读取数据

    read1 = pd.read_csv("E:/Users/Havorld/Desktop/data.csv")
    print(read1)
    
    输出:
        name  age  source
    0  gerry   18    98.5
    1    tom   21    78.2
    2   lili   24    98.5
    3   john   20    89.2
    
    # 读取文本数据,指定属性分隔符为";" 不读取头数据
    read2 = pd.read_csv("data.txt",sep=";",header = None)
    print(read2)
    
    输出:
           0   1     2     3     4
    0  gerry  18  98.5  89.5  88.5
    1    tom  21  98.5  85.5  80.0
    2   lili  20  85.6  86.2   NaN
    3   john  18  70.0  85.0  60.0
    4    joe  19  80.0  85.0  82.0
    
    • read_csv常用参数:


      image.png

    pandas数据过滤获取

    read2.columns = {"name","age",u"语文",u"数学",u"英语"} #指定列名
    print(read2)
    
         age  数学 语文  英语   name
    0  gerry  18  98.5  89.5  88.5
    1    tom  21  98.5  85.5  80.0
    2   lili  20  85.6  86.2   NaN
    3   john  18  70.0  85.0  60.0
    4    joe  19  80.0  85.0  82.0
    
    read3 = read2[read2.columns[2:]] #取出指定的数据
    print(read3)
    
       语文    英语  name
    0  98.5  89.5  88.5
    1  98.5  85.5  80.0
    2  85.6  86.2   NaN
    3  70.0  85.0  60.0
    4  80.0  85.0  82.0
    
    read4 = read3.dropna() #删除含有NaN的行
    print(read4)
    
       语文    英语  name
    0  98.5  89.5  88.5
    1  98.5  85.5  80.0
    3  70.0  85.0  60.0
    4  80.0  85.0  82.0
    

    选取数据loc,iloc,ix

    import numpy as np  
    import pandas as pd 
    #生产数据 
    df = pd.DataFrame(np.arange(0,60,2).reshape(10,3),columns=list('abc'))  
    print(df)
    
    # loc通过行引用row index和列名column names选取数据
    
    
    #取第0行第b列的值
    print(df.loc[0, 'b'])  
    #取第0行至第3行的ab列
    print(df.loc[0:3, ['a', 'b']])  
    #取第1行和第5行的bc列
    print(df.loc[[1, 5], ['b', 'c']])  
    
    
    # iloc通过行引用row index和列引用column index选取数据
    
    print(df.iloc[0,1])  
    print(df.iloc[0:4, [0,1]])  
    print(df.iloc[[1, 5], 1:3])
    
    
    # ix既可以通过行引用row index和列名column names选取数据,又可以通过行引用row index和列引用column index选取数据
    
    print(df.ix[0,"b"])
    print(df.ix[0,1])
    print(df.ix[0:3,["a","b"]])
    print(df.ix[0:3,[0,1]])
    print(df.ix[[1,5],["b","c"]])
    print(df.ix[[1,5],[1,2]])
    

    pandas缺省值NaN处理方法

    • dropna:根据标签的值中是否存在缺失数据对轴标签进行过滤(删除), 可以通过阈值的调节对缺失值的容忍度
    • fillna:用指定值或者插值的方式填充缺失数据,比如: ffill或者bfill
    • isnull: 返回一个含有布尔值的对象,这些布尔值表示那些值是缺失值NA
    • notnull: isnull的否定式
        df5=DataFrame([
                ['Tom',np.NaN,456.67,'M'],['Merry',34,456.67,np.NaN],
                ['Gerry',np.NaN,np.NaN,np.NaN],['John',23,np.NaN,'M'],
                ['Joe',18,2300,'F']],columns=['name','age','salary','Gender']
            )
        print(df5)
        
            name   age   salary Gender
        0    Tom   NaN   456.67      M
        1  Merry  34.0   456.67    NaN
        2  Gerry   NaN      NaN    NaN
        3   John  23.0      NaN      M
        4    Joe  18.0  2300.00      F
        
        df5.dropna()   #dropna删除行中包含NaN的行数据
        df5.dropna(axis=1)   #删除列中包含NaN的列(axis=0为行)数据
        df5.dropna(how='all')   #丢弃全部为NaN值的行数据
    



    df6 = DataFrame(np.random.randn(7,3))
    print(df6)

                  0         1         2
        0  0.280872 -1.890914 -0.237311
        1  0.721152 -0.300591  0.285356
        2 -1.748477  0.991288 -0.349774
        3 -1.678800 -0.608380 -0.002143
        4 -1.273338  0.946480 -1.179870
        5 -0.533472  0.669000  0.667644
        6  1.339726  0.119211 -1.016756
        
        df6.ix[:4,2] = np.nan #把0-4行第2列的的数值改为NaN
        print(df6)
        
                  0         1         2
        0  0.280872 -1.890914       NaN
        1  0.721152 -0.300591       NaN
        2 -1.748477  0.991288       NaN
        3 -1.678800 -0.608380       NaN
        4 -1.273338  0.946480       NaN
        5 -0.533472  0.669000  0.667644
        6  1.339726  0.119211 -1.016756
    


    df7 = df6.fillna(0)
    print(df7)
    0 1 2
    0 0.280872 -1.890914 0.000000
    1 0.721152 -0.300591 0.000000
    2 -1.748477 0.991288 0.000000
    3 -1.678800 -0.608380 0.000000
    4 -1.273338 0.946480 0.000000
    5 -0.533472 0.669000 0.667644
    6 1.339726 0.119211 -1.016756

    pandas常用的数学统计方法

    df8 = read3
    df8 = df8.dropna()
    print(df8)
    
       name    语文    数学
    0  98.5  89.5  88.5
    1  98.5  85.5  80.0
    3  70.0  85.0  60.0
    4  80.0  85.0  82.0
    # 针对Series或各DataFrame列计算总统计值
    print(df8.describe())
    
                name         语文         数学
    count   4.000000   4.000000   4.000000
    mean   86.750000  86.250000  77.625000
    std    14.168627   2.179449  12.297527
    min    70.000000  85.000000  60.000000
    25%    77.500000  85.000000  75.000000
    50%    89.250000  85.250000  81.000000
    75%    98.500000  86.500000  83.625000
    max    98.500000  89.500000  88.500000
    
    print(df8.count())
    print(df8.count(axis = 1))
    
    name    4
    语文      4
    数学      4
    dtype: int64
    0    3
    1    3
    3    3
    4    3
    dtype: int64
    

    相关系数与协方差

    唯一值、值计数以及成员资格

    • unique:数组去重
    • value_counts:计算Series中各个元素出现的频率
    • isin:判断矢量化集合的元素是否是Series或DataFrame中数据的子集
        s = Series(["a","b","b","d","c"])
        
        print(s.value_counts())
        print(s.isin(["a","b"]))
        print(s.unique())
        
        输出:
        b    2
        d    1
        c    1
        a    1
        dtype: int64
        
        0     True
        1     True
        2     True
        3    False
        4    False
        dtype: bool
        
        ['a' 'b' 'd' 'c']
    

    层次索引

    data = Series([768,325,914,666],index=[
        ["2015","2015","2015","2016"],
        ["apple","banana","orange","apple"]
    ])
    print(data)
    
    2015  apple     768
          banana    325
          orange    914
    2016  apple     666
    dtype: int64
    

    code:

    df9 = DataFrame({
        "year":[2001,2001,2002,2002,2003],
        "fruit":["apple","banana","apple","banana","apple"],
        "production":[121,122,123,124,125],
        "profits":[22.1,22.2,22.3,22.4,22.5]
    })
    print(df9)
    
        fruit  production  profits  year
    0   apple         121     22.1  2001
    1  banana         122     22.2  2001
    2   apple         123     22.3  2002
    3  banana         124     22.4  2002
    4   apple         125     22.5  2003
    
    df9 = df9.set_index(["year","fruit"])  # 把year和fruit合并(方便计算某一年水果的情况)
    print(df9)
    
                 production  profits
    year fruit                      
    2001 apple          121     22.1
         banana         122     22.2
    2002 apple          123     22.3
         banana         124     22.4
    2003 apple          125     22.5
    
    print(df9.ix[2002,"apple"]) #展示2002年的香蕉情况
    print(df9.ix[2002]) #展示2002年的水果情况
    
    production    123.0
    profits        22.3
    Name: (2002, apple), dtype: float64
            production  profits
    fruit                      
    apple          123     22.3
    banana         124     22.4
    
    df9 = df9.sum(level="year") # 以年为单位production,profits 相加
    print(df9)
    
          production  profits
    year                     
    2001         243     44.3
    2002         247     44.7
    2003         125     22.5
    

    join函数

    merge函数

    可以使用help(pd.merge)查看函数的帮助

    def merge(left, right, how='inner', on=None, left_on=None, right_on=None,
              left_index=False, right_index=False, sort=False,
              suffixes=('_x', '_y'), copy=True, indicator=False)
    

    参数how : {'left', 'right', 'outer', 'inner'},默认参数为'inner'

    left:以左边的df为主键进行合并,right:以右边的df为主键进行合并

    import pandas as pd
    from pandas import DataFrame
    
    left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                         'B': ['B0', 'B1', 'B2', 'B3'],
                         'key': ['K0', 'K1', 'K0', 'K1']})
    
    right = pd.DataFrame({'C': ['C0', 'C1', "C2"],
                          'D': ['D0', 'D1', "D2"],
                          'K': ['K0', 'K1', "K0"]},
                         index=['zero', 'one', "two"])
    
    print(left)
    print(right)
    
    result = pd.merge(left, right, how='left', left_on='key', right_on="K",
                      sort=False);
    print(result)
    
    输出:
        A   B key
    0  A0  B0  K0
    1  A1  B1  K1
    2  A2  B2  K0
    3  A3  B3  K1
           C   D   K
    zero  C0  D0  K0
    one   C1  D1  K1
    two   C2  D2  K0
    
    left:
        A   B key   C   D   K
    0  A0  B0  K0  C0  D0  K0
    1  A0  B0  K0  C2  D2  K0
    2  A1  B1  K1  C1  D1  K1
    3  A2  B2  K0  C0  D0  K0
    4  A2  B2  K0  C2  D2  K0
    5  A3  B3  K1  C1  D1  K1
    
    以左边的键位主键:
    1.先是left的k0对应right的2个k0
    2.是left的k1对应right的1个k0
    3.是left的k0对应right的2个k0
    1.是left的k1对应right的1个k1
    以右边的键为主键同上
    
    right:
        A   B key   C   D   K
    0  A0  B0  K0  C0  D0  K0
    1  A2  B2  K0  C0  D0  K0
    2  A0  B0  K0  C2  D2  K0
    3  A2  B2  K0  C2  D2  K0
    4  A1  B1  K1  C1  D1  K1
    5  A3  B3  K1  C1  D1  K1
    

    参数left_on和right_on

    left_on:合并时,左边的键
    right_on:合并时,右边的键

    agg函数

    apply函数

    mDataFram["score"]= mSeries.apply(####)

    mDataFram["score"] = mDataFram.apply(####)

    apply中的函数对Series进行操作后再返回回来

    也可以有多返回:

    mDataFram[["score","count"]] = mDataFram.apply(####)

    mDataFram["score"], mDataFram["count"] = zip(*mDataFram.apply(####))

    groupby函数

    import numpy as np
    from pandas import DataFrame
    
    df = DataFrame(
        {'key1': ['a', 'a', 'b', 'b', 'a'],
         'key2': ['one', 'two', 'one', 'two', 'one'],
         'data1': np.random.randn(5),
         'data2': np.random.randn(5)})
    print(df)
    print("--------")
    
    grouped1 = df['data1'].groupby(df['key1'])
    print(grouped1.mean())
    print("--------")
    
    grouped2 = df['data1'].groupby(df['key2'])
    print(grouped2.mean())
    
    输出:
          data1     data2 key1 key2
    0 -2.589168 -0.733088    a  one
    1  0.807556 -0.396627    a  two
    2 -0.425544 -0.007338    b  one
    3 -1.867421 -1.037650    b  two
    4  0.851296  0.548271    a  one
    --------
    key1
    a   -0.310106
    b   -1.146482
    Name: data1, dtype: float64
    --------
    key2
    one   -0.721139
    two   -0.529933
    Name: data1, dtype: float64
    
    image.png

    相关文章

      网友评论

          本文标题:科学计算库pandas基础

          本文链接:https://www.haomeiwen.com/subject/ycsjpktx.html