数据分析工具pandas快速入门教程1-开胃菜

作者: python测试开发 | 来源:发表于2018-08-16 21:53 被阅读4次

    简介

    Pandas是用于数据分析的开源Python库,也是目前数据分析最重要的开源库。它能够处理类似电子表格的数据,用于快速数据加载,操作,对齐,合并等。为Python提供这些增强功能,Pandas的数据类型为:Series和DataFrame。DataFrame为整个电子表格或矩形数据,而Series是DataFrame的列。DataFrame也可以被认为是字典或Series的集合。

    加载数据

    load.py

    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # load.py
    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    print("\n\n查看前五行")
    print(df.head())
    
    print("\n\n查看类型")
    print(type(df))
    
    print("\n\n查看大小")
    print(df.shape)
    
    print("\n\n查看列名")
    print(df.columns)
    
    print("\n\n查看dtypes(基于列)")
    print(df.dtypes)
    
    print("\n\n查看统计信息")
    print(df.info())
    
    

    执行结果

    
    $ ./load.py 
    
    
    查看前五行
           country continent  year  lifeExp       pop   gdpPercap
    0  Afghanistan      Asia  1952   28.801   8425333  779.445314
    1  Afghanistan      Asia  1957   30.332   9240934  820.853030
    2  Afghanistan      Asia  1962   31.997  10267083  853.100710
    3  Afghanistan      Asia  1967   34.020  11537966  836.197138
    4  Afghanistan      Asia  1972   36.088  13079460  739.981106
    
    
    查看类型
    <class 'pandas.core.frame.DataFrame'>
    
    
    查看大小
    (1704, 6)
    
    
    查看列名
    Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')
    
    
    查看dtypes(基于列)
    country       object
    continent     object
    year           int64
    lifeExp      float64
    pop            int64
    gdpPercap    float64
    dtype: object
    
    
    查看统计信息
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1704 entries, 0 to 1703
    Data columns (total 6 columns):
    country      1704 non-null object
    continent    1704 non-null object
    year         1704 non-null int64
    lifeExp      1704 non-null float64
    pop          1704 non-null int64
    gdpPercap    1704 non-null float64
    dtypes: float64(2), int64(2), object(2)
    memory usage: 80.0+ KB
    None
    
    Pandas类型 Python类型
    object string
    int64 int
    float64 float
    datetime64 datetime

    行列与单元格

    col.py

    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # col.py
    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    # 列操作
    country_df = df['country'] # 列名选取单列
    
    print("\n\n列首5行")
    print(country_df.head())
    
    print("\n\n列尾5行")
    print(country_df.tail())
    
    country_df_dot = df.country # 点号的方式选取列
    print("\n\n点号的方式选取列")
    print(country_df_dot.head())
    
    subset = df[['country', 'continent', 'year']] # 选取多列
    print("\n\n选取多列")
    print(subset.head())
    

    执行结果

    
    $ ./col.py 
    
    
    列首5行
    0    Afghanistan
    1    Afghanistan
    2    Afghanistan
    3    Afghanistan
    4    Afghanistan
    Name: country, dtype: object
    
    
    列尾5行
    1699    Zimbabwe
    1700    Zimbabwe
    1701    Zimbabwe
    1702    Zimbabwe
    1703    Zimbabwe
    Name: country, dtype: object
    
    
    点号的方式选取列
    0    Afghanistan
    1    Afghanistan
    2    Afghanistan
    3    Afghanistan
    4    Afghanistan
    Name: country, dtype: object
    
    
    选取多列
           country continent  year
    0  Afghanistan      Asia  1952
    1  Afghanistan      Asia  1957
    2  Afghanistan      Asia  1962
    3  Afghanistan      Asia  1967
    4  Afghanistan      Asia  1972
    

    row.py

    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # row.py
    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    # 行操作,注意df.loc[-1]是非法的
    print("\n\n第一行")
    print(df.loc[0])
    
    print("\n\n行数")
    number_of_rows = df.shape[0]
    print(number_of_rows)
    
    last_row_index = number_of_rows - 1
    print("\n\n最后一行")
    print(df.loc[last_row_index])
    
    print("\n\ntail的方法输出最后一行")
    print(df.tail(n=1))
    
    subset_loc = df.loc[0]
    subset_head = df.head(n=1)
    print("\n\nloc的类型为序列Series")
    print(type(subset_loc))
    
    print("\n\nhead的类型为数据帧DataFrame")
    print(type(subset_head))
    
    print("\n\nloc选取三列,类型为数据帧DataFrame")
    print(df.loc[[0, 99, 999]])
    print(type(df.loc[[0, 99, 999]]))
    
    print("\n\niloc选取第一行")
    print(df.iloc[0])
    
    print("\n\niloc选取三行")
    print(df.iloc[[0, 99, 999]])
    
    

    执行结果

    
    $ ./row.py 
    
    
    第一行
    country      Afghanistan
    continent           Asia
    year                1952
    lifeExp           28.801
    pop              8425333
    gdpPercap        779.445
    Name: 0, dtype: object
    
    
    行数
    1704
    
    
    最后一行
    country      Zimbabwe
    continent      Africa
    year             2007
    lifeExp        43.487
    pop          12311143
    gdpPercap     469.709
    Name: 1703, dtype: object
    
    
    tail的方法输出最后一行
           country continent  year  lifeExp       pop   gdpPercap
    1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298
    
    
    loc的类型为序列Series
    <class 'pandas.core.series.Series'>
    
    
    head的类型为数据帧DataFrame
    <class 'pandas.core.frame.DataFrame'>
    
    
    loc选取三列,类型为数据帧DataFrame
             country continent  year  lifeExp       pop    gdpPercap
    0    Afghanistan      Asia  1952   28.801   8425333   779.445314
    99    Bangladesh      Asia  1967   43.453  62821884   721.186086
    999     Mongolia      Asia  1967   51.253   1149500  1226.041130
    <class 'pandas.core.frame.DataFrame'>
    
    
    iloc选取第一行
    country      Afghanistan
    continent           Asia
    year                1952
    lifeExp           28.801
    pop              8425333
    gdpPercap        779.445
    Name: 0, dtype: object
    
    
    iloc选取三行
             country continent  year  lifeExp       pop    gdpPercap
    0    Afghanistan      Asia  1952   28.801   8425333   779.445314
    99    Bangladesh      Asia  1967   43.453  62821884   721.186086
    999     Mongolia      Asia  1967   51.253   1149500  1226.041130
    

    mix.py

    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # Author:    xurongzhong#126.com wechat:pythontesting qq:37391319
    # qq群:144081101 591302926  567351477
    # CreateDate: 2018-06-07
    # mix.py
    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    # 混合选取
    print("\n\nloc选取坐标")
    print(df.loc[42, 'country'])
    
    print("\n\niloc选取坐标")
    print(df.iloc[42, 0])
    
    print("\n\nloc选取子集")
    print(df.loc[[0, 99, 999], ['country', 'lifeExp', 'gdpPercap']])
    

    执行结果

    #!python
    
    $ ./mix.py 
    
    
    loc选取坐标
    Angola
    
    
    iloc选取坐标
    Angola
    
    
    loc选取子集
             country  lifeExp    gdpPercap
    0    Afghanistan   28.801   779.445314
    99    Bangladesh   43.453   721.186086
    999     Mongolia   51.253  1226.041130
    

    分组和聚合

    group.py

    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # group.py
    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    print("\n\n年人均产值")
    print(df.groupby('year')['lifeExp'].mean())
    
    print("\n\n基于年分组")
    grouped_year_df = df.groupby('year')
    print(type(grouped_year_df))
    print(grouped_year_df)
    
    print("\n\nlifeExp")
    grouped_year_df_lifeExp = grouped_year_df['lifeExp']
    print(type(grouped_year_df_lifeExp))
    print(grouped_year_df_lifeExp)
    
    print("\n\n年平均产值")
    mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
    print(mean_lifeExp_by_year)
    
    print("\n\n基于年和洲分组")
    print(df.groupby(['year', 'continent'])[['lifeExp',
    'gdpPercap']].mean())
    
    print("\n\n统计每个洲的国家数")
    print(df.groupby('continent')['country'].nunique())
    
    

    执行结果

    #!python
    
    $ ./group.py 
    
    
    年人均产值
    year
    1952    49.057620
    1957    51.507401
    1962    53.609249
    1967    55.678290
    1972    57.647386
    1977    59.570157
    1982    61.533197
    1987    63.212613
    1992    64.160338
    1997    65.014676
    2002    65.694923
    2007    67.007423
    Name: lifeExp, dtype: float64
    
    
    基于年分组
    <class 'pandas.core.groupby.groupby.DataFrameGroupBy'>
    <pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f0e2b0c89e8>
    
    
    lifeExp
    <class 'pandas.core.groupby.groupby.SeriesGroupBy'>
    <pandas.core.groupby.groupby.SeriesGroupBy object at 0x7f0e151e2f28>
    
    
    年平均产值
    year
    1952    49.057620
    1957    51.507401
    1962    53.609249
    1967    55.678290
    1972    57.647386
    1977    59.570157
    1982    61.533197
    1987    63.212613
    1992    64.160338
    1997    65.014676
    2002    65.694923
    2007    67.007423
    Name: lifeExp, dtype: float64
    
    
    基于年和洲分组
                      lifeExp     gdpPercap
    year continent                         
    1952 Africa     39.135500   1252.572466
         Americas   53.279840   4079.062552
         Asia       46.314394   5195.484004
         Europe     64.408500   5661.057435
         Oceania    69.255000  10298.085650
    1957 Africa     41.266346   1385.236062
         Americas   55.960280   4616.043733
         Asia       49.318544   5787.732940
         Europe     66.703067   6963.012816
         Oceania    70.295000  11598.522455
    1962 Africa     43.319442   1598.078825
         Americas   58.398760   4901.541870
         Asia       51.563223   5729.369625
         Europe     68.539233   8365.486814
         Oceania    71.085000  12696.452430
    1967 Africa     45.334538   2050.363801
         Americas   60.410920   5668.253496
         Asia       54.663640   5971.173374
         Europe     69.737600  10143.823757
         Oceania    71.310000  14495.021790
    1972 Africa     47.450942   2339.615674
         Americas   62.394920   6491.334139
         Asia       57.319269   8187.468699
         Europe     70.775033  12479.575246
         Oceania    71.910000  16417.333380
    1977 Africa     49.580423   2585.938508
         Americas   64.391560   7352.007126
         Asia       59.610556   7791.314020
         Europe     71.937767  14283.979110
         Oceania    72.855000  17283.957605
    1982 Africa     51.592865   2481.592960
         Americas   66.228840   7506.737088
         Asia       62.617939   7434.135157
         Europe     72.806400  15617.896551
         Oceania    74.290000  18554.709840
    1987 Africa     53.344788   2282.668991
         Americas   68.090720   7793.400261
         Asia       64.851182   7608.226508
         Europe     73.642167  17214.310727
         Oceania    75.320000  20448.040160
    1992 Africa     53.629577   2281.810333
         Americas   69.568360   8044.934406
         Asia       66.537212   8639.690248
         Europe     74.440100  17061.568084
         Oceania    76.945000  20894.045885
    1997 Africa     53.598269   2378.759555
         Americas   71.150480   8889.300863
         Asia       68.020515   9834.093295
         Europe     75.505167  19076.781802
         Oceania    78.190000  24024.175170
    2002 Africa     53.325231   2599.385159
         Americas   72.422040   9287.677107
         Asia       69.233879  10174.090397
         Europe     76.700600  21711.732422
         Oceania    79.740000  26938.778040
    2007 Africa     54.806038   3089.032605
         Americas   73.608120  11003.031625
         Asia       70.728485  12473.026870
         Europe     77.648600  25054.481636
         Oceania    80.719500  29810.188275
    
    
    统计每个洲的国家数
    continent
    Africa      52
    Americas    25
    Asia        33
    Europe      30
    Oceania      2
    Name: country, dtype: int64
    

    基本绘图

    
    import pandas as pd
    
    df = pd.read_csv(r"../data/gapminder.tsv", sep='\t') 
    
    global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean()
    print(global_yearly_life_expectancy)
    
    global_yearly_life_expectancy.plot()
    
    
    image.png

    相关文章

      网友评论

        本文标题:数据分析工具pandas快速入门教程1-开胃菜

        本文链接:https://www.haomeiwen.com/subject/zphabftx.html