美文网首页
Pandas CSV - read_csv / to_csv()

Pandas CSV - read_csv / to_csv()

作者: shellblock | 来源:发表于2021-05-14 10:52 被阅读0次

    CSV(Comma-Separated Values,逗号分隔值,有时也称为字符分隔值,因为分隔字符也可以不是逗号),其文件以纯文本形式存储表格数据(数字和文本)。

    CSV 是一种通用的、相对简单的文件格式,被用户、商业和科学广泛应用。

    本文以 meal_order_info.csv 为例说明。

    语法

    基本语法格式:

    pd.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]],
    sep=',', delimiter=None, header='infer', names=None, index_col=None,
    usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True,
    dtype=None, engine=None, converters=None, true_values=None,
    false_values=None, skipinitialspace=False, skiprows=None,
    skipfooter=0, nrows=None, na_values=None, keep_default_na=True,
    na_filter=True, verbose=False, skip_blank_lines=True,
    parse_dates=False, infer_datetime_format=False,
    keep_date_col=False, date_parser=None, dayfirst=False,
    cache_dates=True, iterator=False, chunksize=None,
    compression='infer', thousands=None, decimal: str = '.',
    lineterminator=None, quotechar='"', quoting=0,
    doublequote=True, escapechar=None, comment=None,
    encoding=None, dialect=None, error_bad_lines=True,
    warn_bad_lines=True, delim_whitespace=False,
    low_memory=True, memory_map=False, float_precision=None)
    

    参数

    pandas.read_csv函数常用参数及说明:

    参数名称 说明
    filepath 接收str,表示文件路径,无默认值
    sep 接收str,表示文件的分隔符,默认为“,”
    header 接收int或sequence,表示将某行数据为列名,为int时表示将第n行作为列名;为sequence时表示将sequence作为列名。默认为infer,表示自动识别
    name 接收array
    index_col 接收int,sequence,False
    dtype 接收dict
    engine 接收c或Python
    nrows 接收int
    encoding 接收str

    实例

    import pandas as pd
    df = pd.read_csv('.../data/meal_order_info.csv', encoding='gbk')
    print(df.head())
    

    输出结果为:

        info_id  emp_id  number_consumers  mode  dining_table_id  \
    0      417    1442                 4   NaN             1501   
    1      301    1095                 3   NaN             1430   
    2      413    1147                 6   NaN             1488   
    3      415    1166                 4   NaN             1502   
    4      392    1094                10   NaN             1499   
    
       dining_table_name  expenditure  dishes_count  accounts_payable  \
    0               1022          165             5               165   
    1               1031          321             6               321   
    2               1009          854            15               854   
    3               1023          466            10               466   
    4               1020          704            24               704   
    
          use_start_time  ...           lock_time cashier_id  pc_id  order_number  \
    0  2016/8/1 11:05:36  ...   2016/8/1 11:11:46        NaN    NaN           NaN   
    1  2016/8/1 11:15:57  ...   2016/8/1 11:31:55        NaN    NaN           NaN   
    2  2016/8/1 12:42:52  ...   2016/8/1 12:54:37        NaN    NaN           NaN   
    3  2016/8/1 12:51:38  ...   2016/8/1 13:08:20        NaN    NaN           NaN   
    4  2016/8/1 12:58:44  ...   2016/8/1 13:07:16        NaN    NaN           NaN   
    
       org_id  print_doc_bill_num  lock_table_info  order_status        phone  \
    0     330                 NaN              NaN             1  18688880641   
    1     328                 NaN              NaN             1  18688880174   
    2     330                 NaN              NaN             1  18688880276   
    3     330                 NaN              NaN             1  18688880231   
    4     330                 NaN              NaN             1  18688880173   
    
       name  
    0   苗宇怡  
    1    赵颖  
    2   徐毅凡  
    3   张大鹏  
    4   孙熙凯  
    
    [5 rows x 21 columns]
    

    同样也可以使用to_csv()方法将 DataFrame 存储为 csv 文件

    实例

    import pandas as pd
       
    # 三个字段 name, site, age
    nme = ["Google", "Runoob", "Taobao", "Wiki"]
    st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
    ag = [90, 40, 80, 98]
       
    # 字典
    dict = {'name': nme, 'site': st, 'age': ag}
         
    df = pd.DataFrame(dict)
     
    # 保存 dataframe
    df.to_csv('site.csv')
    

    执行成功后,我们打开 site.csv 文件,显示结果如下:


    site.csv

    数据处理

    head()
    head( n ) 方法用于读取前面的 n 行,如果不填参数 n ,默认返回 5 行。

    实例 - 读取前面 5 行

    import pandas as pd
    df = pd.read_csv('nba.csv')
    print(df.head())
    

    输出结果为:

                Name            Team  Number Position   Age Height  Weight            College     Salary
    0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
    1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
    2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
    3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
    4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0
    

    实例 - 读取前面 10 行

    import pandas as pd
    
    df = pd.read_csv('nba.csv')
    
    print(df.head(10))
    

    输出结果为:

                Name            Team  Number Position   Age Height  Weight            College      Salary
    0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas   7730337.0
    1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette   6796117.0
    2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University         NaN
    3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State   1148640.0
    4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN   5000000.0
    5   Amir Johnson  Boston Celtics    90.0       PF  29.0    6-9   240.0                NaN  12000000.0
    6  Jordan Mickey  Boston Celtics    55.0       PF  21.0    6-8   235.0                LSU   1170960.0
    7   Kelly Olynyk  Boston Celtics    41.0        C  25.0    7-0   238.0            Gonzaga   2165160.0
    8   Terry Rozier  Boston Celtics    12.0       PG  22.0    6-2   190.0         Louisville   1824360.0
    9   Marcus Smart  Boston Celtics    36.0       PG  22.0    6-4   220.0     Oklahoma State   3431040.0
    

    tail()

    tail( n ) 方法用于读取尾部的 n 行,如果不填参数 n ,默认返回 5 行,空行各个字段的值返回 NaN。

    实例 - 读取末尾 5 行

    import pandas as pd
    
    df = pd.read_csv('nba.csv')
    
    print(df.tail())
    

    输出结果为:

                 Name       Team  Number Position   Age Height  Weight College     Salary
    453  Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0  Butler  2433333.0
    454     Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0     NaN   900000.0
    455  Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0     NaN  2900000.0
    456   Jeff Withey  Utah Jazz    24.0        C  26.0    7-0   231.0  Kansas   947276.0
    457           NaN        NaN     NaN      NaN   NaN    NaN     NaN     NaN        NaN
    

    实例 - 读取末尾 10 行

    import pandas as pd
    
    df = pd.read_csv('nba.csv')
    
    print(df.tail(10))
    

    输出结果为:

                   Name       Team  Number Position   Age Height  Weight   College      Salary
    448  Gordon Hayward  Utah Jazz    20.0       SF  26.0    6-8   226.0    Butler  15409570.0
    449     Rodney Hood  Utah Jazz     5.0       SG  23.0    6-8   206.0      Duke   1348440.0
    450      Joe Ingles  Utah Jazz     2.0       SF  28.0    6-8   226.0       NaN   2050000.0
    451   Chris Johnson  Utah Jazz    23.0       SF  26.0    6-6   206.0    Dayton    981348.0
    452      Trey Lyles  Utah Jazz    41.0       PF  20.0   6-10   234.0  Kentucky   2239800.0
    453    Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0    Butler   2433333.0
    454       Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0       NaN    900000.0
    455    Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0       NaN   2900000.0
    456     Jeff Withey  Utah Jazz    24.0        C  26.0    7-0   231.0    Kansas    947276.0
    457             NaN        NaN     NaN      NaN   NaN    NaN     NaN       NaN         NaN
    

    info()

    info() 方法返回表格的一些基本信息:

    实例

    import pandas as pd
    
    df = pd.read_csv('nba.csv')
    
    print(df.info())
    

    输出结果为:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 458 entries, 0 to 457          # 行数,458 行,第一行编号为 0
    Data columns (total 9 columns):            # 列数,9列
     #   Column    Non-Null Count  Dtype       # 各列的数据类型
    ---  ------    --------------  -----  
     0   Name      457 non-null    object 
     1   Team      457 non-null    object 
     2   Number    457 non-null    float64
     3   Position  457 non-null    object 
     4   Age       457 non-null    float64
     5   Height    457 non-null    object 
     6   Weight    457 non-null    float64
     7   College   373 non-null    object         # non-null,意思为非空的数据    
     8   Salary    446 non-null    float64
    dtypes: float64(4), object(5)                 # 类型
    

    non-null 为非空数据,我们可以看到上面的信息中,总共 458 行,College 字段的空值最多。

    本文所用到的相关文件下载

    相关文章

      网友评论

          本文标题:Pandas CSV - read_csv / to_csv()

          本文链接:https://www.haomeiwen.com/subject/nycadltx.html