美文网首页
pandas数据读入

pandas数据读入

作者: 千千罐 | 来源:发表于2020-10-20 10:20 被阅读0次

    Pandas中关于读入数据的相关函数列表如下:

    pd.read_clipboard() pd.read_gbq()       pd.read_parquet()   pd.read_sql_query()
    pd.read_csv()       pd.read_hdf()       pd.read_pickle()    pd.read_sql_table()
    pd.read_excel()     pd.read_html()      pd.read_sas()       pd.read_stata()
    pd.read_feather()   pd.read_json()      pd.read_spss()      pd.read_table()
    pd.read_fwf()       pd.read_msgpack()   pd.read_sql()   
    

    以下函数都有对应的导出函数:

    pd.read_clipboard() pd.read_feather()   pd.read_html()      pd.read_pickle()
    pd.read_csv()       pd.read_gbq()       pd.read_json()      pd.read_sql()
    pd.read_excel()     pd.read_hdf()       pd.read_msgpack()   
    

    本文先主要介绍2个参数,陆续更新 。。。。。。

    • pandas.read_csv() :主要读取文本文件.csv,.txt,.xls
    • pandas.read_excel() :主要读取超文本文件.xlsx
    1. pandas.read_csv() 读入文本文件或者URL等

    1)函数详解:

    pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, 
                    usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None,
                    converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None,
                    nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, 
                    skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False,       
                    date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', 
                    thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, 
                    comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, 
                    warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, 
                    memory_map=False, float_precision=None)
    

    常用参数介绍:

    • sep :分隔符(str)默认是","。若读入的是txt,应设置为sep='\t"
    • delimiter :定界符(str),默认是"None"。备选分隔符(如果指定该参数,则sep参数失效)
    • header :指定行数用来作为列名( int or list of ints), 默认是 ‘infer’。 如果同时header和skiprows指定,header对应skiprows相对的结束的位置,header使用新的行号(忽略注释/空行)
    • names:用于结果的列名列表(array-like),默认是 "None"。如果数据文件中没有列标题行,就需要执行header=None。默认列表中不能出现重复,除非设定参数mangle_dupe_cols=True
    • index_col:用作行索引的列编号或者列名(int or sequence or False), 默认是 “None”。如果给定一个序列则有多个行索引。如果文件不规则,行尾有分隔符,则可以设定index_col=False 来是的pandas不适用第一列作为行索引。
    • encoding:字符编码。encoding="latin-1" 来读取法语字符,encoding="utf-8" 或 encoding="gbk"读取中文
    • skiprows: 跳过某些行不读入。skiprows=[1,2] 表示不读取第 1 行和第 2 行。会忽略comment,若第一行满足comment,则skiprows=[1],则结果只会不读第一行,行编号不会由comment而改变,即skiprows使用原始行号(包括注释/空行)
    • nrows : 设置读取行数。默认是"None”,若想要读取前10行,应设置为nrow=10
    • comment:str(1个字符), 默认是“None",匹配到str后的所有字符会被解析为"NaN"

    2)参数应用示例:

    读入文件(example.csv)内容如下:

    Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
    #chr6,117643659,117643659,C,G,0,3199,100.00%
    
    chr12,25368462,25368462,C,T,1,2258,99.90%
    chr17,7579801,7579801,G,C,51,1727,99.30%
    chr9,21975017,21975017,C,T,1,3593,99.90%
    

    默认读入时,会忽略空行

    >>> import pandas as pd
    >>> #pd.__version__  #查看Pandas版本
    >>> df = pd.read_csv("example.csv")
    >>> df
         Chr      Start        End Ref Alt  Ref_Reads  Alt_Reads      Var
    0  #chr6  117643659  117643659   C   G          0       3199  100.00%
    1  chr12   25368462   25368462   C   T          1       2258   99.90%
    2  chr17    7579801    7579801   G   C         51       1727   99.30%
    3   chr9   21975017   21975017   C   T          1       3593   99.90% 
    

    设置comment="#",忽略匹配到#之后的所有内容,这里"#"可以换成任意其他字符,字符数必须时1。在python中类似于匹配到该字符。在pandas中会把匹配到该字符之后的所有字符转化为NaN进行写入

    >>> df = pd.read_csv("example.csv",comment="#")
    >>> df        #这里#开头的那行没有读入
         Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
    0  chr12  25368462  25368462   C   T          1       2258  99.90%
    1  chr17   7579801   7579801   G   C         51       1727  99.30%
    2   chr9  21975017  21975017   C   T          1       3593  99.90%
    
    ''' example-1.csv"文件内容如下:'''
    
    Chr,Start,End,Ref,Alt,Ref_Reads,Alt_Reads,Var
    #chr6,117643659,117643659,C,G,0,3199,100.00%
    
    chr12,#25368462,25368462,C,T,1,2258,99.90%
    chr17,7579#801,7579801,G,C,51,1727,99.30%
    chr9,21975017,21975017,C,T,1,3593,99.90%
    
    >>> df1 = pd.read_csv("example-1.csv",comment="#")
    >>> df1      #发现#之后的内容全部转为NaN了
         Chr       Start         End  Ref  Alt  Ref_Reads  Alt_Reads     Var
    0  chr12         NaN         NaN  NaN  NaN        NaN        NaN     NaN
    1  chr17      7579.0         NaN  NaN  NaN        NaN        NaN     NaN
    2   chr9  21975017.0  21975017.0    C    T        1.0     3593.0  99.90%
    

    header 默认第一行为列名(即header=0),若同时使用comment时,这里会忽略注释行,重新进行索引排序后的第一行

    >>> df = pd.read_csv("example.csv",comment="#")
    >>> df
         Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
    0  chr12  25368462  25368462   C   T          1       2258  99.90%
    1  chr17   7579801   7579801   G   C         51       1727  99.30%
    2   chr9  21975017  21975017   C   T          1       3593  99.90%
    
    >>> df = pd.read_csv("example-2.csv",comment="#",header=0)
    >>> df
         Chr     Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
    0  chr12  25368462  25368462   C   T          1       2258  99.90%
    1  chr17   7579801   7579801   G   C         51       1727  99.30%
    2   chr9  21975017  21975017   C   T          1       3593  99.90%
    

    当header=n,且n > 0时,默认不会读取n(索引)前面的行,n行会作为列名

    >>> df = pd.read_csv("example-2.csv",comment="#",header=1)
    >>> df
       chr12  25368462  25368462.1  C  T   1  2258  99.90%
    0  chr17   7579801     7579801  G  C  51  1727  99.30%
    1   chr9  21975017    21975017  C  T   1  3593  99.90%
    

    当 header=None 即不指定列名,read_csv为其自动加上列索引(从0开始)

    >>> df = pd.read_csv("example-2.csv",comment="#",header=None)
    >>> df
           0         1         2    3    4          5          6       7
    0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
    1  chr12  25368462  25368462    C    T          1       2258  99.90%
    2  chr17   7579801   7579801    G    C         51       1727  99.30%
    3   chr9  21975017  21975017    C    T          1       3593  99.90%
    

    当 header=None时,可以通过names=[],指定新的列名,新的数据框列数由names长度决定

    • names长度小于原始数据列数时,则会进行截取
    • names长度大于原始数据列数时, 则会通过NaN进行填补
    >>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H'])
    >>> df
           A         B         C    D    E          F          G       H
    0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
    1  chr12  25368462  25368462    C    T          1       2258  99.90%
    2  chr17   7579801   7579801    G    C         51       1727  99.30%
    3   chr9  21975017  21975017    C    T          1       3593  99.90%
    >>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G'])
    >>> df
                  A         B    C    D          E          F       G
    Chr       Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var
    chr12  25368462  25368462    C    T          1       2258  99.90%
    chr17   7579801   7579801    G    C         51       1727  99.30%
    chr9   21975017  21975017    C    T          1       3593  99.90%
    
    >>> df = pd.read_csv("example-2.csv",comment="#",header=None,names=['A','B','C','D','E','F','G','H','I'])
    >>> df
           A         B         C    D    E          F          G       H   I
    0    Chr     Start       End  Ref  Alt  Ref_Reads  Alt_Reads     Var NaN
    1  chr12  25368462  25368462    C    T          1       2258  99.90% NaN
    2  chr17   7579801   7579801    G    C         51       1727  99.30% NaN
    3   chr9  21975017  21975017    C    T          1       3593  99.90% NaN
    

    index_col设置行名,若同时设置header,如下,’Chr‘会被作为行名与列名产生了冲突,所以 header参数和index_col参数不能混用,header指定列索引,index_col指定行索引,一个DataFrame对象只有一种索引

    >>> df = pd.read_csv("example-2.csv",comment="#",index_col=0)
    >>> df
              Start       End Ref Alt  Ref_Reads  Alt_Reads     Var
    Chr                                                            
    chr12  25368462  25368462   C   T          1       2258  99.90%
    chr17   7579801   7579801   G   C         51       1727  99.30%
    chr9   21975017  21975017   C   T          1       3593  99.90%
    
    2. pandas.read_excel() 读取 xlsx格式文件, 需要安装xlrd包

    1) 函数详解:

    pandas.read_excel(io,sheet_name = 0,header = 0,names = None,index_col = None,usecols = None,        
                      squeeze = False,dtype = None, ...)
    

    主要参数介绍:

    • io :输入文件
    • sheet_name:默认为0,可以是字符串列表或整数列表(None,string,int)。可以是sheet的索引,也可以是sheet名,为None时获取所有工作表。
    • header:用法同pandas.read_csv()
    • names:用法同pandas.read_csv()
    • index_col:用法同pandas.read_csv()

    2) 参数举例说明:
    输入文件(example.xlsx)包含三个sheet,分别是A,B,C

    当 不设置 sheet_name 时 和 sheet_name = 0 以及 sheet_name = 'A' 结果一致

    >>> import pandas as pd
    >>> df = pd.read_excel('example.xlsx')    #不设置sheet_name
    >>> df
         Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%
    
    >>> df = pd.read_excel('example.xlsx',sheet_name=0)  #sheet_name = 0 
    >>> df
         Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%
    
    >>> df = pd.read_excel('example.xlsx',sheet_name='A')  #sheet_name = 'A','A'是第一个sheet
    >>> df
         Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%
    
    

    当sheet_name = None 时,会读取全部sheet的内容

    >>> df = pd.read_excel('example.xlsx',sheet_name=None)
    >>> df
    OrderedDict([('A',      Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%), ('B',      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
    1  chr12   25368462   25368462   C   T         KRAS  99.74%
    2   chr5   38950776   38950776   G   A       RICTOR  99.52%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr7  116435768  116435768   C   T          MET  99.49%
    1   chr7  140449150  140449150   T   C         BRAF  52.17%
    2   chr2   30143499   30143499   G   C          ALK  99.58%
    3   chr1   11288758   11288758   G   A         MTOR  99.68%
    4  chr12  133208979  133208979   T   C         POLE  46.77%)])
    
    

    sheet_name 也可以是列表,包含想要读取的sheet的索引或者名称,也可以混合使用,比如

    sheet_name = [0,1],显示的索引是0,1

    >>> df = pd.read_excel('example.xlsx',sheet_name=[0,1])
    >>> df
    OrderedDict([(0,      Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%), (1,      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
    1  chr12   25368462   25368462   C   T         KRAS  99.74%
    2   chr5   38950776   38950776   G   A       RICTOR  99.52%)])
    

    sheet_name = ['A','C'],显示的索引是'A','C'

    >>> df = pd.read_excel('example.xlsx',sheet_name=['A','C'])
    >>> df
    OrderedDict([('A',      Chr      Start        End Ref Alt Gene.refGene      VAF
    0  chr10   43600607   43600607   C   A          RET  100.00%
    1   chr7  140624440  140624440   C   T         BRAF   53.50%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr7  116435768  116435768   C   T          MET  99.49%
    1   chr7  140449150  140449150   T   C         BRAF  52.17%
    2   chr2   30143499   30143499   G   C          ALK  99.58%
    3   chr1   11288758   11288758   G   A         MTOR  99.68%
    4  chr12  133208979  133208979   T   C         POLE  46.77%)])
    

    sheet_name = [1,'C'],显示的索引是1和'C'

    >>> df = pd.read_excel('example.xlsx',sheet_name=[1,'C'])
    >>> df
    OrderedDict([(1,      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr2  111907691  111907691   T   C      BCL2L11  99.37%
    1  chr12   25368462   25368462   C   T         KRAS  99.74%
    2   chr5   38950776   38950776   G   A       RICTOR  99.52%), ('C',      Chr      Start        End Ref Alt Gene.refGene     VAF
    0   chr7  116435768  116435768   C   T          MET  99.49%
    1   chr7  140449150  140449150   T   C         BRAF  52.17%
    2   chr2   30143499   30143499   G   C          ALK  99.58%
    3   chr1   11288758   11288758   G   A         MTOR  99.68%
    4  chr12  133208979  133208979   T   C         POLE  46.77%)])
    

    参考:
    https://www.cnblogs.com/datablog/p/6127000.html
    https://www.jianshu.com/p/d1eed925509b

    相关文章

      网友评论

          本文标题:pandas数据读入

          本文链接:https://www.haomeiwen.com/subject/sgqkmktx.html