美文网首页
笔记|数据分析之pandas基础----数据文件加载与存储(一)

笔记|数据分析之pandas基础----数据文件加载与存储(一)

作者: loannes | 来源:发表于2019-04-30 13:34 被阅读0次
    pandas-read-file

    读写文本格式的数据

    通过cat 输出文件内容:

    In [20]: cat ex1.csv
    a,b,c,d,message
    1,2,3,4,hello
    5,6,7,8,world
    9,10,11,12,foo
    

    由于文件是csv格式,可以使用read_csv读取文件并返回DataFrame

    In [23]: df = pd.read_csv('ex1.csv')
    
    In [24]: df
    Out[24]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    

    如果想读取没有标题行的文件:

    In [26]: cat ex2.csv
    1,2,3,4,hello
    5,6,7,8,world
    9,10,11,12,foo
    
    In [27]: pd.read_csv('ex2.csv', header=None)
    Out[27]:
       0   1   2   3      4
    0  1   2   3   4  hello
    1  5   6   7   8  world
    2  9  10  11  12    foo
    

    或者可以直接定义标题

    In [28]: pd.read_csv('ex2.csv', names=['a','b','c','d','message'])
    Out[28]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    

    还可以将多个列做成层次化索引

    In [29]: cat csv_mindex.csv
    key1,key2,value1,value2
    one,a,1,2
    one,b,3,4
    one,c,5,6
    one,d,7,8
    two,e,9,10
    
    In [30]: parsed = pd.read_csv('csv_mindex.csv', index_col=['key1','key2'])
    
    In [31]: parsed
    Out[31]:
               value1  value2
    key1 key2
    one  a          1       2
         b          3       4
         c          5       6
         d          7       8
    two  e          9      10
    

    有些表格分隔会比较杂乱,如下面这个:

    In [32]: cat ex3.csv
                A              B                C
    Aaa -0.264438   -1.026059           -0.619500
    bbb 0.927272    0.302904            -0.032399
    Ccc -0.265273   -0.386314           -0.217601
    

    可以在读取的时候传入一个正则来调整这些不规则的分隔

    In [32]: cat ex3.csv
                A              B                C
    Aaa -0.264438   -1.026059           -0.619500
    bbb 0.927272    0.302904            -0.032399
    Ccc -0.265273   -0.386314           -0.217601
    In [33]: result = pd.read_csv('ex3.csv', sep='\s+')
    
    In [34]: result
    Out[34]:
                A         B         C
    Aaa -0.264438 -1.026059 -0.619500
    bbb  0.927272  0.302904 -0.032399
    Ccc -0.265273 -0.386314 -0.217601
    

    如何在读取文件的时候找到缺失值并标记为NaN

    In [59]: cat ex5.csv
    Something,a,b,c,d,message
    0,one,1,2,3,4,NA
    1,two,5,6,-1.#IND,8,world
    2,three,7,8,9,10,NULL
    
    In [57]: result = pd.read_csv('ex5.csv')
    
    In [58]: pd.isnull(result)
    Out[58]:
       Something      a      b      c      d  message
    0      False  False  False  False  False     True
    1      False  False  False   True  False    False
    2      False  False  False  False  False     True
    

    pandas默认会把NA,-1.#IND,NULL当做是缺失值。
    我们也可以自己定义:

    In [63]: cat ex5.csv
    Something,a,b,c,d,message
    0,one,1,2,3,4,NA
    1,two,5,6,-1.#IND,8,world
    2,three,7,8,9,10,NULL
    4,four,11,12,13,14,空
    In [64]: result = pd.read_csv('ex5.csv')
    
    In [65]: result
    Out[65]:
      Something   a   b     c   d message
    0       one   1   2   3.0   4     NaN
    1       two   5   6   NaN   8   world
    2     three   7   8   9.0  10     NaN
    4      four  11  12  13.0  14       空  #在这里加了条‘空’的字符串
    

    na_values可以接受一组用于表示缺失值的字符串:

    In [61]: result = pd.read_csv('ex5.csv', na_values=['空'])
    
    In [62]: result
    Out[62]:
      Something   a   b     c   d message
    0       one   1   2   3.0   4     NaN
    1       two   5   6   NaN   8   world
    2     three   7   8   9.0  10     NaN
    4      four  11  12  13.0  14     NaN
    

    逐块读取文本文件

    如果在处理大型文件的时候,不想直接读取大文件而选择读取部分数据的时候可以通过nrows来读取指定行数

    In [69]: pd.read_csv('ex6.csv',nrows=5)
    Out[69]:
            One       two     three      four key
    0  0.467758 -0.758257 -1.824459 -1.824726   L
    1 -0.359277 -1.995273  1.252799 -0.853329   B
    2 -0.509925  1.259425  0.559245  0.992874   G
    3 -0.509925  1.259425  0.559245  0.992874   R
    4 -0.509925  1.259425  0.559245  0.992874   Q
    

    如果想逐块读取行数,可设置chunksize来实现

    In [70]: chunker = pd.read_csv('ex6.csv',chunksize=5)
    
    In [71]: chunker
    Out[71]: <pandas.io.parsers.TextFileReader at 0x117ef4eb8>
    

    由于read_csv函数返回的是一个TextFileReader对象,所以想获取数据还需要解析它才行。

    In [73]: for c in chunker:
        ...:     print(c)
        ...:
            One       two     three      four key
    0  0.467758 -0.758257 -1.824459 -1.824726   L
    1 -0.359277 -1.995273  1.252799 -0.853329   B
    2 -0.509925  1.259425  0.559245  0.992874   G
    3 -0.509925  1.259425  0.559245  0.992874   R
    4 -0.509925  1.259425  0.559245  0.992874   Q
            One       two     three      four key
    5 -0.509925  1.259425  0.559245  0.992874   D
    6 -0.509925  1.259425  0.559245  0.992874   A
    7 -0.509925  1.259425  0.559245  0.992874   B
    8 -0.509925  1.259425  0.559245  0.992874   E
    9 -0.509925  1.259425  0.559245  0.992874   C
             One       two     three      four key
    10 -0.509925  1.259425  0.559245  0.992874   F
    11 -0.509925  1.259425  0.559245  0.992874   M
    12 -0.509925  1.259425  0.559245  0.992874   V
    

    将数据写到文本格式中

    同样数据也可以被输出为分隔符格式的文本,然后存储在文件中
    我们可以利用DataFrame中的to_csv函数来实现

    In [76]: data.to_csv('ex5out.csv')
    
    In [77]: cat ex5out.csv
    ,Something,a,b,c,d,message
    0,one,1,2,3.0,4,
    1,two,5,6,,8,world
    2,three,7,8,9.0,10,
    4,four,11,12,13.0,14,空
    

    当然,也可以使用其他分隔符

    In [79]: data.to_csv('ex5out_02.csv',sep='|')
    
    In [80]: cat ex5out_02.csv
    |Something|a|b|c|d|message
    0|one|1|2|3.0|4|
    1|two|5|6||8|world
    2|three|7|8|9.0|10|
    4|four|11|12|13.0|14|空
    

    手工处理分隔符格式

    有时候也会遇到read_csv不能读取的文本,如下面这种:

    data = pd.read_csv('ex7.csv')
    ---------------------------------------------------------------------------
    ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
    
    In [83]: cat ex7.csv
    "a","b","c"
    "1","2","3"
    "1","2","3","4"
    

    由于数据格式不准确导致无法解析,这个时候就需要用到手动去处理了。通过Python自带的csv.reader来解决此问题:

    In [85]: f = open('ex7.csv')
    
    In [86]: reader = csv.reader(f)
    
    In [87]: for line in reader:
        ...:     print(line)
        ...:
    ['a', 'b', 'c']
    ['1', '2', '3']
    ['1', '2', '3', '4']
    

    先把数据放进了reader对象中,然后再解决数据不规整的问题:

    In [89]: lines = list(csv.reader(open('ex7.csv')))
        ...: header, values = lines[0],lines[1:]
        ...: data_dict = {h: v for h, v in zip(header, zip(*values))}
        ...:
    
    In [90]: data_dict
    Out[90]: {'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
    

    通过遍历,把reader中的数据按照正确格式放进了字典data_dict中,把不规整的数据给过滤掉了。
    接下来可以把字典转换成DataFrame

    In [92]: frame = DataFrame(data_dict)
    
    In [93]: frame
    Out[93]:
       a  b  c
    0  1  2  3
    1  1  2  3
    

    JSON数据

    In [102]: obj="""
         ...: {"name":"Wes",
         ...: "places_lived":["United States", "Spain","Germany"],
         ...: "pet":null,
         ...: "siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
         ...:             {"name":"Katie","age":33,"pet":"Cisco"}]
         ...:      }
         ...:      """
    
    In [103]: result = json.loads(obj)
    
    In [104]: result
    Out[104]:
    {'name': 'Wes',
     'places_lived': ['United States', 'Spain', 'Germany'],
     'pet': None,
     'siblings': [{'name': 'Scott', 'age': 25, 'pet': 'Zuko'},
      {'name': 'Katie', 'age': 33, 'pet': 'Cisco'}]}
    

    将JSON对象转换成DataFrame

    In [105]: siblings = DataFrame(result['siblings'], columns=['name','age'])
    
    In [106]: siblings
    Out[106]:
        name  age
    0  Scott   25
    1  Katie   33
    

    书上记载pandas还未完全开发出能高效导入和导出JSON的功能,需要进一步确认。

    XML和HTML:Web信息收集

    利用pandas的read_html来读取html,它会自动结合lxmlBeautiful Soup将数据解析为DataFrame。所以在这之前必须安装好它们。

    pip3 install lxml
    pip3 install beautifulsoup4 html5lib
    

    read_html会尝试解析<table>标签内的表格数据

    In [17]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
    
    In [18]: failures = tables[0]
    
    In [19]: failures.head()
    Out[19]:
                          Bank Name  ...       Updated Date
    0                   Allied Bank  ...  November 17, 2016
    1  The Woodbury Banking Company  ...  November 17, 2016
    2        First CornerStone Bank  ...  September 6, 2016
    3            Trust Company Bank  ...  September 6, 2016
    4    North Milwaukee State Bank  ...      June 16, 2016
    

    接下来可以对数据做一下清洗以及分析,从数据中我们分析下从2000-2010倒闭银行的数量:

    In [20]: close_stamps = pd.to_datetime(failures['Closing Date']) #把failures['Closing Date']全部转换成了DateTime格式放入了Series中
    
    #因为数据中记录了银行的倒闭时间,也就是说知道了‘Closing Date’中有多少个数据就有多少家银行倒闭了。
    # 下面以年的形式统计了每年倒闭的银行数
    In [21]: close_stamps.dt.year.value_counts()
    Out[21]:
    2010    157
    2009    140
    2011     92
    2012     51
    2008     25
    2013     24
    2014     18
    2002     11
    2015      8
    2016      5
    2004      4
    2001      4
    2007      3
    2003      3
    2000      2
    Name: Closing Date, dtype: int64
    

    利用lxml.objectify解析XML

    解析XML将会是一个非常常见的数据分析操作,以后也会经常遇到这种XML文件。
    我们需要用到lxml库中的objectify对象。

    In [24]: path = 'Performance_MNR.xml'
    
    In [25]: from lxml import objectify
    
    In [26]: parsed = objectify.parse(open(path))
    
    In [27]: root = parsed.getroot()
    
    In [28]: skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']
    
    In [29]: for elt in root.INDICATOR:
        ...:     el_data = {}
        ...:     for child in elt.getchildren():
        ...:         if child.tag in skip_fields:
        ...:             continue
        ...:         el_data[child.tag] = child.pyval
        ...:     data.append(el_data)
        ...:
    
    In [30]: perf = DataFrame(data)
    
    In [31]: perf
    Out[31]:
                  AGENCY_NAME            CATEGORY  ... YTD_ACTUAL YTD_TARGET
    0    Metro-North Railroad  Service Indicators  ...       96.9         95
    1    Metro-North Railroad  Service Indicators  ...         96         95
    2    Metro-North Railroad  Service Indicators  ...       96.3         95
    3    Metro-North Railroad  Service Indicators  ...       96.8         95
    4    Metro-North Railroad  Service Indicators  ...       96.6         95
    5    Metro-North Railroad  Service Indicators  ...       96.2         95
    6    Metro-North Railroad  Service Indicators  ...       96.2         95
    7    Metro-North Railroad  Service Indicators  ...       96.2         95
    8    Metro-North Railroad  Service Indicators  ...       95.9         95
    9    Metro-North Railroad  Service Indicators  ...         96         95
    10   Metro-North Railroad  Service Indicators  ...       96.1         95
    11   Metro-North Railroad  Service Indicators  ...         96         95
    12   Metro-North Railroad  Service Indicators  ...       92.6       96.2
    13   Metro-North Railroad  Service Indicators  ...       94.6       96.2
    14   Metro-North Railroad  Service Indicators  ...       95.4       96.2
    15   Metro-North Railroad  Service Indicators  ...       95.9       96.2
    16   Metro-North Railroad  Service Indicators  ...       96.2       96.2
    17   Metro-North Railroad  Service Indicators  ...       96.4       96.2
    18   Metro-North Railroad  Service Indicators  ...       96.5       96.2
    19   Metro-North Railroad  Service Indicators  ...       96.4       96.2
    20   Metro-North Railroad  Service Indicators  ...       96.3       96.2
    21   Metro-North Railroad  Service Indicators  ...       96.2       96.2
    22   Metro-North Railroad  Service Indicators  ...       96.1       96.2
    23   Metro-North Railroad  Service Indicators  ...         96       96.2
    24   Metro-North Railroad  Service Indicators  ...         98       96.3
    25   Metro-North Railroad  Service Indicators  ...       95.6       96.3
    26   Metro-North Railroad  Service Indicators  ...       96.1       96.3
    27   Metro-North Railroad  Service Indicators  ...       96.6       96.3
    28   Metro-North Railroad  Service Indicators  ...       96.8       96.3
    29   Metro-North Railroad  Service Indicators  ...       96.9       96.3
    ..                    ...                 ...  ...        ...        ...
    618  Metro-North Railroad  Service Indicators  ...      95.14
    619  Metro-North Railroad  Service Indicators  ...      95.38
    620  Metro-North Railroad  Service Indicators  ...       95.7
    621  Metro-North Railroad  Service Indicators  ...         96
    622  Metro-North Railroad  Service Indicators  ...      96.21
    623  Metro-North Railroad  Service Indicators  ...       96.5
    624  Metro-North Railroad  Service Indicators  ...      97.95         97
    625  Metro-North Railroad  Service Indicators  ...      98.92         97
    626  Metro-North Railroad  Service Indicators  ...      99.29         97
    627  Metro-North Railroad  Service Indicators  ...      99.47         97
    628  Metro-North Railroad  Service Indicators  ...      99.58         97
    629  Metro-North Railroad  Service Indicators  ...      98.19         97
    630  Metro-North Railroad  Service Indicators  ...      98.46         97
    631  Metro-North Railroad  Service Indicators  ...      98.69         97
    632  Metro-North Railroad  Service Indicators  ...       98.3         97
    633  Metro-North Railroad  Service Indicators  ...      97.55         97
    634  Metro-North Railroad  Service Indicators  ...      97.47         97
    635  Metro-North Railroad  Service Indicators  ...      96.84         97
    636  Metro-North Railroad  Service Indicators  ...        100         97
    637  Metro-North Railroad  Service Indicators  ...        100         97
    638  Metro-North Railroad  Service Indicators  ...      98.86         97
    639  Metro-North Railroad  Service Indicators  ...      98.76         97
    640  Metro-North Railroad  Service Indicators  ...      90.91         97
    641  Metro-North Railroad  Service Indicators  ...                    97
    642  Metro-North Railroad  Service Indicators  ...                    97
    643  Metro-North Railroad  Service Indicators  ...                    97
    644  Metro-North Railroad  Service Indicators  ...                    97
    645  Metro-North Railroad  Service Indicators  ...                    97
    646  Metro-North Railroad  Service Indicators  ...                    97
    647  Metro-North Railroad  Service Indicators  ...                    97
    
    [648 rows x 12 columns]
    

    **我们也可以通过StringIO 来获取HTML的标记
    下面来看一个简单的例子:

    In [33]: from io import StringIO
    
    In [36]: tag = '<a href="http://www.googlge.com">Google</a>'
    
    In [37]: root = objectify.parse(StringIO(tag)).getroot()
    
    In [38]: root.get('href')
    Out[38]: 'http://www.googlge.com'
    
    In [39]: root.text
    Out[39]: 'Google'
    

    相关文章

      网友评论

          本文标题:笔记|数据分析之pandas基础----数据文件加载与存储(一)

          本文链接:https://www.haomeiwen.com/subject/ppmdnqtx.html