美文网首页
python3.6 数据分析-数据加载、存储与文件格式

python3.6 数据分析-数据加载、存储与文件格式

作者: LeeMin_Z | 来源:发表于2018-07-26 23:31 被阅读58次

    1. 数据加载与存储

    1.1. np.save,np.load

    In [78]: a = np.arange(10)
    
    In [79]: a
    Out[79]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    
    In [80]: np.save('some_array',a)
    
    In [83]: np.load('some_array.npy')
    Out[83]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    

    1.2. 常规用 pd.read_<tab> 和data.to_<format>走遍天下,新版pandas几乎什么格式都能读了。

    2. CSV 和 txt 格式

    1. 读取.csv格式的文件,直接read_csv不需要加分隔号;用read_table需要制定分隔号
    2. 关于用CLI读数据,linux人尽皆知用cat,但是windows用的是type,而且斜杠方向与linux相反
    3. csv很方便,直接read,然后选择参数,例如header,index_col

    a) 例子1,csv可以用read_csv或read_table读取

    
    # windows system 
    # ex1, csv and text values
    
    In [3]: !type ch06\ex1.csv
    a,b,c,d,message
    1,2,3,4,hello
    5,6,7,8,world
    9,10,11,12,foo
    
    In [10]: df = pd.read_csv('ch06/ex1.csv')
    
    In [11]: df
    Out[11]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    
    In [12]: df1 = pd.read_table('ch06/ex1.csv')
    
    In [13]: df1
    Out[13]:
      a,b,c,d,message
    0   1,2,3,4,hello
    1   5,6,7,8,world
    2  9,10,11,12,foo
    
    In [14]: df1 = pd.read_table('ch06/ex1.csv',sep=',')
    
    In [15]: df1
    Out[15]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    

    b) 例子2,csv设置参数header,index_col

    # ex2 csv and header,index_col
    
    In [48]: pd.read_csv('ch06/ex2.csv',header=None)
    Out[48]:
       0   1   2   3      4
    0  1   2   3   4  hello
    1  5   6   7   8  world
    2  9  10  11  12    foo
    
    In [49]: pd.read_csv('ch06/ex2.csv',names=['a','b','c','d','message'])
    Out[49]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    
    In [53]: pd.read_csv('ch06/ex2.csv',names= names,index_col = 'message')
    Out[53]:
             a   b   c   d
    message
    hello    1   2   3   4
    world    5   6   7   8
    foo      9  10  11  12
    
    # csv_mindex.csv
    
    In [57]: !type ch06\csv_mindex.csv
    key1,key2,value1,value2
    one,a,1,2
    one,b,3,4
    one,c,5,6
    one,d,7,8
    two,a,9,10
    two,b,11,12
    two,c,13,14
    two,d,15,16
    
    In [60]: parsed = pd.read_csv('ch06/csv_mindex.csv',index_col=['key1','key2'])
    
    In [61]: parsed
    Out[61]:
               value1  value2
    key1 key2
    one  a          1       2
         b          3       4
         c          5       6
         d          7       8
    two  a          9      10
         b         11      12
         c         13      14
         d         15      16
    

    c) 例子3,多个空格时使用正则式\s+

    In [62]: list(open('ch06/ex3.txt'))
    Out[62]:
    ['            A         B         C\n',
     'aaa -0.264438 -1.026059 -0.619500\n',
     'bbb  0.927272  0.302904 -0.032399\n',
     'ccc -0.264273 -0.386314 -0.217601\n',
     'ddd -0.871858 -0.348382  1.100491\n']
    
    In [63]:
    
    In [63]:
    
    In [63]: result = pd.read_table('ch06/ex3.txt',sep='\s+')
    
    In [64]: result
    Out[64]:
                A         B         C
    aaa -0.264438 -1.026059 -0.619500
    bbb  0.927272  0.302904 -0.032399
    ccc -0.264273 -0.386314 -0.217601
    ddd -0.871858 -0.348382  1.100491
    

    d) 例子4,忽略格式不对的行,处理缺失值

    In [65]: !type ch06\ex4.csv
    # hey!
    a,b,c,d,message
    # just wanted to make things more difficult for you
    # who reads CSV files with computers, anyway?
    1,2,3,4,hello
    5,6,7,8,world
    9,10,11,12,foo
    In [66]:
    
    In [66]: pd.read_csv('ch06/ex4.csv',skiprows=[0,2,3])
    Out[66]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    
    In [67]: !type ch06\ex5.csv
    something,a,b,c,d,message
    one,1,2,3,4,NA
    two,5,6,,8,world
    three,9,10,11,12,foo
    
    In [68]: pd.read_csv('ch06/ex5.csv',na_values='Null')
    Out[68]:
      something  a   b     c   d message
    0       one  1   2   3.0   4     NaN
    1       two  5   6   NaN   8   world
    2     three  9  10  11.0  12     foo
    
    In [69]: setNAvaluse = {'message':['foo','NA'],'something':['two']}
    
    In [70]: pd.read_csv('ch06/ex5.csv',na_values=setNAvaluse)
    Out[70]:
      something  a   b     c   d message
    0       one  1   2   3.0   4     NaN
    1       NaN  5   6   NaN   8   world
    2     three  9  10  11.0  12     NaN
    

    JSON 格式

    json 包,直接load就好。可以看py4e免费在线text book

    XML tree

    python3.6 直接有elementree可以用,数据读出来常规处理就好。同上

    二进制

    参考官网

    7.1. struct — Interpret bytes as packed binary data

    HDF5文件

    这个好像是hadoop里的文件格式,适用于处理大批量文件,大数据上手继续学这部分。

    In [39]: store = pd.HDFStore('mydata.h5')
    
    In [41]: frame
    Out[41]:
       a   b   c   d message
    0  1   2   3   4   hello
    1  5   6   7   8   world
    2  9  10  11  12     foo
    
    In [42]: store['obj1'] = frame
    
    In [43]: store
    Out[43]:
    <class 'pandas.io.pytables.HDFStore'>
    File path: mydata.h5
    /obj1            frame        (shape->[3,5])
    
    In [44]: store['obj1_col'] = frame['a']
    
    In [45]: store
    Out[45]:
    <class 'pandas.io.pytables.HDFStore'>
    File path: mydata.h5
    /obj1                frame        (shape->[3,5])
    /obj1_col            series       (shape->[3])
    

    EXCEL

    不用按照书里的安装啥库了,现在pandas可以直接读pd.read_excel('ch06/test.xls')

    使用HTML和Web API

    从网页中获取数据,暂时我只用过urllib和socket...
    可以看py4e网站: Networked programs

    request库好像是高级用法,待做

    数据库

    简单的SQL语言可以用内置的sqlite3

    MongoDB

    这是NoSQL数据库,还没装,迟点跟着hadoop一起做...


    2018.7.2x 大数据文件格式,上手后再做。被成功安利request库处理网页。

    相关文章

      网友评论

          本文标题:python3.6 数据分析-数据加载、存储与文件格式

          本文链接:https://www.haomeiwen.com/subject/ppoqmftx.html