DataFrame

作者: 庵下桃花仙 | 来源:发表于2019-01-31 14:22 被阅读4次

    DataFrame 表示矩阵数据表,有行索引和列索引。

    构建方式

    
    In [43]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        ...:         'year' : [2000, 2001, 2002, 2001, 2001, 2003],
        ...:         'pop'  : [1.5, 1.7,  3.6, 2.4, 2.9, 3.2]}
    
    In [44]: frame = pd.DataFrame(data)
    
    In [45]: frame
    Out[45]:
        state  year  pop
    0    Ohio  2000  1.5
    1    Ohio  2001  1.7
    2    Ohio  2002  3.6
    3  Nevada  2001  2.4
    4  Nevada  2001  2.9
    5  Nevada  2003  3.2
    

    对于大型 DataFrame,head 方法只选出前5行

    In [46]: frame.head()
    Out[46]:
        state  year  pop
    0    Ohio  2000  1.5
    1    Ohio  2001  1.7
    2    Ohio  2002  3.6
    3  Nevada  2001  2.4
    4  Nevada  2001  2.9
    

    指定顺序

    In [47]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
    Out[47]:
       year   state  pop
    0  2000    Ohio  1.5
    1  2001    Ohio  1.7
    2  2002    Ohio  3.6
    3  2001  Nevada  2.4
    4  2001  Nevada  2.9
    5  2003  Nevada  3.2
    

    传的列不在字典中

    In [49]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
        ...:                                  index=['one', 'two', 'three', 'four', 'five', 'six'])
    
    In [50]: frame2
    Out[50]:
           year   state  pop debt
    one    2000    Ohio  1.5  NaN
    two    2001    Ohio  1.7  NaN
    three  2002    Ohio  3.6  NaN
    four   2001  Nevada  2.4  NaN
    five   2001  Nevada  2.9  NaN
    six    2003  Nevada  3.2  NaN
    

    某一列可以按字典型标记或属性检索为 Series

    In [51]: frame2['state']
    Out[51]:
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    six      Nevada
    Name: state, dtype: object
    
    In [52]: frame2.year
    Out[52]:
    one      2000
    two      2001
    three    2002
    four     2001
    five     2001
    six      2003
    Name: year, dtype: int64
    

    行也可以通过位置或特殊属性 loc 进行选取

    In [53]: frame2.loc['three']
    Out[53]:
    year     2002
    state    Ohio
    pop       3.6
    debt      NaN
    Name: three, dtype: object
    

    列的引用是可以修改的

    In [54]: frame2['debt'] = 16.5
    
    In [55]: frame2
    Out[55]:
           year   state  pop  debt
    one    2000    Ohio  1.5  16.5
    two    2001    Ohio  1.7  16.5
    three  2002    Ohio  3.6  16.5
    four   2001  Nevada  2.4  16.5
    five   2001  Nevada  2.9  16.5
    six    2003  Nevada  3.2  16.5
    
    In [56]: frame2['debt'] = np.arange(6.)
    
    In [57]: frame2
    Out[57]:
           year   state  pop  debt
    one    2000    Ohio  1.5   0.0
    two    2001    Ohio  1.7   1.0
    three  2002    Ohio  3.6   2.0
    four   2001  Nevada  2.4   3.0
    five   2001  Nevada  2.9   4.0
    six    2003  Nevada  3.2   5.0
    

    将Series赋值给一列

    In [58]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
    
    In [59]: frame2['debt'] = val
    
    In [60]: frame2
    Out[60]:
           year   state  pop  debt
    one    2000    Ohio  1.5   NaN
    two    2001    Ohio  1.7  -1.2
    three  2002    Ohio  3.6   NaN
    four   2001  Nevada  2.4  -1.5
    five   2001  Nevada  2.9  -1.7
    six    2003  Nevada  3.2   NaN
    

    del 删除某一列

    In [61]: frame2['eastern'] = frame2.state == 'Ohio'
    
    In [62]: frame2
    Out[62]:
           year   state  pop  debt  eastern
    one    2000    Ohio  1.5   NaN     True
    two    2001    Ohio  1.7  -1.2     True
    three  2002    Ohio  3.6   NaN     True
    four   2001  Nevada  2.4  -1.5    False
    five   2001  Nevada  2.9  -1.7    False
    six    2003  Nevada  3.2   NaN    False
    
    In [63]: del frame2['eastern']
    
    In [64]: frame2.columns
    Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    

    对Series的修改会映射到DaraFrame中,如果要复制,应显示使用Series的copy方法

    另一种数据形式

    In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
        ...:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
    
    In [66]: frame3 = pd.DataFrame(pop)
    
    In [67]: frame3
    Out[67]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    

    调换行和列

    In [68]: frame3.T
    Out[68]:
            2000  2001  2002
    Nevada   NaN   2.4   2.9
    Ohio     1.5   1.7   3.6
    

    如果显示指明索引,则内部的字典的键不会被排序

    In [69]: pd.DataFrame(pop, index=[2001, 2002, 2003])
    Out[69]:
          Nevada  Ohio
    2001     2.4   1.7
    2002     2.9   3.6
    2003     NaN   NaN
    

    包含Series的字典也可以用于构造DataFrame

    In [70]: pdata = {'Ohio': frame3['Ohio'][: -1],
        ...:          'Nevada': frame3['Nevada'][: 2]}
    
    In [71]: pd.DataFrame(pdata)
    Out[71]:
          Ohio  Nevada
    2000   1.5     NaN
    2001   1.7     2.4
    

    索引和列拥有name属性

    In [72]: frame3.index.name = 'year'
    
    In [73]: frame3.columns.name = 'state'
    
    In [74]: frame3
    Out[74]:
    state  Nevada  Ohio
    year
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [75]: frame3.values
    Out[75]:
    array([[nan, 1.5],
           [2.4, 1.7],
           [2.9, 3.6]])
    

    自动选择适合所有列的类型

    In [77]: frame2.values
    Out[77]:
    array([[2000, 'Ohio', 1.5, nan],
           [2001, 'Ohio', 1.7, -1.2],
           [2002, 'Ohio', 3.6, nan],
           [2001, 'Nevada', 2.4, -1.5],
           [2001, 'Nevada', 2.9, -1.7],
           [2003, 'Nevada', 3.2, nan]], dtype=object)
    

    索引对象

    在构造Series或DataFrame时,使用的任意数组或标签序列都可以在内部转换为索引对象

    In [78]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
    
    In [79]: index = obj.index
    
    In [80]: index
    Out[80]: Index(['a', 'b', 'c'], dtype='object')
    
    In [81]: index[1:]
    Out[81]: Index(['b', 'c'], dtype='object')
    
    In [82]: index[1] = 'd'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-82-a452e55ce13b> in <module>
    ----> 1 index[1] = 'd'
    
    c:\users\a\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
       3881
       3882     def __setitem__(self, key, value):
    -> 3883         raise TypeError("Index does not support mutable operations")
       3884
       3885     def __getitem__(self, key):
    
    TypeError: Index does not support mutable operations
    
    In [83]:
    
    In [83]: labels = pd.Index(np.arange(3))
    
    In [84]: labels
    Out[84]: Int64Index([0, 1, 2], dtype='int64')
    
    In [85]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
    
    In [86]: obj2
    Out[86]:
    0    1.5
    1   -2.5
    2    0.0
    dtype: float64
    
    In [87]: obj2.index is labels
    Out[87]: True
    

    索引对象是不可变的

    In [89]: frame3.columns
    Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
    
    In [90]: 'Ohio' in frame3.columns
    Out[90]: True
    
    In [91]: 2003 in frame3.columns
    Out[91]: False
    
    
    In [88]: frame3
    Out[88]:
    state  Nevada  Ohio
    year
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [89]: frame3.columns
    Out[89]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
    
    In [90]: 'Ohio' in frame3.columns
    Out[90]: True
    
    In [91]: 2003 in frame3.columns
    Out[91]: False
    
    In [92]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
    
    In [93]: dup_labels
    Out[93]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
    

    相关文章

      网友评论

        本文标题:DataFrame

        本文链接:https://www.haomeiwen.com/subject/eymksqtx.html