pandas 学习心得(3):层级索引

作者: 不做废物 | 来源:发表于2018-08-03 18:59


    jupyter notebook:pandas 学习心得(3):层级索引





    import numpy as np
    import pandas as pd

    多级索引的作用: 用低维的Series 或 DataFrame 表示更高维的数据
    首先在不知道pandas 提供多级索引的条件下,创造一个Series 数据集

    index= {('California', 2000),('California',2010),
            ('New York',2000),('New York',2010),
    populations = [33871648,37253956,
    pop = pd.Series(populations, index = index)
    Texas       2000    33871648
    New York    2000    37253956
                2010    18976457
    California  2010    19378102
    Texas       2010    20851820
    California  2000    25145561
    dtype: int64


    {('California', 2000),
     ('California', 2010),
     ('New York', 2000),
     ('New York', 2010),
     ('Texas', 2000),
     ('Texas', 2010)}

    而且,上面pop 两个California 怎么不挨在一起,强迫症受不了!

    • pandas 多级索引
      现在我们利用 笛卡儿积 生成多级索引
    index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  
    MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
               labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])


    pop = pop.reindex(index)
    California  2000    25145561
                2010    19378102
    New York    2000    37253956
                2010    18976457
    Texas       2000    33871648
                2010    20851820
    dtype: int64

    好看多了, 其中最左边的索引为 0级索引,2000这些为1级索引,以此类推。


    pop[:,2010]  # [a,b] a表示 California 这些地名,b 表示2000这些年份
    California    19378102
    New York      18976457
    Texas         20851820
    dtype: int64


    1. 显式地创建多级索引

    pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])  # 从简单数组中创建
    MultiIndex(levels=[['a', 'b'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
    pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])  # 从元组中创建
    MultiIndex(levels=[['a', 'b'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
    pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  # 从笛卡尔积中创建,已经了解过了
    MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
               labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

    更详细地,可以直接提供levels和 labels 进行创建

    pd.MultiIndex(levels = [['a','b'],[1,2]],
                 labels = [[0,0,1,1],[0,1,0,1]])
    MultiIndex(levels=[['a', 'b'], [1, 2]],
               labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

    levels 有 两个列表, 分别表示 第0级索引和第1级索引
    labels 也有两个列表,两个列表长度(数据集元素的个数)相同,分表表示 数据 取自第0级索引和 1级索引的 第几个标签,结合笛卡尔积理解


    pop.index.names = ['states','years']
    states      years
    California  2000     25145561
                2010     19378102
    New York    2000     37253956
                2010     18976457
    Texas       2000     33871648
                2010     20851820
    dtype: int64

    2. 多级列索引

    下面模拟一个医疗数据的 DataFrame

    index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                       names=['year', 'visit'])
    columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                         names=['subject', 'type'])
    data = np.round(np.random.randn(4,6), 1)
    data[:,::2] *= 10
    data += 37
    array([[30. , 38.4, 43. , 35.8, 34. , 36. ],
           [35. , 36. , 31. , 36.6, 24. , 36.7],
           [31. , 36.6, 47. , 37.2, 37. , 39. ],
           [39. , 35.7, 40. , 36.4, 37. , 36.9]])
    health_data = pd.DataFrame(data, index = index, columns = columns)

    <table border="1" class="dataframe">
    <th colspan="2" halign="left">Bob</th>
    <th colspan="2" halign="left">Guido</th>
    <th colspan="2" halign="left">Sue</th>
    <th rowspan="2" valign="top">2013</th>
    <th rowspan="2" valign="top">2014</th>

    • 对DataFrame提供一个索引, 只能查询 第0级列索引
    health_data['Guido']  # health_data['HR']  会报错

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <th rowspan="2" valign="top">2013</th>
    <th rowspan="2" valign="top">2014</th>


    1. Series 多级索引

    以pop 数据集为例

    states      years
    California  2000     25145561
                2010     19378102
    New York    2000     37253956
                2010     18976457
    Texas       2000     33871648
                2010     20851820
    dtype: int64
    pop['California',2000]  # 注意各级索引的位置
    pop['California']  # 如果只提供一个,不加逗号,那么只能在 0级索引中挑选,pop[2010] 报错
    2000    25145561
    2010    19378102
    dtype: int64
    pop.loc['California':'New York']  # 还可以进行切片, 0级索引必须经过排序(A-Z)
    # 可使用 pop = pop.sort_index() 进行索引的排序
    states      years
    California  2000     25145561
                2010     19378102
    New York    2000     37253956
                2010     18976457
    dtype: int64
    • 如果索引已经排序,要使用较低层级索引, 第0层索引可以使用空切片
    California    19378102
    New York      18976457
    Texas         20851820
    dtype: int64

    还可以使用 掩码、花式索引,就不展开了

    2. DataFrame 多级索引

    以 health_data 数据集为例


    <table border="1" class="dataframe">
    <th colspan="2" halign="left">Bob</th>
    <th colspan="2" halign="left">Guido</th>
    <th colspan="2" halign="left">Sue</th>
    <th rowspan="2" valign="top">2013</th>
    <th rowspan="2" valign="top">2014</th>

    • DataFrame的基本索引式列索引,若不使用 loc iloc ,则只能进行列索引
    year  visit
    2013  1        43.0
          2        31.0
    2014  1        47.0
          2        40.0
    Name: (Guido, HR), dtype: float64
    • 使用DataFrame 的索引器,则可以进行行、列索引
    health_data.iloc[0:2, 0:2]

    <table border="1" class="dataframe">
    <th colspan="2" halign="left">Bob</th>
    <th rowspan="2" valign="top">2013</th>

    health_data.loc[:,(('Bob','Guido'), 'HR')]  # 这个案例 详细琢磨下

    <table border="1" class="dataframe">
    <th rowspan="2" valign="top">2013</th>
    <th rowspan="2" valign="top">2014</th>


    • health_data.loc[ , ] 逗号左边 为行, 右边为列
    • health_data.loc[: ,<font color="#dddd00">(</font><br /> (列的第0级索引),(列的第一级索引 ) <font color="#dddd00">)</font><br /> ]如要进行多级索引,必须用嵌套元组的形式


    health_data.loc[:,(:, 'HR')]  
      File "<ipython-input-25-ff9aeaa8e80b>", line 1
        health_data.loc[:,(:, 'HR')]
    SyntaxError: invalid syntax

    3. 索引的设置与重置


    • 索引的重置
    Help on method reset_index in module pandas.core.series:
    reset_index(level=None, drop=False, name=None, inplace=False) method of pandas.core.series.Series instance
        Generate a new DataFrame or Series with the index reset.
        This is useful when the index needs to be treated as a column, or
        when the index is meaningless and needs to be reset to the default
        before another operation.
        level : int, str, tuple, or list, default optional
            For a Series with a MultiIndex, only remove the specified levels
            from the index. Removes all levels by default.
        drop : bool, default False
            Just reset the index, without inserting it as a column in
            the new DataFrame.
        name : object, optional
            The name to use for the column containing the original Series
            values. Uses ``self.name`` by default. This argument is ignored
            when `drop` is True.
        inplace : bool, default False
            Modify the Series in place (do not create a new object).
        Series or DataFrame
            When `drop` is False (the default), a DataFrame is returned.
            The newly created columns will come first in the DataFrame,
            followed by the original Series values.
            When `drop` is True, a `Series` is returned.
            In either case, if ``inplace=True``, no value is returned.
        See Also
        DataFrame.reset_index: Analogous function for DataFrame.
        >>> s = pd.Series([1, 2, 3, 4], name='foo',
        ...               index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
        Generate a DataFrame with default index.
        >>> s.reset_index()
          idx  foo
        0   a    1
        1   b    2
        2   c    3
        3   d    4
        To specify the name of the new column use `name`.
        >>> s.reset_index(name='values')
          idx  values
        0   a       1
        1   b       2
        2   c       3
        3   d       4
        To generate a new Series with the default set `drop` to True.
        >>> s.reset_index(drop=True)
        0    1
        1    2
        2    3
        3    4
        Name: foo, dtype: int64
        To update the Series in place, without generating a new one
        set `inplace` to True. Note that it also requires ``drop=True``.
        >>> s.reset_index(inplace=True, drop=True)
        >>> s
        0    1
        1    2
        2    3
        3    4
        Name: foo, dtype: int64
        The `level` parameter is interesting for Series with a multi-level
        >>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']),
        ...           np.array(['one', 'two', 'one', 'two'])]
        >>> s2 = pd.Series(
        ...     range(4), name='foo',
        ...     index=pd.MultiIndex.from_arrays(arrays,
        ...                                     names=['a', 'b']))
        To remove a specific level from the Index, use `level`.
        >>> s2.reset_index(level='a')
               a  foo
        one  bar    0
        two  bar    1
        one  baz    2
        two  baz    3
        If `level` is not set, all levels are removed from the Index.
        >>> s2.reset_index()
             a    b  foo
        0  bar  one    0
        1  bar  two    1
        2  baz  one    2
        3  baz  two    3
    pop_flat = pop.reset_index()  # 如果不指定name参数,它会自动添加列名

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <td>New York</td>
    <td>New York</td>

    pop_flat2 = pop.reset_index(name = 'population')  # 如果不指定name参数,它会自动添加列名

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <td>New York</td>
    <td>New York</td>

    • 索引的设置
      以pop_flat2 为例,它将上述的普通DataFrame 制作成多级索引的DataFrame
    pop_flat2.set_index(['states', 'years'])  # 返回数据框

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <th rowspan="2" valign="top">California</th>
    <th rowspan="2" valign="top">New York</th>
    <th rowspan="2" valign="top">Texas</th>

    pop_flat2.set_index( 'years')  # 返回数据框

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <td>New York</td>
    <td>New York</td>

    索引 stack 与 unstack

    个人认为 stack 与 unstack 进行维度的转换很方便,可以将数据集进行长短变换,以满足不同需要

    states      years
    California  2000     25145561
                2010     19378102
    New York    2000     37253956
                2010     18976457
    Texas       2000     33871648
                2010     20851820
    dtype: int64

    使用unstack 将 states 作为列名

    pop.unstack(level = 1)

    <table border="1" class="dataframe">
    <tr style="text-align: right;">
    <th>New York</th>



        pandas 学习心得(3):层级索引

