12 Pandas的索引index的用途

把数据存储于普通的column列也能用于数据查询，那使用index有什么好处？

index的用途总结：

更方便的数据查询；
使用index可以获得性能提升；
自动的数据对齐功能；
更多更强大的数据结构支持；

import pandas as pd

df = pd.read_csv("./datas/ml-latest-small/ratings.csv")

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

df.count()



    userId       100836
    movieId      100836
    rating       100836
    timestamp    100836
    dtype: int64

1、使用index查询数据


# drop==False，让索引列还保持在column
df.set_index("userId", inplace=True, drop=False)

df.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
userId
1	1	1	4.0	964982703
1	1	3	4.0	964981247
1	1	6	4.0	964982224
1	1	47	5.0	964983815
1	1	50	5.0	964982931

df.index




    Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
                ...
                610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
               dtype='int64', name='userId', length=100836)


# 使用index的查询方法
df.loc[500].head(5)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

# 使用column的condition查询方法
df.loc[df["userId"] == 500].head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

2. 使用index会提升查询性能

如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);

如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);

如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

实验1：完全随机的顺序查询

# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

df_shuffle.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
userId
160	160	2340	1.0	985383314
129	129	1136	3.5	1167375403
167	167	44191	4.5	1154718915
536	536	276	3.0	832839990
67	67	5952	2.0	1501274082

# 索引是否是递增的
df_shuffle.index.is_monotonic_increasing


    False


df_shuffle.index.is_unique

    False


# 计时，查询id==500数据性能
%timeit df_shuffle.loc[500]

    376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

实验2：将index排序后的查询

df_sorted = df_shuffle.sort_index()

df_sorted.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	userId	movieId	rating	timestamp
userId
1	1	2985	4.0	964983034
1	1	2617	2.0	964982588
1	1	3639	4.0	964982271
1	1	6	4.0	964982224
1	1	733	4.0	964982400

# 索引是否是递增的
df_sorted.index.is_monotonic_increasing

    True

df_sorted.index.is_unique

    False

%timeit df_sorted.loc[500]

    203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3. 使用index能自动对齐数据

包括series和dataframe

s1 = pd.Series([1,2,3], index=list("abc"))

s1



    a    1
    b    2
    c    3
    dtype: int64


s2 = pd.Series([2,3,4], index=list("bcd"))

s2



    b    2
    c    3
    d    4
    dtype: int64


s1+s2



    a    NaN
    b    4.0
    c    6.0
    d    NaN
    dtype: float64