美文网首页
pandas dataframe boolean indexin

pandas dataframe boolean indexin

作者: 筝韵徽 | 来源:发表于2019-01-09 15:53 被阅读101次
    import pandas as pd
    import numpy as np
    from tabulate import tabulate
    

    使用boolean indexing 筛选数据

    df= pd.read_csv('data/stackoverflow_qa.csv')
    df.head()
    
    |    |      id | creationdate        |   score |   viewcount | title                                                                               |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:------------------------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                                            |             3 |              4 |               1 | yueerhu        |         125 | Mike Pennington |     26995 |
    |  1 | 5515021 | 2011-04-01 14:50:44 |       8 |        7015 | Compute a compounded return series in Python                                        |             3 |              6 |               7 | Jason Strimpel |        3301 | Mike Pennington |     26995 |
    |  2 | 5558607 | 2011-04-05 21:13:50 |       2 |        7392 | Sort a pandas DataMatrix in ascending order                                         |             2 |              0 |               1 | Jason Strimpel |        3301 | Wes McKinney    |     43310 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas                      |             1 |              0 |               7 | user814005     |         117 | Wes McKinney    |     43310 |
    |  4 | 7577546 | 2011-09-28 01:58:38 |       9 |        2488 | Using pandas, how do I subsample a large DataFrame by group in an efficient manner? |             1 |              0 |               5 | Uri Laserson   |         958 | HYRY            |     54137 |
    
    • 单纯的 boolean
    • 条件列表
    conditions= [True,False,False,True,False]
    df.head(5).loc[conditions]
    
    |    |      id | creationdate        |   score |   viewcount | title                                                          |   answercount |   commentcount |   favoritecount | quest_name   |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:---------------------------------------------------------------|--------------:|---------------:|----------------:|:-------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                       |             3 |              4 |               1 | yueerhu      |         125 | Mike Pennington |     26995 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas |             1 |              0 |               7 | user814005   |         117 | Wes McKinney    |     43310 |
    

    说明:第0条,第3条为true,其他为false

    df.head(5)[conditions]
    
    |    |      id | creationdate        |   score |   viewcount | title                                                          |   answercount |   commentcount |   favoritecount | quest_name   |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:---------------------------------------------------------------|--------------:|---------------:|----------------:|:-------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                       |             3 |              4 |               1 | yueerhu      |         125 | Mike Pennington |     26995 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas |             1 |              0 |               7 | user814005   |         117 | Wes McKinney    |     43310 |
    

    上例说明:这个很容易让人混淆,这是要选择列还是行,个人习惯尽量不用

    • 条件Series
    s=pd.Series([True,False,False,True,False])
    s
    
    0     True
    1    False
    2    False
    3     True
    4    False
    dtype: bool
    
    df.head(5).loc[s]
    
    |    |      id | creationdate        |   score |   viewcount | title                                                          |   answercount |   commentcount |   favoritecount | quest_name   |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:---------------------------------------------------------------|--------------:|---------------:|----------------:|:-------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                       |             3 |              4 |               1 | yueerhu      |         125 | Mike Pennington |     26995 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas |             1 |              0 |               7 | user814005   |         117 | Wes McKinney    |     43310 |
    

    注意,series的index 要与被筛选的dataframe的index一致
    下面看例子

    s.index
    
    RangeIndex(start=0, stop=5, step=1)
    
    df.head(5).index
    
    RangeIndex(start=0, stop=5, step=1)
    

    上边index是一致的,来看不一致的

    ss=pd.Series([True,False,False,True,False],index=[6,7,8,9,10])
    ss
    
    6      True
    7     False
    8     False
    9      True
    10    False
    dtype: bool
    
    df.head(5).loc[ss]
    
    ---------------------------------------------------------------------------
    
    IndexingError                             Traceback (most recent call last)
    
    <ipython-input-11-a1fa0f97c715> in <module>()
    ----> 1 df.head(5).loc[ss]
    
    
    E:\software\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
       1476 
       1477             maybe_callable = com._apply_if_callable(key, self.obj)
    -> 1478             return self._getitem_axis(maybe_callable, axis=axis)
       1479 
       1480     def _is_scalar_access(self, key):
    
    
    E:\software\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
       1866             return self._get_slice_axis(key, axis=axis)
       1867         elif com.is_bool_indexer(key):
    -> 1868             return self._getbool_axis(key, axis=axis)
       1869         elif is_list_like_indexer(key):
       1870 
    
    
    E:\software\anaconda3\lib\site-packages\pandas\core\indexing.py in _getbool_axis(self, key, axis)
       1491             axis = self.axis or 0
       1492         labels = self.obj._get_axis(axis)
    -> 1493         key = check_bool_indexer(labels, key)
       1494         inds, = key.nonzero()
       1495         try:
    
    
    E:\software\anaconda3\lib\site-packages\pandas\core\indexing.py in check_bool_indexer(ax, key)
       2354         mask = isna(result._values)
       2355         if mask.any():
    -> 2356             raise IndexingError('Unalignable boolean Series provided as '
       2357                                 'indexer (index of the boolean Series and of '
       2358                                 'the indexed object do not match')
    
    
    IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
    

    报错,不匹配.所以尽量避免手写Series的index 可以 s=pd.Series([True,False,False,True,False],index=df.head(5).index) 代替

    • numpy arrays 条件
    a=np.array([True,False,False,True,False])
    df.head(5).loc[a]
    
    |    |      id | creationdate        |   score |   viewcount | title                                                          |   answercount |   commentcount |   favoritecount | quest_name   |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:---------------------------------------------------------------|--------------:|---------------:|----------------:|:-------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                       |             3 |              4 |               1 | yueerhu      |         125 | Mike Pennington |     26995 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas |             1 |              0 |               7 | user814005   |         117 | Wes McKinney    |     43310 |
    
    • 使用不等式 > < >= <= == !=
    condit=df['score'] > 10
    
    condit.head()
    
    0    False
    1    False
    2    False
    3    False
    4    False
    Name: score, dtype: bool
    
    type(condit)
    
    pandas.core.series.Series
    
    df1=df.loc[condit]
    df1.head()
    
    |    |      id | creationdate        |   score |   viewcount | title                                                                  |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name      |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:-----------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:--------------|----------:|
    |  6 | 7776679 | 2011-10-15 08:21:17 |      25 |       28159 | append two data frame with pandas                                      |             2 |              7 |               4 | Jean-Pat       |         882 | Wes McKinney  |     43310 |
    |  8 | 7837722 | 2011-10-20 14:46:14 |     201 |      223746 | What is the most efficient way to loop through dataframes with pandas? |             8 |              3 |             115 | Muppet         |        1563 | Nick Crawford |      2779 |
    | 14 | 8916302 | 2012-01-18 19:41:27 |      29 |       20614 | selecting across multiple columns with python pandas?                  |             3 |              0 |              14 | user248237dfsf |       19244 | Wes McKinney  |     43310 |
    | 17 | 8991709 | 2012-01-24 17:59:53 |     136 |       16783 | Why are pandas merges in python faster than data.table merges in R?    |             3 |             16 |              60 | Zach           |       12484 | Matt Dowle    |     41275 |
    | 24 | 9555635 | 2012-03-04 14:25:36 |      19 |        6604 | Open source Enthought Python alternative                               |             8 |              5 |               6 | tshauck        |        5957 | ogrisel       |     24990 |
    
    df.shape
    
    (56398, 12)
    
    df1.shape
    
    (1364, 12)
    

    本质上还是使用了相同index的Series进行条件筛选的
    使用链式方式 如下

    df.loc[df['score']>10].head()
    
    |    |      id | creationdate        |   score |   viewcount | title                                                                  |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name      |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:-----------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:--------------|----------:|
    |  6 | 7776679 | 2011-10-15 08:21:17 |      25 |       28159 | append two data frame with pandas                                      |             2 |              7 |               4 | Jean-Pat       |         882 | Wes McKinney  |     43310 |
    |  8 | 7837722 | 2011-10-20 14:46:14 |     201 |      223746 | What is the most efficient way to loop through dataframes with pandas? |             8 |              3 |             115 | Muppet         |        1563 | Nick Crawford |      2779 |
    | 14 | 8916302 | 2012-01-18 19:41:27 |      29 |       20614 | selecting across multiple columns with python pandas?                  |             3 |              0 |              14 | user248237dfsf |       19244 | Wes McKinney  |     43310 |
    | 17 | 8991709 | 2012-01-24 17:59:53 |     136 |       16783 | Why are pandas merges in python faster than data.table merges in R?    |             3 |             16 |              60 | Zach           |       12484 | Matt Dowle    |     41275 |
    | 24 | 9555635 | 2012-03-04 14:25:36 |      19 |        6604 | Open source Enthought Python alternative                               |             8 |              5 |               6 | tshauck        |        5957 | ogrisel       |     24990 |
    
    df.loc[df['ans_name']=='Scott Boston'].head()
    
    |       |       id | creationdate        |   score |   viewcount | title                                                |   answercount |   commentcount |   favoritecount | quest_name   |   quest_rep | ans_name     |   ans_rep |
    |------:|---------:|:--------------------|--------:|------------:|:-----------------------------------------------------|--------------:|---------------:|----------------:|:-------------|------------:|:-------------|----------:|
    | 38161 | 43491342 | 2017-04-19 09:14:28 |       4 |         167 | Merging pandas dataframes based on nearest value(s)  |             1 |              0 |             nan | AkiRoss      |        3991 | Scott Boston |     23611 |
    | 38178 | 43190850 | 2017-04-03 17:31:33 |       1 |         284 | Python Seaborn Plot ValueError                       |             2 |              3 |             nan | Ryan         |         545 | Scott Boston |     23611 |
    | 38237 | 43176052 | 2017-04-03 03:21:12 |       2 |          39 | Convert an indexed pandas matrix to a flat dataframe |             2 |              0 |             nan | alvas        |       31923 | Scott Boston |     23611 |
    | 38246 | 43209525 | 2017-04-04 14:03:17 |       5 |         131 | Pandas: Optimal way to MultiIndex columns            |             2 |              0 |               0 | sparc_spread |        5470 | Scott Boston |     23611 |
    | 38275 | 43211893 | 2017-04-04 15:45:17 |       0 |          38 | How to calculate a index series for a event window   |             1 |              3 |             nan | zsljulius    |        1102 | Scott Boston |     23611 |
    
    df.loc[(df['ans_name']=='Scott Boston')&(df['score']>2)].head()
    
    |       |       id | creationdate        |   score |   viewcount | title                                                               |   answercount |   commentcount |   favoritecount | quest_name       |   quest_rep | ans_name     |   ans_rep |
    |------:|---------:|:--------------------|--------:|------------:|:--------------------------------------------------------------------|--------------:|---------------:|----------------:|:-----------------|------------:|:-------------|----------:|
    | 38161 | 43491342 | 2017-04-19 09:14:28 |       4 |         167 | Merging pandas dataframes based on nearest value(s)                 |             1 |              0 |             nan | AkiRoss          |        3991 | Scott Boston |     23611 |
    | 38246 | 43209525 | 2017-04-04 14:03:17 |       5 |         131 | Pandas: Optimal way to MultiIndex columns                           |             2 |              0 |               0 | sparc_spread     |        5470 | Scott Boston |     23611 |
    | 38621 | 43240690 | 2017-04-05 20:07:29 |       3 |         210 | pandas data manipulation in python                                  |             2 |              0 |               1 | BigDataScientist |         360 | Scott Boston |     23611 |
    | 38640 | 42870703 | 2017-03-18 05:06:22 |       5 |         125 | Simultaneous operation of groupby and resample on pandas dataframe? |             1 |              0 |               1 | S. Naribole      |          43 | Scott Boston |     23611 |
    | 39223 | 42938535 | 2017-03-21 21:35:36 |       3 |         424 | python pandas resample count and sum                                |             2 |              0 |             nan | jeangelj         |         779 | Scott Boston |     23611 |
    
    df.loc[(df['ans_name']=='Scott Boston')|(df['score']<10)].head()
    
    |    |      id | creationdate        |   score |   viewcount | title                                                                               |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:------------------------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                                            |             3 |              4 |               1 | yueerhu        |         125 | Mike Pennington |     26995 |
    |  1 | 5515021 | 2011-04-01 14:50:44 |       8 |        7015 | Compute a compounded return series in Python                                        |             3 |              6 |               7 | Jason Strimpel |        3301 | Mike Pennington |     26995 |
    |  2 | 5558607 | 2011-04-05 21:13:50 |       2 |        7392 | Sort a pandas DataMatrix in ascending order                                         |             2 |              0 |               1 | Jason Strimpel |        3301 | Wes McKinney    |     43310 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas                      |             1 |              0 |               7 | user814005     |         117 | Wes McKinney    |     43310 |
    |  4 | 7577546 | 2011-09-28 01:58:38 |       9 |        2488 | Using pandas, how do I subsample a large DataFrame by group in an efficient manner? |             1 |              0 |               5 | Uri Laserson   |         958 | HYRY            |     54137 |
    

    df.loc[~((df['ans_name']=='Scott Boston')&(df['score']>2))].head()
    
    |    |      id | creationdate        |   score |   viewcount | title                                                                               |   answercount |   commentcount |   favoritecount | quest_name     |   quest_rep | ans_name        |   ans_rep |
    |---:|--------:|:--------------------|--------:|------------:|:------------------------------------------------------------------------------------|--------------:|---------------:|----------------:|:---------------|------------:|:----------------|----------:|
    |  0 | 5486226 | 2011-03-30 12:26:50 |       4 |        2113 | Rolling median in python                                                            |             3 |              4 |               1 | yueerhu        |         125 | Mike Pennington |     26995 |
    |  1 | 5515021 | 2011-04-01 14:50:44 |       8 |        7015 | Compute a compounded return series in Python                                        |             3 |              6 |               7 | Jason Strimpel |        3301 | Mike Pennington |     26995 |
    |  2 | 5558607 | 2011-04-05 21:13:50 |       2 |        7392 | Sort a pandas DataMatrix in ascending order                                         |             2 |              0 |               1 | Jason Strimpel |        3301 | Wes McKinney    |     43310 |
    |  3 | 6467832 | 2011-06-24 12:31:45 |       9 |       13056 | How to get the correlation between two timeseries using Pandas                      |             1 |              0 |               7 | user814005     |         117 | Wes McKinney    |     43310 |
    |  4 | 7577546 | 2011-09-28 01:58:38 |       9 |        2488 | Using pandas, how do I subsample a large DataFrame by group in an efficient manner? |             1 |              0 |               5 | Uri Laserson   |         958 | HYRY            |     54137 |
    

    说明:&-与,|-或, ~ 非 都是位操作,所以每个条件都要带上(),类似(df['ans_name']=='Scott Boston')

    
    

    相关文章

      网友评论

          本文标题:pandas dataframe boolean indexin

          本文链接:https://www.haomeiwen.com/subject/fwkgrqtx.html