pandas dataframe Boolean条件筛选数据2

作者: 筝韵徽 | 来源:发表于2019-01-10 17:10 被阅读61次

pandas dataframe boolean条件数据筛选3
pandas dataframe Boolean条件筛选数据2
pandas DataFrame 数据筛选(2)
2021-12-31 Python-23
pandas中loc-iloc-ix的使用
pandas dataframe boolean indexin
Pandas
Python学习笔记（5）：Pandas数据结构介绍
2020-02-12
Pandas库介绍--DataFrame数据筛选

Boolean条件筛选数据2

import pandas as pd
import numpy as np
from tabulate import tabulate

df=pd.read_csv('data/stackoverflow_qa.csv')
print(tabulate(df.head(),headers=df.columns,tablefmt='simple'))

         id  creationdate           score    viewcount  title                                                                                  answercount    commentcount    favoritecount  quest_name        quest_rep  ans_name           ans_rep
--  -------  -------------------  -------  -----------  -----------------------------------------------------------------------------------  -------------  --------------  ---------------  --------------  -----------  ---------------  ---------
 0  5486226  2011-03-30 12:26:50        4         2113  Rolling median in python                                                                         3               4                1  yueerhu                 125  Mike Pennington      26995
 1  5515021  2011-04-01 14:50:44        8         7015  Compute a compounded return series in Python                                                     3               6                7  Jason Strimpel         3301  Mike Pennington      26995
 2  5558607  2011-04-05 21:13:50        2         7392  Sort a pandas DataMatrix in ascending order                                                      2               0                1  Jason Strimpel         3301  Wes McKinney         43310
 3  6467832  2011-06-24 12:31:45        9        13056  How to get the correlation between two timeseries using Pandas                                   1               0                7  user814005              117  Wes McKinney         43310
 4  7577546  2011-09-28 01:58:38        9         2488  Using pandas, how do I subsample a large DataFrame by group in an efficient manner?              1               0                5  Uri Laserson            958  HYRY                 54137

复杂的条件筛选
- 条件1:score>=8 并且 ans_name 是以Mike开头
- 条件2:answercount >=2 并且ans_name 是HYRY
- 找出符合这两个条件的数据
- 使用| 将条件1 和条件2 取并集如下

cond1=(df['score']>=8) &(df['ans_name'].str.startswith('Mike'))
cond2=(df['answercount']>=2) &(df['ans_name']=='HYRY')
cond3 =cond1 | cond2
aa=df.loc[cond3].head(10)
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

            id  creationdate           score    viewcount  title                                                                                            answercount    commentcount    favoritecount  quest_name        quest_rep  ans_name           ans_rep
----  --------  -------------------  -------  -----------  ---------------------------------------------------------------------------------------------  -------------  --------------  ---------------  --------------  -----------  ---------------  ---------
   1   5515021  2011-04-01 14:50:44        8         7015  Compute a compounded return series in Python                                                               3               6                7  Jason Strimpel         3301  Mike Pennington      26995
 216  11619144  2012-07-23 19:26:32        0          613  Pandas xaxis auto-format issue                                                                             2               0              nan  joelhoro                379  HYRY                 54137
 722  13978682  2012-12-20 18:57:18        2         1382  Change Categorical Variable levels to What I provide/Combine levels two categorical variables              3               0              nan  Tom Bennett             574  HYRY                 54137
 932  15126679  2013-02-28 03:00:33        4         9115  Plot key count per unique value count in pandas                                                            2               0                1  monkut                22867  HYRY                 54137
 937  15581829  2013-03-23 00:16:32        6         9456  how to perform an inner or outer join of DataFrames with Pandas on non-simplistic criterion                2               0                6  zzzeek                41696  HYRY                 54137
 962  17778394  2013-07-22 00:31:30       16        11317  List Highest Correlation Pairs from a Large Correlation Matrix in Pandas?                                  4               0                3  Kyle Brandt            8282  HYRY                 54137
1343  18070520  2013-08-06 01:42:47        1          536  pandas memory usage when reindexing                                                                        2               2              nan  Andrea Zonca           3904  HYRY                 54137
1373  15856213  2013-04-06 21:28:12        5         1896  What is the equivalent to R's match() for python Pandas/numpy?                                             3               0                1  Solomon                 546  HYRY                 54137
1549  17401197  2013-07-01 09:29:11        2         1255  How to speed up Pandas multilevel dataframe shift by group?                                                3               0              nan  bigbug                 7865  HYRY                 54137
1589  15374261  2013-03-12 23:28:23        2         2087  pandas dataframe groupby a number of rows                                                                  2               0                1  seumas                  377  HYRY                 54137

isin的使用
- 找出 ans_name 是Mike Pennington，Wes McKinney， HYRY 的数据如下

aa=df.loc[df['ans_name'].isin(['Mike Pennington','Wes McKinney','HYRY'])].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

         id  creationdate           score    viewcount  title                                                                                  answercount    commentcount    favoritecount  quest_name        quest_rep  ans_name           ans_rep
--  -------  -------------------  -------  -----------  -----------------------------------------------------------------------------------  -------------  --------------  ---------------  --------------  -----------  ---------------  ---------
 0  5486226  2011-03-30 12:26:50        4         2113  Rolling median in python                                                                         3               4                1  yueerhu                 125  Mike Pennington      26995
 1  5515021  2011-04-01 14:50:44        8         7015  Compute a compounded return series in Python                                                     3               6                7  Jason Strimpel         3301  Mike Pennington      26995
 2  5558607  2011-04-05 21:13:50        2         7392  Sort a pandas DataMatrix in ascending order                                                      2               0                1  Jason Strimpel         3301  Wes McKinney         43310
 3  6467832  2011-06-24 12:31:45        9        13056  How to get the correlation between two timeseries using Pandas                                   1               0                7  user814005              117  Wes McKinney         43310
 4  7577546  2011-09-28 01:58:38        9         2488  Using pandas, how do I subsample a large DataFrame by group in an efficient manner?              1               0                5  Uri Laserson            958  HYRY                 54137

- 条件1： ans_name 是Wes McKinney， HYRY 的数据
- 条件2： score=9
- 找出符合条件1并且符合条件2的数据，如下

aa=df.loc[df['ans_name'].isin(['Wes McKinney','HYRY']) & (df['score']==9)].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

           id  creationdate           score    viewcount  title                                                                                  answercount    commentcount    favoritecount  quest_name        quest_rep  ans_name        ans_rep
---  --------  -------------------  -------  -----------  -----------------------------------------------------------------------------------  -------------  --------------  ---------------  --------------  -----------  ------------  ---------
  3   6467832  2011-06-24 12:31:45        9        13056  How to get the correlation between two timeseries using Pandas                                   1               0                7  user814005              117  Wes McKinney      43310
  4   7577546  2011-09-28 01:58:38        9         2488  Using pandas, how do I subsample a large DataFrame by group in an efficient manner?              1               0                5  Uri Laserson            958  HYRY              54137
140  11037895  2012-06-14 16:58:42        9         1144  How can I generalize my pandas data grouping to more than 3 dimensions?                          1               0                4  Tim Whitcomb           7432  Wes McKinney      43310
457  12950024  2012-10-18 08:14:42        9         1070  Add a column with a groupby on a hierarchical dataframe                                          1               2                6  Rutger Kassies        21364  Wes McKinney      43310
584  13419822  2012-11-16 15:43:15        9        12248  pandas dataframe, copy by value                                                                  1               0                4  Andrew                  248  Wes McKinney      43310

isnull 的使用
- 找出ans_name 为null的数据如下

aa=df.loc[df.ans_name.isnull()].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

         id  creationdate           score    viewcount  title                                                                    answercount    commentcount    favoritecount  quest_name             quest_rep    ans_name    ans_rep
--  -------  -------------------  -------  -----------  ---------------------------------------------------------------------  -------------  --------------  ---------------  -------------------  -----------  ----------  ---------
 5  7766400  2011-10-14 10:33:54        2          977  Pandas + Django + mod_wsgi + virtualenv                                            1               2                1  Evan Davey                   100         nan        nan
11  8451327  2011-12-09 20:27:24        1         2435  Python map() function output into Pandas DataFrame                                 1               0                1  briant57                       6         nan        nan
16  8966871  2012-01-23 03:21:00        0          287  Running Python/Numpy/Pandas on older secure computer                               0               3              nan  Casey                         16         nan        nan
29  9641916  2012-03-09 22:36:52        7         5519  Python Pandas: can't find numpy.core.multiarray when importing pandas              4               2              nan  Dylan Cutler                  41         nan        nan
30  9647656  2012-03-10 15:35:26        3          986  Pandas dataframe in mixed mode can't serialize to hdf5?                            1               0              nan  David van Coevorden           45         nan        nan

说明：isnull和isna 方法等同，isna是isnull的别名

boolean条件选择在Series中的应用

ss=df['commentcount']
ss.head()

0    4
1    6
2    0
3    0
4    0
Name: commentcount, dtype: int64

找出commentcount>10的数据如下

cond=ss>10
cond.head()

0    False
1    False
2    False
3    False
4    False
Name: commentcount, dtype: bool

ss[cond].head()

17     16
76     14
566    11
763    12
781    19
Name: commentcount, dtype: int64

ss.loc[cond].head()

17     16
76     14
566    11
763    12
781    19
Name: commentcount, dtype: int64

ss.loc[ss>10].head()

17     16
76     14
566    11
763    12
781    19
Name: commentcount, dtype: int64

找出 commentcount>10 并且 <15的数据如下

ss.loc[(ss>10)&(ss<15)].head()

76     14
566    11
763    12
787    12
837    13
Name: commentcount, dtype: int64

between 的使用

ss.loc[ss.between(10,15)].head()

76     14
566    11
763    12
764    10
787    12
Name: commentcount, dtype: int64

条件筛选（选择行列）

找出 viewcount>15000 的这'creationdate', 'viewcount', 'ans_name'列数据如下

aa=df.loc[df.viewcount>15000,['creationdate', 'viewcount', 'ans_name']].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id  creationdate           score  viewcount
----  -------------------  -------  -------------
   6  2011-10-15 08:21:17    28159  Wes McKinney
   7  2011-10-18 20:16:12    18917  Wes McKinney
   8  2011-10-20 14:46:14   223746  Nick Crawford
  14  2012-01-18 19:41:27    20614  Wes McKinney
  17  2012-01-24 17:59:53    16783  Matt Dowle

找出favoritecount 在20-50之间，并且返回隔行数据如下

aa=df.loc[df.viewcount.between(20,50),::2].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id    creationdate    score  viewcount                                                                    title  answercount      commentcount
----  --------------  -------  -------------------------------------------------------------------------  -------  ---------------  ----------------
 866        14444916        1  Pandas obtain share from DataFrame                                               0  Manuel Zompetta  Zelazny7
 981        18280947        1  How to do multiple column conversions on a Pandas Row in a DF in one pass        1  dartdog          Jeff
1049        17233635        1  Joining 2 dataframes on a specific column with IDs                               0  user7289         waitingkuo
1086        15006145       -1  Is it possible to have a data frame in which cells contain lists?                2  Roman            nan
1332        16303918        0  How to perform key-restricted broadcast-operate-update in Pandas?                0  kjo              Wouter Overmeire

条件选择列

df.shape

(56398, 12)

列表条件：共12列，隔列选择

co=[True,False]*6
co

[True, False, True, False, True, False, True, False, True, False, True, False]

aa=df.loc[:,co].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id    creationdate    score  viewcount                                                                              title  answercount     commentcount
----  --------------  -------  -----------------------------------------------------------------------------------  -------  --------------  ---------------
   0         5486226        4  Rolling median in python                                                                   4  yueerhu         Mike Pennington
   1         5515021        8  Compute a compounded return series in Python                                               6  Jason Strimpel  Mike Pennington
   2         5558607        2  Sort a pandas DataMatrix in ascending order                                                0  Jason Strimpel  Wes McKinney
   3         6467832        9  How to get the correlation between two timeseries using Pandas                             0  user814005      Wes McKinney
   4         7577546        9  Using pandas, how do I subsample a large DataFrame by group in an efficient manner?        0  Uri Laserson    HYRY

字典条件

col={'score':True,'commentcount':True}

aa=df.loc[:,col].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id    creationdate    score
----  --------------  -------
   0               4        4
   1               8        6
   2               2        0
   3               9        0
   4               9        0

选择出viewcount>10000，隔列显示

aa=df.loc[df.viewcount>10000,co].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id    creationdate    score  viewcount                                                                     title  answercount     commentcount
----  --------------  -------  --------------------------------------------------------------------------  -------  --------------  --------------
   3         6467832        9  How to get the correlation between two timeseries using Pandas                    0  user814005      Wes McKinney
   6         7776679       25  append two data frame with pandas                                                 7  Jean-Pat        Wes McKinney
   7         7813132       10  Convert array of string (category) to array of int from a pandas dataframe        0  Jean-Pat        Wes McKinney
   8         7837722      201  What is the most efficient way to loop through dataframes with pandas?            3  Muppet          Nick Crawford
  14         8916302       29  selecting across multiple columns with python pandas?                             0  user248237dfsf  Wes McKinney

选择都是int64类型的所有列的数据

aa=df.loc[:,df.dtypes == 'int64'].head()
print(tabulate(aa,headers=df.columns,tablefmt='simple'))

  id    creationdate    score    viewcount    title    answercount
----  --------------  -------  -----------  -------  -------------
   0     5.48623e+06        4         2113        3              4
   1     5.51502e+06        8         7015        3              6
   2     5.55861e+06        2         7392        2              0
   3     6.46783e+06        9        13056        1              0
   4     7.57755e+06        9         2488        1              0

df.dtypes

id                 int64
creationdate      object
score              int64
viewcount          int64
title             object
answercount        int64
commentcount       int64
favoritecount    float64
quest_name        object
quest_rep        float64
ans_name          object
ans_rep          float64
dtype: object

pandas dataframe boolean条件数据筛选3
pandas dataframe boolean条件数据筛选3 列与列之间的比较形成的筛选条件筛选出 answer...
pandas dataframe Boolean条件筛选数据2
Boolean条件筛选数据2 复杂的条件筛选条件1:score>=8 并且 ans_name 是以Mike开头条件...
pandas DataFrame 数据筛选(2)
导入numpy库，创建DataFrame 数据来做演示。 import numpy as np data = pd...
2021-12-31 Python-23
pandas pandas数据结构 pandas 有 2 个常用的数据结构：Series 和 Dataframe一...
pandas中loc-iloc-ix的使用
Pandas中loc,iloc,ix的使用使用 iloc 从DataFrame中筛选数据 iloc 是基于“位置...
pandas dataframe boolean indexin
使用boolean indexing 筛选数据单纯的 boolean 条件列表说明：第0条，第3条为true，...
Pandas
Pandas 目录一、Pandas基础二、Pandas三大数据结构1.Series2.DataFrame3.Ind...
Python学习笔记（5）：Pandas数据结构介绍
目录一、Pandas的数据结构介绍1、Series2、DataFrame（1）Dataframe创建（2）读取D...
2020-02-12
Pandas笔记之创建 Pandas DataFrame DataFrame为Pandas的第二种主要数据结构，是...
Pandas库介绍--DataFrame数据筛选
1.数据准备 2.筛选D列大于0的行扩展，& 用作且，如下例所示 3.只需要A和B列数据，而D和C列数据都是用于...