Python 数据处理（二十五）—— 索引和选择数据（续）

作者: 名本无名 | 来源:发表于2021-03-02 20:20 被阅读0次

Python 数据处理（二十五）—— 索引和选择数据（续）
Python 数据处理（二十三）—— 索引和选择数据
Python 数据处理（二十四）—— 索引和选择
Python学习如何监视Python程序的内存使用情况
Python数据处理&可视化学习指南
Python计算容积率
Python 数据处理（二十六）—— 索引和选择之 query
pandas 中链式索引选择数据1
Python数据处理从零开始----第四章（可视化）①②堆积柱状
Python数据处理从零开始----第四章（可视化）①③多变量绘

13 使用 isin 索引

isin() 方法，顾名思义，就是判断 pandas 对象的每个元素是否存在传入的对象（Series、DataFram、dict 以及可迭代对象）中，返回一个布尔值 DataFrame

In [165]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [166]: s
Out[166]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [167]: s.isin([2, 4, 6])
Out[167]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [168]: s[s.isin([2, 4, 6])]
Out[168]: 
2    2
0    4
dtype: int64

同样，该方法也适用于 Index 对象，当你不知道所要寻找的标签中哪些是真实存在的时候，这一方法是很有用的

In [169]: s[s.index.isin([2, 4, 6])]
Out[169]: 
4    0
2    2
dtype: int64

# 注意与下面代码的区别
In [170]: s.reindex([2, 4, 6])
Out[170]: 
2    2.0
4    0.0
6    NaN
dtype: float64

对于 MultiIndex，还可以指定索引的级别

In [171]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [172]: s_mi
Out[172]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [173]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[173]: 
0  c    2
1  a    3
dtype: int64

In [174]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[174]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

DataFrame 也有 isin() 方法，如果传递的是数组、列表或序列，会返回一个与原 DataFrame 大小相同的布尔 DataFrame

In [175]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [176]: values = ['a', 'b', 1, 3]

In [177]: df.isin(values)
Out[177]: 
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

通常，您希望将特定的值与特定的列相匹配。只需将 values 设置为一个字典，其中键是列名，值是要检查的值的列表。

In [178]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [179]: df.isin(values)
Out[179]: 
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False

将 DataFrame 的 isin 与 any() 和 all() 方法结合起来，可以快速选择满足给定条件的数据子集。

In [180]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [181]: row_mask = df.isin(values).all(1)

In [182]: df[row_mask]
Out[182]: 
   vals ids ids2
0     1   a    a

14 `where()` 和 `mask()` 方法

用布尔向量从 Series 中选择值，一般会返回数据的一个子集。为了保证选择的数据与原始数据的形状相同，可以使用 Series 和 DataFrame 中的 where 方法

返回选定的行

In [183]: s[s > 0]
Out[183]: 
3    1
2    2
1    3
0    4
dtype: int64

返回一个与原始数据形状相同的 Series

In [184]: s.where(s > 0)
Out[184]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

下面的代码与 df.where(df < 0) 功能一样

In [185]: df[df < 0]
Out[185]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

此外，where 接受一个可选的 other 参数，用于指定条件为 False 的值的替换值

In [186]: df.where(df < 0, -df)
Out[186]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

您可能希望根据一些布尔条件设置值

In [187]: s2 = s.copy()

In [188]: s2[s2 < 0] = 0

In [189]: s2
Out[189]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [190]: df2 = df.copy()

In [191]: df2[df2 < 0] = 0

In [192]: df2
Out[192]: 
                   A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000

默认情况下，where 返回数据的拷贝后的修改数据。这里有一个可选参数 inplace，可以在不创建副本的情况下修改原始数据。

In [193]: df_orig = df.copy()

In [194]: df_orig.where(df > 0, -df, inplace=True)

In [195]: df_orig
Out[195]: 
                   A         B         C         D
2000-01-01  2.104139  1.309525  0.485855  0.245166
2000-01-02  0.352480  0.390389  1.192319  1.655824
2000-01-03  0.864883  0.299674  0.227870  0.281059
2000-01-04  0.846958  1.222082  0.600705  1.233203
2000-01-05  0.669692  0.605656  1.169184  0.342416
2000-01-06  0.868584  0.948458  2.297780  0.684718
2000-01-07  2.670153  0.114722  0.168904  0.048048
2000-01-08  0.801196  1.392071  0.048788  0.808838

注意：

pandas 的 df1.where(m, df2) 与 numpy 的 np.where(m, df1, df2) 相等

In [196]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[196]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

对齐

此外，where 会与布尔输入对齐，与 .loc 的部分选择和部分设置类似

In [197]: df2 = df.copy()

In [198]: df2[df2[1:4] > 0] = 3

In [199]: df2
Out[199]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838

where 还可以接受 axis 和 level 参数来对齐输入

In [200]: df2 = df.copy()

In [201]: df2.where(df2 > 0, df2['A'], axis='index')
Out[201]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

这等效于下面的代码，但是速度更快

In [202]: df2 = df.copy()

In [203]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[203]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

where 的条件参数和 other 参数接受一个可调用函数，函数必须传入一个参数，并返回有效的输出作为条件和 other 参数。

In [204]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [205]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[205]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

mask

mask() 是 where 的逆布尔运算

In [206]: s.mask(s >= 0)
Out[206]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [207]: df.mask(df >= 0)
Out[207]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

15 使用 numpy 函数进行有条件的放大设置

where() 的替代方法是使用 numpy.where()。通过与设置新列相结合，可以对数据进行放大，其中的值是根据条件确定的。

In [208]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})

In [209]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')

In [210]: df
Out[210]: 
  col1 col2  color
0    A    Z  green
1    B    Z  green
2    B    X    red
3    C    Y    red

考虑到在下面的数据，你有两个选择。你想在第二列有 'Z' 的时候，将新的列颜色设置为 'green'。你可以执行以下操作

In [208]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})

In [209]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')

In [210]: df
Out[210]: 
  col1 col2  color
0    A    Z  green
1    B    Z  green
2    B    X    red
3    C    Y    red

如果有多个条件，可以使用 numpy.select()。

例如，如果说对应三个条件有三种颜色可以选择，第四种颜色作为备用，可以做如下处理

In [211]: conditions = [
   .....:     (df['col2'] == 'Z') & (df['col1'] == 'A'),
   .....:     (df['col2'] == 'Z') & (df['col1'] == 'B'),
   .....:     (df['col1'] == 'B')
   .....: ]
   .....: 

In [212]: choices = ['yellow', 'blue', 'purple']

In [213]: df['color'] = np.select(conditions, choices, default='black')

In [214]: df
Out[214]: 
  col1 col2   color
0    A    Z  yellow
1    B    Z    blue
2    B    X  purple
3    C    Y   black

16 重复数据

如果你想识别和删除 DataFrame 中的重复行，可以使用下面两个方法：

duplicate: 返回一个布尔向量，其长度与数据行数一样，指示对应的行是否重复
drop_duplicates: 删除重复的行

默认情况下，第一个观察到的行被认为是唯一的（即保留重复的第一个），但是每个方法都有一个 keep 参数来指定要保存的目标

keep='first'(default): 保留重复数据的第一个
keep='last': 保留重复数据的最后一个
keep=False: 删除所有存在重复的行

In [281]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [282]: df2
Out[282]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [283]: df2.duplicated('a')
Out[283]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [284]: df2.duplicated('a', keep='last')
Out[284]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [285]: df2.duplicated('a', keep=False)
Out[285]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [286]: df2.drop_duplicates('a')
Out[286]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [287]: df2.drop_duplicates('a', keep='last')
Out[287]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [288]: df2.drop_duplicates('a', keep=False)
Out[288]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

此外，还可以传递列名列表来标识重复行

In [289]: df2.duplicated(['a', 'b'])
Out[289]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [290]: df2.drop_duplicates(['a', 'b'])
Out[290]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329

要按索引值删除重复的内容，请使用 Index.deproicated 然后执行切片。也包含同样的 keep 参数

In [291]: df3 = pd.DataFrame({'a': np.arange(6),
   .....:                     'b': np.random.randn(6)},
   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   .....: 

In [292]: df3
Out[292]: 
   a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [293]: df3.index.duplicated()
Out[293]: array([False,  True, False, False,  True,  True])

In [294]: df3[~df3.index.duplicated()]
Out[294]: 
   a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409

In [295]: df3[~df3.index.duplicated(keep='last')]
Out[295]: 
   a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [296]: df3[~df3.index.duplicated(keep=False)]
Out[296]: 
   a         b
c  3 -0.894409

17 类似字典个 get 方法

每个 Series 或 DataFrame 都有一个可以返回默认值的 get 方法

In [297]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [298]: s.get('a')  # equivalent to s['a']
Out[298]: 1

In [299]: s.get('x', default=-1)
Out[299]: -1

18 使用索引或列标签查找值

有时，您希望提取一组给定行标签和列标签序列的值，这可以通过使用 DataFrame.melt 和 DataFrame.loc 来实现

In [300]: df = pd.DataFrame({'col': ["A", "A", "B", "B"],
   .....:                    'A': [80, 23, np.nan, 22],
   .....:                    'B': [80, 55, 76, 67]})
   .....: 

In [301]: df
Out[301]: 
  col     A   B
0   A  80.0  80
1   A  23.0  55
2   B   NaN  76
3   B  22.0  67

In [302]: melt = df.melt('col')

In [303]: melt = melt.loc[melt['col'] == melt['variable'], 'value']

In [304]: melt.reset_index(drop=True)
Out[304]: 
0    80.0
1    23.0
2    76.0
3    67.0
Name: value, dtype: float64

Python 数据处理（二十五）—— 索引和选择数据（续）
13 使用 isin 索引 isin() 方法，顾名思义，就是判断 pandas 对象的每个元素是否存在传入的对象...
Python 数据处理（二十三）—— 索引和选择数据
前言 pandas 对象的轴标签信息有很多用途，如利用标识符来标识数据能够显式的和自动对齐数据获取和设置数据...
Python 数据处理（二十四）—— 索引和选择
8 组合位置和标签索引如果你想获取 'A' 列的第 0 和第 2 个元素，你可以这样做: 这也可以用 .iloc...
Python学习如何监视Python程序的内存使用情况
前言我们使用Python和它的数据处理库套件(如panda和scikiti -learn)进行大量数据处理时候，...
Python数据处理&可视化学习指南
Python数据处理指南 Python数据可视化指南
Python计算容积率
现在ArcGIS计算好面积，导出dbf表格，在Python进行数据处理和可视化。数据处理读取数据数据清洗数...
Python 数据处理（二十六）—— 索引和选择之 query
19 query() 方法 DataFrame 对象有一个 query() 方法，它允许使用表达式进行选择。例...
pandas 中链式索引选择数据1
pandas 中链式索引选择数据1 链式索引选择数据,示例1 +链式索引选择数据,示例2 链式索引选择数据,示例...
Python数据处理从零开始----第四章（可视化）①②堆积柱状
目录 Python数据处理从零开始----第四章（可视化）①Matplotlib包 Python数据处理从零开始-...
Python数据处理从零开始----第四章（可视化）①③多变量绘
目录 Python数据处理从零开始----第四章（可视化）①Matplotlib包 Python数据处理从零开始-...