Python 数据处理（六）—— panda 基础介绍

作者: 名本无名 | 来源:发表于2021-02-07 22:43 被阅读0次

Python 数据处理（六）—— panda 基础介绍
Python学习如何监视Python程序的内存使用情况
pandas资料汇总
《基于Python的大数据分析基础及实战》（余本国）PDF电子书
数据处理——Python
Python数据处理(一)：处理 JSON、XML、CSV三种格
keras学习-基础部分
pandas 的基本介绍
Python 基础入门 7--编写测试用例（完）
python计算生态概览

前言

在这里，我们将会讨论很多 pandas 数据结构所共有的基本功能函数

先让我们来创建一个示例对象

In [1]: index = pd.date_range("1/1/2000", periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

介绍

1. head 和 tail

顾名思义，我们可以通过 head() 和tail() 函数，快速访问 Series 或 DataFrame 的前面和后面几行数据。默认只显示 5 行，也可以指定需要显示的行数。

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()
Out[5]: 
0    0.825854
1   -1.891351
2   -1.420520
3   -0.014777
4    0.380429
dtype: float64

In [6]: long_series.tail(3)
Out[6]: 
997   -2.097890
998    1.353178
999   -1.000323
dtype: float64

2. 属性与底层数据

pandas 对象有许多属性，能让你方便的访问元数据

shape：对象的轴维度，与 ndarray 一样
轴标签：
- Series：只有 index
- DataFrame：index 和 columns

注意：可以为这些属性赋值

In [7]: df[:2]
Out[7]: 
                   A         B         C
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df
Out[9]: 
                   a         b         c
2000-01-01 -0.173215  0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03  1.071804  0.721555 -0.706771
2000-01-04 -1.039575  0.271860 -0.424972
2000-01-05  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427
2000-01-07  0.524988  0.404705  0.577046
2000-01-08 -1.715002 -1.039268 -0.370647

你可以把 pandas 对象（Series、Index、DataFrame）看成是数组容器，保存数据并执行运算。

大部分类型的底层都是 numpy.ndarray。一般 pandas 和其他第三方库都会扩展 numpy 的类型系统。

可以使用 array 属性获取索引或 Series 中的实际数据

In [10]: s.array
Out[10]: 
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
 -1.1356323710171934,  1.2121120250208506]
Length: 5, dtype: float64

In [11]: s.index.array
Out[11]: 
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

也可以直接转换为 NumPy 数组

In [12]: s.to_numpy()
Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

In [13]: np.asarray(s)
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121])

to_numpy() 的 dtype 参数控制生成的 numpy.ndarray 的类型。例如，带时区的日期时间，NumPy 并未提供带时区信息的 datetime 数据类型，pandas 则有了两种表现形式

Timestamp：一种对象类型的 numpy.ndarray，提供了正确的 tz 信息。

In [14]: ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))

In [15]: ser.to_numpy(dtype=object)
Out[15]: 
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
       Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
      dtype=object)

datetime64[ns]：也是一种 numpy.ndarray，去除时区信息的 UTC 值

In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]: 
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')

当 DataFrame 里所有列的数据类型都一样时，DataFrame.to_numpy() 可以返回底层数据

In [17]: df.to_numpy()
Out[17]: 
array([[-0.1732,  0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949],
       [ 1.0718,  0.7216, -0.7068],
       [-1.0396,  0.2719, -0.425 ],
       [ 0.567 ,  0.2762, -1.0874],
       [-0.6737,  0.1136, -1.4784],
       [ 0.525 ,  0.4047,  0.577 ],
       [-1.715 , -1.0393, -0.3706]])

但是，如果每列的值不是一样的，获取底层数据则会比较复杂。如果 DataFrame 里包含了字符串，输出结果的类型就是 object，如果包含整数和浮点数，则会输出浮点类型。

之前，pandas 推荐使用 Series.values 或 DataFrame.values 从 Series 和 DataFrame 中获取数据。但是现在推荐使用 .array 或 to_numpy 提取数据，因为使用 .values 有以下几个缺点

当 Series 包含扩展类型时，.values 无法判断到底是该返回 NumPy 数组，还是返回 ExtensionArray。

而 .array 则只返回 ExtensionArray，且不会复制数据。.to_numpy() 则返回 NumPy 数组，但是需要复制、并强制转换类型。
当 DataFrame 是混合数据类型时，.values 会复制数据，并将数据强制转换为同一种数据类型，但这是一种代价较高的操作。

而 to_numpy() 返回的是 NumPy 数组，这种方式更清晰，也不会把 DataFrame 里的数据都当作同一种类型

3. 加速操作

pandas 使用 numexpr 与 bottleneck 库来加速某些类型的二进制数值与布尔运算。

这些库在处理大型数据集时特别有用，并且可以大大提高速度。numexpr 使用了智能分块、缓存和多核，bottleneck 是一组专门的 cython 例程，在处理有 nans 的数组时特别快。

对于包含 100 列 X 10 万行数据的 DataFrame

操作	0.11.0版 (ms)	旧版 (ms)	提升比率
df1 > df2	13.32	125.35	0.1063
df1 * df2	21.71	36.63	0.5928
df1 + df2	22.04	36.50	0.6039

这两个库默认是开启状态，可以使用下面的设置禁用

pd.set_option('compute.use_bottleneck', False)
pd.set_option('compute.use_numexpr', False)

4. 灵活的二元操作

对于 pandas 数据结构之间的二元操作，需要注意以下几点

多维（如 DataFrame）与低维（如 Series）对象之间的广播机制
缺失值处理

4.1 匹配/广播机制

DataFrame 支持 add(), sub(), mul(), div() 以及相应的 radd(), rsub() 等函数执行二元操作。对于这些函数，可以使用 axis 参数设置应用于索引还是列

In [18]: df = pd.DataFrame(
   ....:     {
   ....:         "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
   ....:         "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
   ....:         "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
   ....:     }
   ....: )
   ....: 

In [19]: df
Out[19]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [20]: row = df.iloc[1]

In [21]: column = df["two"]

In [22]: df.sub(row, axis="columns")
Out[22]: 
        one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [23]: df.sub(row, axis=1)
Out[23]: 
        one       two     three
a  1.051928 -0.139606       NaN
b  0.000000  0.000000  0.000000
c  0.352192 -0.433754  1.277825
d       NaN -1.632779 -0.562782

In [24]: df.sub(column, axis="index")
Out[24]: 
        one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

In [25]: df.sub(column, axis=0)
Out[25]: 
        one  two     three
a -0.377535  0.0       NaN
b -1.569069  0.0 -1.962513
c -0.783123  0.0 -0.250933
d       NaN  0.0 -0.892516

还可以使用 Series 与多级索引的某一级对应

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples(
   ....:     [(1, "a"), (1, "b"), (1, "c"), (2, "a")], names=["first", "second"]
   ....: )
   ....: 

In [28]: dfmi.sub(column, axis=0, level="second")
Out[28]: 
                   one       two     three
first second                              
1     a      -0.377535  0.000000       NaN
      b      -1.569069  0.000000 -1.962513
      c      -0.783123  0.000000 -0.250933
2     a            NaN -1.493173 -2.385688

Series 和 Index 支持 divmod() 内置函数，支持同时整除和取模运算，返回包含两个元素的元组

In [29]: s = pd.Series(np.arange(10))

In [30]: s
Out[30]: 
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div
Out[32]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    2
7    2
8    2
9    3
dtype: int64

In [33]: rem
Out[33]: 
0    0
1    1
2    2
3    0
4    1
5    2
6    0
7    1
8    2
9    0
dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem
Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

而且，还支持按元素 divmod()

In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [40]: div
Out[40]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    1
dtype: int64

In [41]: rem
Out[41]: 
0    0
1    1
2    2
3    0
4    0
5    1
6    1
7    2
8    2
9    3
dtype: int64

4.2 缺失值处理和填充

Series 与 DataFrame 的算数函数支持 fill_value 参数，可以用指定值替换某个位置的缺失值。

例如，两个 DataFrame 相加，如果两个 DataFrame 的同一个位置都有缺失值，其相加的和仍为 NaN，如果只有一个 DataFrame 里存在缺失值，则可以使用 fill_value 指定的值来替代 NaN。

当然，也可以直接使用 fillna 把所有 NaN 替换为想要的值。

我们将 df 拷贝一份，并且将第一行第三列的 NaN 赋值为 1，保存为 df2

>>> df2 = df2.copy()

>>> df2.loc['a', 'three'] = np.nan

In [42]: df
Out[42]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [43]: df2
Out[43]: 
        one       two     three
a  1.394981  1.772517  1.000000
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [44]: df + df2
Out[44]: 
        one       two     three
a  2.789963  3.545034       NaN
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

In [45]: df.add(df2, fill_value=0)
Out[45]: 
        one       two     three
a  2.789963  3.545034  1.000000
b  0.686107  3.824246 -0.100780
c  1.390491  2.956737  2.454870
d       NaN  0.558688 -1.226343

4.3 比较操作

二元比较操作也有对应的函数：eq, ne, lt, gt, le 和 ge

函数	英文	运算符
eq	equal	==
ne	not equal	!=
lt	less than	<
le	less than or equal	≤
gt	greater than	>
ge	greater than or equal	≥

In [46]: df.gt(df2)
Out[46]: 
     one    two  three
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [47]: df2.ne(df)
Out[47]: 
     one    two  three
a  False  False   True
b  False  False  False
c  False  False  False
d   True  False  False

这些函数或运算会生成一个与左侧对象相同的 pandas 对象，但是对应的值都是布尔值，可以使用这些布尔值进行索引操作

4.4 布尔简化

可以使用 empty, any(), all() 和 bool() 把布尔数据进行汇总

In [48]: (df > 0).all()
Out[48]: 
one      False
two       True
three    False
dtype: bool

In [49]: (df > 0).any()
Out[49]: 
one      True
two      True
three    True
dtype: bool

可以把上面的结果进一步汇总为单个布尔值

In [50]: (df > 0).any().any()
Out[50]: True

使用 enmty 属性判断一个 pandas 对象是否为空

In [51]: df.empty
Out[51]: False

In [52]: pd.DataFrame(columns=list("ABC")).empty
Out[52]: True

使用 bool() 在布尔语境中评估单元素 pandas 对象

In [53]: pd.Series([True]).bool()
Out[53]: True

In [54]: pd.Series([False]).bool()
Out[54]: False

In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True

In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False

注意：不要使用下面的判断

>>> if df:
...     pass

>>> df and df2

两种方法都会引发错误

4.5 比较对象是否等效

一般有许多方法能够得到一样的结果，例如 df + df，df * 2。你可能想用 (df + df == df * 2).all() 来判断，但其实结果是 False

In [57]: df + df == df * 2
Out[57]: 
     one   two  three
a   True  True  False
b   True  True   True
c   True  True   True
d  False  True   True

In [58]: (df + df == df * 2).all()
Out[58]: 
one      False
two       True
three    False
dtype: bool

注意：df+df==df*2 包含一些 False！因为 NaN 不会相等

In [59]: np.nan == np.nan
Out[59]: False

因此， N 维数据有 equals() 函数用于判断相等

In [60]: (df + df).equals(df * 2)
Out[60]: True

注意：Series 或 DataFrame 索引的顺序必须是相同的，验证结果才为 True

In [61]: df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})

In [62]: df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])

In [63]: df1.equals(df2)
Out[63]: False

In [64]: df1.equals(df2.sort_index())
Out[64]: True

4.6 比较数组对象

标量值与 pandas 数据结构比较是逐元素比较的

In [65]: pd.Series(["foo", "bar", "baz"]) == "foo"
Out[65]: 
0     True
1    False
2    False
dtype: bool

In [66]: pd.Index(["foo", "bar", "baz"]) == "foo"
Out[66]: array([ True, False, False])

pandas 还可以比较相同长度的数组对象之间的逐元素比较

In [67]: pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])
Out[67]: 
0     True
1     True
2    False
dtype: bool

In [68]: pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])
Out[68]: 
0     True
1     True
2    False
dtype: bool

如果尝试比较不同长度的 Index 或 Series 对象将会引发 ValueError 错误

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

注意：这里的操作与 NumPy 的广播机制不同

In [69]: np.array([1, 2, 3]) == np.array([2])
Out[69]: array([False,  True, False])

如果无法广播会返回 False

In [70]: np.array([1, 2, 3]) == np.array([1, 2])
Out[70]: False

4.7 合并重叠数据集

有时需要合并两个相似的数据集，这两个数据集里的其中一个的数据比另一个多。

一个例子是，代表某一特定经济指标的两个数据，其中一个被认为 "质量较高"。然而，质量较低的系列可能追溯到历史上更久远的年代，或具有更完整的数据覆盖面。

因此，我们希望合并两个 DataFrame 对象，其中一个 DataFrame 中的缺失值有条件地填充来自另一个 DataFrame 的相似标签的值。

实现此操作的函数是 combine_first()，我们将对其进行说明

In [71]: df1 = pd.DataFrame(
   ....:     {"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]}
   ....: )
   ....: 

In [72]: df2 = pd.DataFrame(
   ....:     {
   ....:         "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
   ....:         "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
   ....:     }
   ....: )
   ....: 

In [73]: df1
Out[73]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [74]: df2
Out[74]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [75]: df1.combine_first(df2)
Out[75]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

4.8 常用 DataFrame 合并

上面的 combine_first() 方法调用更通用的 DataFrame.combine() 方法。

该方法传入另一个 DataFrame 及自定义合并函数，并将之与调用的 DataFrame 对齐，再传递与 Series 配对的合并函数。

下面的代码与 combine_first() 实现同样的功能

In [76]: def combiner(x, y):
   ....:     return np.where(pd.isna(x), y, x)
   ....: 

In [77]: df1.combine(df2, combiner)
Out[77]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

Python 数据处理（六）—— panda 基础介绍
前言在这里，我们将会讨论很多 pandas 数据结构所共有的基本功能函数先让我们来创建一个示例对象介绍 1....
Python学习如何监视Python程序的内存使用情况
前言我们使用Python和它的数据处理库套件(如panda和scikiti -learn)进行大量数据处理时候，...
pandas资料汇总
pandas read_csv()读取文件python之pandas简单介绍及使用（一）python利用panda...
《基于Python的大数据分析基础及实战》（余本国）PDF电子书
内容简介《基于Python的大数据分析基础及实战》是一本介绍如何用Python 3.6进行数据处理和分析的学习指...
数据处理——Python
本场 Chat 为 Python 数据处理课程，包括： Python 基础知识（极简教程）提升 Python 代...
Python数据处理(一)：处理 JSON、XML、CSV三种格
Python 数据处理系列博客来啦！本系列将以《Python数据处理》这本书为基础，以书中每章一篇博客的形式带大...
keras学习-基础部分
keras学习基础参考《Python深度学习》一书 python 扩展学习数据处理相关轴方向注意之后，k折...
pandas 的基本介绍
简单介绍 Python 在数据处理上独步天下：代码灵活、开发快速；尤其是 Python 的 Pandas 包，无论...
Python 基础入门 7--编写测试用例（完）
Python 基础入门前六篇： Python 基础入门--简介和环境配置 Python基础入门_2基础语法和变量类...
python计算生态概览
从数据处理到人工智能 Python库之数据分析 1.Numpy: 表达N维数组的最基础库 -Python接口...