美文网首页PythonPython
Python 数据处理(七)—— pandas 统计描述

Python 数据处理(七)—— pandas 统计描述

作者: 名本无名 | 来源:发表于2021-02-08 20:00 被阅读0次

5. 统计描述

SeriesDataFrame 上存在大量的统计描述函数以及其他相关的运算函数。

比如像 sum(), mean()quantile() 等聚合函数,会对数据进行降维。但是像 cumsum()cumprod() 等函数则保持与原对象同样的大小。

一般来说,这些函数都会提供一个 axis 参数,能够接受整数(0, 1)或轴名(index or columns)来指定需要应用的维度

  • Series:不需要指定 axis 参数
  • DataFrameindex(axis=0) 默认, columns(axis=1)

例如:

In [78]: df
Out[78]: 
        one       two     three
a  1.394981  1.772517       NaN
b  0.343054  1.912123 -0.050390
c  0.695246  1.478369  1.227435
d       NaN  0.279344 -0.613172

In [79]: df.mean(0)
Out[79]: 
one      0.811094
two      1.360588
three    0.187958
dtype: float64

In [80]: df.mean(1)
Out[80]: 
a    1.583749
b    0.734929
c    1.133683
d   -0.166914
dtype: float64

所有这些函数都有一个 skipna 参数,用于指定是否排除缺失值(默认为 True

In [81]: df.sum(0, skipna=False)
Out[81]: 
one           NaN
two      5.442353
three         NaN
dtype: float64

In [82]: df.sum(axis=1, skipna=True)
Out[82]: 
a    3.167498
b    2.204786
c    3.401050
d   -0.333828
dtype: float64

通过结合广播行为/算术运算,可以非常简洁地描述各种统计过程。如标准化

In [83]: ts_stand = (df - df.mean()) / df.std()

In [84]: ts_stand.std()
Out[84]: 
one      1.0
two      1.0
three    1.0
dtype: float64

In [85]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [86]: xs_stand.std(1)
Out[86]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

注意cumsum()cumprod() 函数会保留值为 NaN 的位置

In [87]: df.cumsum()
Out[87]: 
        one       two     three
a  1.394981  1.772517       NaN
b  1.738035  3.684640 -0.050390
c  2.433281  5.163008  1.177045
d       NaN  5.442353  0.563873

如果数据包含多级索引,可以使用 level 参数指定应用的级别

>>> df = pd.DataFrame(np.random.randint(80, 120, size=(2, 4)), index= ['girl', 'boy'],
   ...:                    columns=[['English', 'English', 'Chinese', 'Chinese'],
   ...:                             ['like', 'dislike', 'like', 'dislike']])

>>> df
     English         Chinese        
        like dislike    like dislike
girl     108     104     115     102
boy       94     109     105      92

>>> df.mean(level=0, axis=1)
      English  Chinese
girl    106.0    108.5
boy     101.5     98.5

>>> df.columns.names = ['language', 'like']

>>> df.mean(level='language', axis=1)
language  English  Chinese
girl        106.0    108.5
boy         101.5     98.5

下面是常用汇总函数

image.png

注意:有些 NumPy 函数,像 mean, stdsum 默认会忽略 Series 的缺失值

In [88]: np.mean(df["one"])
Out[88]: 0.8110935116651192

In [89]: np.mean(df["one"].to_numpy())
Out[89]: nan

Series.nunique() 会返回 Series 中非 NaN 的唯一值的数量

In [90]: series = pd.Series(np.random.randn(500))

In [91]: series[20:500] = np.nan

In [92]: series[10:20] = 5

In [93]: series.nunique()
Out[93]: 11
5.1 数据汇总 —— describe

descripe() 函数,可以用于计算 SeriesDataFrame 每列的各种汇总统计信息(当然,并不会统计 NaN 值)

In [94]: series = pd.Series(np.random.randn(1000))

In [95]: series[::2] = np.nan

In [96]: series.describe()
Out[96]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
25%       -0.699070
50%       -0.069718
75%        0.714483
max        3.160915
dtype: float64

In [97]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In [98]: frame.iloc[::2] = np.nan

In [99]: frame.describe()
Out[99]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean     0.033387    0.030045   -0.043719   -0.051686    0.005979
std      1.017152    0.978743    1.025270    1.015988    1.006695
min     -3.000951   -2.637901   -3.303099   -3.159200   -3.188821
25%     -0.647623   -0.576449   -0.712369   -0.691338   -0.691115
50%      0.047578   -0.021499   -0.023888   -0.032652   -0.025363
75%      0.729907    0.775880    0.618896    0.670047    0.649748
max      2.740139    2.752332    3.004229    2.728702    3.240991

您可以选择要包含在输出中的特定百分比

In [100]: series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
Out[100]: 
count    500.000000
mean      -0.021292
std        1.015906
min       -2.683763
5%        -1.645423
25%       -0.699070
50%       -0.069718
75%        0.714483
95%        1.711409
max        3.160915
dtype: float64

默认情况下总会包含 median 中位值

对于一个非数值的 Series 对象,description() 将给出一个简单的总结,包括唯一值和最常出现的值的数量

In [101]: s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])

In [102]: s.describe()
Out[102]: 
count     9
unique    4
top       a
freq      5
dtype: object

注意:在包含混合类型的 DataFrame 对象上,describe() 只会把汇总限制在仅包括数字的列,如果没有数字列,则仅统计分类列

In [103]: frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})

In [104]: frame.describe()
Out[104]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

可以使用 include/exclude 参数来控制需要包含/排除的数据类型列表,而 all 参数将包含所有的列

In [105]: frame.describe(include=["object"])
Out[105]: 
         a
count    4
unique   2
top     No
freq     2

In [106]: frame.describe(include=["number"])
Out[106]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [107]: frame.describe(include="all")
Out[107]: 
          a         b
count     4  4.000000
unique    2       NaN
top      No       NaN
freq      2       NaN
mean    NaN  1.500000
std     NaN  1.290994
min     NaN  0.000000
25%     NaN  0.750000
50%     NaN  1.500000
75%     NaN  2.250000
max     NaN  3.000000
5.2 最大最小值的索引

SeriesDataFrame 上的 idxmin()idxmax() 函数可以计算最小值和最大值对应的索引

In [108]: s1 = pd.Series(np.random.randn(5))

In [109]: s1
Out[109]: 
0    1.118076
1   -0.352051
2   -1.242883
3   -1.277155
4   -0.641184
dtype: float64

In [110]: s1.idxmin(), s1.idxmax()
Out[110]: (3, 0)

In [111]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])

In [112]: df1
Out[112]: 
          A         B         C
0 -0.327863 -0.946180 -0.137570
1 -0.186235 -0.257213 -0.486567
2 -0.507027 -0.871259 -0.111110
3  2.000339 -2.430505  0.089759
4 -0.321434 -0.033695  0.096271

In [113]: df1.idxmin(axis=0)
Out[113]: 
A    2
B    3
C    1
dtype: int64

In [114]: df1.idxmax(axis=1)
Out[114]: 
0    C
1    A
2    C
3    A
4    C
dtype: object

当有多个最小值或最大值时,idxmin()idxmax() 返回第一个匹配的索引

In [115]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))

In [116]: df3
Out[116]: 
     A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [117]: df3["A"].idxmin()
Out[117]: 'd'
5.3 值计算和众数

value_counts() 函数能够统计 Series 或数组中数据值的数量

In [118]: data = np.random.randint(0, 7, size=50)

In [119]: data
Out[119]: 
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
       2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
       6, 2, 6, 1, 5, 4])

In [120]: s = pd.Series(data)

In [121]: s.value_counts()
Out[121]: 
2    10
6    10
4     9
3     8
5     8
0     3
1     2
dtype: int64

In [122]: pd.value_counts(data)
Out[122]: 
2    10
6    10
4     9
3     8
5     8
0     3
1     2
dtype: int64

value_counts() 方法可用于统计多个列之间的组合的数目。默认情况下会使用所有列,但可以使用 subset 参数选择一个子集

In [123]: data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}

In [124]: frame = pd.DataFrame(data)

In [125]: frame.value_counts()
Out[125]: 
a  b
1  x    1
2  x    1
3  y    1
4  y    1
dtype: int64

同样,您可以获取 SeriesDataFrame 的众数

In [126]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [127]: s5.mode()
Out[127]: 
0    3
1    7
dtype: int64

In [128]: df5 = pd.DataFrame(
   .....:     {
   .....:         "A": np.random.randint(0, 7, size=50),
   .....:         "B": np.random.randint(-10, 15, size=50),
   .....:     }
   .....: )
   .....: 

In [129]: df5.mode()
Out[129]: 
     A   B
0  1.0  -9
1  NaN  10
2  NaN  13
5.4 离散化和分位数

可以使用 cut()(基于值)和 qcut()(基于样本分位数)函数离散化连续值

In [130]: arr = np.random.randn(20)

In [131]: factor = pd.cut(arr, 4)

In [132]: factor
Out[132]: 
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]]
Length: 20
Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <
                                    (1.179, 1.893]]

In [133]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [134]: factor
Out[134]: 
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() 函数计算样本分位数。例如,我们可以将一些正态分布的数据切成大小相等的四分位数,如下所示

In [135]: arr = np.random.randn(30)

In [136]: factor = pd.qcut(arr, [0, 0.25, 0.5, 0.75, 1])

In [137]: factor
Out[137]: 
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]]
Length: 30
Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <
                                    (1.184, 2.346]]

In [138]: pd.value_counts(factor)
Out[138]: 
(-2.278, -0.301]    8
(1.184, 2.346]      8
(-0.301, 0.569]     7
(0.569, 1.184]      7
dtype: int64

我们也可以通过无限值来定义分仓

In [139]: arr = np.random.randn(20)

In [140]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [141]: factor
Out[141]: 
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

相关文章

网友评论

    本文标题:Python 数据处理(七)—— pandas 统计描述

    本文链接:https://www.haomeiwen.com/subject/vnefxltx.html