Pandas_Select_Data_Sample 随机样本

使用该sample()方法随机选择Series或DataFrame中的行或列。默认情况下，该方法将对行进行采样，并接受要返回的特定行数/列数或一小部分行

import pandas as pd
import numpy as np

iris = pd.read_csv('iris.csv')
iris.head()

out:
sepal_length    sepal_width petal_length    petal_width species
0   5.1 3.5 1.4 0.2 setosa
1   4.9 3.0 1.4 0.2 setosa
2   4.7 3.2 1.3 0.2 setosa
3   4.6 3.1 1.5 0.2 setosa
4   5.0 3.6 1.4 0.2 setosa

默认返回一个样本

iris.sample()

out:
sepal_length    sepal_width petal_length    petal_width species
121 5.6 2.8 4.9 2.0 virginica

参数n

可以设置n，用来返回多个数据。

iris.sample(n=3)

out:
sepal_length    sepal_width petal_length    petal_width species
34  4.9 3.1 1.5 0.2 setosa
89  5.5 2.5 4.0 1.3 versicolor
36  5.5 3.5 1.3 0.2 setosa

参数replace

sample()默认每行只出现一次，可以通过replace参数设置出现一次或多次。

iris.sample(n=20,replace=False).index.value_counts()

out:
42     1
29     1
131    1
68     1
101    1
130    1
69     1
104    1
105    1
138    1
77     1
47     1
87     1
49     1
115    1
132    1
6      1
57     1
72     1
66     1
dtype: int64

iris.sample(n=20,replace=True).index.value_counts()

out:
116    2
31     1
140    1
98     1
88     1
132    1
70     1
71     1
8      1
139    1
68     1
148    1
80     1
58     1
51     1
117    1
54     1
24     1
64     1
dtype: int64

参数weights

默认情况下，每行具有相同的选择概率，但如果您希望行具有不同的概率，则可以将sample函数采样权重作为 weights。这些权重可以是列表，NumPy数组或系列，但它们的长度必须与您采样的对象的长度相同。缺失的值将被视为零的权重，并且不允许使用inf值。如果权重不总和为1，则通过将所有权重除以权重之和来对它们进行重新规范化。

s = pd.Series(list(range(5)))
s_weights = [.1, .2, .3, .2, .2]
s_weights2 = [.5, .6, 0, 0, .2]
s.sample(n=4, weights=s_weights)

out:
2    2
4    4
1    1
3    3
dtype: int64

s.sample(n=2, weights=s_weights2)

out:
0    0
1    1
dtype: int64

应用于DataFrame时，只需将列的名称作为字符串传递，就可以使用DataFrame的列作为采样权重（假设您要对行而不是列进行采样）。

iris.sample(n=10, weights='petal_width')

out:
sepal_length    sepal_width petal_length    petal_width species
31  5.4 3.4 1.5 0.4 setosa
118 7.7 2.6 6.9 2.3 virginica
72  6.3 2.5 4.9 1.5 versicolor
136 6.3 3.4 5.6 2.4 virginica
29  4.7 3.2 1.6 0.2 setosa
101 5.8 2.7 5.1 1.9 virginica
126 6.2 2.8 4.8 1.8 virginica
144 6.7 3.3 5.7 2.5 virginica
23  5.1 3.3 1.7 0.5 setosa
99  5.7 2.8 4.1 1.3 versicolor

参数axis

sample还允许用户使用axis参数对列而不是行进行采样。

iris.sample(n=2, axis=1).head()

out:
sepal_length    species
0   5.1 setosa
1   4.9 setosa
2   4.7 setosa
3   4.6 setosa
4   5.0 setosa

参数random_state

可以sample使用random_state参数为随机数生成器设置种子，该参数将接受整数（作为种子）或NumPy RandomState对象。

iris.sample(n=2, random_state=3)

out:
sepal_length    sepal_width petal_length    petal_width species
47  4.6 3.2 1.4 0.2 setosa
3   4.6 3.1 1.5 0.2 setosa

iris.sample(n=2, random_state=5)

out:
sepal_length    sepal_width petal_length    petal_width species
82  5.8 2.7 3.9 1.2 versicolor
134 6.1 2.6 5.6 1.4 virginica