手撕数据集
1.随机数
2.哈希表
使用工具
1.sklearn.model_selection
Signature: train_test_split(*arrays, **options)
Docstring:
Split arrays or matrices into random train and test subsets
将数组或矩阵分割为随机训练和测试子集
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y))
and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.
Read more in the :ref:User Guide <cross_validation>
.
Parameters
arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
具有相同长度/形状的可索引项序列允许的输入是列表、numpy数组、scipy-sparse矩阵或dataframes
test_size :float, int or None, optional (default=0.25)
如果是float表示比例,如果是int表示test样本的个数,如果是None将所有数据设为训练集。默认是等于0.25
train_size : float, int, or None, (default=None)
类上
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random
.
shuffle : boolean, optional (default=True)
在分割前是否重新洗牌,如果False, 那么 stratify 必须是None
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as
the class labels.
Returns
splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
versionadded:: 0.16
If the input is sparse, the output will be a
scipy.sparse.csr_matrix
. Else, output type is the same as the
input type.
Examples
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
File: d:\anaconda3\lib\site-packages\sklearn\model_selection_split.py
Type: function
2.from sklearn.model_selection import StratifiedShuffleSplit
Init signature:
StratifiedShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=None)
Docstring:
Stratified ShuffleSplit cross-validator
Provides train/test indices to split data in train/test sets.
This cross-validation object is a merge of StratifiedKFold and
ShuffleSplit, which returns stratified randomized folds. The folds
are made by preserving the percentage of samples for each class.
Note: like the ShuffleSplit strategy, stratified random splits
do not guarantee that all folds will be different, although this is
still very likely for sizeable datasets.
Read more in the :ref:User Guide <cross_validation>
.
Parameters
n_splits : int, default 10
Number of re-shuffling & splitting iterations.
test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. By default, the value is set to 0.1.
The default will change in version 0.21. It will remain 0.1 only
if train_size
is unspecified, otherwise it will complement
the specified train_size
.
train_size : float, int, or None, default is None
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random
.
Examples
>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss) # doctest: +ELLIPSIS
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for train_index, test_index in sss.split(X, y):
... print("TRAIN:", train_index, "TEST:", test_index)
... X_train, X_test = X[train_index], X[test_index]
... y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]
File: d:\anaconda3\lib\site-packages\sklearn\model_selection\_split.py
Type: ABCMeta
网友评论