美文网首页
分割数据集的方法一

分割数据集的方法一

作者: 微斯人_吾谁与归 | 来源:发表于2019-05-22 17:18 被阅读0次

手撕数据集

1.随机数

2.哈希表

使用工具

1.sklearn.model_selection

Signature: train_test_split(*arrays, **options)

Docstring:

Split arrays or matrices into random train and test subsets
将数组或矩阵分割为随机训练和测试子集

Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data
into a single call for splitting (and optionally subsampling) data in a
oneliner.

Read more in the :ref:User Guide <cross_validation>.

Parameters

arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse
matrices or pandas dataframes.
具有相同长度/形状的可索引项序列允许的输入是列表、numpy数组、scipy-sparse矩阵或dataframes


test_size :float, int or None, optional (default=0.25)
如果是float表示比例,如果是int表示test样本的个数,如果是None将所有数据设为训练集。默认是等于0.25


train_size : float, int, or None, (default=None)
类上


random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.


shuffle : boolean, optional (default=True)
在分割前是否重新洗牌,如果False, 那么 stratify 必须是None


stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as
the class labels.


Returns

splitting : list, length=2 * len(arrays)
List containing train-test split of inputs.
versionadded:: 0.16
If the input is sparse, the output will be a
scipy.sparse.csr_matrix. Else, output type is the same as the
input type.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

File: d:\anaconda3\lib\site-packages\sklearn\model_selection_split.py
Type: function


2.from sklearn.model_selection import StratifiedShuffleSplit

Init signature:

StratifiedShuffleSplit(n_splits=10, test_size='default', train_size=None, random_state=None)

Docstring:

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a merge of StratifiedKFold and
ShuffleSplit, which returns stratified randomized folds. The folds
are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits
do not guarantee that all folds will be different, although this is
still very likely for sizeable datasets.

Read more in the :ref:User Guide <cross_validation>.

Parameters

n_splits : int, default 10
Number of re-shuffling & splitting iterations.

test_size : float, int, None, optional
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. By default, the value is set to 0.1.
The default will change in version 0.21. It will remain 0.1 only
if train_size is unspecified, otherwise it will complement
the specified train_size.

train_size : float, int, or None, default is None
If float, should be between 0.0 and 1.0 and represent the
proportion of the dataset to include in the train split. If
int, represents the absolute number of train samples. If None,
the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random.

Examples

>>> from sklearn.model_selection import StratifiedShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])
>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5
>>> print(sss)       # doctest: +ELLIPSIS
StratifiedShuffleSplit(n_splits=5, random_state=0, ...)
>>> for train_index, test_index in sss.split(X, y):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]
File:           d:\anaconda3\lib\site-packages\sklearn\model_selection\_split.py
Type:           ABCMeta

相关文章

  • 数据集分割方法

    在机器学习建模过程中,通行的做法通常是将数据集分为训练集和测试集。测试集是与训练独立的数据,完全不参与训练,用于最...

  • 分割数据集的方法一

    手撕数据集 1.随机数 2.哈希表 使用工具 1.sklearn.model_selection Signatur...

  • 4种语义分割数据集Cityscapes上SOTA方法总结

    摘要:当前语义分割方法面临3个挑战。 本文分享自华为云社区《语义分割数据集Cityscapes上SOTA方法总结[...

  • 数据集的分割与sklearn实现

    今天聊一下数据集分割的问题,在使用机器学习算法的时候,我们需要对原始数据集进行分割,分为训练集、验证集、测试集。训...

  • CVPR2019|In Defense of Pre-train

    用于道路驾驶的实时语义分割 Abstract 在要求苛刻的道路驱动数据集上, 语义分割方法最近取得了成功, 激发了...

  • 数据集分割

    一、单个文件分割训练集、测试集和验证集 一、单个文件分割多个训练集、测试集和验证集(5折) 有用的话,点个小红心哦!

  • 基于Keras实现Kaggle2013--Dogs vs. Ca

    【下载数据集】 下载链接--百度网盘关于猫的部分数据集示例 【整理数据集】 将训练数据集分割成训练集、验证集、测试...

  • 常用数据集介绍及转换

    研究背景 在深度学习中常用的数据集进行归纳和总结 语义分割的数据集 1、COCO 数据集 COCO(Common ...

  • 2.封装kNN算法之数据分割

    训练数据集与测试数据集 当我们拿到一组数据之后,通常我们需要把数据分割成两部分,即训练数据集和测试数据集。训练数据...

  • scikit-learn 中的交叉验证方法

    scikit-learn中提供了多种用于交叉验证的数据集分割方法。这里对这些方法的区别和应用场景做一个梳理。 基本...

网友评论

      本文标题:分割数据集的方法一

      本文链接:https://www.haomeiwen.com/subject/mdjuzqtx.html