美文网首页
pytorch-dataloader使用方法

pytorch-dataloader使用方法

作者: 升不上三段的大鱼 | 来源:发表于2020-08-23 22:55 被阅读0次

    pytorch有自带的Dataset类和dataloader函数按批返回数据,应用的例子可以看这个

    这篇文章里我们来看一看dataloader的代码是如何实现的。

    Dataloader类的初始化函数里有一些参数值得注意:

    if sampler is None:  # give default samplers
         if self._dataset_kind == _DatasetKind.Iterable:
          # See NOTE [ Custom Samplers and IterableDataset ]
                sampler = _InfiniteConstantSampler()
         else:  # map-style
                if shuffle:
                    sampler = RandomSampler(dataset)
                else:
                    sampler = SequentialSampler(dataset)
    
    if batch_size is not None and batch_sampler is None:
          # auto_collation without custom batch_sampler
          batch_sampler = BatchSampler(sampler, batch_size, drop_last)
    

    在这里根据shuffle是否为true分别调用了RandomSampler(dataset)和SequentialSampler(dataset)。 batch_sampler 由 BatchSampler得到,构造一个batch的代码如下,本质上还是一个generator,它从sampler获取index,直到达到所需的batch_size。:

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch
    

    那么sampler又是如何构造的呢?
    先来看SequentialSampler,就是很简单的把数据加载进来,通过iter函数返回数据集大小内的数字。

    class SequentialSampler(Sampler):
        r"""Samples elements sequentially, always in the same order.
        Arguments:
            data_source (Dataset): dataset to sample from
        """
    
        def __init__(self, data_source):
            self.data_source = data_source
    
        def __iter__(self):
            return iter(range(len(self.data_source)))
    
        def __len__(self):
            return len(self.data_source)
    

    RandomSampler相对复杂一些,不过原理与sequential sampler相比,iter函数返回的数据集大小范围内的随机数,而不是按顺序排列的index。

    class RandomSampler(Sampler):
        r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
        If with replacement, then user can specify :attr:`num_samples` to draw.
        Arguments:
            data_source (Dataset): dataset to sample from
            replacement (bool): samples are drawn with replacement if ``True``, default=``False``
            num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
                is supposed to be specified only when `replacement` is ``True``.
        """
    
        def __init__(self, data_source, replacement=False, num_samples=None):
            self.data_source = data_source
            self.replacement = replacement
            self._num_samples = num_samples
    
            if not isinstance(self.replacement, bool):
                raise ValueError("replacement should be a boolean value, but got "
                                 "replacement={}".format(self.replacement))
    
            if self._num_samples is not None and not replacement:
                raise ValueError("With replacement=False, num_samples should not be specified, "
                                 "since a random permute will be performed.")
    
            if not isinstance(self.num_samples, int) or self.num_samples <= 0:
                raise ValueError("num_samples should be a positive integer "
                                 "value, but got num_samples={}".format(self.num_samples))
    
        @property
        def num_samples(self):
            # dataset size might change at runtime
            if self._num_samples is None:
                return len(self.data_source)
            return self._num_samples
    
        def __iter__(self):
            n = len(self.data_source)
            if self.replacement:
                return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
            return iter(torch.randperm(n).tolist())
    
        def __len__(self):
            return self.num_samples
    

    接下来是collate_fn的设置:

    if collate_fn is None:
         if self._auto_collation:
                collate_fn = _utils.collate.default_collate
         else:
                collate_fn = _utils.collate.default_convert
    

    collate_fn的作用是将每个数据字段放入具有batch size大小的张量。由dataloader获得的是一个batch大小的张量,比如batch是4,图片大小(3,64,64),dataloader给出来的tensor的大小为(4,3,64,64),collate_fn的作用就是把这些输入图片叠在一起成为一个tensor。
    如果想要dataloader输出不一样的数据,可以自己定义collate_fn函数,这篇里也有例子。

    参考:
    https://github.com/pytorch/pytorch/tree/e870a9a87042805cd52973e36534357f428a0748/torch/utils/data
    https://pytorch.org/docs/stable/data.html

    相关文章

      网友评论

          本文标题:pytorch-dataloader使用方法

          本文链接:https://www.haomeiwen.com/subject/qxywjktx.html