为应对大数据和内存限制,我们常采用批训练的方式以提高训练效率
1. 数据准备
import torch
import torch.utils.data as Data
torch.manual_seed(1) # reproducible
BATCH_SIZE = 5
# BATCH_SIZE = 8
x = torch.linspace(1, 10, 10) # this is x data (torch tensor)
y = torch.linspace(10, 1, 10) # this is y data (torch tensor)
2.数据装载与分批次
torch_dataset = Data.TensorDataset(x, y)
loader = Data.DataLoader(
dataset=torch_dataset, # torch TensorDataset format
batch_size=BATCH_SIZE, # mini batch size
shuffle=True, # random shuffle for training
num_workers=2, # subprocesses for loading data
)
注:
- Data.TensorDataset():包装数据和目标张量的数据集,通过沿着第一个维度索引两个张量来恢复每个样本。
- batch_size: mini batch size
- shuffle: random shuffle for training
- num_workers=2:subprocesses for loading data,数据有x,y两类
- Data.DataLoader(): 数据加载器, 用以组合数据集和采样器,并在数据集上提供单进程或多进程迭代器。
3. 批训练
def show_batch():
for epoch in range(3): # train entire dataset 3 times
for step, (batch_x, batch_y) in enumerate(loader): # for each training step
# train your data...
print('Epoch: ', epoch, '| Step: ', step, '| batch x: ',
batch_x.numpy(), '| batch y: ', batch_y.numpy())
if __name__ == '__main__':
show_batch()
注:
- epoch:表示所有数据的训练次数
- batch:每一批次装载
- 补充:zip()函数,将zip(x,y)装载为元组,以成对提取数据
网友评论