此教程翻译自 PyTorch
官方教程
本教程通过 PyTorch
自带的一些实例来介绍它的基本概念。
PyTorch
核心提供了两个主要特征:
- n 维的张量,类似于
numpy
,但可以在 GPU 上运行 - 为构建和训练神经网络提供自动求导
我们讲使用一个全连接的 ReLU
神经网络来作为我们运行的例子。这个神经网络只有一个隐藏层,并且通过梯度下降来最小化网络输出和真实值的欧几里得距离来训练神经网络拟合随机数据。
Tensors
热身: numpy
在介绍 PyTorch
之前,我们首先使用 numpy
来实现这个神经网络。
numpy
提供了一个 n 维数组对象和许多操作这些数组的函数。numpy
是一个通用的科学计算框架;它没有计算图、深度学习和梯度的概念。然而我们可以很容易的使用 numpy
的操作手动实现前向传播和反向传播让一个两层的网络来拟合随机数据:
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h_relu[h<0] = 0
grad_w1 = x.T.dot(grad_h)
# update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
PyTorch: Tensors
Numpy
是一个伟大的框架,但是他不能利用 GPUs 来加快数值计算。对于现代深度神经网络, GPU 通常提供50倍或更高的加速,所以不幸的是 numpy 不足以用于现代深度学习。
这里,我介绍 PyTorch 中最基础的概念: 张量(Tensor)。Tensor 在概念上与 numpy 数组相同,它是一个 N 维的数组,并提供很多操作它的函数。和 numpy 类似,PyTorch 中的 Tensors 没有深度学习、计算图和梯度这些内容,他们是科学计算的通用工具。
然而,PyTorch的Tensor 和 numpy 不同,它可以利用 GPUs 来加快数值计算。为了让 Tensors 在 GPU上运行,你只需把它转型维一个新的类型即可。
这里我们使用 PyTorch 来训练一个两层的网络让它拟合随机数据。 和上面 numpy 的例子类似,我们需要手动实现网络的前向和反向传播:
import torch
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run GPU
# N is batch size; D_in is input dimenssion
# H is hidden demenssion; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10
# create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)
# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pss: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss)
# backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 3.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h<0] = 0
grad_w1 = x.t().mm(grad_h)
# update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
自动求导
PyTorch: Variables 和 autograd
在上面的例子中,我们自己实现了神经网络的前向和反向传播。手动实现反向传播对于一个小型的两层网络来说并不是什么难题,但是对于大型复杂啊网络,很快就会变得棘手。
值得庆幸的是,我们可以使用自动微分来自动计算神经网络的反向传播。PyTorch 中的 autograd
包提供了这一功能。当我们使用自动求导时,你的神经网络的前向传播讲定义一个计算图;图中的节点将是 Tensors,边表示从输入张量产生输出张量的函数。通过图反向传播可以让你轻松的计算梯度。
这听起来很复杂,但是在实际使用却十分简单。我们用 Variable
对象来包装 Tensors。 一个 Variable 表示计算图中的一个节点。如果 x
是一个 Variable
,那么 x.data
则是一个 Tensor,x.grad
是另一个用来保存 x
关于某个标量值的梯度。
PyTorch 的 Variable 拥有和 PyTorch 的 Tensors 一样的 API:几乎所有 Tensors 上的操作都可以在 Variable 上运行;不同之处在于使用 Variables 定义了一个计算图, 允许你自动的计算梯度。
这里我们使用 Variable 和自动求导来实现我们的两层网络;现在我们不再需要手动地实现反向传播。
import torch
from torch.autograd import Variable
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
# N is batch size; D_in is input dimenssion
# H is hidden demenssion; D_out is output dimension
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outpus, and wrap them in Variables.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Variables during the backward pass
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
# Create random Tensors for weights, and wrap them in Variables.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Variables; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Variables.
# Now loss is a Variable of shape (1,) and loss.data is a Tensor of shape
# (1,); loss.data[0] is a scalar value holding the loss.
loss = (y_pred - y).pow(2).sum()
print(t, loss.data[0])
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Variables with requires_grad=True.
# After this call w1.grad and w2.grad will be Variables holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Update weights using gradient descent; w1.data and w2.data are Tensors,
# w1.grad and w2.grad are Variables and w1.grad.data and w2.grad.data are
# Tensors.
w1.data -= learning_rate * w1.grad.data
w2.data-= learning_rate * w2.grad.data
# Manually zero the gradients after updating weights
w1.grad.data.zero_()
w2.grad.data.zero_()
PyTorch:定义新的 autograd 函数
在底层,每个原始 autograd 操作符实际上是在 Tensors 上操作的两个函数。forward 函数从输入张量来计算输出张量。backward函数接收输出向量关于某个标量的梯度,然后计算输入张量关于关于同一个标量的梯度。
在 PyTorch 中, 我们可以通过定义一个 torch.autograd.Function
的子类并实现 forward
和 backward
两个函数容易地实现我们自己的 autograd 操作符。我们可以通过定义一个该操作符的实例,并像函数一样给它传递一个包含输入数据的 Variable 来调用它,这样就使用我们新定义的 autograd 操作符。
在这个例子中我们, 我们自己定义了一个 autograd 函数来执行 ReLU 非线性映射,并使用它实现了一个两层的网络:
import torch
from torch.autograd import Variable
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Function by subclassing
torch.autograd.Function and implementing forward and backward passed
which operate on tensors
"""
def forward(self, input):
"""
In the forward pass we receive a Tensor containing the input and return a
Tensor containing the output. You can cache arbitrary Tensors for use in the
backward pass using the save_for_backward method.
"""
self.save_for_backward(input)
return input.clamp(min=0)
def backward(self, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input
"""
input, = self.saved_tensors
grad_input = grad_output.clone()
grad_input[input<0] = 0
return grad_input
dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)
# Create random Tensors for weights, and wrap them in Variables.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Construct an instance of our MyReLU class to use in our network
relu = MyReLU()
# Forward pass: compute predicted y using operations on Variables; we compute
# ReLU using our custom autograd operation.
y_pred = relu(x.mm(w1)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.data[0])
# Use autograd to compute the backward pass
loss.backward()
# Update weights using gradient descent
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data
# Manually zero the gradients after updateing weights
w1.grad.data.zero_()
w2.grad.data.zero_()
Tensorflow: 静态图
PyTorch 的 autograd 看起来很像 Tensorflow: 在两个框架中,我们都定义一个计算图,然后使用自动微分来计算梯度。最大的区别是: Tensorflow 中的计算图是静态的,而 PyTorch 使用的是动态计算图。
在 Tensorflow 中,我们只定义一次计算图,然后重复执行同一个图,可能将不同的输入数据提供给图。在 PyTorch 中,每次前向传播都会定义一个新的计算图。
静态图很好,因为您可以预先优化图; 例如,一个框架可能决定为了效率而融合某些图操作,或者想出一个在许多GPU或许多机器上的分布运行计算图的策略。 如果您一遍又一遍地重复使用同一个图,那么这个潜在的昂贵的前期优化可以在同一个图重复运行的情况下分摊。
静态图和动态图不同的一个方面是控制流。对于某些模型,我们可能希望对数据点执行不同的计算;例如,对于每个数据点,循环网络可以展开不同数量的时间步长,这个展开可以作为一个循环来实现。
使用静态图,循环结构需要成为图形的一部分;出于这个原因,Tensorflow 提供了像 tf.scan
这样的操作付来将循环嵌入到计算图中。使用动态图,这种情形就变得更简单了:由于我们为每个示例动态地都贱图,我们可以使用普通的命令式流控制来执行每个输入的不同计算。
为了与上面的 PyTorch 的 autograd 例子对比,这里我们使用 Tensorflow 来拟合一个简单的两层网络:
import tensorflow as tf
import numpy as np
# First we set up the computational graph:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))
# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))
# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)
# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)
# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])
# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)
# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
# Run the graph once to initialize the Variables w1 and w2.
sess.run(tf.global_variables_initializer())
# Create numpy arrays holding the actual data for the inputs x and targets
# y
x_value = np.random.randn(N, D_in)
y_value = np.random.randn(N, D_out)
for _ in range(500):
# Execute the graph many times. Each time it executes we want to bind
# x_value to x and y_value to y, specified with the feed_dict argument.
# Each time we execute the graph we want to compute the values for loss,
# new_w1, and new_w2; the values of these Tensors are returned as numpy
# arrays.
loss_value, _, _ = sess.run([loss, new_w1, new_w2],
feed_dict={x: x_value, y: y_value})
print(loss_value)
nn 模块
PyTorch:nn
计算图和 autograd 是定义复杂运算符和自动求导的的一个非常强大的范例。然而对于大规模的神经网络, 原始的 autograd 可能有点太低级了。
当构建神经网络时,我们经常想到把计算组织维层级结构,其中一些具有可学习的参数,这些参数将在学习期间被优化。
在 Tensorflow 中,像 Keras
, TensorFlow-Slim
和 TFLearn
这样的软件包提供了对原始图的更高级的抽象,这对于构建神经网络很有用。
在 PyTorch 中, nn
包提供了同样的功能。 nn
包提供了一组模块,他们大致相当于神经网络层。一个模块接收一个输入变量并计算输出向量,也可能保存内部状态,如包含可学习的参数。 nn
包还定义了一组训练神经网络时有用的损失函数。
在这个例子中我们使用 nn
包来实现我们的两层网络:
import torch
from torch.autograd import Variable
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Variables for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out)
)
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(size_average=False)
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Variable of input data to the Module and it produces
# a Variable of output data.
y_pred = model(x)
# Compute and print loss. We pass Variables containing the predicted and true
# values of y, and the loss function returns a Variable containing the
# loss.
loss = loss_fn(y_pred, y)
print(t, loss.data[0])
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Variables with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
for param in model.parameters():
param.data -= learning_rate * param.grad.data
PyTorch: optim
到目前为止,我们已经更新了我们模型的权重,通过手动改变需要学习的参数变量的 .data
成员。这对于像随机梯度下降这种简单的优化算法来说不是很困难,但是在实际训练神经网络时,我们常常使用更复杂的优化算法,如 AdaGrad,RMSProp, Adam等。
PyTorch 的 optim
包抽象了优化算法的思想,并提供了常用算法的实现。
在这个例子中,我们讲使用 nn
包来重新定义我们之前的模型,但是我们讲使用 optim
包提供的 Adam 算法来优化我们的模型:
import torch
from torch.autograd import Variable
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)
# Use the nn package to define our model and loss function
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out)
)
loss_fn = torch.nn.MSELoss(size_average=False)
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Variables it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss
loss = loss_fn(y_pred, y)
print(t, loss.data[0])
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable weights
# of the model)
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
PyTorch: 自定义模块
有时你想要指定比现有模块序列更复杂的模型;对于这种情况,你可以通过继承 nn.Module
来定义自己的模块,并实现 forward
函数,他接收一个输入变量,并使用其他模块或 autograd 操作符来生成输出变量。
在这个例子中,我们实现了一个两层网络来作为自定义模块:
import torch
from torch.autograd import Variable
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Variable of input data and we must return
a Variable of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Variables.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.data[0])
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
PyTorch: 控制流 + 共享权重
作为动态图和权值共享的例子,我们实现了一个不同的模型: 一个全连接的 ReLU 网络,每次前向传播时, 从1到4随机选择一个数来作为隐藏层的层数。=,多次重复使用相同的权重来计算最内层的隐藏层。
对于这个模型,我们可以使用普通的 Python 控制流来实现循环,在定义前向传播时,通过简单的重复使用同一个模块,我们可以实现最内层之间的权重共享。
import random
import torch
from torch.autograd import Variable
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)
# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
print(t, loss.data[0])
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
例子
以上实例的代码地址如下:
Tensors
autograd
PyTorch: Variables and autograd
PyTorch: Defining new autograd functions
nn Module
PyTorch: nn
网友评论