CS231n (winter 2016) : Assignmen

作者: Deepool | 来源:发表于2016-06-28 10:17 被阅读17881次

CS231n (winter 2016) : Assignmen
CS231n (winter 2016) : Assignmen
CS231n (winter 2016) : Assignmen
CS231n (winter 2016) : Assignmen
Know how to keep warm and be sty
this is not a homework assignmen
一个前端的自我修养（转）
讲解：FIT9132、Databases、SQL、SQLC/C+
代做TreeSets作业、代写D2L Dropbox作业、Pyt
代做TreeSets作业、代写D2L Dropbox作业、Pyt

前言：

以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。建议PC端阅读，该课程的学习资料和代码如下：
视频和PPT
笔记
 assignment2初始代码

Part 1：深层全连接神经网络（python编程任务）

我们在Assignment1中完成了简单的2-layer全连接神经网络，但是我们之前的编程不够模块化，所有的计算部分（损失函数的计算、梯度的计算等等）都放在了一个函数块里，使得没有灵活性，即我们无法随意更改网络的结构。这里，我们使用更加模块化的编程方式，每个模块之间相互独立，运行的时候可以相互调用，使得我们的神经网络结构十分灵活。就像这样：

python 
def layer_forward(x, w): 
""" Receive inputs x and weights w """ 
# Do some computations ... 
z = # ... some intermediate value 
# Do some more computations ... 
out = # the output  
cache = (x, w, z, out) # Values we need to compute gradients  

return out, cache 

The backward pass will receive upstream derivatives and the cache object, 
and will return gradients with respect to the inputs and weights, like this:

python 
def layer_backward(dout, cache): 
""" 
Receive derivative of loss with respect to outputs and cache, 
and compute derivative with respect to inputs. 
""" 
# Unpack cache values 
x, w, z, out = cache 

# Use values in cache to compute derivatives 
dx = # Derivative of loss with respect to x 
dw = # Derivative of loss with respect to w  

return dx, dw

此外，我们会将前面学过的参数更新策略全部集成到模块中，这样我们可以探索不同的参数更新策略的性能表现；我们也会将Batch Normalization和Dropout应用到模块中，来更高效地优化深度网络。

由于这部分的编程任务较为繁重，我们把任务拆分下来，一步一步地完成：

1. 2-layer全连接神经网络

这部分我们需要完成以下编程任务（此外，需要看懂solver.py）：
--> fc_net.py里的TwoLayerNet类
--> layers.py里的前四个函数
--> optim.py

具体代码如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from layer_utils import *

class TwoLayerNet(object):   
    """    
    A two-layer fully-connected neural network with ReLU nonlinearity and    
    softmax loss that uses a modular layer design. We assume an input dimension    
    of D, a hidden dimension of H, and perform classification over C classes.    

    The architecure should be affine - relu - affine - softmax.    

    Note that this class does not implement gradient descent; instead, it    
    will interact with a separate Solver object that is responsible for running    
    optimization.    

    The learnable parameters of the model are stored in the dictionary    
    self.params that maps parameter names to numpy arrays.   
    """
    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,           
                              weight_scale=1e-3, reg=0.0):    
        """    
        Initialize a new network.   
        Inputs:    
        - input_dim: An integer giving the size of the input    
        - hidden_dim: An integer giving the size of the hidden layer    
        - num_classes: An integer giving the number of classes to classify    
        - dropout: Scalar between 0 and 1 giving dropout strength.    
        - weight_scale: Scalar giving the standard deviation for random 
                        initialization of the weights.    
        - reg: Scalar giving L2 regularization strength.    
        """    
        self.params = {}    
        self.reg = reg   
        self.params['W1'] = weight_scale * np.random.randn(input_dim, hidden_dim)     
        self.params['b1'] = np.zeros((1, hidden_dim))    
        self.params['W2'] = weight_scale * np.random.randn(hidden_dim, num_classes)  
        self.params['b2'] = np.zeros((1, num_classes))

    def loss(self, X, y=None):    
        """   
        Compute loss and gradient for a minibatch of data.    
        Inputs:    
        - X: Array of input data of shape (N, d_1, ..., d_k)    
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].  
        Returns:   
        If y is None, then run a test-time forward pass of the model and return:    
        - scores: Array of shape (N, C) giving classification scores, where              
                  scores[i, c] is the classification score for X[i] and class c. 
        If y is not None, then run a training-time forward and backward pass and    
        return a tuple of:    
        - loss: Scalar value giving the loss   
        - grads: Dictionary with the same keys as self.params, mapping parameter             
                 names to gradients of the loss with respect to those parameters.    
        """
        scores = None
        N = X.shape[0]
        # Unpack variables from the params dictionary
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        h1, cache1 = affine_relu_forward(X, W1, b1)
        out, cache2 = affine_forward(h1, W2, b2)
        scores = out              # (N,C)
        # If y is None then we are in test mode so just return scores
        if y is None:   
            return scores

        loss, grads = 0, {}
        data_loss, dscores = softmax_loss(scores, y)
        reg_loss = 0.5 * self.reg * np.sum(W1*W1) + 0.5 * self.reg * np.sum(W2*W2)
        loss = data_loss + reg_loss

       # Backward pass: compute gradients
       dh1, dW2, db2 = affine_backward(dscores, cache2)
       dX, dW1, db1 = affine_relu_backward(dh1, cache1)
       # Add the regularization gradient contribution
       dW2 += self.reg * W2
       dW1 += self.reg * W1
       grads['W1'] = dW1
       grads['b1'] = db1
       grads['W2'] = dW2
       grads['b2'] = db2

       return loss, grads

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

#import numpy as np

def affine_forward(x, w, b):   
    """    
    Computes the forward pass for an affine (fully-connected) layer. 
    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N   
    examples, where each example x[i] has shape (d_1, ..., d_k). We will    
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and    
    then transform it to an output vector of dimension M.    
    Inputs:    
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)    
    - w: A numpy array of weights, of shape (D, M)    
    - b: A numpy array of biases, of shape (M,)   
    Returns a tuple of:    
    - out: output, of shape (N, M)    
    - cache: (x, w, b)   
    """
    out = None
    # Reshape x into rows
    N = x.shape[0]
    x_row = x.reshape(N, -1)         # (N,D)
    out = np.dot(x_row, w) + b       # (N,M)
    cache = (x, w, b)
    
    return out, cache

def affine_backward(dout, cache):   
    """    
    Computes the backward pass for an affine layer.    
    Inputs:    
    - dout: Upstream derivative, of shape (N, M)    
    - cache: Tuple of: 
    - x: Input data, of shape (N, d_1, ... d_k)    
    - w: Weights, of shape (D, M)    
    Returns a tuple of:   
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)    
    - dw: Gradient with respect to w, of shape (D, M) 
    - db: Gradient with respect to b, of shape (M,)    
    """    
    x, w, b = cache    
    dx, dw, db = None, None, None   
    dx = np.dot(dout, w.T)                       # (N,D)    
    dx = np.reshape(dx, x.shape)                 # (N,d1,...,d_k)   
    x_row = x.reshape(x.shape[0], -1)            # (N,D)    
    dw = np.dot(x_row.T, dout)                   # (D,M)    
    db = np.sum(dout, axis=0, keepdims=True)     # (1,M)    

    return dx, dw, db

def relu_forward(x):   
    """    
    Computes the forward pass for a layer of rectified linear units (ReLUs).    
    Input:    
    - x: Inputs, of any shape    
    Returns a tuple of:    
    - out: Output, of the same shape as x    
    - cache: x    
    """   
    out = None    
    out = ReLU(x)    
    cache = x    

    return out, cache

def relu_backward(dout, cache):   
    """  
    Computes the backward pass for a layer of rectified linear units (ReLUs).   
    Input:    
    - dout: Upstream derivatives, of any shape    
    - cache: Input x, of same shape as dout    
    Returns:    
    - dx: Gradient with respect to x    
    """    
    dx, x = None, cache    
    dx = dout    
    dx[x <= 0] = 0    

    return dx

def svm_loss(x, y):   
    """    
    Computes the loss and gradient using for multiclass SVM classification.    
    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
         for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss   
    - dx: Gradient of the loss with respect to x    
    """    
    N = x.shape[0]   
    correct_class_scores = x[np.arange(N), y]    
    margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)    
    margins[np.arange(N), y] = 0   
    loss = np.sum(margins) / N   
    num_pos = np.sum(margins > 0, axis=1)    
    dx = np.zeros_like(x)   
    dx[margins > 0] = 1    
    dx[np.arange(N), y] -= num_pos    
    dx /= N    

    return loss, dx

def softmax_loss(x, y):    
    """    
    Computes the loss and gradient for softmax classification.    Inputs:    
    - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class         
    for the ith input.    
    - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and         
         0 <= y[i] < C   
    Returns a tuple of:    
    - loss: Scalar giving the loss    
    - dx: Gradient of the loss with respect to x   
    """    
    probs = np.exp(x - np.max(x, axis=1, keepdims=True))    
    probs /= np.sum(probs, axis=1, keepdims=True)    
    N = x.shape[0]   
    loss = -np.sum(np.log(probs[np.arange(N), y])) / N    
    dx = probs.copy()    
    dx[np.arange(N), y] -= 1    
    dx /= N    

    return loss, dx

def ReLU(x):    
    """ReLU non-linearity."""    
    return np.maximum(0, x)

---> optim.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def sgd(w, dw, config=None):    
    """    
    Performs vanilla stochastic gradient descent.    
    config format:    
    - learning_rate: Scalar learning rate.    
    """    
   if config is None: config = {}    
   config.setdefault('learning_rate', 1e-2)    
   w -= config['learning_rate'] * dw    

   return w, config

def sgd_momentum(w, dw, config=None):    
    """    
    Performs stochastic gradient descent with momentum.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - momentum: Scalar between 0 and 1 giving the momentum value.                
    Setting momentum = 0 reduces to sgd.    
    - velocity: A numpy array of the same shape as w and dw used to store a moving    
    average of the gradients.   
    """   
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)   
    config.setdefault('momentum', 0.9)    
    v = config.get('velocity', np.zeros_like(w))    
    next_w = None    
    v = config['momentum'] * v - config['learning_rate'] * dw    
    next_w = w + v    
    config['velocity'] = v    

    return next_w, config

def rmsprop(x, dx, config=None):    
    """    
    Uses the RMSProp update rule, which uses a moving average of squared gradient    
    values to set adaptive per-parameter learning rates.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared                  
    gradient cache.    
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - cache: Moving average of second moments of gradients.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-2)  
    config.setdefault('decay_rate', 0.99)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('cache', np.zeros_like(x))    
    next_x = None    
    cache = config['cache']    
    decay_rate = config['decay_rate']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']    
    cache = decay_rate * cache + (1 - decay_rate) * (dx**2)    
    x += - learning_rate * dx / (np.sqrt(cache) + epsilon)  
    config['cache'] = cache    
    next_x = x    

    return next_x, config

def adam(x, dx, config=None):    
    """    
    Uses the Adam update rule, which incorporates moving averages of both the  
    gradient and its square and a bias correction term.    
    config format:    
    - learning_rate: Scalar learning rate.    
    - beta1: Decay rate for moving average of first moment of gradient.    
    - beta2: Decay rate for moving average of second moment of gradient.   
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.    
    - m: Moving average of gradient.    
    - v: Moving average of squared gradient.    
    - t: Iteration number.   
    """    
    if config is None: config = {}    
    config.setdefault('learning_rate', 1e-3)    
    config.setdefault('beta1', 0.9)    
    config.setdefault('beta2', 0.999)    
    config.setdefault('epsilon', 1e-8)    
    config.setdefault('m', np.zeros_like(x))    
    config.setdefault('v', np.zeros_like(x))    
    config.setdefault('t', 0)   
    next_x = None    
    m = config['m']    
    v = config['v']    
    beta1 = config['beta1']    
    beta2 = config['beta2']    
    learning_rate = config['learning_rate']    
    epsilon = config['epsilon']   
    t = config['t']    
    t += 1    
    m = beta1 * m + (1 - beta1) * dx    
    v = beta2 * v + (1 - beta2) * (dx**2)    
    m_bias = m / (1 - beta1**t)    
    v_bias = v / (1 - beta2**t)    
    x += - learning_rate * m_bias / (np.sqrt(v_bias) + epsilon)    
    next_x = x    
    config['m'] = m    
    config['v'] = v    
    config['t'] = t    

    return next_x, config

编程完成后，我们可以用FullyConnectedNets.ipynb里的代码来check我们的代码是否有误。check完之后，我们可以在CIFAR-10上跑一遍，和Assignment1里的2-layer神经网络比较一下，结果应该是差不多的。

这里，我贴一下在CIFAR-10上运行的代码和结果图：
---> two_layer_fc_net_start.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

import matplotlib.pyplot as plt
from fc_net import *
from data_utils import get_CIFAR10_data
from solver import Solver

data = get_CIFAR10_data()
model = TwoLayerNet(reg=0.9)
solver = Solver(model, data,                
                lr_decay=0.95,                
                print_every=100, num_epochs=40, batch_size=400, 
                update_rule='sgd_momentum',                
                optim_config={'learning_rate': 5e-4, 'momentum': 0.5})

solver.train()                 

plt.subplot(2, 1, 1)
plt.title('Training loss')
plt.plot(solver.loss_history, 'o')
plt.xlabel('Iteration')

plt.subplot(2, 1, 2)
plt.title('Accuracy')
plt.plot(solver.train_acc_history, '-o', label='train')
plt.plot(solver.val_acc_history, '-o', label='val')
plt.plot([0.5] * len(solver.val_acc_history), 'k--')
plt.xlabel('Epoch')
plt.legend(loc='lower right')
plt.gcf().set_size_inches(15, 12)
plt.show()


best_model = model
y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)
y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)
print 'Validation set accuracy: ', (y_val_pred == data['y_val']).mean()
print 'Test set accuracy: ', (y_test_pred == data['y_test']).mean()
# Validation set accuracy:  about 52.9%
# Test set accuracy:  about 54.7%


# Visualize the weights of the best network
from vis_utils import visualize_grid

def show_net_weights(net):    
    W1 = net.params['W1']    
    W1 = W1.reshape(3, 32, 32, -1).transpose(3, 1, 2, 0)    
    plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))   
    plt.gca().axis('off')    

plt.show()show_net_weights(best_model)

Figure_1.png

2. Multilayer全连接网络 + Batch Normalization

这部分我们需要完成以下编程任务:
--> fc_net.py 里的 FullyConnectedNet类
--> layers.py 里的 batchnorm_forward 和 batchnorm_backward函数

具体代码如下：
---> fc_net.py

__coauthor__ = 'Deeplayer'
# 6.22.2016

from layer_utils import *

class FullyConnectedNet(object):    
    """    
    A fully-connected neural network with an arbitrary number of hidden layers,    
    ReLU nonlinearities, and a softmax loss function. This will also implement    
    dropout and batch normalization as options. For a network with L layers,    
    the architecture will be    
    {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax    
    where batch normalization and dropout are optional, and the {...} block is    
    repeated L - 1 times.   
    Similar to the TwoLayerNet above, learnable parameters are stored in the    
    self.params dictionary and will be learned using the Solver class. 
    def __init__(self, hidden_dims, input_dim=3*32*32,  
                 num_classes=10,              
                 dropout=0, use_batchnorm=False, reg=0.0,    
                 weight_scale=1e-2, dtype=np.float32, seed=None):    
    """
    def __init__(self, hidden_dims, input_dim=3*32*32, 
                 num_classes=10,           
                 dropout=0, use_batchnorm=False, reg=0.0,      
                 weight_scale=1e-2, dtype=np.float32, seed=None):

        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout > 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        layers_dims = [input_dim] + hidden_dims + [num_classes]
        for i in xrange(self.num_layers):    
            self.params['W' + str(i+1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i+1])    
            self.params['b' + str(i+1)] = np.zeros((1, layers_dims[i+1]))    
            if self.use_batchnorm and i < len(hidden_dims): 
                self.params['gamma' + str(i+1)] = np.ones((1, layers_dims[i+1]))        
                self.params['beta' + str(i+1)] = np.zeros((1, layers_dims[i+1]))
        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:    
            self.dropout_param = {'mode': 'train', 'p': dropout}    
            if seed is not None:        
                self.dropout_param['seed'] = seed
        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.use_batchnorm:    
            self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, bn, out = {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])    
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)

        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh = {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:        
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:        
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

在给出 batchnorm_forward 和 batchnorm_backward函数代码之前，先给出Batch Normalization的算法和反向求导公式：

Batch Normalization, algorithm1.png

Backpropagate the gradient of loss ℓ .png

---> layers.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 

import numpy as np

def batchnorm_forward(x, gamma, beta, bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)
    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':    
        sample_mean = np.mean(x, axis=0, keepdims=True)       # [1,D]    
        sample_var = np.var(x, axis=0, keepdims=True)         # [1,D] 
        x_normalized = (x - sample_mean) / np.sqrt(sample_var + eps)    # [N,D]    
        out = gamma * x_normalized + beta    
        cache = (x_normalized, gamma, beta, sample_mean, sample_var, x, eps)    
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean    
        running_var = momentum * running_var + (1 - momentum) * sample_var
    elif mode == 'test':    
        x_normalized = (x - running_mean) / np.sqrt(running_var + eps)    
        out = gamma * x_normalized + beta
    else:    
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

def batchnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    x_normalized, gamma, beta, sample_mean, sample_var, x, eps = cache
    N, D = x.shape
    dx_normalized = dout * gamma       # [N,D]
    x_mu = x - sample_mean             # [N,D]
    sample_std_inv = 1.0 / np.sqrt(sample_var + eps)    # [1,D]
    dsample_var = -0.5 * np.sum(dx_normalized * x_mu, axis=0, keepdims=True) * sample_std_inv**3
    dsample_mean = -1.0 * np.sum(dx_normalized * sample_std_inv, axis=0, keepdims=True) - \                                
                                   2.0 * dsample_var * np.mean(x_mu, axis=0, keepdims=True)
    dx1 = dx_normalized * sample_std_inv
    dx2 = 2.0/N * dsample_var * x_mu
    dx = dx1 + dx2 + 1.0/N * dsample_mean
    dgamma = np.sum(dout * x_normalized, axis=0, keepdims=True)
    dbeta = np.sum(dout, axis=0, keepdims=True)

    return dx, dgamma, dbeta

完成编程后，我们可以用Batch Normalization.ipynb来check我们的code是否有误。下面我会给出在使用Batch Normalization的情况下，6-layer神经网络在CIFAR-10上的performance。可以预见，6-layer神经网络的performance应该不会比2-layer神经网络的performance好多少的（因为会存在我在Assignment1最后提到的问题1）。

在这之前，我们先来看看Batch Normalization对梯度消失现象的缓解能力怎样，同时给出在不同weight_scales下的情况。我们分别以sigmoid和ReLU作为为激活函数的6-layer神经网络为例，测试一下：

---> batchnorm_and_weight_scales.py

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

from fc_net import *
from solver import *
import matplotlib.pyplot as plt
from data_utils import get_CIFAR10_data

# Load the (preprocessed) CIFAR10 data.
data = get_CIFAR10_data()

hidden_dims = [100, 100, 100, 100, 100]
num_train = 5000
small_data = {  
       'X_train': data['X_train'][:num_train],  
       'y_train': data['y_train'][:num_train],  
       'X_val': data['X_val'],  
       'y_val': data['y_val'],
}
bn_solvers = {}
solvers = {}
weight_scales = np.logspace(-4, 0, num=20)
for i, weight_scale in enumerate(weight_scales):    
    print 'Running weight scale %d / %d' % (i + 1, len(weight_scales)) 
    bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)    
    model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)    

    bn_solver = Solver(bn_model, small_data,        
                       num_epochs=10, batch_size=100,           
                       update_rule='adam',                  
                       optim_config={'learning_rate': 1e-3, },                  
                       verbose=False, print_every=1000)    
    bn_solver.train()    
    bn_solvers[weight_scale] = bn_solver    

    solver = Solver(model, small_data,                  
                    num_epochs=10, batch_size=100,      
                    update_rule='adam',                 
                    optim_config={'learning_rate': 1e-3, },  
                    verbose=False, print_every=1000)    
    solver.train()    
    solvers[weight_scale] = solver

# Plot results of weight scale experiment
best_train_accs, bn_best_train_accs = [], []
best_val_accs, bn_best_val_accs = [], []
final_train_loss, bn_final_train_loss = [], []

for ws in weight_scales: 
    best_train_accs.append(max(solvers[ws].train_acc_history))
    bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))  

    best_val_accs.append(max(solvers[ws].val_acc_history))  
    bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))  

    final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))  
    bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))

plt.subplot(3, 1, 1)
plt.title('Best val accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best val accuracy')
plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')
plt.legend(ncol=2, loc='lower right')

plt.subplot(3, 1, 2)
plt.title('Best train accuracy vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Best training accuracy')
plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')
plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.subplot(3, 1, 3)
plt.title('Final training loss vs weight initialization scale')
plt.xlabel('Weight initialization scale')
plt.ylabel('Final training loss')
plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')
plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')
plt.legend(loc='upper left')

plt.gcf().set_size_inches(10, 15)
plt.show()

Activation Function: Sigmoid.png

Activation Function: ReLU.png

从上图可以看出：

1)、Batch Normalization解决了困扰学术界十几年的sigmoid的过饱和问题（梯度消失问题），bravo！可能你觉得上面的结果不够直接，那么我贴一下每层的权重梯度值：

Left: without Batch Normalization --- Right: with Batch Normalization

2)、即使没有梯度消失现象，sigmoid还是没有ReLU好。
3)、如果weight_scales选得好的话，当激活函数为ReLU时，Batch Normalization对识别率的提升并不多。

现在，我给一下6-layer神经网络在CIFAR-10上的识别结果（激活函数为ReLU）：
· Validation set accuracy: 0.554
· Test set accuracy: 0.54

3. Dropout

这部分我们需要完成以下编程任务:
--> 修改fc_net.py，将dropout加进去
vlayers.py 里的 dropout_forward 和 dropout_backward函数

Dropout是我们在实际（深度）神经网络训练中，用得非常多的一种正则化手段，可以很好地抑制过拟合。即：在训练过程中，我们对每个神经元，都以概率p保持它的激活状态。下面给出3-layer神经网络的dropout示意图:

CS231n Convolutional Neural Networks for Visual Recognition.png

具体代码如下：

对于fc_net.py我们只要修改下其中的loss函数：

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

    def loss(self, X, y=None):    
        """    
        Compute loss and gradient for the fully-connected net.    
        Input / output: Same as TwoLayerNet above.    
        """    
        X = X.astype(self.dtype)    
        mode = 'test' if y is None else 'train'    
        # Set train/test mode for batchnorm params and dropout param since they    
        # behave differently during training and testing.    
        if self.dropout_param is not None: 
            self.dropout_param['mode'] = mode    
        if self.use_batchnorm:        
        for bn_param in self.bn_params:            
            bn_param['mode'] = mode    
        scores = None    
        h, cache1, cache2, cache3, cache4, bn, out = {}, {}, {}, {}, {}, {}, {}    
        out[0] = X

        # Forward pass: compute loss
        for i in xrange(self.num_layers-1):    
            # Unpack variables from the params dictionary    
            W, b = self.params['W' + str(i+1)], self.params['b' + str(i+1)]
            if self.use_batchnorm:        
                gamma, beta = self.params['gamma' + str(i+1)], self.params['beta' + str(i+1)]        
                h[i], cache1[i] = affine_forward(out[i], W, b)        
                bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i])        
                out[i+1], cache3[i] = relu_forward(bn[i])
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param) 
            else:        
                out[i+1], cache3[i] = affine_relu_forward(out[i], W, b)
                if self.use_dropout:    
                    out[i+1], cache4[i] = dropout_forward(out[i+1], self.dropout_param)
        W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)]
        scores, cache = affine_forward(out[self.num_layers-1], W, b)

        # If test mode return early
        if mode == 'test':   
            return scores

        loss, reg_loss, grads = 0.0, 0.0, {}
        data_loss, dscores = softmax_loss(scores, y)
        for i in xrange(self.num_layers):    
            reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i+1)]*self.params['W' + str(i+1)])
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        dout, dbn, dh, ddrop = {}, {}, {}, {}
        t = self.num_layers-1
        dout[t], grads['W'+str(t+1)], grads['b'+str(t+1)] = affine_backward(dscores, cache)
        for i in xrange(t):    
            if self.use_batchnorm:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]     
                dbn[t-1-i] = relu_backward(dout[t-i], cache3[t-1-i]) 
                dh[t-1-i], grads['gamma'+str(t-i)], grads['beta'+str(t-i)] = batchnorm_backward(dbn[t-1-i], cache2[t-1-i])       
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_backward(dh[t-1-i], cache1[t-1-i])    
            else:
                if self.use_dropout:    
                    ddrop[t-1-i] = dropout_backward(dout[t-i], cache4[t-1-i])    
                    dout[t-i] = ddrop[t-1-i]
                dout[t-1-i], grads['W'+str(t-i)], grads['b'+str(t-i)] = affine_relu_backward(dout[t-i], cache3[t-1-i])

        # Add the regularization gradient contribution
        for i in xrange(self.num_layers):    
            grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]

        return loss, grads

---> layers.py 里的 dropout_forward 和 dropout_backward函数

__coauthor__ = 'Deeplayer'
# 6.22.2016 #

def dropout_forward(x, dropout_param):
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:  
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None
    if mode == 'train':    
        mask = (np.random.rand(*x.shape) < p) / p    
        out = x * mask
    elif mode == 'test':    
        out = x

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    dropout_param, mask = cache
    mode = dropout_param['mode']
    dx = None

    if mode == 'train':    
        dx = dout * mask
    elif mode == 'test':    
        dx = dout

    return dx

完成编程后，我们可以用Dropout.ipynb里的代码来check你的code是否有误。我们可以用Dropout.ipynb里最后一部分的代码来比较下使用和不使用dropout的区别：

Dropout vs Overfitting.png

Part 2：卷积神经网络（Convolutional Neural Networks, CNNs）

现在我们开始理解本课程的核心内容 —— 卷积神经网络，对于视觉识别任务，CNNs无疑是最出彩的。和我们前面讲过的全连接神经网络相比，CNNs的优越之处在哪呢？我觉得可以列出以下几点：

1)、它的权值共享以及局部（感受野，receptive field）连接的特点，使之更加类似生物神经网络，视觉皮层的神经元就是局部接受信息的（即这些神经元只响应某些特定感受野的刺激）；
2)、在我们的图像比较大的情况下（如 96x96、224x224、384x384、512x512等），全连接神经网络将需要训练超大量的参数（权重和偏置），这不仅会使得计算变得非常耗时，还会导致更加严重的过拟合现象。而CNNs的权值共享和局部连接的特点，使得需要训练的参数锐减（指数级的）；
3)、CNNs具有强大的特征提取能力（从边缘到局部再到整体），而全连接神经网络基本没有特征提取的能力。

下面我们来具体讨论CNNs的结构特点，讨论之前，先给一张图，方便感受下CNNs的大致结构：

CS231n Convolutional Neural Networks for Visual Recognition.png

1. 卷积层（Convolutional Layer）

卷积层，也可以称之为特征提取层，是CNNs最重要的部分。卷积层需要训练的参数是一系列的过滤器（我更喜欢卷积核这个词），这些过滤器的大小一致，通常都是正方形。假设我们有n个过滤器，每个过滤器的大小为kxk（k通常取3或5），那么这一层我们需要训练的参数就有nxkxk+n/c个（这里的c表示通道数，如果是灰度图像c=1，如果是彩色图像c=3）。权值共享告诉我们，一个过滤器只能提取一种特征，即当过滤器在图像上卷积（滑动）的过程中，只提取了该图像全局范围内的同一个特征。所以，n个过滤器可以提取图像的n个不同特征。这里贴张卷积过程的动图，这里的过滤器个数是6，但事实上是2种（因为有三个通道嘛），所以提取了两种特征：

CS231n Convolutional Neural Networks for Visual Recognition.gif

动图中，你会发现图像外面多了一圈0，而且过滤器移动的步长（stride）为2。补零这个操作，我们称之为zero-padding。我们记补零的圈数为p，过滤器移动步长为s，那么计算输出卷积特征（convolved feature，或者叫activation map）边长的公式为： L=(input_dim-k+2p)/s+1，输出特征的维数则为LxLxn/c。zero-padding这个操作产生的原因是为了保证过滤器的滑动能从头到尾刚刚好，即保证上面的公式能够整除。上面的p，s和n是需要我们提前设定好的三个超参数。对于步长s的设定，s设定得越小，提取的信息就越丰富，但计算量会相对大一点；s设定得越大，计算量会相对小一点，但是提取的信息就少一些。s的通常选择是1。

---> PS: 卷积为什么work?
自然图像有其固有特性，也就是说，图像的一部分的统计特性与其他部分是一样的。这也意味着我们在这一部分学习的特征也能用在另一部分上，所以对于这个图像上的所有位置，我们都能使用同样的学习特征。（摘自UFLDL）

2. 池化层（Pooling Layer）

卷积层的下一层是池化层，但要注意，卷积层的输出会经过激活函数（如ReLU）激活后，进入池化层。池化层的作用是将卷积层输出的维数进一步降低，以此来减少参数的数量和计算量。具体来讲，是将卷积层得到的结果无重合的分成几个子区域，然后选择每一子区域的最大值，或者平均值，或者2范数，我们以取最大值的max pooling为例（相对而言，max pooling的效果更好，所以我们通常采用max pooling），给出一个diagram：

CS231n Convolutional Neural Networks for Visual Recognition.png

通常，池化层的采样窗口大小为2x2。

有些人认为池化层并不是必要的，如Striving for Simplicity: The All Convolutional Net。此外，有人发现去除池化层对于生成式模型（generative models）很重要，例如variational autoencoders(VAEs)，generative adversarial networks(GANs)。可能在以后的模型结构中，池化层会逐渐减少或者消失。

3. 全连接层（Fully-connected layer）

现在的很多CNNs模型，在最后几层（一般是1~3层）会采用全连接的方式去学习更多的信息。注意，全连接层的最后一层就是输出层；除了最后一层，其它的全连接层都包含激活函数。

4. 卷积神经网络结构（CNNs Architectures）

CNNs的通常结构，可以表述如下：

INPUT --> [[CONV --> RELU]*N --> POOL?]*M --> [FC --> RELU]*K --> FC(OUTPUT)

其中，"?"是代表池化层是可选的，可有可无；N（一般0_{3），K（一般0}2）和M（M>=0）是具体层数。

注意，我们倾向于选择多层小size的卷积层，而不是一个大size的卷积层。
现在，我们以3个3x3的卷积层和1个7x7的卷积层为例，加以对比说明。从下图可以看出，这两种方法最终得到的activation map大小是一致的，但3个3x3的卷积层明显更好：
1)、3层的非线性组合要比1层线性组合提取出的特征具备更高的表达能力；
2)、3层小size的卷积层的参数数量要少，3x3x3<7x7；
3)、同样的，为了便于反向传播时的梯度计算，我们需要保留很多中间梯度，3层小size的卷积层需要保留的中间梯度更少。

3_3x3 VS 1_7x7.png

下面我给出一个最简单的CNNs结构的diagram（input+1conv+1pool+2fc）:

A simple CNNs architecture.png

这里我们列举几种常见类型的卷积神经网络结构：

· INPUT --> FC/OUT      这其实就是个线性分类器
· INPUT --> CONV --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> POOL]*2 --> FC --> RELU --> FC/OUT
· INPUT --> [CONV --> RELU --> CONV --> RELU --> POOL]*3 --> [FC --> RELU]*2 --> FC/OUT

---> PS:
1、对于输入层（图像层），我们一般会把图像大小resize成边长为2的次方的正方形。比如CIFAR-10是32x32x3，STL-10是64x64x3，而ImageNet是224x224x3或者512x512x3。

2、实际工程中，我们得预估一下内存，然后根据内存的情况去设定合理的值。例如输入是224x224x3得图片，过滤器大小为3x3，共64个，zero-padding为1，这样每张图片需要72MB的内存（这里的72MB囊括了图片以及对应的参数、梯度和激活值在内的，所需要的内存空间），但是在GPU上运行的话，内存可能不够（相比于CPU，GPU的内存要小得多），所以需要调整下参数，比如过滤器大小改为7x7，stride改为2（ZF net），或者过滤器大小改为11x11，stride改为4（AlexNet）。

3、构建一个实际可用的深度卷积神经网络最大的瓶颈是GPU的内(显)存。现在很多GPU只有3/4/6GB的内存，单卡最大的也就12G（NVIDIA），所以我们应该在设计卷积神经网的时候，多加考虑内存主要消耗在哪里：

大量的激活值和中间梯度值；
参数，反向传播时的梯度以及使用momentum，Adagrad，or RMSProp时的缓存都会占用储存，所以估计参数占用的内存时，一般至少要乘以3倍；
数据的batch以及其他的类似信息或者来源信息等也会消耗一部分内存。

下面列出一些著名的卷积神经网络：
· LeNet，这是最早成功应用的卷积神经网络，Yann LeCun在论文LeNet中提出。
· AlexNet，2012 ILSVRC竞赛远超第2名的卷积神经网络，掀起了深度学习的浪潮。
· ZF Net，2013 ILSVRC竞赛冠军，调整了Alexnet的结构参数, 扩增了中间卷积层。
· GoogLeNet，2014 ILSVRC竞赛冠军，极大地减少了参数数量（由 60M到4M）。
· VGGNet，2014 ILSVRC，证明了CNNs的深度对于最后的效果有至关重要的作用。
· ResNet，2015 ILSVRC竞赛冠军，截止2016年5月10，这是最先进的模型。最近Kaiming He等人，提出了改进版Identity Mappings in Deep Residual Networks。

From Kaiming He's ICML16 tutorial

Part 3：Python编程任务（3-layer CNNs）

这部分我们需要完成以下编程任务：
1)、layers.py里的以下函数：
---> conv_forward_naive
---> conv_backward_naive
---> max_pool_forward_naive
---> max_pool_backward_naive

在给出卷积层的代码前，我们先理解下卷积层的前向和后向传播时，具体是如何计算的。为了理解方便，我们假设某一个batch里的第一张图片为x[0, :, :, :]，有RGB三个通道，每个通道大小为7x7，padding为1，stride为2，那么x[0, :, :, :]的大小为1x3x9x9；此外，我们假设有3个过滤器，每个大小为3x3，用w表示所有过滤器中的权重（如第一个滤波器的第一个通道为w[0, 0, :, :]）；偏置b的大小为1x3；activation maps用out来表示，大小为3x4x4（如第一个map为out[0, :, :]）。

以刚才的假设为例，给出前向传播和后向传播的具体计算过程（反向传播的那张图片分辨率较高，请在新的标签页打开图片并放大，或者下载后观看）：

Forward.png

Backward.jpg

具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def conv_forward_naive(x, w, b, conv_param):
    stride, pad = conv_param['stride'], conv_param['pad']
    N, C, H, W = x.shape
    F, C, HH, WW = w.shape
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant')
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride
    s = stride
    out = np.zeros((N, F, H_new, W_new))

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    out[i, f, j, k] = np.sum(x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] * w[f]) + b[f]

    cache = (x, w, b, conv_param)

    return out, cache


def conv_backward_naive(dout, cache):
    x, w, b, conv_param = cache
    pad = conv_param['pad']
    stride = conv_param['stride']
    F, C, HH, WW = w.shape
    N, C, H, W = x.shape
    H_new = 1 + (H + 2 * pad - HH) / stride
    W_new = 1 + (W + 2 * pad - WW) / stride

    dx = np.zeros_like(x)
    dw = np.zeros_like(w)
    db = np.zeros_like(b)

    s = stride
    x_padded = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')
    dx_padded = np.pad(dx, ((0, 0), (0, 0), (pad, pad), (pad, pad)), 'constant')

    for i in xrange(N):       # ith image    
        for f in xrange(F):   # fth filter        
            for j in xrange(H_new):            
                for k in xrange(W_new):                
                    window = x_padded[i, :, j*s:HH+j*s, k*s:WW+k*s]
                    db[f] += dout[i, f, j, k]                
                    dw[f] += window * dout[i, f, j, k]                
                    dx_padded[i, :, j*s:HH+j*s, k*s:WW+k*s] += w[f] * dout[i, f, j, k]

    # Unpad
    dx = dx_padded[:, :, pad:pad+H, pad:pad+W]

    return dx, dw, db

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

下面给出池化层（最大值池化）的代码：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def max_pool_forward_naive(x, pool_param):
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    out = np.zeros((N, C, H_new, W_new))
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s] 
                    out[i, j, k, l] = np.max(window)

    cache = (x, pool_param)

    return out, cache


def max_pool_backward_naive(dout, cache):
    x, pool_param = cache
    HH, WW = pool_param['pool_height'], pool_param['pool_width']
    s = pool_param['stride']
    N, C, H, W = x.shape
    H_new = 1 + (H - HH) / s
    W_new = 1 + (W - WW) / s
    dx = np.zeros_like(x)
    for i in xrange(N):    
        for j in xrange(C):        
            for k in xrange(H_new):            
                for l in xrange(W_new):                
                    window = x[i, j, k*s:HH+k*s, l*s:WW+l*s]                
                    m = np.max(window)               
                    dx[i, j, k*s:HH+k*s, l*s:WW+l*s] = (window == m) * dout[i, j, k, l]

    return dx

同样，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

上面的编程中，我们使用了多层for循环，这会使得运行速度过慢。为了加快运行速度，Assignment2里提供了fast_layers.py，但需要借助Cython来生成C扩展，加快运行速度。这里，我给出naive版和fast版在运行速度上的对比，从下图可以看出，运行速度得到了极大的提升：

Naive vs Fast.png

2)、cnn.py，具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

from layer_utils import *

class ThreeLayerConvNet(object):    
    """    
    A three-layer convolutional network with the following architecture:       
       conv - relu - 2x2 max pool - affine - relu - affine - softmax
    """

    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,             
                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
                 dtype=np.float32):
        self.params = {}
        self.reg = reg
        self.dtype = dtype

        # Initialize weights and biases
        C, H, W = input_dim
        self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
        self.params['b1'] = np.zeros((1, num_filters))
        self.params['W2'] = weight_scale * np.random.randn(num_filters*H*W/4, hidden_dim)
        self.params['b2'] = np.zeros((1, hidden_dim))
        self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params['b3'] = np.zeros((1, num_classes))

        for k, v in self.params.iteritems():    
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        W3, b3 = self.params['W3'], self.params['b3']

        # pass conv_param to the forward pass for the convolutional layer
        filter_size = W1.shape[2]
        conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}

        # pass pool_param to the forward pass for the max-pooling layer
        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

        # compute the forward pass
        a1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
        a2, cache2 = affine_relu_forward(a1, W2, b2)
        scores, cache3 = affine_forward(a2, W3, b3)

        if y is None:    
            return scores

        # compute the backward pass
        data_loss, dscores = softmax_loss(scores, y)
        da2, dW3, db3 = affine_backward(dscores, cache3)
        da1, dW2, db2 = affine_relu_backward(da2, cache2)
        dX, dW1, db1 = conv_relu_pool_backward(da1, cache1)

        # Add regularization
        dW1 += self.reg * W1
        dW2 += self.reg * W2
        dW3 += self.reg * W3
        reg_loss = 0.5 * self.reg * sum(np.sum(W * W) for W in [W1, W2, W3])

        loss = data_loss + reg_loss
        grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2, 'W3': dW3, 'b3': db3}

        return loss, grads

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

3)、layers.py里的spatial_batchnorm_forward和spatial_batchnorm_backward函数。在给出代码前，我放张图，方便大家理解CNNs里的Batch Normalization是怎么计算卷积层的均值mean和标准差std的：

ConvNet Batch Normalization.png

具体代码如下：

__coauthor__ = 'Deeplayer'
# 6.25.2016 #

def spatial_batchnorm_forward(x, gamma, beta, bn_param):
    N, C, H, W = x.shape
    x_new = x.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    out, cache = batchnorm_forward(x_new, gamma, beta, bn_param)
    out = out.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return out, cache


def spatial_batchnorm_backward(dout, cache):
    N, C, H, W = dout.shape
    dout_new = dout.transpose(0, 2, 3, 1).reshape(N*H*W, C)
    dx, dgamma, dbeta = batchnorm_backward(dout_new, cache)
    dx = dx.reshape(N, H, W, C).transpose(0, 3, 1, 2)

    return dx, dgamma, dbeta

完成编程后，可以用ConvolutionalNetworks.ipynb里的代码来check编程是否有误。

以上面完成的ThreeLayerConvNet为例，比较下使用和不使用Batch Normalization对收敛速度的影响。从下图中的结果可以看出，使用Batch Normalization明显加快了收敛，使得训练速度大幅提升（因为需要的epoch更少）：

with BN --vs-- without BN.png

---> PS:
1、数据扩增（Data Augmentation）
当数据集较小的情况下，这一操作还是十分有效的，可以一定程度提高识别率。具体的扩增方法如下：
1)、水平翻转（Horizontal flips）

Horizontal flips.png
2)、随机剪裁（Random crops/scales）
Random crops/scales.png
3)、色彩抖动（Color jitter）
Randomly jitter contrast.png
4)、发挥想象力（Get creative）
比如：平移、旋转、拉伸、切变、光学畸变等等。

下面我给出一个CNN模型，测试其在CIFAR-10上的表现（进行简单的水平翻转来扩增数据），training set: 49000x2, validation set: 1000, test set: 10000。CNN层数结构如下：

           [[conv - relu]x3 - pool]x3 - affine - relu - affine - softmax

训练结果如下：
· Validation set accuracy: 0.904
· Test set accuracy: 0.892

Training loss & Accuracy

CONV layer 1: filters

Part 4：可视化卷积神经网络

可视化手段可以直观地揭开CNNs的神秘面纱，帮助我们更好地理解CNNs究竟学到了什么，下面我们讨论下具体的可视化技术：

1. 可视化权重和激活值

以AlexNet为例，给出每层部分权重和激活值的可视化如下：

CONV layer 1: filters(left) and activations(right)

CONV layer 2: filters(left) and activations(right)

CONV layer 3: activations

CONV layer 4: activations

CONV layer 5: activations

Fully-connected layer 1 & 2

Output layer

2. 检索能最大限度激活神经元的图片

我们可以将大量图片输入网络，追踪那些可以最大限度激活神经元的图片，然后我们可以可视化这些图片，以此来理解神经元在它的感受野里究竟在寻找什么，以便能够正确地分类图片？下图是AlexNet的第五个pooling层（光头躺枪 O__O "…）：

AlexNet: pooling layer 5

3. 利用t-SNE和CNNs的特征向量来可视化图片

CNNs可以表示为对输入图像进行逐层转化，最终形成一个可以用线性分类器进行分类的representation，这个最终形成的representation就是CNN codes（例如AlexNet里输入分类器之前的那个4096维向量），即特征向量。

t-SNE作为对高维数据降维并可视化的最好的方法之一，其可视化结果有非常棒的视觉效果。我们可以将CNN codes输入t-SNE，得到每一张图片（对应一个特征向量）对应的二维向量，然后可以可视化出如下结果（靠的越近的图片，在CNNs眼里越相似）：

t-SNE visualization of CNN codes

4. 局部遮挡图片

为了判断CNNs是否是依靠图片中正确的目标进行进行分类（而不是靠蒙的），我们可以对图片进行局部遮挡，来测试CNNs。从下图可以看出，CNNs确实是依靠正确的目标进行分类的：

Occluding parts of the image

Part 5：迁移学习（Transfer Learning）

实际中，我们很少从头开始训练一个CNNs，因为通常我们没有足够的数据。我们常采取的做法是：使用已经在大数据集（例如ImageNet）上训练好的CNNs作为我们的初始模型或者一个固定的特征提取器，然后用在新的数据集上。上张图以便说明：

CS231n Convolutional Neural Networks for Visual Recognition.png

当新数据集和预训练时的数据集不相似的情况下（如医学图像），上图的策略需要稍稍调整下：若新数据集较小，我们需要训练除线性分类器之外更前面的几层；若新数据集较大，我们需要微调所有层。

---> CS231n: Assignment 1
---> CS231n: Assignment 3

CS231n (winter 2016) : Assignmen
前言：以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。建议PC端...
CS231n (winter 2016) : Assignmen
接着上篇 CS231n : Assignment 1 继续：这里，我给出我的最优模型代码和第一层权重W1的可视化...
CS231n (winter 2016) : Assignmen
前言：以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。该课程的学...
CS231n (winter 2016) : Assignmen
前言：以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。建议PC端...
Know how to keep warm and be sty
2016-11-09MulanMULAN沐兰 Winter is coming! Are you prepared...
this is not a homework assignmen
其实我写这个的时候都不知道具体要写什么，因为没有一个主题，也没人强迫，所以可以写也可以不写。还有一点就是根本不知道...
一个前端的自我修养（转）
作者: 寒泉（winter）发表于: 2016-03-23原链接 : http://taobaofed.org/...
讲解：FIT9132、Databases、SQL、SQLC/C+
FIT9132 Introduction to Databases2019 Semester 1Assignmen...
代做TreeSets作业、代写D2L Dropbox作业、Pyt
Project 3: TreeSetsDue: Friday Mar 1, 11:59 pm1 Assignmen...
代做TreeSets作业、代写D2L Dropbox作业、Pyt
Project 3: TreeSetsDue: Friday Mar 1, 11:59 pm1 Assignmen...

网友评论

berrystrudel:楼主请问softmax_loss这里我不太理解
loss = -np.sum(np.log(probs[np.arange(N), y])) / N
这个是什么意思[np.arange(N), y]？
我的理解是求loss应该是probs[c]
其中c=(i,y[i]), for i in np.arange(N)
berrystrudel:如果像楼主这样写的话，我感觉每个样本的loss成了一行中几项的和
冒绿光的盒子:楼主您好我想请教一个问题：
dx = np.reshape(dx, x.shape) # (N,d1,...,d_k)
x_row = x.reshape(x.shape[0], -1)
在卷积网络的反向传播里面，为什么dx = np.reshape要这句话啊这句话是用来干什么的呢？
还有x_row的为什么需要重新reshape啊这个不是多此一举吗，x.shape[0]就是行，-1自动调整啊，为什么要用这个reshape呢？谢谢
横渡:楼主你好。请问有assignment2中PyTorch和TensorFlow两个题目的讲解吗
borelset:全连接网络里fc_net里FullConnectionNet计算loss和梯度的地方似乎有点问题
FullConnecttionNet的层数已经不确定了，计算loss的时候正则化项为什么还是乘以0.5啊？这个系数不是为了应对层数增加而产生的修正项吗？不然层数越多这个正则化项越大，反而导致data_loss在loss中的比例减少了。
另外就是对应的计算grad(W)的问题
grads['W'+str(i+1)] += self.reg * self.params['W' + str(i+1)]
这一句里应该再乘一个系数2/num_layers
因为在计算loss的时候，需要把各个W的L2范数取平均值加到data_loss里，所以各个np.sum(W*W)都要除以num_layers。那么到了计算梯度的时候，W^2的梯度就是2W，再考虑它们计算loss的时候都除以了num_layers，所以各个W都应该有个系数2/num_layers。
两层网络的时候没有这项是因为2/2刚好等于1。
06d01a33a516:Hi 答主你好，我想问一下你在实现dropout时是如何保证train和test的量级一致的
剑鱼闯天涯:求助ThreeLayerConvNet这个模型的问题
我用楼主的代码尝试跑CIFAR-10，但是Loss值几乎不收敛。我是完全按照楼主的代码写的，Solver也是用的assignment2里的代码，实在找不出问题出在哪，请诸位帮忙分析下或者提供下这个例子完整代码我自行分析
韧_dc6f:楼主，我想请问一下，你的DNN模块化处理的时候为啥X的shape是（N,d1,d2,...dk）而不是（N,D），因为我从你的net_start类的get_cifar10_data得到的X_train的shape是（N,D），所以我这边有点没懂，请赐教一下，谢谢
914eace03ba0:求助fast_layer编译的问题（版本为python-3.6）
先是遇到下面的报错

error：Cannot find vcvarsall.bat

用https://github.com/cython/cython/wiki/CythonExtensionsOnWindows中提到的第二种方法
用（Microsoft Visual C++ Compiler for Python）编译后遇到以下报错

error：syntax error in option 'MANIFEST:EMBED,ID=2'

这个cython编译搞了大半天了，求助啊~~~~
或者哪个好心人直接发我编译好的文件，拜托了。
fb5fc4b578ef:楼主写的真的可以。知乎上有原课程笔记翻译，感觉很多说的比较复杂，楼主的解说好理解多了。
914eace03ba0:我重装电脑，用anaconda2+Microsoft Visual C++ Compiler for Python 2.7后编译成功。
还是用python2好，python3要折腾的很累
914eace03ba0:差点忘了感谢作者的文章，对于back_prop部分的求导的公式，挺详细的
21cfcc29fd9b:感谢博主这么细致的讲解。卷积神经网络前向传播那张图是不是有问题？跟旁边的公式对不上。作用在第0个特征图上面的滤波器只有w0吧。而图片上用到了三个滤波器。
32ee24c21b36:楼主你好，我想请问一下为什么我的spatial batch normalization部分出来的dx-error比你的大呀，我不是你那种写法，难道是因为我reshape成（n * (h*w)），看起来特征维度更大，所以误差大些？
另外，三层卷积网络train the net 部分，开始的loss很大很大，并且之后出来的结果特别差，（没改参数），这是为什么呀，我看别人不改参数结果就挺好的，不明白
42a0b651b9c2:楼主非常感谢啊，只是现在你提供的初始代码的链接已经无效了，找不到那些代码了，能否提供一下下载包。谢谢
Deepool:已更新
Deepool:@扶朕起来嗨SY 已经更新
de99256ee1c0:同样得请求，请楼主务必再次上传代码，跪谢！
32ee24c21b36:写的很好，大赞，正向传播的时候各种for循环写出来了，但是不善用矩阵之间的运算，反向的时候就写不出来了，看了博主的图解之后就好很多，感谢
Earnest_62f7:请问一下在relu_backward() 这个函数中ReLU函数求导，dout > 0 是不是应该是1，为什么是dout？
Earnest_62f7:@Deeplayer 懂了，非常感谢您
Deepool:反向传播用到了链式法则，dx = 1 * dout (if x > 0), dx = 0 * dout = 0 (if x <= 0)，这里的 dout 不是out (out=ReLU(x)) 对 x 的导数，而是 loss 对 out 的导数。
PaKing:请问一下ConvolutionalNetworks.ipynb代码在哪？？
f9b70b0aa5bb:楼主的作业帮了我大忙，我也分享一下代码：
def conv_forward_victor(x, w, b, conv_param):
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
# Check dimensions
h_flash = H + 2 * pad
w_flash = W + 2 * pad
assert (h_flash - HH) % stride == 0, 'width does not work'
assert (w_flash - WW) % stride == 0, 'height does not work'
out_h = 1 + (h_flash - HH) / stride
out_w = 1 + (w_flash - WW) / stride
x_pad = np.zeros([C, N, h_flash, w_flash])
x_pad[:, :, pad:-pad, pad:-pad] = x.transpose(1, 0, 2, 3)
x_new = np.zeros([C, HH, WW, N, out_h, out_w]) # 将x变换后的形状，预分配内存
for i in xrange(HH):
for j in xrange(WW):
# 例如，i=0，j=0时，把滤板左上角的元素集合成一个矩阵
x_new[:,i,j] = x_pad[:,:,i::stride,j::stride][:,:,:out_h,:out_w]
x_r = x_new.reshape(C * HH * WW, -1)
out_new = np.dot(w.reshape(F, -1), x_r) + b[:, np.newaxis]
out = out_new.reshape(F, N, out_h, out_w).transpose(1, 0, 2, 3)
cache = (x, w, b, conv_param, x_r)
return out, cache
比conv_forward_fast快一丢丢！
yxyswag:你好，请问在卷积层的反向传播的求导过程中，out = w*x 为什么 dw = dout*x呢？根据我的微积分知识，应该是dout = dw*x，为什么会写成上面的式子呢？希望你能解答一下，谢谢！
Deepool: @yxyswag 这里是偏导，如果只根据out=x*w+b，可以得到dw=x，但又根据反向传播链式法则，dw=dout*x。因为这里的out并不是最终的输出函数。
312d14f8b871:路过给Cpython出错的提个醒，在英文路径下编译好，复制过去就好了
1a7a5b9a3a78:写得很好，卷积层的求导一直不懂，看图舒服多了。
Deepool:谢谢 ;)
c93787129954:博主作业写得挺好，我遇到问题会过来看看博主咋解决。
conv_forward_naive我用矩阵运算效率可能更高一些，分享一下
#############################################################################
# TODO: Implement the convolutional forward pass. #
# Hint: you can use the function np.pad for padding. #
#############################################################################
stride = conv_param.get('stride')
pad = conv_param.get('pad')

#padding firstly
if pad > 0:
#only pad heights and weights
x = np.pad(x,((0,0),(0,0),(pad,pad),(pad,pad)), 'constant', constant_values=0)

#using the H and W after padding
N, C, H, W = x.shape
F, C, HH, WW = w.shape

D = C*HH*WW #dimensions of filter

#reshape to [F,D]
flat_w = w.reshape(F,D) #[F, D]
# out sizes
out_H = 1 + (H - HH) / stride
out_W = 1 + (W - WW) / stride

out = np.zeros([out_H, out_W, N, F]) #will transpose to [N, F, out_H, out_W]

#move firstly in weight direction in the range of out sizes
for index_H in xrange(out_H):
for index_W in xrange(out_W):
offset_H = index_H * stride #input images pixels left-top point
offset_W = index_W * stride
#input region of all channels of all images
conv_input = x[:, :, offset_H:offset_H + HH, offset_W:offset_W + WW].\
reshape(N, D) #reshape to [N,D]
#[N, D] dot [D,F]
value = conv_input.dot(flat_w.T) + b # [N, F]
out[index_H][index_W] = value

#transpose back to [N, F, out_H, out_W]
out = out.transpose(2,3,0,1)
#############################################################################
# END OF YOUR CODE #
#############################################################################
f9b70b0aa5bb:我发现np.pad()是个坑，占了很长时间，不如自己写一个代替
Deepool:感谢 :)
3158e2d7732f:您好，能具体说一下如何导包im2col_cython使用fast_layer吗，我安装了Cython，用setup.py生成了im2col_cython.c文件，但fast_layer还是显示导包im2col_cython不成功，谢谢
3158e2d7732f:@GOODSTUDY1 我也解决了，谢谢啊～
GOODSTUDY1:@syk2118 我解决了，你还有没解决咱们可以交流一下。
GOODSTUDY1:@syk2118 你的问题解决了吗？同问？？怎么搞定这个。
卑鄙的我_:讲的很清楚
卑鄙的我_:@GOODSTUDY1
Compile the Cython extension: Convolutional Neural Networks require a very
efficient implementation. We have implemented of the functionality using
Cython; you will need to compile the Cython extension
before you can run the code. From the cs231n directory, run the following
command:

python setup.py build_ext --inplace

卑鄙的我_:@GOODSTUDY1 找到setup.py所在的文件夹，然后用cmd命令：“python setup.py build_ext --inplace” ，详细的东西可以在根目录的README中看到。
GOODSTUDY1:@苟利国家请问您知道那个fast_layer怎么编译吗？我在windows下报错。
cbd22817c2de:
Deepool:谢谢 :)