CS231n (winter 2016) : Assignmen

作者: Deepool | 来源:发表于2016-06-20 00:11 被阅读32758次

前言：

以斯坦福cs231n课程的python编程任务为主线，展开对该课程主要内容的理解和部分数学推导。该课程的学习资料和代码如下：
视频和PPT
笔记
 assignment1初始代码

Part 1：线性分类器（Linear classifier）

分值函数，将原始数据（即输入数据，经过预处理后的）映射成每个类对应的分值（分值越高，该类的可能性就越大）。
score function: map the raw data to class scores.
损失（代价）函数，表示预测结果和真实类标签之间的差。
loss/cost function: quantify the agreement between the predicted scores and the ground truth labels.
优化，最小化损失函数（通过优化分值函数中的参数/权重）。
optimization: minimize the loss function with respect to the parameters of the score function.
数据集（本课程中指图像）：
·训练集（training dataset），用于训练模型（即训练模型的表达能力）。
·验证集（validation dataset），用于微调模型的超参数，使模型性能达到最优。
·测试集（test dataset），测试模型的泛化能力。

假设我们有一个图像训练集 X，是一个大小为[N,D]的矩阵；其中，N表示样本的数量，D表示样本的维数。x_i是X中的第i行，即第i个样本。
y表示一个向量，大小为[1,N]；y_i表示第i个样本的真实类别，y_i=1,2,3, ...,C。

假设我们有一个线性映射为：

---------------------------------------------> f(x_i, W, b) = x_iW + b <----------------------------------------

其中，W是权重（weight）矩阵，大小为[D,C]，W的第j列表示x_i在第j(1≤ j ≤C)个类别上的线性映射；b是偏置向量，大小为[1,C]；f的大小为[N,C]。（ps: 这里的公式为了和代码里的保持一致，做了调整，下面的公式都为编程服务）
函数f(x_i, W, b) 的值就是C在每个类别上的得分，而我们的最终目标就是学习到W和b，使得f的大小在全局范围内接近真实值，即真实的类别得到更高的分数。
为了便于直观理解，下面贴出一个栗子（和上面的公式有点区别，但不影响理解）：

CS231n Convolutional Neural Networks for Visual Recognition.png

图片中的结果认为这很可能是一只狗，说明W和b的值没有训练好。
关于线性分类器的几何解释和模板解释，可以直接看cs231n的笔记，这里不再赘述。

为了便于计算，我们可以将b和W进行合并，将b加到W的最后一行，W的大小将变为[D+1,C]。此时，x_i需要增加一维常数1，即x_i的大小为[1,D+1]（编程的时候别忘了）；同时，上面的f需要修改为：f(x_i, W) 。

数据预处理（Part 3 部分会讲一下为什么需要预处理）
在机器学习里，规范化/归一化（Normalization）输入特征（这里指像素值[0,255]）是非常常见且必要的，特别是对于各维度幅度变化非常大的数据集。但对于图像而言，一般只要去均值（mean subtraction）即可（因为图像每一维的像素值都在[0,255]之间），即计算出训练集图像的均值图像，然后每张图像（包括训练集、验证集和测试集）减去均值图像（一般不需要归一化和白化）。在numpy中此过程可以表示为： X -= np.mean(X, axis=0)。

1. 多分类支持向量机损失函数（Multiclass SVM loss）

SVM loss : 对于每一张图像样本，正确分类的得分应该比错误分类的得分至少高Δ（Δ的取值在实际中一般为1，不将Δ作为参数是因为它的变化可以转换为W的变化，所以只要训练W就行了）。这里贴上一张图便于理解：

CS231n Convolutional Neural Networks for Visual Recognition.png

上面我们提到的线性映射将第i个样本的像素值作为输入，输出该样本在C个类别上的得分，形成一个分值向量，大小为[1,C]。所以，我们记s_j = f(x_i, W)_j，表示第i个样本在第j类上的得分。那么，multiclass SVM loss的表达式如下：

------------------------------------------> L_i = ∑_{j≠y_i} max(0, s_j−s_{y_i}+Δ) <-----------------------------------

从表达式中可以看出，当s_{y_i} >= s_j + Δ 时，L_i = 0，这时候表示判断准确；反之，L_i>0，这时候表示判断有误。
我们可以将L_i重新表示如下：

-------------------------------------> L_i = ∑_{j≠y_i} max(0, (x_iW)_j−(x_{y_i}W)_j+Δ) <-----------------------------

上面的max(0, -)函数称为Hinge loss，有时候也可以用max(0, -)²，称为squared hinge loss SVM (or L2-SVM)，它对错误的惩罚更加严厉。我们可以通过交叉验证来选择具体的形式（多数情况下我们会使用前者）。（ps: 这里有一篇介绍Hinge loss的博文）

2. 正则化（Regularization）

上面的损失函数存在缺陷：W不唯一。假设一组W使得损失函数的值为0，那么 λW (λ>1) 也能做到。为了得到唯一的W进行分类工作，我们可以添加一个正则化惩罚项（regularization penalty）R(W)来实现，通常是2范数：

-------------------------------------------> R(W) = ∑_k∑_s (W_k_,_s)² <-----------------------------------------

添加惩罚项后，完整的损失函数表达式为：

------------------------------------------> L = (1/N)∑_iL_i + λR(W) <----------------------------------------

其中，λ可以通过交叉验证来选择。

对参数进行惩罚最主要的作用其实是防止过拟合（overfitting），提高模型的泛化能力。此外，偏置b不会对输入特征的影响强度产生作用，所以我们不需要对b进行惩罚（但是b被合并到了W里，所以实际上我们在assignment1里对b也进行了惩罚，不过影响不大）。

后面求解参数W会用到L关于W的偏导数，这里我们先给出（推导比较简单，这里Δ我直接换成1了）：

---------------------------> ∇_W_{y_i} L_i = - x_i^T(∑_{j≠y_i}1(x_iW_j - x_iW_{y_i} +1>0)) + 2λW_{y_i} <----------------------
----------------------------> ∇_W_{_j} L_i = x_i^T 1(x_iW_j - x_iW_{y_i} +1>0) + 2λW_j , (j≠y_i) <-----------------------

其中，1(·)是示性函数，其取值规则为：1(表达式为真) =1；1(表达式为假) =0。

3. Softmax classifier

Softmax是二值Logistic回归在多分类问题上的推广。
这里函数f保持不变，将Hinge loss替换成交叉熵损失函数（cross-entropy loss），其损失函数表达式如下（log(e) =1）：

--------------------------------------> L_i = -log(exp(f_{y_i})/∑_j exp(f_j)) <--------------------------------------

其中，函数f_j(z) = exp(z_j)/∑_k exp(z_k)称为softmax函数。可以看出softmax函数的输出实际上是输入样本x_i在K个类别上的概率分布，而上式是概率分布的交叉熵（不是相对熵，虽然看上去好像是相对熵，下面我会稍微修改一下L_i，还原它的本来面目；交叉熵可以看做熵与相对熵之和）。
先引入一下信息论里的交叉熵公式：H(p,q) = -∑_x p(x)logq(x)；其中p表示真实分布，q表示拟合分布。下面我们来修改下L_i：

----------------------------------> L_i = -∑_k p_i,klog(exp(f_k)/∑_j exp(f_j)) <-----------------------------------

其中，p_i = [0,0, ...,0,1,0, ...,0,0]，p_i,k=p_i[k]，p_i 的大小为[1,C]，p_i 中只有p_i[y_i]=1，其余元素均为0。现在感觉如何？

在实际编程计算softmax函数时，可能会遇到数值稳定性（Numeric stability）问题（因为在计算过程中，exp(f_{y_i}) 和 ∑_j exp(f_j) 的值可能会变得非常大，大值数相除容易导致数值不稳定），为了避免出现这样的问题，我们可以进行如下处理：

CS231n Convolutional Neural Networks for Visual Recognition.png

其中，C的取值通常为：logC = -max_j f_j，即-logC取f每一行中的最大值。

现在，结合惩罚项，给出总的损失函数：

---------------------> L = -(1/N)∑_i∑_j1(k=y_i)log(exp(f_k)/∑_j exp(f_j)) + λR(W) <-----------------------

后面求解参数W会用到L关于W的偏导数，这里我们先给出结果，然后推导一遍：

--------------> ∇_{W_k} L = -(1/N)∑_i x_i^T(p_i,m-P_m) + 2λW_k, where P_k = exp(f_k)/∑_j exp(f_j) <----------

推导过程如下：

Derivative of softmax loss function.png

下面贴出一张图，大家可以直观感受下SVM和Softmax关于损失函数的计算区别：

CS231n Convolutional Neural Networks for Visual Recognition.png

4. 优化（Optimization）

优化就是通过在训练集上训练参数（权重和偏置），最小化损失函数的过程。然后，通过验证集来微调超参数（学习率、惩罚因子λ等等），最终得到最优的模型；并用测试集来测试模型的泛化能力。

通常我们用梯度下降法（Gradient Descent）并结合反向传播（Backpropagation）来训练参数。具体的参数更新策略，这里我们使用vanilla update方法（我们会在Part3神经网络部分，具体介绍不同的参数更新策略），即x += - learning_rate * dx，其中x表示需要更新的参数。
梯度下降的版本很多，通常我们使用Mini-batch梯度下降法（Mini-batch Gradient Descent），具体参见该课程的笔记。
ps: 在编程任务中你会发现上面提示用随机梯度下降（Stochastic Gradient Descent, SGD），但实际上用了Mini-batches，所以当你听到有人用SGD来优化参数，不要惊讶，他们实际是用了Mini-batches的。

至于反向传播，实际就是链式法则（chain rule），这里不展开讲，具体参见课程笔记。实际上我已经给出了，就是上面的偏导。等到后面的神经网络，再具体展开讲一下。

Part 2： Python编程任务（线性分类器）

· 我用的IDE是Pycharm。
· Assignment1的线性分类器部分，我们需要完成 linear_svm.py，softmax.py，linear_classifier.py。在完成后，你可以用svm.ipynb和softmax.ipynb里的代码来debug你的模型，获得最优模型，然后在测试集上测试分类水平。
· Assignment1用的图像库是CIFAR-10，你也可以从这里下载。

linear_svm.py代码如下：

__coauthor__ = 'Deeplayer'
# 5.19.2016

import numpy as np
def svm_loss_naive(W, X, y, reg):
    """
    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means 
         that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    dW = np.zeros(W.shape)   # initialize the gradient as zero
    # compute the loss and the gradient
    num_classes = W.shape[1]
    num_train = X.shape[0]
    loss = 0.0
    for i in xrange(num_train):    
        scores = X[i].dot(W)    
        correct_class_score = scores[y[i]]
        for j in xrange(num_classes):
            if j == y[i]:    
                continue
            margin = scores[j] - correct_class_score + 1   # note delta = 1
            if margin > 0:
                loss += margin
                dW[:, y[i]] += -X[i, :]     # compute the correct_class gradients
                dW[:, j] += X[i, :]         # compute the wrong_class gradients
    # Right now the loss is a sum over all training examples, but we want it
    # to be an average instead so we divide by num_train.
    loss /= num_train
    dW /= num_train
    # Add regularization to the loss.
    loss += 0.5 * reg * np.sum(W * W)
    dW += reg * W
    return loss, dW

def svm_loss_vectorized(W, X, y, reg):
    """
    Structured SVM loss function, vectorized implementation.Inputs and outputs 
    are the same as svm_loss_naive.
    """
    loss = 0.0
    dW = np.zeros(W.shape)   # initialize the gradient as zero
    scores = X.dot(W)        # N by C
    num_train = X.shape[0]
    num_classes = W.shape[1]
    scores_correct = scores[np.arange(num_train), y]   # 1 by N
    scores_correct = np.reshape(scores_correct, (num_train, 1))  # N by 1
    margins = scores - scores_correct + 1.0     # N by C
    margins[np.arange(num_train), y] = 0.0
    margins[margins <= 0] = 0.0
    loss += np.sum(margins) / num_train
    loss += 0.5 * reg * np.sum(W * W)
    # compute the gradient
    margins[margins > 0] = 1.0
    row_sum = np.sum(margins, axis=1)                  # 1 by N
    margins[np.arange(num_train), y] = -row_sum        
    dW += np.dot(X.T, margins)/num_train + reg * W     # D by C
  
    return loss, dW

softmax.py代码如下：

__coauthor__ = 'Deeplayer'
# 5.19.2016

import numpy as np

def softmax_loss_naive(W, X, y, reg):    

    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)    # D by C
    dW_each = np.zeros_like(W)
    num_train, dim = X.shape
    num_class = W.shape[1]
    f = X.dot(W)    # N by C
    # Considering the Numeric Stability
    f_max = np.reshape(np.max(f, axis=1), (num_train, 1))   # N by 1
    prob = np.exp(f - f_max) / np.sum(np.exp(f - f_max), axis=1, keepdims=True) # N by C
    y_trueClass = np.zeros_like(prob)
    y_trueClass[np.arange(num_train), y] = 1.0
    for i in xrange(num_train):
        for j in xrange(num_class):    
            loss += -(y_trueClass[i, j] * np.log(prob[i, j]))    
            dW_each[:, j] = -(y_trueClass[i, j] - prob[i, j]) * X[i, :]
        dW += dW_each
    loss /= num_train
    loss += 0.5 * reg * np.sum(W * W)
    dW /= num_train
    dW += reg * W

    return loss, dW

def softmax_loss_vectorized(W, X, y, reg):    
    """    
    Softmax loss function, vectorized version.    

    Inputs and outputs are the same as softmax_loss_naive.    
    """    
    # Initialize the loss and gradient to zero.    
    loss = 0.0    
    dW = np.zeros_like(W)    # D by C    
    num_train, dim = X.shape

    f = X.dot(W)    # N by C
    # Considering the Numeric Stability
    f_max = np.reshape(np.max(f, axis=1), (num_train, 1))   # N by 1
    prob = np.exp(f - f_max) / np.sum(np.exp(f - f_max), axis=1, keepdims=True)
    y_trueClass = np.zeros_like(prob)
    y_trueClass[range(num_train), y] = 1.0    # N by C
    loss += -np.sum(y_trueClass * np.log(prob)) / num_train + 0.5 * reg * np.sum(W * W)
    dW += -np.dot(X.T, y_trueClass - prob) / num_train + reg * W

    return loss, dW

linear_classifier.py代码如下：

__coauthor__ = 'Deeplayer'
# 5.19.2016

from linear_svm import *
from softmax import *

class LinearClassifier(object):    

    def __init__(self):        
        self.W = None    

    def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100, 
                          batch_size=200, verbose=True):
        """        
        Train this linear classifier using stochastic gradient descent.   
  
        Inputs:       
        - X: A numpy array of shape (N, D) containing training data; there are N          
             training samples each of dimension D.        
        - y: A numpy array of shape (N,) containing training labels; y[i] = c          
             means that X[i] has label 0 <= c < C for C classes.        
        - learning_rate: (float) learning rate for optimization.        
        - reg: (float) regularization strength.        
        - num_iters: (integer) number of steps to take when optimizing
        - batch_size: (integer) number of training examples to use at each step.        
        - verbose: (boolean) If true, print progress during optimization.

        Outputs:         
        A list containing the value of the loss function at each training iteration.
        """
        num_train, dim = X.shape
        # assume y takes values 0...K-1 where K is number of classes
        num_classes = np.max(y) + 1  
        if self.W is None:
            # lazily initialize W
            self.W = 0.001 * np.random.randn(dim, num_classes)   # D by C

        # Run stochastic gradient descent(Mini-Batch) to optimize W
        loss_history = []
        for it in xrange(num_iters): 
            X_batch = None
            y_batch = None
            # Sampling with replacement is faster than sampling without replacement.
            sample_index = np.random.choice(num_train, batch_size, replace=False)
            X_batch = X[sample_index, :]   # batch_size by D
            y_batch = y[sample_index]      # 1 by batch_size
            # evaluate loss and gradient
            loss, grad = self.loss(X_batch, y_batch, reg)
            loss_history.append(loss)

            # perform parameter update
            self.W += -learning_rate * grad
            if verbose and it % 100 == 0:
                print 'Iteration %d / %d: loss %f' % (it, num_iters, loss)

        return loss_history

    def predict(self, X):    
        """    
        Use the trained weights of this linear classifier to predict labels for   
        data points.    

        Inputs:    
        - X: D x N array of training data. Each column is a D-dimensional point.    

        Returns:    
        - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional     
                  array of length N, and each element is an integer giving the 
                  predicted class.  
        """
        y_pred = np.zeros(X.shape[1])    # 1 by N
        y_pred = np.argmax(np.dot(self.W.T, X), axis=0)

        return y_pred

    def loss(self, X_batch, y_batch, reg):   
        """    
        Compute the loss function and its derivative.    
        Subclasses will override this.    

        Inputs:    
        - X_batch: A numpy array of shape (N, D) containing a minibatch of N 
                   data points; each point has dimension D.    
        - y_batch: A numpy array of shape (N,) containing labels for the minibatch.
        - reg: (float) regularization strength.   

        Returns: A tuple containing:    
        - loss as a single float    
        - gradient with respect to self.W; an array of the same shape as W   
        """    
        pass

class LinearSVM(LinearClassifier):   
    """ 
    A subclass that uses the Multiclass SVM loss function 
    """    
    def loss(self, X_batch, y_batch, reg):        
        return svm_loss_vectorized(self.W, X_batch, y_batch, reg)

class Softmax(LinearClassifier):   
    """ 
    A subclass that uses the Softmax + Cross-entropy loss function 
    """    
    def loss(self, X_batch, y_batch, reg):        
        return softmax_loss_vectorized(self.W, X_batch, y_batch, reg)

下面我贴一下微调超参数获得最优模型的代码，并给出一些运行结果和图：

1、 LinearClassifier_svm_start.py

__coauthor__ = 'Deeplayer'
# 5.20.2016

import numpy as np
import matplotlib.pyplot as plt
import math
from linear_classifier import *
from data_utils import load_CIFAR10

# Load the raw CIFAR-10 data.
cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py' # u should change this
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
print 'Training data shape: ', X_train.shape     # (50000,32,32,3)
print 'Training labels shape: ', y_train.shape   # (50000L,)
print 'Test data shape: ', X_test.shape          # (10000,32,32,3)
print 'Test labels shape: ', y_test.shape        # (10000L,)
print

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 
                'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):    
    idxs = np.flatnonzero(y_train == y)    
    idxs = np.random.choice(idxs, samples_per_class, replace=False) 
    for i, idx in enumerate(idxs):        
        plt_idx = i * num_classes + y + 1 
        plt.subplot(samples_per_class, num_classes, plt_idx)   
        plt.imshow(X_train[idx].astype('uint8'))        
        plt.axis('off')       
        if i == 0:            
            plt.title(cls)
plt.show()

# Split the data into train, val, and test sets.
num_training = 49000
num_validation = 1000
num_test = 1000
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask]                  # (1000,32,32,3)
y_val = y_train[mask]                  # (1,1000)
mask = range(num_training)
X_train = X_train[mask]                # (49000,32,32,3)
y_train = y_train[mask]                # (1,49000)
mask = range(num_test)
X_test = X_test[mask]                  # (1000,32,32,3)
y_test = y_test[mask]                  # (1,1000)

# Preprocessing1: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))    # (49000,3072)
X_val = np.reshape(X_val, (X_val.shape[0], -1))          # (1000,3072)
X_test = np.reshape(X_test, (X_test.shape[0], -1))       # (1000,3072)

# Preprocessing2: subtract the mean image
mean_image = np.mean(X_train, axis=0)       # (1,3072)
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image

# Visualize the mean image
plt.figure(figsize=(4, 4))
plt.imshow(mean_image.reshape((32, 32, 3)).astype('uint8'))
plt.show()

# Bias trick, extending the data
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])    # (49000,3073)
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])          # (1000,3073)
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])       # (1000,3073)

# Use the validation set to tune hyperparameters (regularization strength 
# and learning rate).
learning_rates = [1e-7, 5e-5]
regularization_strengths = [5e4, 1e5]
results = {}best_val = -1    # The highest validation accuracy that we have seen so far.
best_svm = None   # The LinearSVM object that achieved the highest validation rate.
iters = 1500
for lr in learning_rates:
    for rs in regularization_strengths:    
        svm = LinearSVM()    
        svm.train(X_train, y_train, learning_rate=lr, reg=rs, num_iters=iters)    
        Tr_pred = svm.predict(X_train.T)    
        acc_train = np.mean(y_train == Tr_pred)    
        Val_pred = svm.predict(X_val.T)    
        acc_val = np.mean(y_val == Val_pred)    
        results[(lr, rs)] = (acc_train, acc_val)    
        if best_val < acc_val:
            best_val = acc_val
            best_svm = svm

# print results
for lr, reg in sorted(results):    
    train_accuracy, val_accuracy = results[(lr, reg)]    
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % 
                                (lr, reg, train_accuracy, val_accuracy)
print 'Best validation accuracy achieved during validation: %f' % best_val # around 38.2%

# Visualize the learned weights for each class
w = best_svm.W[:-1, :]   # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 
                    'dog', 'frog', 'horse', 'ship', 'truck']
for i in xrange(10):    
    plt.subplot(2, 5, i + 1)    
    # Rescale the weights to be between 0 and 255    
    wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)    
    plt.imshow(wimg.astype('uint8'))    
    plt.axis('off')    
    plt.title(classes[i])
    plt.show()

# Evaluate the best svm on test set
Ts_pred = best_svm.predict(X_test.T)
test_accuracy = np.mean(y_test == Ts_pred)     # around 37.1%
print 'LinearSVM on raw pixels of CIFAR-10 final test set accuracy: %f' % test_accuracy

下面可视化一下部分原始图片、均值图像和学习到的权重：

figure_1.png

figure_2.png

figure_3.png

2、 LinearClassifier_softmax_start.py

__coauthor__ = 'Deeplayer'
# 5.20.2016

import numpy as np
from data_utils import load_CIFAR10
from linear_classifier import *

def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):  
    """ 
    Load the CIFAR-10 dataset from disk and perform preprocessing to prepare 
    it for the linear classifier. These are the same steps as we used for the SVM, 
    but condensed to a single function.  
    """  
    # Load the raw CIFAR-10 data 
    cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py'   # make a change
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)  
    # subsample the data  
    mask = range(num_training, num_training + num_validation)
    X_val = X_train[mask]  
    y_val = y_train[mask]  
    mask = range(num_training)  
    X_train = X_train[mask]  
    y_train = y_train[mask]  
    mask = range(num_test)  
    X_test = X_test[mask]  
    y_test = y_test[mask]  
    # Preprocessing: reshape the image data into rows  
    X_train = np.reshape(X_train, (X_train.shape[0], -1))  
    X_val = np.reshape(X_val, (X_val.shape[0], -1)) 
    X_test = np.reshape(X_test, (X_test.shape[0], -1))  
    # subtract the mean image  
    mean_image = np.mean(X_train, axis=0)  
    X_train -= mean_image  
    X_val -= mean_image  
    X_test -= mean_image  
    # add bias dimension and transform into columns  
    X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])  
    X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])  
    X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])  

    return X_train, y_train, X_val, y_val, X_test, y_test

# Invoke the above function to get our data.
X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()

# Use the validation set to tune hyperparameters (regularization strength 
# and learning rate).
results = {}
best_val = -1
best_softmax = None
learning_rates = [1e-7, 5e-7]
regularization_strengths = [5e4, 1e4]
iters = 1500
for lr in learning_rates:    
    for rs in regularization_strengths:        
        softmax = Softmax()       
        softmax.train(X_train, y_train, learning_rate=lr, reg=rs, num_iters=iters)        
        Tr_pred = softmax.predict(X_train.T)       
        acc_train = np.mean(y_train == Tr_pred)       
        Val_pred = softmax.predict(X_val.T)        
        acc_val = np.mean(y_val == Val_pred)       
        results[(lr, rs)] = (acc_train, acc_val)       
        if best_val < acc_val:           
            best_val = acc_val            
            best_softmax = softmax

# Print out results.
for lr, reg in sorted(results):    
    train_accuracy, val_accuracy = results[(lr, reg)]    
    print 'lr %e reg %e train accuracy: %f val accuracy: %f' % 
                                    (lr, reg, train_accuracy, val_accuracy)
        # around 38.9%                     
print 'best validation accuracy achieved during cross-validation: %f' % best_val

# Evaluate the best softmax on test set.
Ts_pred = best_softmax.predict(X_test.T)
test_accuracy = np.mean(y_test == Ts_pred)       # around 37.4%
print 'Softmax on raw pixels of CIFAR-10 final test set accuracy: %f' % test_accuracy

最后以SVM为例，比较一下向量化和非向量化编程在运算速度上的差异：
--> naive_vs_vectorized.py

__coauthor__ = 'Deeplayer'
# 5.20.2016

import time
from linear_svm import *
from data_utils import load_CIFAR10

def get_CIFAR10_data(num_training=49000, num_dev=500):  

    # Load the raw CIFAR-10 data  
    cifar10_dir = 'E:/PycharmProjects/ML/CS231n/cifar-10-batches-py'   # make a change
    X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)  
    mask = range(num_training)  
    X_train = X_train[mask]  
    mask = np.random.choice(num_training, num_dev, replace=False)    
    X_dev = X_train[mask]  
    y_dev = y_train[mask]  

    X_train = np.reshape(X_train, (X_train.shape[0], -1))  
    X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))    

    mean_image = np.mean(X_train, axis=0)  
    X_dev -= mean_image  
    X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])  

    return X_dev, y_dev

X_dev, y_dev = get_CIFAR10_data()
# generate a random SVM weight matrix of small numbers
W = np.random.randn(3073, 10) * 0.0001
tic = time.time()
loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Naive loss and gradient: computed in %fs' % (toc - tic)    # around 0.198s

tic = time.time()
loss_vectorized, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)
toc = time.time()
print 'Vectorized loss and gradient: computed in %fs' % (toc - tic)    # around 0.005s

Part 3：神经网络（Neural Networks）

神经网络模型是由多个人工神经元构成的多层网络结构，而人工神经元的灵感来自人脑；相对于生物神经元，人工神经元只是一个十分粗糙的模型。下面给出一张生物神经元和它的数学模型的对比图：

CS231n Convolutional Neural Networks for Visual Recognition.png

从上图的数学模型我们可以看出人工神经元的处理过程如下：
输入x与权重w做内积 ----> 内积结果输入激活函数 ---> 从激活函数输出信号
感知器（perceptron）和S型神经元（sigmoid neuron），是两个重要的人工神经元，承载了神经网络的关键思想（可以移步Michael Nielsen写的Neural Networks and Deep Learning）。

先介绍下S型神经元，上张图：

Neural Networks and Deep Learning.png

S 型神经元有多个输入x₁，x₂，x₃，... ；对每个输入有权重w₁，w₂，...，和⼀个总的偏置b。输出output = σ�(w�x+b)，这里σ被称为S型函数，定义为：

---------------------------------------------> σ(z) = 1/(1+e^-z) <---------------------------------------------

σ的函数曲线如下：

Neural Networks and Deep Learning.png

这个形状是阶跃函数平滑后的版本：

Neural Networks and Deep Learning.png

σ函数的平滑特性是它成为激活函数的关键因素，� σ的平滑特性意味着权重和偏置的微小变化，Δw_j和Δb，会通过神经元产生一个微小的输出变化Δoutput。实际上，Δoutput可以很好地近似表示为：

Neural Networks and Deep Learning.png

从公式可以看出，Δoutput是一个反映权重和偏置变化的线性函数。这一线性性质，使得我们可以很容易地选择小的权重和偏置的变化量，来获得任何想要的小的输出变化量。

下面介绍神经网络的结构（ps: 这里指前馈(feedforward)神经网络，网络中是没有回路的，信息总是向前传播，不反向回馈），神经网络通常有如下结构：

The Architecture of Neural Networks.png

上图是一个含有两个隐藏层的3-layer神经网络，层与层之间是全连接（fully-connected）的。输入层是图像数据（经过预处理后的），即该层的神经元数量等于输入图片的维数；神经网络的隐藏层可以是一层或多层，多层神经网络我们称为人工神经网络（ANN），其实最后一层隐藏层，我们可以看成是输入图像的特征向量；输出层神经元的数量等于需要分类的图像数据的类别数，输出值可以看成是在每个类别上的得分。

对于分类任务而言，根据损失函数（SVM loss function or softmax loss function）选择的不同，神经网络的输出层也可以看作是SVM层或Softmax层。神经网络的激活函数是非线性的，所以神经网络是一个非线性分类器。
---> （ps: 神经网络的输出层神经元不含激活函数f）

神经网络的多层结构给它带来了非常强大的表达能力（层越深，神经元数量越多，表达能力越强），换句话说，神经网络可以拟合任意函数！具体的可视化证明可以移步这里。但是，隐藏层或神经元数量越多，越容易出现过拟合（overfitting）现象，这时我们需要使用规则化（L2 regularization, dropout等等）来控制过拟合。

接下来我们具体讨论神经网络的各个环节：

1. 激活函数的选择

之前我们已经介绍了S型函数，但是在实际应用中，我们基本不会使用它，因为它的缺陷较多。先看下σ的导数：

Neural Networks and Deep Learning.png

从图中我们可以看到，S型函数导数值在0到0.25之间。在进行反向传播的时候，σ�′会和梯度相乘，前面层的梯度值等于后面层的乘积项，那么越往前梯度值越小，慢慢趋近于0，这就是梯度消失问题（vanishing gradient problem）。为了便于理解梯度为什么会消失，我们给出一个每层只有一个神经元的4-layer简化模型：

The Vanishing Gradient Problem.png

其中，C表示代价函数，a_j = �σ(z_j)（注意，a₄ = z₄），z_j = w_ja_j-1 + b_j，我们称 z_j是神经元的带权输入。现在我们要来研究一下第一个隐藏神经元的梯度∂C/∂b₁，这里我们直接给出表达式（具体证明，请移步这里）：

The Vanishing Gradient Problem.png

我们看出∂C/∂b₁会是∂C/∂b₃的1/16 或者更小，这其实就是梯度消失的本质原因。这会导致深层神经网络前面的隐藏层神经元学习速度慢于后面隐藏层神经元的学习速度，而且越往前越慢，最终无法学习。
---> ps: 对于这个问题，不论使用什么样的激活函数，都会出现，但是有些激活函数可以减轻这一问题。说到这里，不得不提一下Batch Normalization，这一方法在很大程度上缓解了梯度消散问题，bravo！

除此之外，sigmoid还有两个缺陷：
其一，当sigmoid的输入值很小或者很大的时候，它的导数会趋于0，在反向传播的时候梯度就会趋于0，那么神经元就不能很好的更新而提前饱和；
其二，sigmoid神经元输出值（激活值）是恒大于0的，那么问题来了，就以上面的4-layer简化模型为例，你会发现在反向传播时，梯度会恒正或恒负（取决于∂C/∂a₄的正负）。换句话说，连接到同一个神经元的所有权重w（包括偏置b）会一起增加或者一起减少。这就有问题了，因为某些权重可能需要有不同方向的变化（虽然没有严格的证明，但这样更加合理）。所以，我们通常希望激活函数的输出值是关于0对称的。

下面列出一些相对于sigmoid性能更好的激活函数：

1、Tanh
tanh神经元使用双曲正切函数替换了S型函数，tanh函数的定义如下：

-------------------------------------> tanh(z) = (e^x−e^-x)/(e^x+e^-x) <----------------------------------------

该公式也可以写成：tanh(z) = 2σ(2z)−1，所以tanh可以看做是sigmoid的缩放版，相对于sigmoid的好处是他的输出值关于0对称，其函数曲线如下：

Neural Networks and Deep Learning.png

2、修正线性单元（Rectified Linear Unit, ReLU）
ReLU是近几年在图像识别上比较受欢迎的激活函数，定义如下：

------------------------------------------> f(z) = max(0, z) <------------------------------------------------

其函数曲线如下：

Neural Networks and Deep Learning.png

ReLU的优点在于它不会饱和，收敛快（即能快速找到代价函数的极值点），而且计算简单（函数形式很简单）；但是收敛快也使得ReLU比较脆弱，如果梯度的更新太快，还没有找到最佳值，就进入小于0的函数段，这会使得梯度变为0，无法更新梯度直接挂机了。所以，对于ReLU神经元，控制学习率（learning rate）十分重要。此外，它的输出也不是均值为零0的。
---> ps: 在assignment1里的神经网络部分，我们选择ReLU作为我们的激活函数。

3、Leaky ReLU（LReLU）
Leaky ReLU是ReLU的改进版，修正了ReLU的缺点，定义如下：

-------------------------------------------> f(z)=max(αz, z) <------------------------------------------------

其中，α为较小的正值（如0.01），函数曲线如下：

figure_4.png

4、Maxout
Maxout是ReLU和LReLU的一般化公式，公式如下：

----------------------------------------------> max(z₁, z₂) <--------------------------------------------------

可以看出，该方法会使得参数数量增加一倍。

5、指数线性单元（Exponential Linear Units, ELU）
ELU的公式为：

ELU.png

函数曲线如下：

figure_5.png

ELU除了具有LReLu的优点外，还有输出结果接近于0均值的良好特性；但是，计算复杂度会提高。
---> ps: 通常我们在神经网络中只使用一种激活函数。

2. 数据预处理

和Part1部分一样，假设我们有一个图像训练集X，是一个大小为[N,D]的矩阵；其中，N表示样本的数量，D表示样本的维数。x_i是X中的第i行，即第i个样本。y表示一个向量，大小为[1,N]；y_i表示第i个样本的真实类别，y_i=1,2,3, ...,C。

数据预处理的手段一般有：
· 去均值（mean subtraction）
· 规范化/归一化（normalization）
· 主成分分析（PCA）和白化（whitening）

对于图像而言，我们一般只进行去均值处理（好处1：自然图像数据是平稳的，即数据每一个维度的统计都服从相同分布。去均值处理可以移除图像的平均亮度值，我们对图像的照度并不感兴趣，而更多地关注其内容；好处2：使数据关于0对称），X -= np.mean(X, axis=0)。或者我们可以进一步进行归一化，即每一维减去该维的标准差，X /= np.std(X, axis = 0)。但是，我们通常不会进行白化，因为计算代价太大（需要计算协方差矩阵的特征值）。有关数据预处理的详细内容可以参见UFLDL和课程笔记。

---> PS1: 其实我们还要进行一项预处理，就是将图像向量化，假设图像大小为[d₁,d₂]，向量化之后大小为[1,D]，D=d₁d₂。但是我们通常不会将其纳入预处理范畴。

---> PS2: 我们为什么要进行预处理？因为预处理可以增大数据分布范围，加速收敛，即可以帮助我们更快地找到代价函数的极（小）值点。便于大家直观理解，我绘制了下面这张图（以二维数据为例）：

data preprocessing.png

此图以ReLU神经元为例，ReLU(wx+b) = max(wx+b,0)，图中绿色和红色的线表示wx+b=0；我们发现只有红色的线对数据进行了分割，说明我们随机初始化的参数只有少部分发挥了作用，那么在反向传播时，收敛速度就会变得很慢；但是去均值后的数据被大多数线分割了，这样收敛速度也就会快很多了。

3. 权重初始化方式的选择

通常我们会将权重随机初始化为：均值为0，标准差为一个很小的正数（如0.001）的高斯分布，在numpy中可以写成：w = np.random.randn(n)。这样的初始化方式对于小型的神经网络是可以的（在assignment1的编程部分，我们就是使用这样的初始化方式）。

但是对于深度神经网络，这样的初始化方式并不好。我们以激活函数为tanh为例，如果标准差设置得较小，后面层的激活值将全部趋于0，反向传播时梯度也会变的很小；如果我们将标准差设置得大些，神经元就会趋于饱和，梯度将会趋于零。

为了解决这个问题，我们可以使用方差校准技术：
· 实践经验告诉我们，如果每个神经元的输出都有着相似的分布会使收敛速度加快。而上面使用的随机初始化方式，会使得各个神经元的输出值的分布产生较大的变化。
· 抛开激活函数，我们假设神经元的带权输入s=∑_iw_ix_i，则s和x的方差关系如下：

CS231n Convolutional Neural Networks for Visual Recognition.png

得到的结果显示，如果希望s与输入变量x同分布就需要使w的方差为1/n。即权重初始化方式改为：w = np.random.randn(n) / sqrt(n)。

但是当使用ReLU作为激活函数时，各层神经元的输出值分布又不一样了，对于这个问题这篇论文进行了探讨，并给出了修改：w = np.random.randn(n) * sqrt(2.0/n)，解决了此问题。

至于偏置的初始化，我们可以简单地将其初始化为0。

4. Batch Normalization

Batch Normalization就是在每一层的wx+b和f(wx+b)之间加一个归一化（将wx+b归一化成：均值为0，方差为1；但在原论文中，作者为了计算的稳定性，加了两个参数将数据又还原回去了，这两个参数也是需要训练的。Assignment2部分我会详细介绍）层，说白了，就是对每一层的数据都预处理一次。方便直观感受，上张图：

Batch Normalization.png

这个方法可以进一步加速收敛，因此学习率可以适当增大，加快训练速度；过拟合现象可以得倒一定程度的缓解，所以可以不用Dropout或用较低的Dropout，而且可以减小L2正则化系数，训练速度又再一次得到了提升。即Batch Normalization可以降低我们对正则化的依赖程度。
现在的深度神经网络基本都会用到Batch Normalization。

5. 正则化的选择

这里，我们会继续使用L2正则化（关于L1正则化和最大范数约束，请看课程笔记）来惩罚权重W，控制过拟合现象的发生。在深度神经网络（如卷积神经网络，后续的Assignment2篇会讲到）中我们通常也是选择L2正则化，而且还会增加Dropout来进一步控制过拟合。关于Dropout，我们留到Assignment2部分再详细介绍。

6. 损失函数的选择

损失（代价）函数由data loss 和 regularization loss两部分组成，即L = 1/N ∑_iL_i + λR(W)。我们常用的损失函数是SVM的hinge loss和softmax的交叉熵损失（这里我们只针对数据集中样本只有一个正确类的情况，对于其它分类问题和回归问题，请看课程笔记），这里我们选择softmax的交叉熵损失作为我们的损失函数。

7. 反向传播计算梯度

我们以激活函数f为ReLU，损失函数为softmax的交叉熵损失的3-layer神经网络为例，给出完整的计算各层梯度的过程（由于图片分辨率较高，请在新的标签页打开图片并放大，或者下载后观看。下图中，W³ 的 size 应该是 [H,C]）：

compute the gradient.jpg

8. 参数更新策略

1)、Vanilla update
最简单的参数更新方式，即我们常说的SGD方法的标准计算形式。

2)、Momentum update (SGD+Momentum)
该方法是对Vanilla update的改进版，为了理解momentum 技术，我们可以把现梯度下降，类比于球滚向山谷的底部。momentum 技术修改了梯度下降的两处使之类似于这个物理场景。首先，引入一个称为速度（velocity）的概念。梯度的作用就是改变速度，而不是直接的改变位置，就如同物理学中的力改变速度，只会间接地影响位置；第二，momentum 方法引入了一种摩擦力的项，用来逐步地减少速度。具体的更新规则如下：

----------------------------------------------) v --> v' = �μ�v - λdx (-------------------------------------------
------------------------------------------------) x --> x' = x + v' (---------------------------------------------

其中，x表示需要更新的参数（W和b），v的初始值为0，μ是用来控制摩擦力的量的超参数，取值在(0,1)之间，最常见的设定值为0.9（也可以用交叉验证来选择最合适的μ值，一般我们会从[0.5, 0.9, 0.95, 0.99]里面选出最合适的）。
从公式可以看出，我们通过重复地增加梯度项来构造速度，那么随着迭代次数的增加，速度会越来越快，这样就能够确保momentum技术比标准的梯度下降运行得更快；同时μ的引入，保证了在接近谷底时速度会慢慢下降，最终停在谷底，而不是在谷底来回震荡。
---> ps: SGD+Momentum是最常见的参数更新方式，这里我们就使用此方法。

3)、Nesterov Momentum (SGD+Nesterov Momentum)
算是Momentum update的改良版，实际应用中的收敛效果也略优于momentum update。为了方便理解Nesterov Momentum，我们把Momentum update的更新规则合并如下：

--------------------------------------------) x --> x' = (x + μ�v) - λdx (--------------------------------------

从公式可以看出，(x + μ�v)其实就是x即将去到的下一个位置；但是这个公式在计算梯度的时候，仍然还在计算dx，而我们希望它能前瞻性地计算d(x + μ�v)，这样我们的梯度能更快的下降。贴张辅助理解的图（图中大红点表示参数x的当前位置）：

CS231n Convolutional Neural Networks for Visual Recognition.png

现在我们可以给出Nesterov Momentum的参数更新规则了：

------------------------------------------> x_ahead = x + μv <---------------------------------------------
-----------------------------------------> v = μv - λdx_ahead <--------------------------------------------
-------------------------------------------------> x = x + v <--------------------------------------------------

在实际应用时，我们会稍作修改，对应代码如下：

v_prev = v                              
v = mu * v - learning_rate * dx         # 和 Momentum update 的更新方式一样
x += -mu * v_prev + (1 + mu) * v        # 新的更新方式

如果你想深入了解Nesterov Momentum的数学原理，请看论文：
· Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5.
· Ilya Sutskever’s thesis, contains a exposition of the topic in section 7.2

8.1. 衰减学习率

在实际训练过程中，随着训练过程的推进，逐渐衰减学习率是很有必要的技术手段。这也很容易理解，我们还是以山顶到山谷为例，刚开始离山谷很远，我们的步长可以大点，但是快接近山谷时，我们的步长得小点，以免越过山谷。
常见的学习率衰减方式：
1)、步长衰减：每一个epoch（1 epoch = N/batch_size iterations）过后，学习率下降一些，数学形式为λ'=kλ，k可以取0.9/0.95，我们也可以通过交叉验证获得。
2)、指数衰减：数学形式为α=α₀e^−kt，其中α₀，k为超参数，t是迭代次数。
3)、1/t衰减：数学形式为α=α₀/(1+kt)，其中α₀，k为超参数，t是迭代次数。

在实际应用中，我们通常选择步长衰减，因为它包含的超参数少，计算代价低。

以上的讨论都是以全局使用同样的学习率为前提的，而调整学习率是一件很费时同时也容易出错的事情，因此我们一直希望有一种学习率自更新的方式，甚至可以细化到逐参数更新。下面简单介绍一下几个常见的自适应方法：

1)、Adagrad
Adagrad是Duchi等在论文Adaptive Subgradient Methods for Online Learning and Stochastic Optimization中提出的自适应学习率算法，实现代码如下：

# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

这种方法的好处是，对于高梯度的权重，它们的有效学习率被降低了；而小梯度的权重迭代过程中学习率提升了。要注意的是，这里开根号很重要。平滑参数eps是为了避免除以0的情况，eps一般取值1e-4 到1e-8。

2)、RMSprop
RMSprop是一种高效但是还未正式发布的自适应调节学习率的方法，RMSProp方法对Adagrad算法做了一个简单的优化，以减缓它的迭代强度：

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

其中，decay_rate是一个超参数，其值可以在 [0.9, 0.99, 0.999]中选择。

3)、Adam
Adam有点像RMSProp+momentum，效果比RMSProp稍好，其简化版的代码如下：

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

论文中推荐eps = 1e-8，beta1 = 0.9，beta2 = 0.999。完整的Adam update还包括了一个偏差修正机制，以弥补m，v初始值为零的情况。

---> PS: 建议使用SGD+Nesterov Momentum 或 Adam来更新参数。

其它的一些方法：
·Adadelta by Matthew Zeiler
·Unit Tests for Stochastic Optimization

这里给出一些上述提到的多种参数更新方法下，损失函数最优化的示意图：

opt1.gif

opt2.gif

9. 超参数的优化

神经网络的训练过程中，我们需要对很多超参数进行优化，这个过程通常在验证集上进行，这里我们需要优化的超参数有：
·初始学习率
·学习率衰减因子
·正则化系数/惩罚因子（包括L2惩罚因子，dropout比例）

对于深度神经网络而言，我们训练一次需要很长的时间。所以，在此之前我们花一些时间去做超参数搜索，以确定最佳超参数。最直接的方式就是在框架实现的过程中，设计一个会持续变换超参数实施优化，并记录每个超参数在每一个epoch后，在验证集上状态和效果。实际应用中，神经网络里确定这些超参数，我们一般很少使用n折交叉验证，一般使用一份固定的交叉验证集就可以了。

对于初始学习率，通常的搜索序列是：learning_rate = 10 ** uniform(-6, 1)，训练5 epoches左右，然后缩小范围，训练更多次epoches，最后确定初始学习率的大小，大概在1e-3左右；对于正则化系数λ，通常的搜索序列为[0.5, 0.9, 0.95, 0.99]。

10. 训练过程的可视化观察

1)、观察损失函数，来判断你设置的学习率好坏：

loss function.jpeg

但实际损失函数的变化没有上图光滑，会存在波动，下图是实际训练CIFAR-10的时候，loss的变化情况：

CIFAR10_loss.jpeg

大家可能会注意到上图的曲线有一些上下波动，这和设定的batch size有关系。batch size非常小的情况下，会出现很大的波动，如果batch size设定大一些，会相对稳定一点。
·
2)、观察训练集/验证集上的准确度，来判断是否发生了过拟合：

accuracies.jpeg

Part 4： Python编程任务（2-layer神经网络）

· Assignment1的神经网络部分，我们需要完成neural_net.py，完成后可以用two_layer_net.ipynb里的代码（部分代码需要自己完成）来调试你的模型，优化超参数，获得最优模型，最后在测试集上测试分类水平。
· 这里用的图像库还是CIFAR-10。

neural_net.py 代码如下：

__coauthor__ = 'Deeplayer'
# 6.14.2016 

#import numpy as np
class TwoLayerNet(object):    
    """    
    A two-layer fully-connected neural network. The net has an input dimension of    
    D, a hidden layer dimension of H, and performs classification over C classes.    
    The network has the following architecture:    
    input - fully connected layer - ReLU - fully connected layer - softmax
    The outputs of the second fully-connected layer are the scores for each class.
    """
    def __init__(self, input_size, hidden_size, output_size, std=1e-4): 
        self.params = {}    
        self.params['W1'] = std * np.random.randn(input_size, hidden_size)   
        self.params['b1'] = np.zeros((1, hidden_size))    
        self.params['W2'] = std * np.random.randn(hidden_size, output_size)   
        self.params['b2'] = np.zeros((1, output_size))

    def loss(self, X, y=None, reg=0.0):
        """    
        Compute the loss and gradients for a two layer fully connected neural network.
        """
        # Unpack variables from the params dictionary
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        N, D = X.shape

        # Compute the forward pass
        scores = None
        h1 = ReLU(np.dot(X, W1) + b1)      # hidden layer 1  (N,H)
        out = np.dot(h1, W2) + b2          # output layer    (N,C)
        scores = out                       # (N,C)  
        if y is None:   
            return scores

        # Compute the lossloss = None
        # Considering the Numeric Stability
        scores_max = np.max(scores, axis=1, keepdims=True)    # (N,1)
        # Compute the class probabilities
        exp_scores = np.exp(scores - scores_max)              # (N,C)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)    # (N,C)
        # cross-entropy loss and L2-regularization
        correct_logprobs = -np.log(probs[range(N), y])        # (N,1)
        data_loss = np.sum(correct_logprobs) / N
        reg_loss = 0.5 * reg * np.sum(W1*W1) + 0.5 * reg * np.sum(W2*W2)
        loss = data_loss + reg_loss

        # Backward pass: compute gradients
        grads = {}
        # Compute the gradient of scores
        dscores = probs                                 # (N,C)
        dscores[range(N), y] -= 1
        dscores /= N
        # Backprop into W2 and b2
        dW2 = np.dot(h1.T, dscores)                     # (H,C)
        db2 = np.sum(dscores, axis=0, keepdims=True)    # (1,C)
        # Backprop into hidden layer
        dh1 = np.dot(dscores, W2.T)                     # (N,H)
        # Backprop into ReLU non-linearity
        dh1[h1 <= 0] = 0
        # Backprop into W1 and b1
        dW1 = np.dot(X.T, dh1)                          # (D,H)
        db1 = np.sum(dh1, axis=0, keepdims=True)        # (1,H)
        # Add the regularization gradient contribution
        dW2 += reg * W2
        dW1 += reg * W1
        grads['W1'] = dW1
        grads['b1'] = db1
        grads['W2'] = dW2
        grads['b2'] = db2

        return loss, grads

    def train(self, X, y, X_val, y_val, learning_rate=1e-3, 
               learning_rate_decay=0.95, reg=1e-5, mu=0.9, num_epochs=10, 
               mu_increase=1.0, batch_size=200, verbose=False):   
        """    
        Train this neural network using stochastic gradient descent. 
        Inputs:    
        - X: A numpy array of shape (N, D) giving training data.    
        - y: A numpy array f shape (N,) giving training labels; y[i] = c means that         
             X[i] has label c, where 0 <= c < C.    
        - X_val: A numpy array of shape (N_val, D) giving validation data.    
        - y_val: A numpy array of shape (N_val,) giving validation labels.    
        - learning_rate: Scalar giving learning rate for optimization.    
        - learning_rate_decay: Scalar giving factor used to decay the learning rate                           
                               after each epoch.    
        - reg: Scalar giving regularization strength.    
        - num_iters: Number of steps to take when optimizing.   
        - batch_size: Number of training examples to use per step.    
        - verbose: boolean; if true print progress during optimization.  
        """
        num_train = X.shape[0]
        iterations_per_epoch = max(num_train / batch_size, 1)
        # Use SGD to optimize the parameters
        v_W2, v_b2 = 0.0, 0.0
        v_W1, v_b1 = 0.0, 0.0
        loss_history = []
        train_acc_history = []
        val_acc_history = []

        for it in xrange(1, num_epochs * iterations_per_epoch + 1):   
            X_batch = None   
            y_batch = None    
            # Sampling with replacement is faster than sampling without replacement.   
            sample_index = np.random.choice(num_train, batch_size, replace=True)   
            X_batch = X[sample_index, :]        # (batch_size,D)    
            y_batch = y[sample_index]           # (1,batch_size)   

            # Compute loss and gradients using the current minibatch 
            loss, grads = self.loss(X_batch, y=y_batch, reg=reg) 
            loss_history.append(loss)    

            # Perform parameter update (with momentum)    
            v_W2 = mu * v_W2 - learning_rate * grads['W2']    
            self.params['W2'] += v_W2   
            v_b2 = mu * v_b2 - learning_rate * grads['b2']    
            self.params['b2'] += v_b2   
            v_W1 = mu * v_W1 - learning_rate * grads['W1']    
            self.params['W1'] += v_W1   
            v_b1 = mu * v_b1 - learning_rate * grads['b1']  
            self.params['b1'] += v_b1    
            """    
            if verbose and it % 100 == 0:        
                print 'iteration %d / %d: loss %f' % (it, num_iters, loss) 
            """   
            # Every epoch, check train and val accuracy and decay learning rate.
            if verbose and it % iterations_per_epoch == 0:    
                # Check accuracy    
                epoch = it / iterations_per_epoch    
                train_acc = (self.predict(X_batch) == y_batch).mean()    
                val_acc = (self.predict(X_val) == y_val).mean()    
                train_acc_history.append(train_acc)    
                val_acc_history.append(val_acc)    
                print 'epoch %d / %d: loss %f, train_acc: %f, val_acc: %f' % 
                                    (epoch, num_epochs, loss, train_acc, val_acc)    
                # Decay learning rate    
                learning_rate *= learning_rate_decay    
                # Increase mu    
                mu *= mu_increase

        return {   
            'loss_history': loss_history,   
            'train_acc_history': train_acc_history,   
            'val_acc_history': val_acc_history,
        }

    def predict(self, X):    
        """  
        Inputs:    
        - X: A numpy array of shape (N, D) giving N D-dimensional data points to        
             classify.    
        Returns:    
        - y_pred: A numpy array of shape (N,) giving predicted labels for each of           
                  the elements of X. For all i, y_pred[i] = c means that X[i] is 
                  predicted to have class c, where 0 <= c < C.   
        """    
        y_pred = None    
        h1 = ReLU(np.dot(X, self.params['W1']) + self.params['b1'])    
        scores = np.dot(h1, self.params['W2']) + self.params['b2']    
        y_pred = np.argmax(scores, axis=1)    

        return y_pred

def ReLU(x):    
    """ReLU non-linearity."""    
    return np.maximum(0, x)

完成neural_net.py后，你需要检查代码编写是否正确（用two_layer_net.ipynb里的代码来check）；check完之后，我们就需要优化超参数了。

PS: 由于文章字数达到上限，请到 CS231n : Assignment1（续）继续阅读。 :(|)

网友评论

7b6043e60381:博主您好,在您的two_layer_net任务的nerual_net.py中,我们计算score的loss时使用的交叉熵,我在网上找到的交叉熵的定义时H(x)=-ΣP(x)logP(x),但在您的代码中我发现是这样写的:correct_logprobs = -np.log(probs[range(N), y]),不知道这行代码时如何写出来的,这样写不就成了-ΣlogP(x)了??如果您有时间麻烦不吝赐教,谢谢!
纤晨:结合讲义一起看很有助于理解，非常感谢~
戴戴Day:感谢分享正在学cs231n 2017,这些文章正好帮我回顾解惑了~
帅气的我要加油:你好，我想问一下，你的cs231n作业的运行环境是选择谷歌云还是在本地自己搭的环境？
倪明明:非常感想您的分享，然后学起来简单多了，关于上面那张图compute the gradient.jpg 后向传播的公式第二行第二个好像有点小问题，不过那个NN的代码是对的
2d5c19a1c51f:您好，我最近在完成Assignment1，但是一直卡在图像输入的问题上。提示，找不到 cs231n.data_utils 模块。请问如何解决呢？ python 2.X/ipython
b0aa5066ccdf:昨天又重新跑了一下代码，发现 LinearClassifier_softmax_start.py 报错：
ValueError: need more than 6 values to unpack，请问这如何解决？
21cfcc29fd9b:计算softmax梯度的时候，为什么dscore要除以N呢
陈铭治:收获很大，谢谢分享
hhg121:问个问题，在linear_svm里，增加了正则化项，算出来梯度是2λW。可是为什么代码里 dW += reg*W ，为什么不是dW += 2*reg*W
Deepool:前面乘了个0.5
3c43cbac0814:楼主，你好，我用你的代码，用softmax得到的结果比SVM的低的多，这是怎么回事啊，SVM最好的正确率大概是37.1%，softmax的正确率才10%多
3c43cbac0814:找到我哪里出错了
00b0b8e97f55:楼主，能加个联系方式不？看了你前面的总结和代码，收获很多！
ps：linear_classifier.py 文件中的 y_pred = np.argmax(np.dot(self.W.T, X), axis=0)
应该改为y_pred = np.argmax(np.dot(self.W.T, X.T), axis=0)

ps:neural_net.py 中self.params['b2'] += v_b2 应该改为 self.params['b2'] += v_b2[0] 在axis=0 上广播
00b0b8e97f55:可能np版本问题吧，当时跑不通，还是感谢楼主入门
Deepool:主页里有新浪微博

"ps：linear_classifier.py 文件中的 y_pred = np.argmax(np.dot(self.W.T, X), axis=0)
应该改为y_pred = np.argmax(np.dot(self.W.T, X.T), axis=0)"，这里可改可不改，只要在输入的时候转置就行了。

"ps:neural_net.py 中self.params['b2'] += v_b2 应该改为 self.params['b2'] += v_b2[0] 在axis=0 上广播"，这里不需要改。
chenyu_21cn:太棒了
Deepool:谢谢 :)
2f54bac75202:学校的校园网下载assignment1，和assignment2的数据，几乎没有速度。你能把你已经下好的数据，放在百度云上分享一下吗？谢谢
2f54bac75202:就是assignment1/cs231n/datasets 里脚本get_datasets.sh要下载的数据，我一直下不下来
2f54bac75202:@Deeplayer谢谢。但我要的是作业所需的数据http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz 有100多MB。这个我用学校的校园网一直下不下来，你能分享一下吗？
Deepool:@長安一片月 http://pan.baidu.com/s/1nuCIUt7
c6d54d0b2317:请问，有没有遇到读取数据集memory error？
9918c0f940a3:在网上找到的：在用Python处理大数据时，本来16G的内存，内存还没使用四分之一就开始报MemoryError的错误，后来才知道32bit的Python使用内存超过2G之后，就报这个错误，还没有其他的提示消息。果断换64bit的Python。
原文;http://blog.csdn.net/w_zhongke/article/details/44807889
9918c0f940a3:我也遇到memory error，请问是怎么解决的？谢谢@Deeplayer@禾子z心
报错如下：
Traceback (most recent call last):
File "C:/Users/sj/Desktop/exe.py", line 33, in <module>
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
File "C:/Users/sj/Desktop/exe.py", line 26, in load_CIFAR10
Xtr = np.concatenate(xs)
MemoryError
Deepool:@禾子z心没有，数据集不是很大。
271ff93a7907:我想问一下，softmax计算loss的naive版本，dW_each那部分是不是有问题，累加的时候要把之前的值清0吧
271ff93a7907:@Deeplayer sorry，把公式看错了。。。
271ff93a7907:@Deeplayer In[6]: 可能是版本不一样，我的是这样的
import numpy as np
X = np.zeros(shape=(5,3))
print X
X[:,1]=[1,2,3,4,5]
print X
X[:,2]=[1,2,3,4,5]
print X
out:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]

[[ 0. 1. 0.]
[ 0. 2. 0.]
[ 0. 3. 0.]
[ 0. 4. 0.]
[ 0. 5. 0.]]

[[ 0. 1. 1.]
[ 0. 2. 2.]
[ 0. 3. 3.]
[ 0. 4. 4.]
[ 0. 5. 5.]]
Deepool:@Repaint 不需要，赋值的时候其实就有“清零”的作用。
030_西山我就来:你好，我想问一下，data_util这个包如何安装？
softmax程序有两处小错，一个是prob那行少了一个（；另一个是最后一个dw应该换行。
Deepool:@030_西山我就来我前言里给的assignment1初代代码连接里就有啊
030_西山我就来:@Deeplayer from data_utils import load_CIFAR10
ImportError: No module named data_utils
我这边没有data_utils，想问一下怎么安装？
Deepool:@030_西山我就来已更正，谢谢。data_util.py文件不需要改动，关于CIFAR-10数据你可以直接手动下载到目标文件夹下，或者在线获取。
6c1444355433:你好，我想问下在解释数据预处理，去均值数据收敛速度更快时，为什么去均值后会有许多条线满足wx+b=0呢，这不是一条直线吗，应该只有一条才对，是我理解有误吗?
6c1444355433:@Deeplayer 好像明白了！写得很好哈，谢谢！
Deepool:@decayed 一组(w,b)对应一条直线