美文网首页
CS231n Spring 2019 Assignment 1—

CS231n Spring 2019 Assignment 1—

作者: 赖子啊 | 来源:发表于2019-09-29 22:03 被阅读0次

    距离上一篇的Assignment1中的KNN已经过去快半个月了,因为我中间先把Assignment3的博客先写完了,发现回过头来再看一遍的感觉真的会深入很多。因为上次的KNN在现实中是不会用的(只是给我们一个图像分类的直观感受),原因有两个:(1)在测试耗费太多时间;(2)在生图像素上的距离衡量指标不能有效提供信息(informative)。所以引入线性分类模型,指导思想就是找到合适的loss function和优化方法optimization来减小loss。需要看的是Linear classification: Support Vector Machine, Softmax,主要有两个需要完成:

    现在主要按着作业的顺序来进行,更多关于内容理解可以看这里


    SVM

    这一节里面要用朴素循环和向量化的两种方式来写前向传播计算loss和反向传播计算gradient。线性模型就是通过线性映射:f:R^{D}\rightarrow R^{K}得到分值,分类器输入样本x_{i}后,对第j个类别的分值score是:s_{j}=f(x_{i},W)_{j}注意s_{j}是一个长度为类别数的向量】,而对于多分类支持向量机损失(multiclass support vector machine loss,也叫作hinge loss)就是:
    L_{i}=\sum_{j \neq y_{i}} max(0,s_{j}-s_{y_{i}}+1)其中L_{i}就是样本x_{i}对应的损失值,s_{y_{i}}是对应x_{i}正确类别的分值,这里有个阈值为1,这个也是一个超参数(hyperparameter)。从该损失函数的定义可以看出,只有在正确分类的类别分值比其他类别分值高出一个阈值 ∆ 的时候,损失值才为0,否则损失值都会是一个正值,因为我们的目标就是让物体能够被正确分类。

    反向传播的时候稍微麻烦一点,不过在Optimization:Stochastic Gradient Descent中已经给出了相应的梯度公式:
    \left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -\left(\sum_{j \ne y_i}\mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0)\right)x_i & j = y_i \\ \nabla_{w_j} L_i = & \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i & j \ne y_i \end{aligned}\right.其中\mathbb{1}(\cdot)是个逻辑算子,当其中为真时为返回1,否则返回0。
    乍看这一公式挺复杂的,仔细分析一下也就不麻烦:这里是某个样本对应的损失L_{i}w_{y_i}/w_{j}(也就是w的某一行(或某一列,具体视xw相乘的顺序决定,作业里面对应的是一列))的导数。因为针对一个数据样本,在求损失L_{i}的时候就多次出现{s_{y_{i}}},所以对应得到{s_{y_{i}}}{w_{y_i}}也就累加了,其余的w_{j}就出现一次。要是直观理解了上面的公式了,代码应该就可以了,下面是svm_loss_naive()函数:

    def svm_loss_naive(W, X, y, reg):
        """
        Structured SVM loss function, naive implementation (with loops).
    
        Inputs have dimension D, there are C classes, and we operate on minibatches
        of N examples.
    
        Inputs:
        - W: A numpy array of shape (D, C) containing weights.
        - X: A numpy array of shape (N, D) containing a minibatch of data.
        - y: A numpy array of shape (N,) containing training labels; y[i] = c means
          that X[i] has label c, where 0 <= c < C.
        - reg: (float) regularization strength
    
        Returns a tuple of:
        - loss as single float
        - gradient with respect to weights W; an array of same shape as W
        """
        dW = np.zeros(W.shape) # initialize the gradient as zero
    
        # compute the loss and the gradient
        num_classes = W.shape[1]
        num_train = X.shape[0]
        num_features = W.shape[0]
        loss = 0.0
        for i in range(num_train):
            scores = X[i].dot(W)
            correct_class_score = scores[y[i]]
            for j in range(num_classes):
                if j == y[i]:
                    continue
                margin = scores[j] - correct_class_score + 1 # note delta = 1
                if margin > 0:
                    loss += margin
                    dW[:, y[i]] += -X[i, :]
                    dW[:, j] += X[i, :]
                    
    
        # Right now the loss is a sum over all training examples, but we want it
        # to be an average instead so we divide by num_train.
        loss /= num_train
    
        # Add regularization to the loss.
        loss += reg * np.sum(W * W)
    
        #############################################################################
        # TODO:                                                                     #
        # Compute the gradient of the loss function and store it dW.                #
        # Rather that first computing the loss and then computing the derivative,   #
        # it may be simpler to compute the derivative at the same time that the     #
        # loss is being computed. As a result you may need to modify some of the    #
        # code above to compute the gradient.                                       #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        dW /= num_train
        dW += 2 * reg * W
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        
        return loss, dW
    

    对于向量化的写法,感觉计算loss那部分还是挺容易想到的,但是grad那部分还是有点难,但是看过这篇矩阵对矩阵的求导:linear backpropexample就会有点思路,总结来说就是若是对于一个线性操作:f=xW(这里的xW都是矩阵),假设我们已知loss对于f的梯度\frac{\partial L}{\partial f},那么就有[可以从维度匹配角度辅助记忆]:\frac{\partial L}{\partial W}=x^{T} \frac{\partial L}{\partial f} \frac{\partial L}{\partial x}=\frac{\partial L}{\partial f} W^{T}所以我们可以先求出loss对于score的梯度margins,然后根据上面的式子求得dW:

    def svm_loss_vectorized(W, X, y, reg):
        """
        Structured SVM loss function, vectorized implementation.
    
        Inputs and outputs are the same as svm_loss_naive.
        """
        loss = 0.0
        dW = np.zeros(W.shape) # initialize the gradient as zero
    
        #############################################################################
        # TODO:                                                                     #
        # Implement a vectorized version of the structured SVM loss, storing the    #
        # result in loss.                                                           #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        scores = X.dot(W)        
        num_classes = W.shape[1]
        num_train = X.shape[0]
    
        scores_correct = scores[np.arange(num_train), y]   # 1 by N
        scores_correct = np.reshape(scores_correct, (num_train, -1))  # N by 1
        margins = scores - scores_correct + 1    # N by C
        margins = np.maximum(0,margins)
        margins[np.arange(num_train), y] = 0
        loss += np.sum(margins) / num_train
        loss += reg * np.sum(W * W)
    
    
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        #############################################################################
        # TODO:                                                                     #
        # Implement a vectorized version of the gradient for the structured SVM     #
        # loss, storing the result in dW.                                           #
        #                                                                           #
        # Hint: Instead of computing the gradient from scratch, it may be easier    #
        # to reuse some of the intermediate values that you used to compute the     #
        # loss.                                                                     #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        # compute the gradient
        margins[margins > 0] = 1
        row_sum = np.sum(margins, axis=1)                  # 1 by N
        margins[np.arange(num_train), y] = -row_sum        
        dW += np.dot(X.T, margins)/num_train + 2 * reg * W     # D by C
    
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        return loss, dW
    

    之后还得完善一下LinearClassifier.train()部分,主要是利用SGD来更新一下参数W(其实用的还是mini-batch gradeint descent,因为SGD是batch_size=1的特例,主要的思想还是vanilla gradient descent),现在就是把有用的部分贴出来:

    sample_index = np.random.choice(num_train, batch_size, replace=False)
    X_batch = X[sample_index, :]   # select the batch sample
    y_batch = y[sample_index]      # select the batch label
    
    # SGD
    self.W += -learning_rate * grad
    

    在这一部分里还得利用validation set去tune一下hyperparameters(主要是学习率和正则化强度),result采用字典的形式,键为一组参数组合,值为对应这组参数的训练和验证准确率,自己需要填写的部分:

    for lr in learning_rates:
        for rs in regularization_strengths:
            mySvm = LinearSVM()
            loss_hist = mySvm.train(X_train, y_train, learning_rate=lr, reg=rs,
                          num_iters=1000, verbose=True)
            y_train_pred = mySvm.predict(X_train)
            train_accuracy = np.mean(y_train == y_train_pred)
            y_val_pred = mySvm.predict(X_val)
            val_accuracy = np.mean(y_val == y_val_pred)
            
            results[(lr, rs)] = (train_accuracy, val_accuracy)
            
            if val_accuracy > best_val:
                best_val = val_accuracy
                best_svm = mySvm
    

    Softmax

    softmax classifier与svm calssifier相比,就是把hinge loss改成了cross-entropy loss交叉熵损失,也是binary logistic regression classifier的泛化,先给出公式:L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.05in} \text{或者等价于} \hspace{0.05in} L_i = -f_{y_i} + \log\sum_j e^{f_j}其中f也就是我们之前所说的score,但是我们这里叫logits(unnormalized log probabilities)。直观上来说,就是为了使正确分类的概率最大化,也就是接近于1,loss就越小。从信息熵的角度说,因为交叉熵(cross-entropy)就是:H(p,q) = - \sum_x p(x) \log q(x)=H(p) + D_{KL}(p||q)其中p指的是ground truth distribution,相当于标签;q是estimated distribution,也就是分类器输出。H(p)p分布的熵,当其用one-hot编码时,H(p)=0,所以H(p,q) = D_{KL}(p||q),也就是pq的Kullback-Leibler divergence,当损失减小时,预示分类器预估的分布靠近真实分布。

    反向传播时朴素的loss计算方法需要推导一下,推导后可得出:
    \left\{\begin{aligned} \nabla_{w_{y_i}} L_i = & -x_i + \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} x_i & j = y_i \\ \nabla_{w_j} L_i = & \frac{e^{f_j}}{\sum_j e^{f_j}} x_i & j \ne y_i \end{aligned}\right.
    至于向量化的方法,我们还是可以先求出loss对于f的梯度\frac{\partial L}{\partial f}(可参考这篇课程笔记:Putting it together: Minimal Neural Network Case Study,其实和上面两个公式是一致的),然后就按之前svm的方法求得梯度\frac{\partial L}{\partial W}。现一起给出:

    def softmax_loss_naive(W, X, y, reg):
        """
        Softmax loss function, naive implementation (with loops)
    
        Inputs have dimension D, there are C classes, and we operate on minibatches
        of N examples.
    
        Inputs:
        - W: A numpy array of shape (D, C) containing weights.
        - X: A numpy array of shape (N, D) containing a minibatch of data.
        - y: A numpy array of shape (N,) containing training labels; y[i] = c means
          that X[i] has label c, where 0 <= c < C.
        - reg: (float) regularization strength
    
        Returns a tuple of:
        - loss as single float
        - gradient with respect to weights W; an array of same shape as W
        """
        # Initialize the loss and gradient to zero.
        loss = 0.0
        dW = np.zeros_like(W)
    
        #############################################################################
        # TODO: Compute the softmax loss and its gradient using explicit loops.     #
        # Store the loss in loss and the gradient in dW. If you are not careful     #
        # here, it is easy to run into numeric instability. Don't forget the        #
        # regularization!                                                           #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        num_classes = W.shape[1]
        num_train = X.shape[0]
        for i in range(num_train):
            scores = X[i].dot(W)
            # to keep numerical calculate stablly,minus maximum
            scores -= np.max(scores)
            exp_scores = np.exp(scores)
            prob = exp_scores / np.sum(exp_scores)
            for j in range(num_classes):
                if j == y[i]:
                    correct_class_prob = prob[j]
                    correct_class_logprob = -np.log(correct_class_prob)
                    loss += correct_class_logprob
                    dW[:,j] += (correct_class_prob - 1) * X[i,:]
                # 之前else没加,硬是找不出来这个bug
                else: 
                    dW[:,j] += prob[j] * X[i,:]
    
    
        loss /= num_train 
        loss += 0.5 * reg *np.sum(W * W)
    
        dW /= num_train
        dW += reg * W
    
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        return loss, dW
    
    
    def softmax_loss_vectorized(W, X, y, reg):
        """
        Softmax loss function, vectorized version.
    
        Inputs and outputs are the same as softmax_loss_naive.
        """
        # Initialize the loss and gradient to zero.
        loss = 0.0
        dW = np.zeros_like(W)
    
        #############################################################################
        # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
        # Store the loss in loss and the gradient in dW. If you are not careful     #
        # here, it is easy to run into numeric instability. Don't forget the        #
        # regularization!                                                           #
        #############################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        num_classes = W.shape[1]
        num_train = X.shape[0]
        scores = X.dot(W)
        # to keep numerical calculate stablly,minus maximum
        scores = scores - np.max(scores,axis=1).reshape(-1,1)
        exp_scores = np.exp(scores)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
        # compute the loss: average cross-entropy loss and regularization
        correct_logprobs = -np.log(probs[range(num_train),y])
        data_loss = np.sum(correct_logprobs)/num_train
        reg_loss = 0.5*reg*np.sum(W*W)
        loss = data_loss + reg_loss
    
        # compute the gradient on scores
        dscores = probs
        dscores[range(num_train),y] -= 1
        dscores /= num_train
        dW = np.dot(X.T, dscores) + reg * W
    
    
        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
        return loss, dW
    

    这个softmax的损失函数我们在之后会用到很多次,所以向量化写法一定要熟悉。至于tune hyperparameters部分和svm部分非常像,这里就不放了。

    结果

    注意这里之前我们做了一点bias trick,就是增加了训练集的一个维度,这样就只要单独优化W了(当然W也增加了一个维度,所以在可视化的时候,就得把bias这一维度去掉),这一过程可由下面这张图形象体现:

    bias trick
    这是我用svm classifier训练后对参数进行可视化后的效果,可以看到学习得到的权重矩阵其实就是类别的模板,因为矩阵相乘其实就是向量间的内积操作,要想矩阵相乘之后的score最大,只能两个向量比较接近,所以下面学习到的权重矩阵就是综合后的模板:
    经svm classifier训练后每一类学习得到的权重矩阵可视化
    这是我用softmax classifier训练后对参数进行可视化后的效果,感觉会有点不是那么平滑:
    经softmax classifier训练后每一类学习得到的权重矩阵可视化

    链接

    前后面的作业博文请见:

    相关文章

      网友评论

          本文标题:CS231n Spring 2019 Assignment 1—

          本文链接:https://www.haomeiwen.com/subject/ufbqpctx.html