美文网首页
cs231n学习之Normalization(5)

cs231n学习之Normalization(5)

作者: Latet | 来源:发表于2019-08-24 15:43 被阅读0次

    前言

    本文旨在学习和记录,如需转载,请附出处https://www.jianshu.com/p/59aaaaab746a

    Normalization

    Normalization主要是对网络特征的一种处理方法,期望特征在训练中保持较好的分布。一般都是在激活函数前进行Normalization.对于Normalization,现在主要有以下几种方法:

    image.png
    输入

    一、Batch Normalization

    有关BN的理论知识可以查看博客Batch Normalization,BN的操作对象是对Batch个特征map按通道进行归一化,均值和方差的shape大小为1\times C \times 1\times 1,然后乘以缩放因子和平移因子。

    def batchnorm_forward(x, gamma, beta, bn_param):
        """
        Input:
        - x: Data of shape (N, D)
        - gamma: Scale parameter of shape (D,)
        - beta: Shift paremeter of shape (D,)
        - bn_param: Dictionary with the following keys:
          - mode: 'train' or 'test'; required
          - eps: Constant for numeric stability
          - momentum: Constant for running mean / variance.
          - running_mean: Array of shape (D,) giving running mean of features
          - running_var Array of shape (D,) giving running variance of features
    
        Returns a tuple of:
        - out: of shape (N, D)
        - cache: A tuple of values needed in the backward pass
        """
        mode = bn_param['mode']
        eps = bn_param.get('eps', 1e-5)
        momentum = bn_param.get('momentum', 0.9)
    
        N, D = x.shape
        running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
        running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
    
        out, cache = None, None
        if mode == 'train':
            sample_mean = np.mean(x,axis = 0)## 每一列的均值
            sample_var = np.var(x,axis = 0)
            x_hat = (x- sample_mean)/(np.sqrt(sample_var+eps))
            
            out = gamma*x_hat+beta
            cache = (x, sample_mean, sample_var, x_hat, eps,gamma, beta)
            running_mean = momentum*running_mean +(1-momentum)*sample_mean
            running_var = momentum*running_var + (1-momentum)*sample_var
            
        elif mode == 'test':
            out = gamma* (x - running_mean)/(np.sqrt(running_var+eps))+beta
            pass
        else:
            raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
    
        # Store the updated running means back into bn_param
        bn_param['running_mean'] = running_mean
        bn_param['running_var'] = running_var
    
        return out, cache
    
    
    def batchnorm_backward(dout, cache):
        """
        Inputs:
        - dout: Upstream derivatives, of shape (N, D)
        - cache: Variable of intermediates from batchnorm_forward.
    
        Returns a tuple of:
        - dx: Gradient with respect to inputs x, of shape (N, D)
        - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
        - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
        """
        dx, dgamma, dbeta = None, None, None
        N = dout.shape[0]
        x, sample_mean, sample_var, x_hat, eps,gamma, beta = cache
        dgamma = np.sum(dout*x_hat,axis = 0)
        dbeta = np.sum(dout,axis = 0)
        dhat = dout * gamma
        dx_1 = dhat/(np.sqrt(sample_var+eps))
        dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
        dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
        
        dx_var = dvar*2.0*(x-sample_mean)/N
        dx_mean = dmean*1.0/N
        dx = dx_1+dx_var+dx_mean
        return dx, dgamma, dbeta
    

    上述代码是针对全连接层的BN,如果需要在卷积网络中使用BN,只需把conv出的特征map进行reshape(N\times H\times W,C)即可。此外,BN训练中每一都计算了每个Batch的均值和方差,在测试时所用的均值和方差是训练中所有数据的滑动平均。
    BN的优点

    • 可以容许较大的学习率;
    • 可以采用较差的初始化;
    • 正则化

    BN的缺点
    计算均值和方差是在Batch上,如果Batchsize太小,计算均值和方差不能代表整个数据分布;如果Batchsize太大,会超过显存容量,训练较慢,更新很慢,一般选择32,64,128等

    image.png

    二、Layer Normalization

    LN的操作对象是对N个特征map按N进行归一化,均值和方差的shape大小为N\times 1 \times 1\times 1。简单的说,就是每个样本求一个均值和方差。所以训练和测试时代码都一样,就不必考虑滑动平均了。

    def layernorm_forward(x, gamma, beta, ln_param):
        """
       
        Input:
        - x: Data of shape (N, D)
        - gamma: Scale parameter of shape (D,)
        - beta: Shift paremeter of shape (D,)
        - ln_param: Dictionary with the following keys:
            - eps: Constant for numeric stability
    
        Returns a tuple of:
        - out: of shape (N, D)
        - cache: A tuple of values needed in the backward pass
        """
        out, cache = None, None
        eps = ln_param.get('eps', 1e-5)
       
        x_T = x.T
    #     print(x_T)
        sample_mean = np.mean(x_T,axis = 0)
    
        
        sample_var = np.var(x_T,axis = 0)
       
        x_norm_T = (x_T - sample_mean)/(np.sqrt(sample_var+eps))
    #     print(x_norm_T)
        x_norm = x_norm_T.T
        out = x_norm * gamma +beta
        cache = (x,  sample_mean, sample_var,x_norm,eps, gamma, beta)
        return out, cache
    
    
    def layernorm_backward(dout, cache):
        dx, dgamma, dbeta = None, None, None
        x,  sample_mean, sample_var,x_norm,eps, gamma, beta = cache
        dgamma = np.sum(dout*x_norm, axis = 0) 
        dbeta = np.sum(dout, axis = 0)
        
        dout = dout.T
        N = dout.shape[0]
        dhat = dout * gamma[:,np.newaxis]
        dx_1 = dhat/(np.sqrt(sample_var+eps))
        x = x.T
        dvar = np.sum(dhat*(x-sample_mean),axis=0)*(-0.5)*((sample_var+eps)**(-1.5))
        dmean = np.sum(-dhat,axis=0)/(np.sqrt(sample_var+eps))+dvar*np.mean(2*sample_mean-2*x,axis=0)
        
        dx_var = dvar*2.0*(x-sample_mean)/N
        dx_mean = dmean*1.0/N
        
        dx = dx_1+dx_var+dx_mean
        dx = dx.T
        
    
        return dx, dgamma, dbeta
    

    LN的优点:不需要批训练,在一条数据内部就能进行归一化。可以在Batchsize为1的网络和RNN中。此外,对CNN网络来说,BN比LN适合;对RNN网络来说,LN比BN更适合。

    三、Instance Normalization

    IN的提出主要是针对风格迁移网络。LN的操作对象是对Batch个特征map按像素进行归一化,均值和方差的shape大小为 N \times C \times 1\times 1。因为在图像的风格迁移中,生成的结果主要依赖于某个图像实例,所以在通道和数目上进行归一化不适合风格迁移,需要保持实例个通道内独立。

    image.png

    四、Group Normalization

    GN的提出主要针对BN在小的batchsize下,其估计整体不精确造成的精度下降。
    GN将原始的输入x:N\times C \times H\times W按通道划分成几组N\times G \times (C/G) \times H\times W,然后在各个组内进行归一化。这样计算时不必考虑Batchsize的大小。均值和方差的大小为N\times G \times 1 \times 1\times 1

    image.png
    def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):
        """
        out, cache = None, None
        eps = gn_param.get('eps',1e-5)
        
        N,C,H,W = x.shape
        x_group = np.reshape(x,(N,G,C//G,H,W))
        mean = np.mean(x_group,axis=(2,3,4),keepdims=True)
        var = np.var(x_group,axis=(2,3,4),keepdims=True)
        x_groupnorm = (x_group-mean)/np.sqrt(var+eps)
        x_norm = np.reshape(x_groupnorm,(N,C,H,W))
        out = x_norm*gamma+beta
        cache = (G,x,x_norm,mean,var,gamma,beta,eps)
       
        return out, cache
    
    
    def spatial_groupnorm_backward(dout, cache):
        dx, dgamma, dbeta = None, None, None
        G,x,x_norm,mean,var,gamma,beta,eps = cache
        N,C,H,W = dout.shape
        dbeta = np.sum(dout,axis=(0,2,3),keepdims=True)
        dgamma = np.sum(dout*x_norm,axis=(0,2,3),keepdims=True)
        
        dx_norm = dout*gamma
        dx_groupnorm = dx_norm.reshape((N,G,C//G,H,W))
        x_group = x.reshape((N,G,C//G,H,W))
        
        dvar = np.sum(dx_groupnorm*-1.0/2*(x_group-mean)*(var+eps)**(-1.5),axis=(2,3,4),keepdims=True)
        
        N_group = C//G*H*W
        dmean1 = np.sum(dx_groupnorm*-1.0/np.sqrt(var+eps),axis=(2,3,4),keepdims=True)
        dmean2 = dvar*-2.0/N_group*np.sum(x_group-mean,axis=(2,3,4),keepdims=True)
        dmean = dmean1+dmean2
        
        dx_group1 = dx_groupnorm*1.0/np.sqrt(var+eps)
        dx_group2 = dmean*1.0/N_group
        dx_group3 = dvar*2.0/N_group*(x_group-mean)
        dx_groups = dx_group1+dx_group2+dx_group3
        dx = dx_groups.reshape((N,C,H,W))
       
        return dx, dgamma, dbeta
    

    总结

    1. BN在batch上进行归一化,保留通道数;LN在通道上进行归一化,保留数目N;IN在图像上进行归一化,保留N和C;GN将通道分组,在通道内进行归一化,保留N和G(通道数);
    2. BN,GN更适合CNN;LN更适合RNN;IN主要用于风格迁移。
    3. BN训练和测试代码不一样,测试时需要考虑滑动平均。BN可以设置滑动平均的参数来获取更准确的均值和标准差。

    参考

    1. Batch Normalization:Accelerating Deep Network Training by reducing internal covariate shift
    2. Layer Normalization
    3. Instance Normalization: The Missing Ingredient for Fast Stylization
    4. Group Normalization
    5. cs231课件

    相关文章

      网友评论

          本文标题:cs231n学习之Normalization(5)

          本文链接:https://www.haomeiwen.com/subject/tbppsctx.html