美文网首页
About Loss Function All You Need

About Loss Function All You Need

作者: IntoTheVoid | 来源:发表于2021-03-09 12:24 被阅读0次

    损失函数

    name meaning
    cost function 成本函数是模型预测的值与实际值之间的误差的度量,成本函数是整个训练数据集的平均损失。
    loss function 成本函数和损失函数是同义词,但是损失函数仅用于单个训练示例。 有时也称为错误函数。
    error function 错误函数和损失函数是同义词
    objective function 一种更为通用的表述,定义某一个具体的成本函数作为需要优化的目标函数

    优化目标通常是最小化cost function,即成本函数,通常用符号J(\theta),通常我们会采用梯度下降的方式来对其进行最小化

    Repeat \quad until \quad convergence: \\ \theta_{j} \leftarrow \theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J(\theta)

    在优化过程中,如何使用梯度下降法进行优化,对于每个损失函数通常都会进行如下的5步

    • 确定predict function((f(x))),确定predict function中的参数
    • 确定loss function-对于每一个训练实例的损失(L(\theta))
    • 确定cost function–对于所有训练实例的平均损失(J(\theta))
    • 确定gradients of cost function-对于每个未知参数(\frac{\partial}{\partial\theta_j}J(\theta))
    • 确定learning rate, 确定epoch,更新参数

    下面将对每一种损失函数都进行上面的操作。

    回归损失函数

    Squared Error Loss

    也称为L2损失,是实际值与预测值之差的平方

    优点:

    • 正二次函数(形式为ax ^ 2 + bx + c,其中a> 0),二次函数仅具有全局最小值, 没有局部最小值
    • 可以确保Gradient Descent(梯度下降)收敛(如果完全收敛)到全局最小值

    缺点:

    • 因为其平方性质,导致对异常值的鲁棒性低,即对异常值很敏感, 因此,如果我们的数据容易出现异常值,则不应使用此方法。
    • predict function

    f(x_i) = mx_i +b \\ \theta = \{m,b\}

    • loss function

    L(\theta)=(y_i-f(x_i))^{2}

    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}(y_i-f(x_i))^{2}

    • gradient of cost function

    \begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m} &=2(y_i-f(x_i))*x_i \\ \frac{\partial L(\theta)}{\partial b} &=2(y_i-f(x_i))*1 \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

    • update parameters
    def update_weights_MSE(m, b, X, Y, learning_rate):
        m_deriv = 0
        b_deriv = 0
        N = len(X)
        for i in range(N):
            # Calculate partial derivatives
            # -2x(y - (mx + b))
            m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))
    
            # -2(y - (mx + b))
            b_deriv += -2*(Y[i] - (m*X[i] + b))
    
        # We subtract because the derivatives point in direction of steepest ascent
        m -= (m_deriv / float(N)) * learning_rate
        b -= (b_deriv / float(N)) * learning_rate
    
        return m, b
    
    Absolute Error Loss

    也称为L1损失, 预测值与实际值之间的距离,而与符号无关

    优点:

    • 相比MSE,对于异常值鲁棒性更强,对异常值不敏感

    缺点:

    • Absolute Error 的曲线呈 V 字型,连续但在 y-f(x)=0 处不可导,计算机求解导数比较困难
    • predict function

    f(x_i) = mx_i +b \\ \theta = \{m,b\}

    • loss function

    L(\theta)=|y_i-f(x_i)|

    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}|y_i-f(x_i)|

    • gradient of cost function

    \begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m} &=\frac{(y_i-f(x_i))*x_i}{|y_i-f(x_i)|} \\ \frac{\partial L(\theta)}{\partial b} &=\frac{(y_i-f(x_i))*1}{|y_i-f(x_i)|} \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

    • update parameters
    def update_weights_MAE(m, b, X, Y, learning_rate):
        m_deriv = 0
        b_deriv = 0
        N = len(X)
        for i in range(N):
            # Calculate partial derivatives
            # -x(y - (mx + b)) / |mx + b|
            m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))
    
            # -(y - (mx + b)) / |mx + b|
            b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))
    
        # We subtract because the derivatives point in direction of steepest ascent
        m -= (m_deriv / float(N)) * learning_rate
        b -= (b_deriv / float(N)) * learning_rate
    
        return m, b
    
    Huber Loss

    Huber Loss 是对二者的综合,包含了一个超参数 δ。δ 值的大小决定了 Huber Loss 对 MSE 和 MAE 的侧重性,当 |y−f(x)| ≤ δ 时,变为 MSE;当 |y−f(x)| > δ 时,则变成类似于 MAE

    优点:

    • 减小了对异常值的敏感度问题
    • 实现了处处可导的功能
    • predict function

    f(x_i) = mx_i +b \\ \theta = \{m,b\}

    • loss function

    \begin{aligned} &L_{\delta}(\theta)=\left\{\begin{array}{l} \frac{1}{2}(y_i-{f(x_i)})^{2}, \text { if }|y_i-f(x_i)| \leq \delta. \\ \delta|y_i-f(x_i)|-\frac{1}{2} \delta^{2}, \quad \text { otherwise } \end{array}\right.\\ \end{aligned}

    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L_{\delta}(\theta)

    • gradient of cost function

    \begin{aligned} for \quad Single \quad training \quad example: \\ \text { if }|y_i-f(x_i)| \leq \delta: \\ \frac{\partial L_{\delta}(\theta)}{\partial m} &=(y_i-f(x_i))*x_i \\ \frac{\partial L_{\delta}(\theta)}{\partial b} &=(y_i-f(x_i))*1 \\ \text { otherwise }: \\ \frac{\partial L_{\delta}(\theta)}{\partial m} &=\frac{\delta*(y_i-f(x_i))*x_i}{|y_i-f(x_i)|} \\ \frac{\partial L_{\delta}(\theta)}{\partial b} &=\frac{\delta*(y_i-f(x_i))*1}{|y_i-f(x_i)|} \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

    • update parameters
    def update_weights_Huber(m, b, X, Y, delta, learning_rate):
        m_deriv = 0
        b_deriv = 0
        N = len(X)
        for i in range(N):
            # derivative of quadratic for small values and of linear for large values
            if abs(Y[i] - m*X[i] - b) <= delta:
              m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
              b_deriv += - (Y[i] - (m*X[i] + b))
            else:
              m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
              b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
        
        # We subtract because the derivatives point in direction of steepest ascent
        m -= (m_deriv / float(N)) * learning_rate
        b -= (b_deriv / float(N)) * learning_rate
    
        return m, b
    

    分类损失函数

    分类任务中,如果要度量真实类别与预测类别的差异,可以通过entropy来定义,其中entropy指的是混乱或不确定性,对于一个概率分布的熵值越大,表示分布的不确定性越大。同样,较小的值表示更确定的分布。那么对于分类来讲,更小的entropy,意味着预测的结果更准确, 根据分类任务中类别的数量不同,可以分为二分类和多分类。

    二分类损失函数(Binary)
    Binary Cross Entropy Loss

    属于1类(或正类)的概率= p

    属于0类(或负类)的概率= 1-p

    • predict function

    此处把上面的f替换成了p是为了强调其是一个概率值,除此之外,上面回归中的所有(x_i)只是针对一个特征,下面的函数中有两个特征,其中的i表示数据集中的第i条数据
    z(x_1^{(i)},x_2^{(i)}) = m_1*x_1^{(i)} + m_2 * x_2^{(i)} +b \\ p(z) = \frac{1}{1+e^{-z}} \\ \theta = \{m_1,m_2,b\}

    • loss function

    L(\theta)=-y * \log (p)-(1-y) * \log (1-p)=\left\{\begin{array}{ll} -\log (1-p), & \text { if } y=0 \\ -\log (p), & \text { if } y=1 \end{array}\right.

    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

    • gradient of cost function

    \begin{aligned} for \quad Single \quad training \quad example: \\ \frac{\partial L(\theta)}{\partial m_1} &=\frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial m_1} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[x_1^{(i)}\right] \\ &= (p-y^{(i)})*x_1^{(i)} \\ \frac{\partial L(\theta)}{\partial m_2} &= \frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial m_2} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[x_2^{(i)}\right] \\ &= (p-y^{(i)})*x_2^{(i)} \\ \frac{\partial L(\theta)}{\partial b} &= \frac{\partial L}{\partial p}\frac{\partial p}{\partial z}\frac{\partial z}{\partial b} \\ &= \left[-(\frac{y^{(i)}}{p} - \frac{1-y^{(i)}}{1-p})\right]*\left[p(1-p)\right]\left[ 1\right] \\ &= (p-y^{(i)}) \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m_1} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_1} \\ \frac{\partial J(\theta)}{\partial m_2} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_2} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

    • update parameters
    def update_weights_BCE(m1, m2, b, X1, X2, Y, learning_rate):
        m1_deriv = 0
        m2_deriv = 0
        b_deriv = 0
        N = len(X1)
        for i in range(N):
            s = 1 / (1 / (1 + math.exp(-m1*X1[i] - m2*X2[i] - b)))
            
            # Calculate partial derivatives
            m1_deriv += -X1[i] * (s - Y[i])
            m2_deriv += -X2[i] * (s - Y[i])
            b_deriv += -(s - Y[i])
    
        # We subtract because the derivatives point in direction of steepest ascent
        m1 -= (m1_deriv / float(N)) * learning_rate
        m2 -= (m2_deriv / float(N)) * learning_rate
        b -= (b_deriv / float(N)) * learning_rate
    
        return m1, m2, b
    
    Hinge Loss

    主要用于带有类别标签-1和1的支持向量机(SVM)分类器。因此,请确保将数据集中类别的标签从0更改为-1。

    • predict function

    f(x_1^{(i)},x_2^{(i)}) = m_1*x_1^{(i)} + m_2 * x_2^{(i)} +b \\ \theta = \{m_1, m_2, b\}

    • loss function

    L(\theta)=\max (0,\quad1-y^{(i)} * f(x_1^{(i)},x_2^{(i)}))

    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

    • gradient of cost function

    \begin{aligned} for \quad Single \quad training \quad example: \\ \text { if } y^{(i)} * f(x_1^{(i)},x_2^{(i)}) \leq 1: \\ \frac{\partial L(\theta)}{\partial m_1} &=y^{(i)}*x_1^{(i)}\\ \frac{\partial L(\theta)}{\partial m_2} &=y^{(i)}*x_2^{(i)} \\ \frac{\partial L(\theta)}{\partial b} &=y^{(i)} \\ \text { otherwise }: \\ pass \\ for \quad All \quad training \quad example: \\ \frac{\partial J(\theta)}{\partial m_1} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_1} \\ \frac{\partial J(\theta)}{\partial m_2} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial m_2} \\ \frac{\partial J(\theta)}{\partial b} &=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial L(\theta)}{\partial b} \\ \end{aligned}

    • update parameters
    def update_weights_Hinge(m1, m2, b, X1, X2, Y, learning_rate):
        m1_deriv = 0
        m2_deriv = 0
        b_deriv = 0
        N = len(X1)
        for i in range(N):
            # Calculate partial derivatives
            if Y[i]*(m1*X1[i] + m2*X2[i] + b) <= 1:
              m1_deriv += -X1[i] * Y[i]
              m2_deriv += -X2[i] * Y[i]
              b_deriv += -Y[i]
            # else derivatives are zero
    
        # We subtract because the derivatives point in direction of steepest ascent
        m1 -= (m1_deriv / float(N)) * learning_rate
        m2 -= (m2_deriv / float(N)) * learning_rate
        b -= (b_deriv / float(N)) * learning_rate
    
        return m1, m2, b
    
    多分类损失函数(Multi-class)

    电子邮件归类任务中,处理可以将其归类为垃圾邮件和非垃圾邮件,还可以为一封邮件赋予更多的角色,例如它们被分类为其他各种类别-工作,家庭,社交,晋升等。这是一个多类分类。

    Multi-Class Cross Entropy Loss

    输入向量X_i和对应的独热编码的目标向量Y_i的损失为:

    • predict function

    z = some function \\ p(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}} \quad \text { for } i=1, \ldots, K \text { and } \mathbf{z}=\left(z_{1}, \ldots, z_{K}\right) \in \mathbb{R}^{K}

    • loss function
    image-20210308235126073.png
    • cost function

    J(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(\theta)

    • gradient of cost function

    为了简化,此处损失我只求到z,例如在神经网络中,相当于求输出层的误差梯度

    关于softmax的偏导部分,详情请看softmax梯度
    \begin{aligned} for \quad Single \quad training \quad example: \\ if \quad k = j: \\ \frac{\partial L(\theta)}{\partial z_j} &=\frac{\partial L}{\partial p_{k}}\frac{\partial p_{k}}{\partial z_j} \\ &= \left[-\frac{1}{p_k}\right]*\left[p_k(1-p_j)\right]\\ &= \left[-\frac{1}{p_k}\right]*\left[p_k(1-p_j)\right] \quad \text{because k = j} \\ &= p_j-1 \\ if \quad k \ne j: \\ \frac{\partial L(\theta)}{\partial z_j} &=\frac{\partial L}{\partial p_{k}}\frac{\partial p_{k}}{\partial z_j} \\ &= \left[-\frac{1}{p_k}\right]*\left[-p_kp_j\right] \\ &= \left[-\frac{1}{p_k}\right]*\left[-p_kp_j\right] \\ &= p_j \end{aligned}

    • update parameters
    # importing requirements
    from keras.layers import Dense
    from keras.models import Sequential
    from keras.optimizers import adam
    
    # alpha = 0.001 as given in the lr parameter in adam() optimizer
    
    # build the model
    model_alpha1 = Sequential()
    model_alpha1.add(Dense(50, input_dim=2, activation='relu'))
    model_alpha1.add(Dense(3, activation='softmax'))
    
    # compile the model
    opt_alpha1 = adam(lr=0.001)
    model_alpha1.compile(loss='categorical_crossentropy', optimizer=opt_alpha1, metrics=['accuracy'])
    
    # fit the model
    # dummy_Y is the one-hot encoded 
    # history_alpha1 is used to score the validation and accuracy scores for plotting 
    history_alpha1 = model_alpha1.fit(dataX, dummy_Y, validation_data=(dataX, dummy_Y), epochs=200, verbose=0)
    

    相关文章

      网友评论

          本文标题:About Loss Function All You Need

          本文链接:https://www.haomeiwen.com/subject/egnvqltx.html