[1]. Softmax vs. Softmax-Loss: Numerical Stability
function softmax(z)
#z = z - maximum(z)
o = exp(z)
return o / sum(o)
end
function gradient_together(z, y)
o = softmax(z)
o[y] -= 1.0
return o
end
function gradient_separated(z, y)
o = softmax(z)
∂o_∂z = diagm(o) - o*o'
∂f_∂o = zeros(size(o))
∂f_∂o[y] = -1.0 / o[y]
return ∂o_∂z * ∂f_∂o
end

[2]. 反向传播之一:softmax函数

利用这个特性,在计算之前减去最大值防止溢出,但实际计算的结果并不会受到影响。
[3]. PyTorch - VGG output layer - no softmax?
The reason why this is done is because you only need the softmax layer at the time of inferencing. While training, to calculate the loss you don’t need to softmax and just calculate loss without it. This way the number of computations get reduced!

网友评论