1. nan loss
It's a natural property of stochastic gradient descent, if the learning rate is too large, SGD can diverge into infinity.
Solution: 1) reduce learning rate
2) normalization helped
3) The solution for me was using tf.losses.sparse_softmax_cross_entropy(y, logits) instead of my own implementation of Safe Softmax using tf.nn.Softmax
网友评论