Activation function
-
Sigmoid
- Saturated nerons "kill" the gradient (input of large positive number or very negative number)
- Sigmoid outputs are not zero-centered
- exp() is compute expensive
-
tanh
- fix the point-2 of Sigmoid
-
ReLU
- do not saturate (in +region)
- computationally efficient
- converges much faster than sigmoid/tanh in practice
Preprocess
- zero mean
alway just use zero mean
Weight initialization
- Xavier initialization
Batch Normalization
- Improve gradient flow
- allow higher learning rate
- reduces the strong dependence on initialization
网友评论