preface:

deep learning :a collection of highly complicated data modeling algorithms achieved via multiple layers of nonlinear translation
in a sense ,deep learning equals to DNN

linear models are of great limitation

multi-layers equals to single layer

activation achieves non-linearization

tensorflow provides 7 activations
such as：relu，sigmoid，tanh and etc

multilayers to solve exclusive OR problems

typical point：conpound features extraction

loss function

cross entropy
-tf.reduce_mean(y_*tf.log(y))
how to turn the results of forward-probagation to the form of probability distribution?
then: Softmax is introduced
softmax.png
MSE(mean s)

MSE.png
custom algorithms according to practice

optimizing algorithms

gradient descent
back-probagation
stochastic gradient descent
trade-off batch gradient descent

about learning rate

setup
exponential decay

#tf.train.exponential_decay
#realization
decayed_learning _rate=learning_rate*decay_rate^(global_step/decay_step)
#----------use
learning_rate=tf.train.exponential_decay(0.1,global_step,100,0.96,staircase=True)
#staircase is true ,so multiply it with 0.96 every 100 steps,namely the function is stair-shaped

over-fitting

definition:memorize the random noise instead of learning the total trend
way to avoid: regularization
a metric depicting the complexity about coefficient

L1 one norm ，which will sparse the parameters（more zeros）
L2 two norm，which usually is differentiable so normal
exponentialMovingAverage

#it is a object 
# constructing para.
def __init__(self,decay,num_updates=None,zero_debias=False)
# args:decay for calculate the value of shadow variable i.e. object.average(variable),num_updates for updating decay
# usually the global_step
# global_step is the assistant variable and will add by 1 every training apoch
# member function apply() which is called to create a shadow variable with updated value
# object.apply(self,var_list=None)
# algorithm: decay=min{DECAY,(1.0+num_updates)/(10.0+num_updates)}，DECAY is fixed
# object.average() for getting the value : shadow_variable = decay * shadow_variable + (1-decay) * variable
# control the distance between now and before and slower the change
# this method won’t change the para. but will influence the gradient descent via adjusting the result of forward-prob.