Approach

By properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network.

To define a spatial attention mapping function, the implicit assumption that we make is that the absolute value of a hidden neuron activation (that results when the network is evaluated on given input) can be used as an indication about the importance of that neuron w.r.t. the specific input.

Mid-level attention maps have higher activation level around eyes, nose and lips, high-level activations correspond to the whole face.

Attention transfer
In attention transfer, given the spatial attention maps of a teacher network, the goal is to train a student network that will not only make correct predictions but will also have attentions maps that are similar to those of the teacher.

Let S, T and W S , W T denote student, teacher and their weights correspondingly, and let L(W,x) denote a standard cross entropy loss. Let also I denote the indices of all teacher student activation layer pairs for which we want to transfer attention maps. Then we can define the following total loss:

Gradient-based attention transfer
In this case we define attention as gradient w.r.t. input, which can be viewed as an input sensitivity map, attention at an input spatial location encodes how sensitive the output prediction is w.r.t. changes at that input location.

Experiment

References：
PAYING M ORE A TTENTION TO A TTENTION :I MPROVING THE P ERFORMANCE OF C ONVOLUTIONALN EURAL N ETWORKS VIA A TTENTION TRANSFER, Sergey Zagoruyko, 2017, ICLR