Capsule Networks
Demo - We'll run the Tensorflow code for CapsNet (the new state-of-the-art performance on MNIST, more than convolutional networks)
What is the current State of the Art in Image Classification & Object Recognition?
- Image classification is a central problem in machine learning
- we would not have been successful if we simply used a raw multi-layer perceptron connected to each pixel of an image.
- On top of becoming quickly intractable, this direct operation is not very efficient as pixels are spatially correlated.
---So we initially need to extract features that are:
---meaningful and
---low-dimensional - And that's where convolutional neural networks come in the game!
- Convolutional Networks are the state of the art algorithm
- Basic idea is show an algorithm labeled images i.e photos of dogs labeled "dog" eventually it will start to abstract features that are more likely to indicate the presence of an actual dog
- We can use this model to classify new, unlabeled images.
- First, an input image is fed to the network.
- Filters of a given size scan the image and perform convolutions.
- The obtained features then go through an activation function. Then, the output goes through a succession of pooling and other convolution operations.
- features are reduced in dimension as the network goes on.
- At the end, high-level features are flattened and fed to fully connected layers, which will eventually yield class probabilities through a softmax layer.
- During training time, the network learns how to recognize the features that make a sample belong to a given class through backpropagation.
ConvNets appear as a way to construct features that we would have had to handcraft ourselves otherwise.
1-xAd4dHFESXxGm40IdsHP1Q.jpg
Improvements to CNNs
AlexNet
alt text- Krizhevsky introduced better non-linearity in the network with the ReLU activation, whose derivative is 0 if the feature is below 0 and 1 for positive values. This proved to be efficient for gradient propagation.
- introduced the concept of dropout as regularization. From a representation point of view, you force the network to forget things at random, so that it can see your next input data from a better perspective.
- Introduced data augmentation. When fed to the network, images are shown with random translation, rotation, crop. That way, it forces the network to be more aware of the attributes of the images, rather than the images themselves.
- Deeper. They stacked more convolutional layers before pooling operations. The representation captures consequently finer features that reveal to be useful for classification.
This network largely outperformed what was state-of-the-art back in 2012, with a 15.4% top-5 error on the ImageNet dataset.
VGGNet
alt text- Went Deeper
GoogleNet
alt text- convolutions with different filter sizes are processed on the same input, and then concatenated together.
- This allows the model to take advantage of multi-level feature extraction at each step. For example, general features can be extracted by the 5x5 filters at the same time that more local features are captured by the 3x3 convolutions.
ResNEt
alt text- at some point, we realize that stacking more layers does not lead to better performance. In fact, the exact opposite occurs. But why is that? The Gradient.
- every two layers, there is an identity mapping via an element-wise addition. This proved to be very helpful for gradient propagation, as the error can be backpropagated through multiple paths.
- This helps to combine different levels of features at each step of the network, just like we saw it with the inception modules.
DenseNet
alt text- proposes entire blocks of layers connected to one another.
Patterns
-
Networks are designed to be deeper and deeper.
-
Computational tricks (ReLU, dropout, batch normalization) have been also introduced alongside them and had a significant impact in improving performance.
-
Increasing use of connections between the layers of the network, which helps for producing diverse features and revealed to be useful for gradient propagation.
The Problem with Convolutional Networks
alt text- max-pooling throws away information about the precise position of the entity within the region.
- CNNs have trouble generalizing to novel viewpoints.
- The ability to deal with translation is built in, but for the other dimensions of an affine transformation we have to chose between replicating feature detectors on a grid that grows exponentially with the number of dimensions, or increasing the size of the labeled training set in a similarly exponential way.
Oh! And don't forget this 6 day old paper on fooling ConvNets by modifying just a few pixels
https://arxiv.org/abs/1710.08864
The Capsule Network
alt text- CNNs cannot handle rotation at all - if they are trained on objects in one orientation, they will have trouble when the orientation is changed.
- Pooling gives some translational invariance in much deeper layers, but only in a crude way.
- The human brain must achieve translational invariance in a much better way
- Hinton posits that the brain has modules he calls “capsules” which are particularly good at handling different types of visual stimulus and encoding things like pose – for instance, there might be one for cars and another for faces. - - The brain must have a mechanism for “routing” low level visual information to what it believes is the best capsule for handling it.
- According to Hinton, CNNs do routing by pooling. Pooling was introduced to reduce redundancy of representation and reduce the number of parameters, recognizing that precise location is not important for object detection.
- But Pooling does routing in a very crude way - for instance max pooling just pics the neuron with the highest activation, not the one that is mostlikely relevant to the task at hand.
What is it?
- The capsule is a neural network architecture where a typical layer is modified to contain sub-structures.
- The typical layer of units becomes a layer of capsules. I
- Instead of making a neural network "deeper" in height, it makes it deeper in nesting or inner structure. That's all it is, basically.
- The model is robust to affine transformations
2 Key Features
Layer-based squashing & dynamic routing.
- In a typical neural network, only the output of a unit is squashed (by a non-linear activation function).
- In a capsule network, a capsule is squashed as a whole vector. It does make the "unit" bigger in the neural net.
- Is it more biologically plausible than traditional network? I would say, "slightly". The new one relies less on backpropagation, and might work by associative learning between capsules.
- So its replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement
- To replicate learned knowledge across space they made all but the last layer of capsules be convolutional
More
- For low level capsules, location information is “place-coded” by which capsule is active.
- As we ascend the hierarchy more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule.
- This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.
- There many possible ways to implement the general idea of capsules
The cost of this new architecture?
alt text- The forward pass has an extra outer loop.
- It takes r iterations over all units (instead of 1) to compute the output.
- The data flow is more complicated
- That makes it harder to calculate gradients, and the model may suffer more from vanishing gradients.
- This could prevent the network from scaling and becoming a deep learning rock star.
- The NIPS paper did not provide stability analysis for the forward pass. We don't know the asymptotic behavior of the layers after r iterations. No idea how stable it will be for attacking difficult learning problems.
网友评论