Proposals

Generate region proposals for input(using Selective Search)
Passing each region proposals into CNN for generating features map for each proposal
Train a SVM binary classifier for each category
Using bounding box regression to adjust predicted boxes

Module Design

system overview
R-CNN diagram
region proposals
It is quite easy for CNN to classify a image, but it's hard to locate a specific object within a image. One solution for this is to firstly generate a set of region proposals, and pass these proposals into CNN and classify each of these proposals, and finally output the region with highest score. So the problem here is how to generate proper proposals. Many algorithms have been designed, such as objectness, selective search, category-independent object proposals. The method used in this paper is Selective Search. Selective Search will generate around 2000 region proposals per image.
feature extraction
After obtain a set of proposals, the proposals will be passed into CNN to generate 4096-dimensional feature vector for each proposal. All the proposals are warped to fit the input size of CNN, which might cause distortion(FAST R-CNN proposed RoI pooling layer to solve this problem)
SVM classification
The 4096-dimensional feature vector will be used by SVM to classify each proposal. For there will be C different SVMs for C different categories, each feature vector will by used for C times.(this cause the system takes a long time to output a result, instead FAST R-CNN removes the SVMpart and does classification within CNN model)

Training

Supervised pre-training

CNN model is first trained on ILSVRC 2012 with image-level annotations.

Fine-tuning

After pre-training, fine-tuning is done using PASCAL VOC dataset by replacing the final 1000-way classification layer by 21-way classification layer with one background layer. Generated region proposals are label positive if it has over 0.5 IoU overlap with the ground truth bounding box, and the rest of the proposals are labeled negative. Then the CNN is trained using SGD with batch size 128(96 background windows and 32 positive windows).

SVM training

A threshold is set to label feature vectors of region proposals. Regions with IoU ratio higher than the threshold are labeled positive, otherwise labeled negative. Threshold has to be properly set to achieve better mAP.

Bounding box regression

Some predicted box might cover the object partially, to achieve better coverage, bounding box regression is applied. Regression details will not be described here, the idea is to form a formulation between the predicted box coordinates and the ground truth box, and try to obtain a set of proper parameters for the equation.