Notes on "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks by Ren et al (2016)

Region proposal based approaches for object detection tends to have the state-of-the-art performances in many benchmark datasets. Faster R-CNN uses convolutional network to propose regions which result in sharing the computation. Therefore, the Faster R-CNN is faster than R-CNN and Fast R-CNN (this is not real time capable!). See the slides below which is taken from a course on convolutional neural network from coursera.


Faster R-CNN uses two convolutional neural networks; one is for the region proposal while other is for detection. As shown below the classifier uses proposals meaning "Region Proposal Network" gives the classifier where to look.

To generate region proposal network, a convolutional network over the feature maps is proposed in which the inputs are n by n spatial window of the input feature map (each slideing window is mapped to a lower dimensional feature which is then fed into box regression layer and class regression layer; image below only shows for one window but it is repeated over all the windows) and the outputs are the proposed regions (anchors - pyramid of k anchors which is more efficient). Note that this approach is translational invariant (multibox methods that use k-means algorithm are not!).

4-step training methods is adapted in order to have regional proposal network and detection network share the features.
  1. RPN is initialized with an ImageNet pretrained model and fine tuned end to end for the regional proposal task.
  2. Fast R-CNN is initialized with an ImageNet pretrained model, and trained separately using the proposals generated by the step 1 RPN (At this point no CNN are shared between RPN and Fast R-CNN).
  3. Detector network is used to initliaze RPN training; fix the shared convlutional layers and only fine tune the layers unique to RPN (Finally CNN are shared).
  4. Fine tune the layers unique to Fast R-CNN while keeping other layers the same.
Rule in any machine learning training is do not use two optimization functions at once! All the details and the source codes are available. This method is evaluated with PASCAL VOC and other publicably available dataset. 


Q&A:

Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors