Notes on "Speed/accuracy trade-offs for modern convolutional object detectors"

"Speed/accuracy trade-offs for modern convolutional object detectors" by Huang et al (2017).

The aim of this paper is to evaluate modern convolutional object detection systems where the metrics is speed and accuracy. Separation between "meta-architectures" (Faster R-CNN, R-FCN and SSD) and base feature extractors (VGG, Residual Networks, Inception) are made to evaluate their different combinations.


The paper uses three meta-architectures namely Single Shot Detector (SSD), Faster R-CNN and R-FCN. See the following posts to get familiar with these object detection algorithms. For these meta-architectures six feature extractors are considered and they are VGG-16. Resnet-101, Inception v2, Inception v3, Inception Resnet and MobileNet. Other architecture configurations such as the number of proposals and output stride settings for Resnet and Inception Resnet are also set up in a reasonable way (see the detail in the paper :P). Furthermore the loss function configuration, input size configuration, training and hyperparameter tuning. benchmark on computing and the image size are clearly stated and made for fair comparison.

The results of this investigation in terms of mAP (mean average precision) vs gpu wall clock time for both meta architecture and feature extractor, and memory vs GPI time for only feature extraction are given and summarized below:
  1. Critical points on the optimality frontier: (a) SSD models with Inception v2 and Mobilenet are the most accurate of the fastest models. (b) Faster R-CNN with dense output Inception Resnet models are the most accurate method. (c) Best balance between the speed and accuracy seem to be with R-FCN models using Residual Network.
  2. The effects of adjusting the feature extractor: Strong correlation is found between the classification and detection performance for Faster R-CNN and R-FCN while SDD appears less reliant on the feature extractors classification performance. SSD is also competitive for large objects but is typically poor for small objects.
  3. The effect of adjusting image size: Decrease in image resolution would decrease the accurate but increase the inference time.
  4. The effect of adjusting the number of proposals: The number of proposals can be decreased while achieving similar performance. Increase in box proposals would increase the computational cost but to which degree depends on the feature extractor.
The authors of the paper used ensemble of these models to win the COCO challenge. The main contirbution is the extensive experimental comparison of some of the main aspects that influence the speed and accuracy of modern object detectors - which may help the practitioners of real world applications.


SSD:
Faster R-CNN:
R-FCN:

Q&A:

Q) YOLO Vs Regional methods. 
SSD is very similar to YOLO  that it uses the same one shot detection.

Comments

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Introduction to Bayesian methods

Conjugate Priors