Notes on "Single Shot MultiBox Detector"

February 05, 2018

Notes on Single Shot MultiBox Detector by Liu et al (2016):

This paper introduces Single Shot MultiBox Detector (SSD) which is a feedforward convolutional neural network that prodcues a fixed size collection of bounding boxes and scores for the instances of those bounding boxes, followed by a non minimal suppression step to produce the final detections. Compared to YOLO this algorithm has following features:

Multi-scale feature maps: Unlike YOLO that operates with a single feature map, multi-sacle feature maps are used; convolution layers for detection are added with decreasing size which allows dectecting at multiple scales.
Convolutional predictors for detection: Extra feature layers produce predictions for detection; Each feature layers (at fixed sizes) produce either the score predictions or the offsets in bounding boxes (absolute box positions relative to each feature map positions).
Default boxes and aspect ratios: A set of default bounding boxes are associated with each feature map.

The implementation details on training SSD are as follows:

Matching strategy: Using the Jacard overlap coefficient the ground truth box and the sets of default boxes are compared; all the boxes that are over 0.5 are taken to simplify the learning problem.
Training objective: The overall objective is a weighted sum of the localization (Smooth L1 loss with offsets to the center, and the width and height of the bounding box) and the confidence loss (softmax over multiple classes confidences).
Scales and aspect ratio for the boxes: Rules are presented for choosing the sets of default boxes in terms of scales and aspect ratio (it is done evenly spaced way). Optimal selection is still an open question.
Hard negative mining: Because many sets of default boxes are tried out, there exists imbalance between negatives and positives for confidences (large negatives will exist). By picking the negative default boxes using confidences the ratio of 3:1 between negatives and positives are kept which was found to make the training more stable.
Data augmentation: To increase the robustness with respect to the size and the shapes, the data is augmented by randomly patching the images.

Note that pre-trained VGG-16 on the ILSVRC CLS-LOC dataset is used as the base neural network in this paper. Details on the training (choce of hyper parameters and so on) are mentioned in the paper and the open-source codes do exist.

The data sets PASCAL VOC2007, PASCAL VOC2012 and COCO are used to compare SSD to various algorithms. SSD is found to be, atleast in the experiment set up the author used, competitive in both accuracy and speed to other state-of-the-art object detection algorithms.

Find the overview below.

Q&A:

Q1) Different overlap coefficients?
A) https://stats.stackexchange.com/questions/238684/what-are-the-difference-between-dice-jaccard-and-overlap-coefficients

Search This Blog

Jongseok´s Blog

Notes on "Single Shot MultiBox Detector"

Comments

Post a Comment

Popular posts from this blog

Notes on "Focal Loss for Dense Object Detection" (RetinaNet)

Notes on Tensorflow Programmers Guide.

Introduction to Bayesian methods