+ All Categories
Home > Engineering > Auro tripathy - Localizing with CNNs

Auro tripathy - Localizing with CNNs

Date post: 21-Jan-2018
Category:
Upload: auro-tripathy
View: 513 times
Download: 5 times
Share this document with a friend
39
How CNNs Localize Objects with Increasing Precision and Speed Auro Tripathy May 2017 How do I fine-tune the bounding box? What Class?
Transcript
Page 1: Auro tripathy -  Localizing with CNNs

[email protected] 1

How CNNs Localize Objects with Increasing Precision

and Speed Auro Tripathy

May 2017

How do I fine-tune the

bounding box? What

Class?

Page 2: Auro tripathy -  Localizing with CNNs

[email protected] 2

•  Terms, concepts, and metrics for detection algorithms

•  Two-stage detectors •  Region-based Convolution Neural Networks (R-CNN) •  Fast R-CNN •  Faster R-CNN

•  Unified (single-shot) detectors •  You Only Look Once (YOLO) •  Single-Shot Detector (SSD)

Outline

Page 3: Auro tripathy -  Localizing with CNNs

[email protected] 3

What is to Classification as Where is to Detection

“We’re in the midst of an Object Detection Renaissance” – Ross Girschik

What? ü  Person, Probability=0.7 ü  Dog, Probability=0.8 ü  Horse, Probability=0.8

What & Where? ü  Person, Location=(x1, y1, w1, h1), Confidence=90% ü  Dog, Location=(x2, y2, w2, h2), Confidence=80% ü  Horse, Location=(x3, y3, w3, h5), Confidence=90%

Page 4: Auro tripathy -  Localizing with CNNs

[email protected] 4

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

CNN-Based Detection Performance at a Glance Two-Stage Techniques versus Single-Shot Techniques

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

Frames per Sec (fps)

mea

n Av

g P

reci

sion

(mA

P) V

OC

(fps, mAP)

Page 5: Auro tripathy -  Localizing with CNNs

[email protected] 5

What’s Mean Average Precision (mAP)?

Precision = TP

TP + FP Recall =

TP

TP + FN

1. Predict the Average Precision of each class in your test set 2. Then take the mean of these average individual class precisions to get

mean Average Precision (mAP)

High precision relates to low false-positives

High recall relates to low false-negatives

Page 6: Auro tripathy -  Localizing with CNNs

[email protected] 6

Region-based CNN (R-CNN) Kick-started Detection

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

mA

P V

OC

(fps, mAP)

fps

Page 7: Auro tripathy -  Localizing with CNNs

[email protected] 7

Image

Region Proposal Generator (2000 Regions)

CNN - Feature Extractor Per Region

CNN Output - Feature Vector

Linear SVM Classifier for Region

Airplane: No :

Dog: Yes :

TV Monitor: No

Region-Based CNN (R-CNN)

Rich feature hierarchies for accurate object detection and semantic segmentation Tech report (v5)

Bounding Box Regressor

CNN

Page 8: Auro tripathy -  Localizing with CNNs

[email protected] 8

•  Training is a three-stage disjoint pipeline 1.  Fine-tune a CNN on region proposals using log loss 2.  Fits SVMs (acting as object detectors) to CNN features replacing Softmax 3.  Learn to regress bounding boxes with squared loss (L2)

•  External Region Proposal Algorithm

•  No sharing of parameters between the 2000 region proposals

•  Volume of data mandates intermediate stages stored on disk

Using CNNs Broke New Ground The Downside – High Workloads for Train/Test

http://videolectures.net/iccv2015_girshick_fast_r_cnn/

Page 9: Auro tripathy -  Localizing with CNNs

[email protected] 9

What’s Bounding-Box Regression? Learn Transformation W that Maps Proposal P to Ground Truth G

Groundtruth, G

Proposal, P

d(P)

d★(P) = W★Tϕ5(P),

where ★ is x, y, w, h and ϕ5 are Pool5 Features

Transformation d(P) is parameterized into four functions: dx(P), dy(P), dw(P), dh(P)

x, y are linear translations of the center of P’s bounding box w, h are log-space translations of the width & height of P

We learn W by minimizing a standard least squares problem with Ridge Regression regularization

x, y w

h

Page 10: Auro tripathy -  Localizing with CNNs

[email protected] 10

Learn to Only Regress Proposals that are “Nearby” to Ground Truth with Intersection over Union

IoU Threshold = 0.9 IoU Threshold = 0.7 IoU Threshold = 0.6

Used only if the Intersection over Union (IoU) between the predicted box and the ground truth box is greater than a threshold

https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/object_localization_and_detection.html

Page 11: Auro tripathy -  Localizing with CNNs

[email protected] 11

Fast R-CNN Improved Detection w/Single-Stage Training

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

mA

P V

OC

(fps, mAP)

fps

Page 12: Auro tripathy -  Localizing with CNNs

[email protected] 12

•  CNN over entire image instead of over a region proposal

•  Shares convolution layers

•  Continues to use external region proposals •  Projects region proposals on top of

Conv5 of VGG16

•  Simultaneously predicts •  Classes and •  Bounding Boxes via joint training

Fast R-CNN

How do I fine-tune the

bounding box? What Class?

https://clipartfest.com/download/fb2cd25bdefb07cc8eb8cd28091ab62ea3519461.html

Network is designed with a classification “head” and a regression “head”

Page 13: Auro tripathy -  Localizing with CNNs

[email protected] 13

Fast R-CNN

RoI Projection (for each Region)

Region Proposal Generator (2000 Regions)

RoI Pooling Layer Fully Connected (FC6 + FC7)

1024 RoI Feature Vector

FC

Class Probability

Bounding Box Prediction

Conv5

Conv1

Page 14: Auro tripathy -  Localizing with CNNs

[email protected] 14

Fast R-CNN – Forward and Back-Prop Paths Using Multi-Class Loss

RoI Projection (for each Region)

Region Proposal Generator (2000 Regions)

FC compatible RoI Pooling Layer FC

Class Probability

Bounding Box Regressor

Conv5

Conv1

Linear Softmax Linear

Log Loss + Smooth L1 Loss Forward Path

Back-Prop Path

https://andrewliao11.github.io/object_detection/faster_rcnn/

Page 15: Auro tripathy -  Localizing with CNNs

[email protected] 15

Lossmulticlass = Lossclassification + λ * Loss bounding box regression

Multiclass Loss = Log Loss + Smooth L1 Loss

predicted offsets

ground truth regression target Σ Smooth-L1 = -log(loss for true class u) + λ *

0.5x2 if mod(x) < 1 mod(x) – 0.5 otherwise Smooth-L1(x) =

Smooth-L1 Loss less sensitive to outliers than L2 Loss

Page 16: Auro tripathy -  Localizing with CNNs

[email protected] 16

•  RoI is a rectangular window into the feature map (r, c, h,w )

•  HxW grid of sub-windows •  (e.g., 7X7)

•  Each sub-window, h/H x w/W •  Max-pool the values in each sub-

window into the corresponding output grid cell

Introduce Region-of-Interest (RoI) Pooling Layer For Compatibility with the Fully-Connected Layer Above

Back-Propagation routes derivatives through RoI Layer

w

h

(r,c)

h/H w/W

Page 17: Auro tripathy -  Localizing with CNNs

[email protected] 17

•  Higher mAP over R-CNN

•  Training is single-stage using a multi-class loss

•  Training can update all network layers

•  No disk storage is required for feature-caching

Benefits of Fast R-CNN over R-CNN

Page 18: Auro tripathy -  Localizing with CNNs

[email protected] 18

Faster R-CNN Subsumes Region Proposals

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

mA

P V

OC

(fps, mAP)

fps

Page 19: Auro tripathy -  Localizing with CNNs

[email protected] 19

•  Replace the use of external object proposals with a Region Proposal Network (RPN)

•  RPN reuse CNNs for object proposals!

•  RPN shares convolutions with the detection side of the network •  Big benefit, marginal cost of computing proposals becomes small

•  Reuse previously covered Fast R-CNN for detection!

•  Training regime alternates between •  First, fine-tuning for the region proposal task •  Then, fine-tuning for the object detection keeping the proposals fixed

Faster R-CNN

Page 20: Auro tripathy -  Localizing with CNNs

[email protected] 20

Novel “Anchor” Boxes Serve as References at Multiple Scales and Aspect Ratios

Pyramids of feature maps are built & the classifier is run at all scales

feature map

scaled images

Pyramids of filters of multiple scales and sizes are run on the feature map

multiple filters

Pyramids of reference boxers in the regression functions

feature map

anchors = references at multiple scales and aspect ratios

✓ New

Page 21: Auro tripathy -  Localizing with CNNs

[email protected] 21

Region Proposal Network Training Classifies Objectness & Regresses Bounding Boxes

Conv5

Conv1

k=9 * 2 Class Scores (object or background)

k=9 * 4 Box Proposals (x, y, w, h)

Sliding window

k=9 “anchor” boxes to address Three scales (128,256,512) Three aspect ratios (2:1, 1:1, 1:2)

Scale 1 Scale 2 Scale 3

1:1

2:1

1:2

“Objectness” Score Bounding Box Regression

256 Dimension Vector for each Anchor at each location

Page 22: Auro tripathy -  Localizing with CNNs

[email protected] 22

Step 1 – Train RPN initialized w/ImageNet to Output Region Proposals

FC

Bounding Box Regressor

Conv5

Conv1

Linear Softmax RPN

Layers

RPN Proposals

Fine-Tuned end-to-end w/ImageNet Weights

https://andrewliao11.github.io/object_detection/faster_rcnn/

Page 23: Auro tripathy -  Localizing with CNNs

[email protected] 23

Step 2 – Train Fast R-CNN with Learnt Region Proposals

FC

Bounding Box Regressor

Conv5

Conv1

Linear Softmax

RPN Layers

Object Class Probabilities

Fine-Tuned end-to-end w/ImageNet Weights

RPN Proposals Learned in Step 1

Page 24: Auro tripathy -  Localizing with CNNs

[email protected] 24

Step 3 – Initialize RPN from Model Trained in Step 2 & Train RPN Again

FC

Bounding Box Regressor

Conv5

Conv1

Linear Softmax

RPN Layers

RPN Proposals

Share the Weights from Step 2 but Lock them (prevent updates)

Page 25: Auro tripathy -  Localizing with CNNs

[email protected] 25

Step 4 – Fine Tune FC Layers of Fast R-CNN Using the Shared Convolution Weights from Step 3

FC

Bounding Box Regressor

Conv5

Conv1

Linear Softmax RPN

Layers

Object Class Probabilities RPN Proposals Learned in Step 3

Share the Weights from Step 3 But Prevent Updates

Fine-tune the unique layers Of Fast R-CNN

Page 26: Auro tripathy -  Localizing with CNNs

[email protected] 26

You Only Look Once (YOLO) Uses One Network, Runs Fast

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

mA

P V

OC

(fps, mAP)

fps

Page 27: Auro tripathy -  Localizing with CNNs

[email protected] 27

You-Only-Look-Once (YOLO) Do Away with Dual Networks (RPN + Classifier), Use a Single Network

•  Divide Image into a S=7 x S=7 grid of cells

•  Within each cell, predict 1.  B=2 Bounding Boxes 2.  C=20 Class Probabilities

•  Each Bounding Box predicts 5 parameters

•  x, y, width, height, confidence •  x, y is the center of the box relative

to the grid cell •  Conditional class probability

(conditioned on the grid cell containing an object)

Bounding Box + Confidence

Class Probability

•  Output of Network: •  S * S * (5 * B + C) •  7 * 7 *(5 * 2 + 20) = 1470 values

Page 28: Auro tripathy -  Localizing with CNNs

[email protected] 28

YOLO – Very Fast Direct Prediction Using a CNN Output S * S * (5 * B + C) 7 * 7 *(5 * 2 + 20) = 1470 values

448 * 4483

112 * 112

25656 * 56

192

1024

512

10247 * 7

14 * 14

Convs, 7x7x64-s-2MaxPool, 2x2-s-2

10247 * 7

7 * 7 (5 * 2 + 20)

4096

Convs, 3x3x192MaxPool, 2x2-s-2

Convs, 1x1x1283x3x2561x1x2563x3x512

MaxPool, 2x2-s2

Convs, (1x1x2563x3x512) x 4

1x1x5123x3x1024

MaxPool, 2x2-s-2

Fully Connected LayerConvs, 3x3x1024

3x3x1024

Fully Connected Layer

28 * 28

Convs, (1x1x5123x3x1024) x 2

1x1x5123x3x1024

3x3x1024-s-2

Page 29: Auro tripathy -  Localizing with CNNs

[email protected] 29

YOLO’s 1X1 Convolutions Reduces Parameters, Runs Fast Simple Example Shows Parameters Reduced from 4860 to 1440

Parameter Size =18 x (3 x 3) x 30 =

486030

h

w

3

3w

h

OutputFeature

Map3x3Kernel

InputFeature

Map

18

Total Parameter Size =90 + 1350 =

1440

30

h

w

3

3w

h

OutputFeature

Map3x3

Kernel

InputFeature

Map

18

1x1Kernel

5

w

h

Parameter Size =5 x (3 x 3) x 30 =

1350Parameter Size =18 x (1 x 1) x 5 =

90

Page 30: Auro tripathy -  Localizing with CNNs

[email protected] 30

•  Confidence score Intersection-over-Union (IoU) between •  Predicted Box •  Ground Truth

Non-Maximal Suppression via Intersection-Over-Union

Predicted Box Ground Truth

Intersection Area Union Area IoU=

Page 31: Auro tripathy -  Localizing with CNNs

[email protected] 31

•  “[YOLO] struggles with small objects that appear in groups, such as flocks of birds.”

•  “[YOLO] struggles to generalize to objects in new or unusual aspect ratios or configurations.”

•  “YOLO struggles to localize objects correctly.”

Limitations of YOLO

You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

Page 32: Auro tripathy -  Localizing with CNNs

[email protected] 32

21, 63.2

19, 80

91, 69

81, 73.7 67, 76.8 59, 77.8 40, 78.6

60

70

80

90

0 20 40 60 80 100

YOLOv2 catches up to SSD Provides Tradeoffs Between Speed and Accuracy

mA

P V

OC

fps YOLO9000: Better, Faster, Stronger Joseph Redmon, Ali Farhadi University of Washington, Allen Institute for AI

SSD512X512

YOLOv2228x228

YOLOv2352x352

YOLOv2416x416

YOLOv2480x480 YOLOv2544x544

YOLOv1448x448

Page 33: Auro tripathy -  Localizing with CNNs

[email protected] 33

Single-Shot Detector (SSD), Faster than YOLO and as Accurate as Faster R-CNN

0.5, 34.3

0.02, 58.5

0.4, 70 7, 73.2

21, 63.2

58, 77 19, 80

30

40

50

60

70

80

0 10 20 30 40 50 60 70

SSD300X300 SSD512X512

YOLO

Faster R-CNN

Fast R-CNN

R-CNN

Deformable Parts Model

mA

P V

OC

(fps, mAP)

fps

Page 34: Auro tripathy -  Localizing with CNNs

[email protected] 34

•  Use six default boxes at each feature cell

•  Similar to anchor boxes in Faster R-CNN

•  Six aspect rations •  { 1, 2, 3, 1/2, 1/3 }

aspect ratio boxes + 1 box with 1 aspect ratio

Uses Default Boxes at Multiple Aspect Ratios & Scales

4x4 Feature Map

8x8 Feature Map

In a convolutional fashion, we evaluate six default boxes of six aspect ratios at each location in two feature maps with different scales (e.g. 8 × 8 and 4 × 4)

Default boxes

Page 35: Auro tripathy -  Localizing with CNNs

[email protected] 35

Single-Shot Detector Uses Feature Maps at Different Scales and Concatenates Them All at the Last Layer

Multiclass Scores

Bounding Box Regression

Forward Path Back-Prop Path

Multiclass Scores

Bounding Box Regression

Stride=2 Convolution

“…, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales.”

19x19

10x10

Page 36: Auro tripathy -  Localizing with CNNs

[email protected] 36

SSD – Six Progressively Smaller Layers Concatenated

300 3003

38 38512

Non Maximum Supression

Concatenate Detections Total Detections/Class:7308

19 191024

1919

1024

5125 5

2563 3

2561 1 256

Conv6 (FC)

Default Boxes:6Detections/Class = (19 * 19 * 6)

Default Boxes:6Detections/Class = (10 * 10 * 6)

Default Boxes:6Detections/Class = (5 * 5 * 6)

Default Boxes:6Detections/Class = (3 * 3 * 6)

Default Boxes:6Detections/Class = (1 * 1 * 6)

Default Boxes:3*Detections/Class = (38 * 38 * 3)

Conv4_3

Conv7 (FC)

Conv8_2

Conv9_2

Conv10_2

Pool 11

VGG-

16 th

ru Po

ol5 La

yer

1010

* 3 Boxes to reduce computation

Page 37: Auro tripathy -  Localizing with CNNs

[email protected] 37

•  Data augmentation adds 6.7% mAP •  Scaling and cropping

•  Additionally, using lower features maps (Conv4_3) for prediction, adds 4% mAP •  Use a variety of default box shapes

•  Similar to Faster R-CNN anchor boxes •  { 1, 2, 3, 1/2, 1/3 } aspect ratio boxes + 1 box with 1 aspect ratio •  {2, 1/2, 3, 1/3} aspect ratio contribute 2.9% mAP

•  Use the atrous algorithm of VGG16 (adds 0.7% mAP) •  Use Hard Negative Mining to balance ratio of positive samples to negative

samples

SSD has Many Tools that Progressively Improve mAP

Page 38: Auro tripathy -  Localizing with CNNs

[email protected] 38

•  Single-shot methods are faster than two-stage methods

•  Single shot mAP is comparable to Faster R-CNN, the best two-stage method

•  SSD is faster than YOLO, and just as accurate as Faster R-CNN

•  YOLOv2 provides tradeoffs between speed and accuracy

•  The building blocks of detection algorithms presented here can lead to higher precision and recall, i.e., more innovations to come

Summary

Page 39: Auro tripathy -  Localizing with CNNs

[email protected] 39

Links to Seminal Resources

Technique Resource R-CNN Rich feature hierarchies for accurate object detection and

semantic segmentation Tech report (v5) Fast R-CNN Fast R-CNN Faster R-CNN Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks YOLO You Only Look Once: Unified, Real-Time Object Detection YOLOv2 YOLO9000: Better, Faster, Stronger SSD SSD: Single Shot MultiBox Detector


Recommended