Learning Less is More - 6D Camera Localization via 3D Surface RegressionEric Brachmann and Carsten Rother
Heidelberg University (HCI/IWR)
This project has received funding from the European ResearchCouncil (ERC) under the European Union’s Horizon 2020 researchand innovation program (grant agreement No 647769)
Code and trained models:
Problem Statement
Contributions
Estimate the 6D camera pose (position + orientation)
relative to a known scene from a single RGB image.
We show that learning less is more. See right:
- Red: Learn everything.
CNN predicts pose directly.
- Orange: Learn two components of a geometric pipeline.
Our previous work [Bra17].
- Cyan: Learn one component of a geometric pipeline.
This work.
- Green: Ground truth camera path.
Estimated Camera PoseInput
• Fully differentiable, robust pose
optimization without learnable
parameters on top of learned scene
coordinate regression
• Learning scene coordinate regression
without a 3D scene model or depth
maps
• Stable end-to-end training due to new
approximation of refinement gradients,
and controlling the entropy of pose
hypotheses
• We exceed state-of-the-art on camera
localization on three datasets (indoor
and outdoor)
Previous Work: Differentiable RANSAC (DSAC) [Bra17]
Reprojection
Errors of 𝐡2
𝐰
𝐡1
𝐡3
𝐡4
𝐡2
Input RGB Scene Coordinate (𝐲)Regression [Sho13]
Hypothesis
Sampling
Scoring (𝑠) Hypothesis
Selection
Refinement (𝐑)
𝐯መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
𝐲
Probabilistic Pose Selection
መ𝐡𝐰 = 𝐡𝑗𝐰, where 𝑗~𝑃(𝑗|𝐰)
𝑃(𝑗|𝐰) = exp 𝑠 𝐡𝑗𝑤 /
𝑘exp(𝑠(𝐡𝑘
𝐰))
DSAC Learning Objective
𝜕
𝜕𝐰𝔼𝑗~𝑃(𝑗|𝐰,𝐯) 𝓁(𝐑 𝐡𝑗
𝐰, 𝐰 , 𝐡∗) =
𝔼𝑗~𝑃(𝑗|𝐰) 𝓁 . 𝜕𝜕𝐰
log 𝑃 𝑗 𝐰 + 𝜕𝜕𝐰𝓁 .
Reprojection
Errors of 𝐡2
𝐡1
𝐡3
𝐡4
𝐡2
መ𝐡
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
Updated Pipeline
64
0x4
80
80
x6
0
Learned Not learned but differentiable
Fully Convolutional
Network ArchitectureLearning w/o
a 3D model
Differentiable
RefinementSoft Inlier
Count
Entropy
Control
Learning without a 3D Scene Model
Training scene coordinate regression in 3 stages:
Input RGB Ground Truth
Scene Coordinates
1) Assume
constant depth 𝑑
2) Optimize Re-
projection Error3) End-to-end
training (DSAC)
1) min σ𝑖 𝐲𝑖 𝐰 − 𝐲𝑖∗ , with 𝐲𝑖
∗ = 𝐡∗ 𝑑𝑥𝑖𝑓,𝑑𝑦𝑖𝑓, 𝑑, 1
𝑇2) min σ𝑖 𝐶𝐡∗−1𝐲𝑖 𝐰 − 𝐩𝑖
Hypothesis Score: Soft Inlier Count
Reprojection Error:
𝑟𝑖 𝐡,𝐰
= 𝐶𝐡−1𝐲𝑖 𝐰 − 𝐩𝑖
Inlier Count: 𝑠 𝐡 = σ𝑖 𝟙 𝜏 − 𝑟𝑖 𝐡,𝐰 - not differentiable
Soft In. Count: 𝑠 𝐡 = σ𝑖 sig(𝜏 − 𝛽𝑟𝑖 𝐡,𝐰 )− differentiable
Previously: learned 𝑠 𝐡 - hard to regularize, overfits
Hypothesis Score: Entropy Control
𝑃 𝑗 𝐰, 𝛼 =exp(𝛼𝑠(𝐡𝑗,𝐰)
σ𝑘 exp(𝛼𝑠(𝐡𝑘,𝐰)
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
𝑠(𝐡1)
𝑠(𝐡4)
𝑠(𝐡2)
𝑠(𝐡3)
Hypothesis Distribution:
High Entropy:
No separation,
Instable
Low Entropy:
Few Learning
Signals, Instable
Medium Entropy:
Rich Learning
Signals, Stable
Entropy:
𝑆 𝛼 = −
𝑗
𝑃 𝑗 𝐰, 𝛼 log 𝑃 𝑗 𝐰, 𝛼
Keep target entropy 𝑆∗during training by adjusting 𝛼: argmin𝛼|𝑆 𝛼 − 𝑆∗|
Differentiable Refinement
Refinement 𝐑 optimizes re-
projection errors 𝐫ℐ of inlier set ℐ: 𝐑 𝐡 = argmin𝐡′ 𝐫ℐ 𝐡′, 𝐰 2
Gauss-Netwon update step: 𝐑𝒕+𝟏 = 𝐑𝒕 − 𝐽𝐫𝑇𝐽𝐫
−𝟏𝐽𝐫𝑇𝐫ℐ 𝐑𝒕, 𝐰
𝜕
𝜕𝐰𝐑 𝐡 ≈ − 𝐽𝐫
𝑇𝐽𝐫−𝟏𝐽𝐫
𝑇𝜕
𝜕𝐰𝐫ℐ 𝐡O, 𝐰
Last update: 𝐑 𝐡 = 𝐡O − 𝐽𝐫𝑇𝐽𝐫
−𝟏𝐽𝐫𝑇𝐫ℐ 𝐡O, 𝐰 , with 𝐡O = 𝐑𝒕=∞ (𝐡)
Gradient approximation:
Results
38,6%
55,9%
60,4%
62,5%
76,1%
0% 20% 40% 60% 80%
ORB+PNP [Sho13]
DSAC (w/ 3D Model)
Our (w/o 3D Model)
DSAC (w/ Depth)
Our (w/ 3D Model)
% Correct Test Frames
7Scenes [Sho13] Results (Error < 5cm,5°)
Avg. Median Err. w/ 3D Model w/o 3D Model
PoseNet [Ken17] 1.43m, 2.9° 1.63m, 2.8°
Active Search [Sat16] 0.29m, 0.6° -
DSAC [Bra17] 0.31m, 0.8° -
Our 0.14m, 0.3° 0.19m, 0.5°
Cambridge Landmarks [Ken15] Results
DSAC (w/ 3D Model) Our (w/ 3D Model) Our (w/o 3D Model)
[Sho13] “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, Shotton et al., CVPR’13
[Ken15] “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization” Kendall et al., ICCV’15
[Ken17] “Geometric Loss Functions for Camera Pose Regression with Deep Learning” Kendall and Cipolla, CVPR 2017
[Sat16] “Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization” Sattler et al., PAMI 2016
[Bra17] “DSAC - Differentiable RANSAC for Camera Localization”, Brachmann et al., CVPR’17
Training Set (2 Images Total)
Test
Image
Estimation with Learned Score (DSAC) Estimation with Soft Inlier Count (Our)
Estimated Camera Poses 3D Model Overlay
DSACOur
3D Model Overlay