Devin White1, Sophie Voisin1,
Christopher Davis1, Andrew Hardin1,
Jeremy Archuleta2, David Eberius3,
1Scalable and High Performance Geocomputation Team
Geographic Information Science and Technology Group
2Data Architectures Team
Computational Data Analytics Group
Oak Ridge National Laboratory
3Innovative Computing Laboratory Department of Electrical Engineering and Computer Science
University of Tennessee – Knoxville
GTC 2016 – April 5, 2016
A Fully-Automated High Performance
Geolocation Improvement Workflow for
Problematic Imaging Systems
Managed by UT-Battelle for the Department of Energy
Outline
Project background
System overview
Scientific foundation
Technological solution
Current system performance
Managed by UT-Battelle for the Department of Energy
Background
Overhead imaging systems (spaceborne and airborne) can vary substantially in their geopositioning accuracy
The sponsor wanted an automated near real time geocoordinate correction capability at ground processing nodes upstream of their entire user community
Extensible automated solution is using well-established photogrammetric, computer vision, and high performance computing techniques to reduce risk and uncertainty
Robust multi-year advanced R&D portfolio aimed at continually improving the system through science, engineering, software, and hardware innovation
We are moving towards on-board processing
Satellites
Manned Aircraft
Unmanned Aerial Systems
Managed by UT-Battelle for the Department of Energy
Isn’t This a Solved Problem?
Systemic constraints – Space
– Power
– Quality/reliability of components
– Subject matter expertise
– Time
– Budget
– Politics
Operational constraints – Collection conditions
– Sensor and platform health
– Existing software quality and performance
– System independence
Many of these issues are greatly amplified on UAS platforms
Managed by UT-Battelle for the Department of Energy
Sponsor Requirements
Solution must:
– Be completely automated
– Be government-owned and based on open source/GOTS code
– Be sensor agnostic by leveraging the Community Sensor Model framework
– Be standards-based (NITF, OGC, etc.) to enable interoperability
– Clearly communicate the quantified level of uncertainty using standard methods
– Be multithreaded and hardware accelerated
– Construct RPC and RSM replacement sensor models as well as generate SENSRB/GLAS and BLOCKA tagged record extensions (TREs)
– Improve geolocation accuracy to within a specific value
– Complete a run within a specific amount of time
The first sensor supported is one of the sponsor’s most important, but also its most problematic
Managed by UT-Battelle for the Department of Energy
Technical Approach (General)
1. Ingest and preprocessing
2. Trusted source selection
3. Global localization (coarse alignment, in ground space)
4. Image registration to generate GCPs (fine alignment, in image space)
5. Sensor model resection and uncertainty propagation
6. Generation and export of new and improved metadata
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline
Photogrammetric Registration of Imagery from Manned and Unmanned Systems
PRIMUS
Input NITF
Source Selection
Global Localization
Registration
Resection
Metadata
Output NITF
R2D2
Reprojection Orthorectification
Mosaicking
Controlled Sources
Core Libraries:
• NITRO (Glycerin)
• GDAL
• Proj.4
• libpq (Postgres)
• OpenCV
• CUDA
• OpenMP
• CSM
• MSP
GPU Implementation
Preprocessing
CPU Implementation
Managed by UT-Battelle for the Department of Energy
Source Selection
Find and assemble trusted control imagery and elevation data that cover the spatial extent of an image.
Source Selection
Elevation
Imagery
Input: image
Managed by UT-Battelle for the Department of Energy
Mosaic Generation
Create bounding
box
Grow bounding
box
Query R2D2’s DB
Start Read images from
disk
Mosaic imagery Create (elevation + geoid)
mosaic
150%
Returns image paths
Managed by UT-Battelle for the Department of Energy
System Hardware
CPU/GPU hybrid architecture
– 12 Dell C4130 HPC nodes
– Each node has:
48 logical processors
256GB of RAM
Dual high speed SSDs
4 Tesla K80s
– Virtual Machine option
Managed by UT-Battelle for the Department of Energy
A Note on Virtualization
We ran VMware on one of our nodes with mixed results
We were able to access one GPU on that node through a VM using PCI passthrough, but the other seven remained unavailable due to VMware software limitations
VMware, GPU, and OS resource requirements limited us to two VMs per node, which is not very helpful
We greatly appreciate the technical assistance NVIDIA provided as we conducted this experiment
Verdict: It’s still a little too early for virtualization to be really useful for high-density compute nodes with multiple GPUs
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline
Photogrammetric Registration of Imagery from Manned and Unmanned Systems
PRIMUS
Input NITF
Source Selection
Global Localization
Registration
Resection
Metadata
Output NITF
R2D2
Reprojection Orthorectification
Mosaicking
Controlled Sources
Core Libraries:
• NITRO (Glycerin)
• GDAL
• Proj.4
• libpq (Postgres)
• OpenCV
• CUDA
• OpenMP
• CSM
• MSP
GPU Implementation
Preprocessing
CPU Implementation
Managed by UT-Battelle for the Department of Energy
Orthorectification Process
Create bounding box
Grow bounding box
Query R2D2’s DB
Begin
Read images from disk Create (elevation + geoid) mosaic
Orthorectify
Source image
Control Selection
Global Localization
Returns image paths
Managed by UT-Battelle for the Department of Energy
Orthorectification Solution
Accelerate portions of our OpenMP-enabled code with GPUs using CUDA
– Sensor Model calculations
– Band Interpolation calculations
Optimize both of the CUDA kernels and their associated memory operations
Create in-house Transverse Mercator CUDA device functions
Combined the Sensor Model and Band Interpolation kernels
Managed by UT-Battelle for the Department of Energy
Orthorectification Optimized
Managed by UT-Battelle for the Department of Energy
• JPEG2000-compressed commercial image pair (36,000 x 30,000 each)
• GPU-enabled RPC orthorectification to UTM
• Each is done in 8 seconds, using one eighth of a single node’s horsepower
• 65,000,000,000 pixels per minute per node, running on multiple nodes
• That includes building HAE terrain models on the fly from tiled global sources
Orthorectification Performance
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline
Photogrammetric Registration of Imagery from Manned and Unmanned Systems
PRIMUS
Input NITF
Source Selection
Global Localization
Registration
Resection
Metadata
Output NITF
R2D2
Reprojection Orthorectification
Mosaicking
Controlled Sources
Core Libraries:
• NITRO (Glycerin)
• GDAL
• Proj.4
• libpq (Postgres)
• OpenCV
• CUDA
• OpenMP
• CSM
• MSP
GPU Implementation
Preprocessing
CPU Implementation
Managed by UT-Battelle for the Department of Energy
Global Localization - Coarse Adjustment
Roughly determine where source and control images match.
Adjust the sensor model.
Triage step in the pipeline.
Global Localization
Output:
coarse sensor model adjustments
C
S
C
S
Input:
source and control images
Managed by UT-Battelle for the Department of Energy
S
Computation - Solution Space
Solution Space:
– Each possible shift (exhaustive search)
Solution:
– Similarity coefficient between the source and the control sub-image
C
Solution space
Managed by UT-Battelle for the Department of Energy
Similarity Metric
Normalized Mutual Information
Histogram with masked area
– Missing data
– Artifact
– Homogeneous area
Source image and mask: NSxMS pixels
Control image and mask: NCxMC pixels
Solution space: nxm NMI coefficients
𝑁𝑀𝐼 = 𝐻𝑆 + 𝐻𝐶𝐻𝐽
𝐻 = − 𝑝 𝑖 𝑙𝑜𝑔2𝑝 𝑖
𝑘
𝑖=0
𝐻 is the entropy 𝑝 𝑖 is the probability density function
𝑘 ∈ 0. . 255 for S and C
0. . 65535 for J
Managed by UT-Battelle for the Department of Energy
Visual Example
Histogram computation (for normalized mutual information)
– nVidia
Histogram64
Histogram256
– Literature
Joint histogram 80x80 bins
– Our problem (joint)Histogram65536 nxm times NSxMS data
Managed by UT-Battelle for the Department of Energy
Kernel families
How to leverage the GPU to compute one solution\one joint histogram (65536 bins)
– 1 kernel per NMI computation Pros: use shared memory to piecewise fill the histogram -
Cons: atomicAdd – syncthread for reduction – CPU call for each solution
– 1 block per NMI computation (K1, K2) Pros: use shared memory to piecewise fill the histogram – 1 kernel to
evaluate all solutions
Cons: atomicAdd – syncthread for reduction
– 1 thread per NMI computation (K3, K4, K5) Pros: global memory access read only - no atomicAdd – no syncthread – 1
kernel to evaluate all solutions
Cons: stack frame 264192 Bytes / thread
Managed by UT-Battelle for the Department of Energy
Kernel details
Kernels K1 K2 K3 K4 K5
occupancy 100%
threads / block 128 256 128 128 128
stack frame 2048 1024 264192 264192 264192
total memory / block 0.26 MB 0.26 MB 33.81 MB 33.81 MB 33.81 MB
total memory / SM 4.19 MB 4.19 MB 541.06 MB 541.06 MB 541.06 MB
total memory / GPU 0.54 GB 0.54 GB 7.03 GB 7.03 GB 7.03 GB
memory % 0.47% 0.47% 61.06% 61.06% 61.06%
spill stores – spill loads 0 – 0
registers 33 34 27 26 29
smem / block 3072 3072 0 0 0
smem / SM 49152 49152 0 0 0
smem % 42.86% 42.86% 0.00% 0.00% 0.00%
cmem[0] – cmem[2] 448 – 20
- partial entropy
- atomicAdd
- synchronization
- 1 solution / block
- 2D index for the joint
histogram
- 1 solution / thread
- 1D index for the joint
histogram
- 1 solution / thread
- no if condition for mask
- 1D index for the joint
histogram
- 1 solution / thread
Managed by UT-Battelle for the Department of Energy
0
10
20
30
40
50
60
70
80
90
100
0 10000 20000 30000 40000 50000 60000
tim
e in
se
co
nd
s
number of solutions
Kernel timings
K1* K1 K2* K2 K3* K3 K4* K4 K5* K5
25
Kernel Timings with Respect of Solution Space
source images:
- 512 x 256
mask:
- 0% - K*
- 50% - K
30 control images:
- 512 x 256 – 1 solution
- 991 x 383 – 61440 solutions 0
0.5
1
1.5
2
2.5
0 5000 10000
Managed by UT-Battelle for the Department of Energy
Summary for Global Localization
Global Localization as coarse adjustment of the sensor model
– Problematic: joint histogram computation for each solution
No compromise on the number of bins - 65536
Exhaustive search
– Solution: leverage of the K80 specifications
12 GB of memory
1 thread per solution
Less than 25 seconds - 61K solutions for a 131K pixel image
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline
Photogrammetric Registration of Imagery from Manned and Unmanned Systems
PRIMUS
Input NITF
Source Selection
Global Localization
Registration
Resection
Metadata
Output NITF
R2D2
Reprojection Orthorectification
Mosaicking
Controlled Sources
Core Libraries:
• NITRO (Glycerin)
• GDAL
• Proj.4
• libpq (Postgres)
• OpenCV
• CUDA
• OpenMP
• CSM
• MSP
GPU Implementation
Preprocessing
CPU Implementation
Managed by UT-Battelle for the Department of Energy
S
Registration - Fine Adjustment
Account for Global Localization Coarse Resolution C
Managed by UT-Battelle for the Department of Energy
Control (X,Y) Descriptor
(152.511,148.398) (123, 122, … , 56)
(101.124,88.6674) (164, 45, …, 165)
⁞ ⁞
Source (X,Y) Descriptor
(157.511,153.398) (123, 122, … , 56)
(106.124,93.6674) (164, 45, …, 165)
⁞ ⁞
Registration Workflows
detect
describe describe
detect
Source Image Control Image
Tiepoint list
Keypoint list Keypoint list search window
match
Descriptor Descriptor
+
+
+
+ +
+
+
+ +
+
detect from
+ +
+
+ +
+ +
+
+
O
+
+
+
+ +
+
+
+ +
+
+ +
+
+
+ +
+
+
+ +
+
+
+
+ +
+
+
+ +
+ + +
+
+ +
+ +
+
+
O
+
+
+
+ +
+
+
+ +
+
metric
metric
Option “Match”
Option “Detect From”
Managed by UT-Battelle for the Department of Energy
OpenCV Library
Leverage OpenCV 2.4.11 Detector CPU GPU BRISK ~
DENSE ~
FAST
GFTT(w/wo HARRIS) ~
MSER ~
ORB(HARRIS/FAST)
SIFT ~
SIMPLEBLOB ~
STAR (CenSurE) ~
SURF
Descriptor CPU GPU BRIEF ~
BRISK ~
FREAK ~
INTENSITY*
ORB(HARRIS/FAST)
SIFT ~
SURF
Matcher CPU GPU Match Detect from BRUTEFORCE
FLANN ~
INTENSITY based*
detect
describe describe
detect
Source Image Control Image
Tiepoint list
Keypoint list Keypoint list search window
match
Descriptor Descriptor
detect from
metric
metric
Option “Match”
Option “Detect From”
Managed by UT-Battelle for the Department of Energy
OpenCV limitation(s)
OpenCV 2.4.11
– for the current Source image for each keypoint
– point to the associated template / descriptor
– point to the associated image / collection of descriptors
– call the GPU function to compute the metric
– find the best match
In-house
– for the current Source image call the GPU function to find the best
match for all keypoints using the descriptor definition and the metric
The keypoints and their associated template\image
are managed outside the GPU call
Each template\image couple locks the GPU
during its function call
The keypoints and their associated template\image
are managed by the GPU call
All the template\image couple access the GPU
during the same function call
Managed by UT-Battelle for the Department of Energy
– OpenCV 2.4.11
– In-house
Visual comparison
What is the difference?
– OpenCV 2.4.11
– In-house
Blocks
organization
Threads
organization
CPU management of the pointer to
the images per keypoints
GPU management of the
block and threads
Blocks
organization
+ +
+
+ +
+
+
+
+ +
+ +
+
+ +
+
+
+
+ +
Threads
organization
Pointer to the sub-images
Ø
Managed by UT-Battelle for the Department of Energy
Back to NMI as Similarity Metric
Normalized Mutual Information
Small “images” but numerous Keypoints – Numerous keypoints
up to 65536 with GPU SURF detector
– Image / Descriptor size 11 x 11 intensity values to describe
– Search area 73 x 73 control sub-image
– Solution space 63 x 63
Descriptors: 11x11 intensity values
Search windows: 73x73 pixels
Solution spaces: 63x63 NMI coefficients
𝑁𝑀𝐼 = 𝐻𝑆 + 𝐻𝐶𝐻𝐽
𝐻 = − 𝑝 𝑖 𝑙𝑜𝑔2𝑝 𝑖
𝑘
𝑖=0
𝐻 is the entropy 𝑝 𝑖 is the probability density function
𝑘 ∈ 0. . 255 for S and C
0. . 65535 for J
…
…
…
Managed by UT-Battelle for the Department of Energy
Kernel details
Basic Kernel (K1)
– Find the best match for all keypoints
1 block per keypoint
– Optimize for the 63 x 63 search windows
64 threads / blocks – 1 idle
each threads compute a “row” of solutions
– limit to 1 joint histogram per block
Loop over entire histogram to compute
Optimized Kernel (K2)
– Sparse joint histogram
65536 bins but only 121 values
– Leverage the 11 x 11 descriptor size
Create 2 lists (length 121) of intensity values
Update joint histogram count from lists
Loop over lists to retrieve aggregate count
Set aggregate count to 0 after first retrieval
List of indices for source
List of indices for the corresponding subset control
Joint histogram
=
Managed by UT-Battelle for the Department of Energy
Kernel Timings with Respect of Number of Keypoints
0
50
100
150
200
250
300
350
400
0 10000 20000 30000 40000 50000 60000
tim
e in
se
co
nd
s
number of keypoints
Kernel timings
K1 K2
0
5
10
15
20
25
30
35
0 2000 4000 6000 8000 10000
17.272
Managed by UT-Battelle for the Department of Energy
Summary for Registration
Registration refine the adjustment of the sensor model
– Problematic: joint histogram computation for each solution
No compromise on the number of bins - 65536
Exhaustive search
– Solution: leverage of the K80 specifications
12 GB of memory
1 block per solution
Leverage the number of values of the descriptors 121 (maximum) << 65536
Less than 100 seconds - 65K keypoints – computes 260M NMI coefficients
About 10K keypoints in less than 20 seconds
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline
Photogrammetric Registration of Imagery from Manned and Unmanned Systems
PRIMUS
Input NITF
Source Selection
Global Localization
Registration
Resection
Metadata
Output NITF
R2D2
Reprojection Orthorectification
Mosaicking
Controlled Sources
Core Libraries:
• NITRO (Glycerin)
• GDAL
• Proj.4
• libpq (Postgres)
• OpenCV
• CUDA
• OpenMP
• CSM
• MSP
GPU Implementation
Preprocessing
CPU Implementation
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline Timings
D1 Source: 200 x 131
Solution space: 6834
Source: 3600 x 2674
D2 Source: 258 x 67
Solution space: 4250
Source: 4571 x 1555
D3 Source: 259 x 88
Solution space: 5980
Source: 4725 x 1607
D4 Source: 318 x 92
Solution space: 5745
Source: 5745 x 1954
Global Localization Registration
Managed by UT-Battelle for the Department of Energy
PRIMUS Pipeline Timings
0
5
10
15
20
25
30
35
40
D1 D2 D3 D4
tim
e in
se
co
nd
s
Pipeline Timings
Misc
Resection
Registration
GlobalLocalization
SourceSelection
Source Images
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
D1 D2 D3 D4
Percentage for each module
Source Images
Managed by UT-Battelle for the Department of Energy
Questions?