SLAM AcceleratedUsing Hardware to improve SLAM algorithm performance
Project Overview
Team Members Roy Lycke Ji Li Ryan Hamor
Take existing SLAM algorithm and implement on computer
Analyze Performance of algorithm to determine kernels to be accelerated in HW
Implement SLAM algorithm on PowerPC with previously identified kernels in HW
RH
What is SLAM?
SLAM stands for Simultaneous Localization and Mapping Predict pose using previous and current data
Types of pose sensors Wheel Encoders GPS
Detect landmarks and correlated to robot using predicted pose. Types of Observation Sensors
Sonar Infrared Laser Scanners Video
RH
Current State of SLAM Algorithms
SLAM algorithms fall into two main categories Extend Kalman Filter
Large Covariance Matrix to Process Particle Filter
Each Particle contains pose estimate and map
RH
Particle Filter Algorithm
RH
What we have Decided to do
RH
Started with existing SLAM implementation ratbot-slam developed by Kris Beevers
ratbot-slam Uses particle filter algorithm and multiple observation scans using
just wheel encoders and 5 IR sensors We modified ratbot-slam to use log files taken from
radish.sourceforge.net
Ratbot-slam Modifications
RH
Create new observation function using laser scans vs. original IR sensors.
Modify motion model to use dead-reckoned odometry
Demo of Modified ratbot-slam
RH
Profile of Modified Code
RL
Areas that can be Accelerated
RL
Decided to accelerate predict step included: motion_model_deadreck gaussian_pose
Estimated Maximum speed up 39% or 1.64x
Why not squared_distance_point_segment? Least understood of algorithms we could accelerate If we had more time we would have developed this
Function Acceleration
RL
Design Decisions Fixed or Floating Point?
Fixed point Implementation done in fixed point Resources required to do floating point were significantly
heavier Heavily Pipeline or Create Predict Stage for each particle?
Heavily Pipelined Data is serially loaded through load and save function to co-
processor It would take too many resources to implement predict stages
in parallel for each particle
Top Level Design
RL
Motion Model C-Code
RH
MotionModel Data Flow
RH
MotionModel Data Flow
RH
MotionModel HDL Stats
RH
Gaussian Pose
JL
void gaussian_pose(const pose_t *mean, const cov3_t *cov, pose_t *sample){ sample->x = gaussian(mean->x, fp_sqrt(cov->xx)); sample->y = gaussian(mean->y, fp_sqrt(cov->yy)); sample->t = gaussian(mean->t, fp_sqrt(cov->tt));}
Gaussian Pose
JL
fixed_t gaussian(fixed_t mean, fixed_t stddev){ static int cached = 0; static fixed_t extra; static fixed_t a, b, c, t; if(cached) { cached = 0; return fp_mul(extra, stddev) + mean; } // pick random point in unit circle do { a = fp_mul(fp_2, fp_rand_0_1()) - fp_1; b = fp_mul(fp_2, fp_rand_0_1()) - fp_1; c = fp_mul(a,a) + fp_mul(b,b); } while(c > fp_1 || c == 0); t = pgm_read_fixed(&unit_gaussian_table[c >> unit_gaussian_shift]); extra = fp_mul(t, a); cached = 1; return fp_mul(fp_mul(t, b), stddev) + mean;}
Parallelism & Acceleration Techniques
JL
Parallelism gaussian_pose function is consists of three gaussian functions. gaussian functions can be separated into two parts
Acceleration TechniquesPipelineMulti-thread
Top Level Diagram of gaussian_Pose
JL
Random Number Generator
JL
Xorshift random number generators are developed. They generate the next number in their sequence by repeatedly taking the exclusive or (XOR) of a number with a bit shifted version of itself.
Random_Number_Manager
JL
Gaussian Entity
JL
Demo of FPGA System
RL
Timing Analysis of Original System
RL
Timing analysis was performed via run-time clock counts and print statements to the minicom
Sections of code timed include: Predict Step, Multiscan Feature Extraction and Data Association Step, & Filter Health Evaluation and Re-sample Step
The Predict Step was implemented on the FPGA for acceleration
Initial timing analysis :Operation Average Runtime (in microseconds)
Present in percentage of runsPredict Step - Original 107,502 100%
Multiscan Step - Original 2,487,969 2.17%
Filter Step - Original 3,394 2.17%
Timing Analysis of Accelerated System
RL
Timing analysis for accelerated implementation was performed in same manner as original implementation
Results shown along with original timing analysis
From the data collected, the Predict Step was accelerated by 88%
Operation Average Runtime (microseconds)
Present in percentage of runs
Predict Step - Original 107,502 100%
Multiscan Step - Original 2,487,969 2.17%
Filter Step - Original 3,394 2.17%
Predict Step - Accelerated 12,784 100%
Multiscan Step - Accelerated 1,982,950 1.94%
Filter Step - Accelerated 13,291 1.94%
Result Analysis
RL
With the Predict Step accelerated by 88.108%, the overall system is accelerated by: 34% = 39% x 88%
Result is a reliable and sizable acceleration to the system execution time
Analysis of other components Multiscan Step accelerated by 20.29% Filter Step slowed by 74.46% Differences may be due to different values generated by FPGA
implementation vs. Original implementation Both implementations use random values More accurate values may lead to longer calculation in other
components
Difficulties with Project Implementation
RL
Networking issues Data transfer - differences between PowerPC and Linux
Limitations of FPGA Unpredictable execution halting Lack of resource libraries
Timing performed with specialized Xilinx library Code needed to be modified to run
PC vs. FPGA Environment Output file format is different
Issue figuring out how to add multiple files to custom IP
Conclusions
RH
Based on the run-time analysis of our implementation of the accelerated SLAM algorithm there was an appreciable speed up achieved.
Our Implementation achieved a speed up of approximately 34% or 1.51x out of an ideal 39% or 1.64x
This result shows that if more of the SLAM algorithm was implemented on an FPGA there could be a greater acceleration.
Top issue in SLAM implementations is getting algorithm’s implemented on embedded real time systems
Future Directions
RL
Add more regions of the Algorithm to the FPGA acceleration Current implementation only accelerates 39% of system
Run SLAM system on different FPGA FPGAs with more robust processors may overcome some of the
limitations our implementation faced
Run different SLAM algorithm Current implementation is a particle filter algorithm, a Kalman
filter algorithm would be next
Load data onto board rather than using PC interaction Load data via memory card Perform single data load and perform memory management on the
FPGA
References
RL
1. Durrant-Whyte, Bailey, “Simultaneous Localization and Mapping: Part 1”, IEEE Robotics and Automation Magazine, June 2006, pg 99 – 1082.
2. Durrant-Whyte, Bailey, “Simultaneous Localization and Mapping: Part 2”, IEEE Robotics and Automation Magazine, September 2006, pg 108 - 1173.
3. Bonato, Peron, Wolf, Holanda, Marques, Cardoso, “An FPGA Implementation for a Kalman Filter with Application to Mobile Robotics”, Industrial Embedded Systems, 2007, pg 148 – 1554.
4. Bonato, Marques, Constantinides, “A Floating-point Extended Kalman Filter Implementation for Autonomous Mobile Robots”, Field Programmable Logic and Applications, 2007, pg 576-5795.
5. Beevers K.R., Huang, W.H., “SLAM with Sparse Sensing”, Robotics and Automation 2006, pg 2285-2290
Questions?
RL