Debugging and Debugging and Optimizing RC Optimizing RC ApplicationsApplications
Seth Koehler
John Curreri
Presentation Overview Introduction Background
Reconfigurable computing (RC) applications Debug Performance analysis
Project overview Project details
ReCAP framework & tool Special features HLL-based debug & performance analysis Case studies
Conclusions
Introduction Debugging and optimization are an integral part of
application development Typically at end of development cycle (after formulation
and design phases) Designers often spend longer debugging the application
than designing it! * Optimization is often just left for a later version, if ever Every optimization made has to re-pass through debug phase
To improve productivity in application design, it is critical to address debug and optimization
* Debugging FPGA systems - ftp://ftp.altera.com/outgoing/download/education/events/ highspeed/Tek_ALTERAFPGADEBUG_IPIntegration_final.pdf
Formulation Design Translation Execution
Background – RC Applications Why reconfigurable computing (RC)?
General-purpose architectures can be wasteful interms of performance and power
Impractical to have an ASIC for every application RC ~= FPGAs (Field-Programmable Gate Arrays)
Application-specific hardware and parallelism Retain flexibility and programmability
RC applications typicallyemploy CPUs and FPGAs Leverage strengths of both types of processors Potential for higher performance using less power Programmed using Hardware Description Languages
(HDLs) or High-Level Languages (HLLs) CPU is programmed with whatever conventional HLL is desired
(C, C++, MPI, UPC, etc.) System and application complexity can make it difficult to achieve a
correct, well-performing application
Background - Debug
Debug: to detect and remove errors from a program * Debugging methods
Stare at code At least it helps you "wrap your mind around your code"
Insert printf statements Requires some good guessing, can be tedious if more than a few printf's
Use debugger (e.g., gdb) Much better – instant access to all data and support for indicating
where/why a program crashed Use simulator
Can provide more flexibility and information than debugger, but simulators can be inaccurate and slow, not to mention hard to make
Write assertions Best – application designer documents situations that are impossible Formal and dynamic verification methods check whether assertions hold
* http://dictionary.reference.com
Background – Performance Analysis Performance analysis – investigate program behavior
using information gathered during execution * Aides designer in locating and remedying application bottlenecks, reducing
guesswork in optimization Replaces tedious, error-prone manual analysis methods (timing routines and
printf statements)
Performance Data
Target System
VisualizationsPotential
Bottlenecks
Application
Instrumented Application
Optimized Application
???
Instrument
Measure
Analyze (Automatically)
Optimize
Optimization and Performance Analysis
Un
assi
sted
O
pti
miz
atio
nF
low
Per
form
ance
A
nal
ysis
O
pti
miz
atio
n F
low
Present
Analyze (Manually)
* http://en.wikipedia.org/wiki/Performance_analysis
Project Overview RC systems and applications are even more complex than in HPC
Heterogeneous components Hierarchy of parallelism among components Lack of visibility inside RC devices
Optimizing applications is crucial for effective use of these systems Debug and performance tools are relied on heavily in HPC to productively verify and
optimize applications Debug and performance tools are even more essential in RC due to additional
system and application complexity, and yet research is lacking Objective: expand the notion and benefits of software debugging and
performance analysis into the software-hardware realm of RC
FPGA Board
No
de
No
de
No
de
Primary Interconnect
Main Memory
Network
Primary Interconnect
CPU
CPU
Network
On
-bo
ard
M
emo
ry
CPU & Primary Interconnect
Secondary InterconnectSecondary Interconnect
System
...
Machine
...
Node Board
CPU
...
Mai
n M
emor
y Dev
ice
Inte
rfac
e
App core
App core
App core
...
FPGA / Device
Secondary Interconnect
Bo
ard
Inte
rfac
e
FPGA
FPGA
FPGA
...
Embedded CPU(s)Boa
rd In
terf
ace
Legend
FPGA Communication
Traditional Processor Communication
CP
U
Inte
rco
nn
ect
To
p-l
evel
Ap
p
On-board FPGA
ReCAP Framework Reconfigurable-computing application performance
(ReCAP) framework Adds assertion-based verification and performance analysis
capabilities to FPGA portion of application Builds upon existing assertions in HLL languages AND Parallel
Performance Wizard (PPW) for performance analysis of CPU portion of application
Three main components HDL Instrumenter Hardware Measurement
Module (HMM) RC-enhanced version of
PPW (PPW+RC) Backend (instrumentation and
measurement) Frontend (analysis and visualization)
ReCAP
HDL Instrumenter
HMM
HDL Instrumentation & Measurement
PPW+RC Backend
CPU Instrumentation & Measurement
PPW+RC Frontend
CPU-FPGA Analysis & Visualization
ReCAP: HDL Instrumenter Modifies HDL design files to monitor application data at runtime
User can define "events" that are of interest e.g., buffer full, cycles spent in a state
User can define "monitors" that determine what to record when event occurs e.g., summary statistics, full trace
User can enable a number of automatic analyses e.g., decision coverage, assertions, profiling, automatic bottleneck detection
HDL InstrumenterInstrumentation Process
Component A
Component B
Component D
Modified Ports /
Interfaces
User Application (HLL)
Module Query ThreadCPU(s)FPGA Access Methods
(Wrapper)
Original Application
PPW+RC backend
FPGA(s)
User Application (HDL)
Measurement and Interface
Hardware Measurement Module (HMM)
Component E Component E
Component C
Top-level component
Legend
Original RC Application
Additions by Instrumentation
Lock Data Manager
ReCAP: Hardware Measurement Module Hardware necessary to record, store, and
retrieve data at runtime Profiling, tracing, and sampling Cycle counter and other module statistics (trace
records dropped, counter overflow, etc.) Buffers for storing trace data Module control for performance data retrieval and
miscellaneous control (e.g., clear and stop)
Hardware Measurement Module (HMM)Instrumentation Process
TriggersSignal
Analysis Block
SignalsData
...0 1 2 P - 1
Cycle Counter
Module Statistics
Module Control
Request
Perf. Data
Trigger
Data
Co
llect
or
In b
itw
idth
Ou
t b
itw
idth
Trace recordsCollector Memory
Sig
nal
(s)
combinatorial or sequential
logicProfile Data
Trace Data
Trace Data
Trace Data
...Component A
Component B
Component D
Modified Ports /
Interfaces
User Application (HLL)
Module Query ThreadCPU(s)FPGA Access Methods
(Wrapper)
Original Application
PPW+RC backend
FPGA(s)
User Application (HDL)
Measurement and Interface
Hardware Measurement Module (HMM)
Component E Component E
Component C
Top-level component
Legend
Original RC Application
Additions by Instrumentation
Lock Data Manager
Trace SampleProfile
ReCAP: PPW+RC PPW+RC backend adds thread to software to query HMM at runtime
Requires lock (since we now have shared FPGA access) Handles FPGA performance data storage and migration to PPW data structures Monitors FPGA API calls in addition to normal PPW software performance monitoring
PPW+RC frontend analyzes and presents measured data for CPUs / FPGAs Table and chart views across multiple experiments Export to Jumpshot for timeline views
PPW+RC front-endInstrumentation Process
Component A
Component B
Component D
Modified Ports /
Interfaces
User Application (HLL)
Module Query ThreadCPU(s)FPGA Access Methods
(Wrapper)
Original Application
PPW+RC backend
FPGA(s)
User Application (HDL)
Measurement and Interface
Hardware Measurement Module (HMM)
Component E Component E
Component C
Top-level component
Legend
Original RC Application
Additions by Instrumentation
Lock Data Manager
ReCAP Tool-Flow HDL source files are instrumented, then synthesized/implemented normally HLL source files are instrumented during compilation
Use ppwcc instead of gcc or ppwupcc instead of upcc Program is executed normally on system Performance data file produced can be viewed and analyzed with PPW+RC
HDL source
HLLsource
InstrumentedHDL source
PerformanceData Files
HDL Instrumenter
Compile with
PPW+RC
Synthesis & Implementation
Visualize with PPW+RC
InstrumentedFPGA Binary
Instrumented CPU executable
Configuration
Execute with PPW+RC
ProgramResults
Analyze with PPW+RC
Common RC Bottleneck Detection
RC bottleneck taxonomy
Hardware
Software / hardware
Excessive comm. time
Imbalance
Load
Stage
Excessive overhead time
Excessive idle time
Contention
Synchronization
Control
Late sender /receiver
Buffering
Full
Empty
Software Traditional HPC bottleneck categories...
Frequent, small transfers
Polling
Buffering
Full
Empty
Control
Inefficient transfer type
Infrequent, large transfers
Inefficient transfer size
Excessive HW stalling
Application
Interface
Barrier
Sub-par channel efficiency
Sub-par computation time
Legend
HLL-based bottleneck
HDL-based bottleneck
Clear / flush
Stall
Automatically search for common RC bottlenecks Reduces time and
knowledge needed to find bottlenecks
Requires some information from user We attempt to minimize the
amount of information requested
Currently produces text file containing All detected bottlenecks, Potential optimization
strategies for each Peak/ideal speedup if
bottleneck resolved
Architecture-Aware Visualization Architecture-aware visualization
Visualization within application & system context, with integrated common-bottleneck data
Must be scalable to large systems Allow user to experiment with different
optimization scenarios to see what provides best performance
FPGA
71%
22%
K1,1
7%
40%
58%
K2,1
2%
80%
GCmp2%
18%
21%
1%
78%
K3,1
57%
33%
K1,2
10%
28%
68%
K2,24%
7%
3%
90%
K3,2
Buffer 1
CPU 0
37%
21% 24%
Core 2
7%
11%
41%
20% 26%
Core 1
8%5%
82%
3%
S
15%
LegendIdleOverheadWorkExternal SendExternal Recv.Idle Ext. SendIdle Ext. Recv.
Buffer 2
SRAM
80%
PCI-e
66%
88%
2%
S
10%
CPU 1
55%4%
5%
Core 2
28%8%
39%
25% 21%
Core 1
2%
13%
67%
HT0
88%
HT1
52%41%
FPGA 0-1 channel3%
GTx
87%
10%
10%
Number Elements
Time (sec)
CPU 3CPU 2CPU 1CPU 0
904MB/s88%
10MB/s1%
812MB/s79%
914MB/s89%
1.79GB/s72%
6MB/s0%
2.50GB/s100%
0MB/s0%
0MB/s0%
FPGA 0 FPGA 1
2.76GB/s69%
CPU 4 CPU 5
1.98GB/s99%
Network
210MB/s10%
FPGA 2
CPU Interconnect
121MB/s12%
691MB/s67%
No
de
le
ve
l
Sy
ste
m
lev
el
HLL Performance Analysis
• Automated instrumentation– Computation
• State machines– Used for preserving execution
order in C functions– Used to control pipelines
• Control and status signals used by library functions
– Communication• Control and status signals
– Streaming communication– DMA transfers
• User-assisted instrumentation– Application-specific variables
• Monitor meaningful values selected by user
• Measurement– Employ HMM from HDL
framework
• High-level languages– Impulse-C and Carte C
• Convert subset of C to HDL• Employ DMA and streaming
communication– Speedup gained by
• Pipelining loops• Library functions• Replicated functions
– Impulse C• Pipelining of loops
– Determined by pragmas in code
– Carte (SRC)• Pipelining of loops
– Automatic pipelining of inner most loop
• Library functions– Called as C function– HDL coded
Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware
HLL Instrumentation & Measurement
HLL Hardware Wrapper
HLL API Wrapper
Application(C source)
MeasurementExtraction
Process/Thread
Application(C source)
FPGA(s)
CPU(s) HLL Tool Flow
Compilesoftware
Implementhardware
Finisheddesign
InstrumentedSignals
Loopback(C source)
Application(HDL) Loopback
(HDL)
HardwareMeasurement
Module
C source
Software-hardwaremapping
Instrumentation
Instrumentation
HLL Analysis & Visualizations Bottleneck detection (currently user-assisted)
Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention
Integration with performance analysis tool Profiling data
Pie charts showing time utilization Tree view of CPU and FPGA timing
b4s0b4s1b4s2b4s3b4s4b6s0b6s1
Main MD loop
Input stream
Pipeline transition
Output steam
C SourceHDL State Machine
?
HLL Assertion Debugging Based off of ANSI C assert function
int num, i, x[10];while(num==0) { num=x[i++]; assert(i<10);}
Failure will halt application, displaying an error
test.c:7: main: Assertion `i<10' failed.
Assertions can be disabled via #define NDEBUG Most HLLs do not synthesize standard C library functions on the FPGA
Convert assertion function to if statement (renamed via Perl script) Send line number of failed assertions on the FPGA to the CPU
Communication stream created and routed between hardware functions with assertion statements and software function
Perform failure actions via a software function (added via Perl script)
Application speedup over single 3.2GHz Xeon
33.9
7.9
37.1
0
5
10
15
20
25
30
35
40
Sp
eed
up
8-node 3.2GHz Xeon8-node H101Optimized 8-node H101
Case Study: N-Queens Overview
Find number of distinct ways n queens can be placed on an n×n board without attacking each other (via backtracking algorithm)
Multi-CPU/FPGA application (UPC/VHDL) Overhead
<= 6% area (sixteen 32-bit profile counters for state machines) <= 2% memory (96-bit-wide trace buffer for core finish time) Negligible frequency degradation observed
N-Queens results for board size of 16
XD1 Xeon-H101
Original Instr. Original Instr.
Slices
(% relative to device)
9,041 9,901
(+4%)
23,086 26,218
(+6%)
Block RAM
(% relative to device)
11 15
(+2%)
21 22
(0%)
Frequency (MHz)
(% relative to orig.)
124 123
(-1%)
101 101
(0%)
Communication (KB/s) <1 33 <1 30
FPGAs
Case study: 2D-PDF estimation* Application
Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of (x, y) coordinate data
3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X Results
Automatic bottleneck detection results showed problematic communication and control Based on tool suggestion, increased buffer sizes and restructuring of control logic was
achieved in a day, providing up to a 5.5x speedup for the 10-core design
* 2D-PDF code written by Karthik Nagarajan
FPGARead
FPGAWrite
Software functions31.0
21.6
18.7
13.8
16.7 15.1
2.5
4.0
9.7
5.4
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10
Number of FPGA Cores
Ex
ec
uti
on
Tim
e (
se
co
nd
s)
Original design
Improved design
Case Study: Molecular Dynamics• Molecular Dynamics
– Simulates interaction of molecules over discrete time steps
• Impulse C version 2.2• XD1000 platform
– Dual-processor motherboard• Opteron 2.2GHz• Stratix-II EP2S180 XD1000 module
• MD communication architecture– Chunks of MD data read from SRAM– Data streamed to multiple pipelined
MD kernels– Results stored back to SRAM
Input Memory Access
SRAMStorage
Output Memory AccessCollector
Output Stream 1 Output Steam 16
MD kernel 16
Input Stream 16
MD kernel 1
Input Steam 1
Distributor
• Stream buffer– Increased buffer size by 32 times
• Speedup change– 6.2 vs. serial baseline before enhancements– 7.8 vs. serial baseline after enhancements
MD Kernel Runtime
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
128 256 512 1024 2048 4096
Stream buffer size (bytes)
FP
GA
ru
nti
me
(sec
on
ds)
Other
Output stream
Pipeline
HLL Debug Case Study
void Logcontrol (… { … co_int64 big, test, update; small_1=321; small_2=123; big=5000000000; test=1073741824; IF_SIM(printf("HW big:%lld\n",big);) IF_SIM(printf("HW test:%lld\n",test);) i=0; while(big<test) { co_stream_write(small_stream, &small_1, sizeof(co_int32)); IF_SIM(printf("HW if passed\n");) small_1=big&4294967295; small_2=big>>32; i++; assert(i<10);}
ni192_suif_tmp <= … & cmp_less_s(r_big(31 downto 0), r_test(31 downto 0));
1001010100000010111110010000000001000000000000000000000000000000
7050327041073741824
Impulse C performs 32 bit comparison with 64 bit values
Impulse C code
VHDL
32 bits
HLL Debug Case Study (cont)
C:\hwr\test4-assert>memtest.exeSmall stream OpenHW big:5000000000HW test:1073741824Big stream OpenSmall lower read:321Small upper read:123…
[root@xd1000-3 test4]# ./run_sw Small stream OpenBig stream Openmemtest_hw.c:31: Assertion 'i<10' failed.Small lower read:705032704Small upper read:1…
Simulation
Hardware execution
Results In simulation, loop does not execute and
assertion is never called In hardware loop executes infinitely In hardware with assert, loop executes and
assertion fails Overhead
Streaming overhead generated per process Additional FPGA resource usage < 0.1%
EP2S180 Original Modified Difference
Logic Used(143520)
13927(9.71%)
13974(9.74%)
+37(+0.03%)
Comb. ALUT(143520)
7930 (5.53%)
8073 (5.63%)
+143 (+0.10%)
Registers(143520)
10013 (6.98%)
10063 (7.01%)
+50(+0.03%)
Block RAM (9383040 bits)
222912 (2.37%)
223488 (2.38%)
+576 (+0.01%)
Frequency (MHz) 143.68 142.03 -1.65(-1.15%)
Conclusions Debug and performance analysis of RC applications is critical for
improving productivity in obtaining a correctly functioning, well-performing application ReCAP framework/tool aides designers with
verification and performance analysis Records and monitors application data on CPU
and FPGA at runtime while minimizing overhead and user effort
Can perform a number of automated analyses including common bottleneck detection, decision coverage, and assertion monitoring
Provides analysis and presentation of CPU/FPGA debug and performance data
ReCAP represents the first RC application performance framework and tool (per extensive literature review) Debug capabilities are also not currently
found in other tools