Debugging and Optimizing RC Applications

Debugging and Debugging and Optimizing RC Optimizing RC ApplicationsApplications

Seth Koehler

John Curreri

Presentation Overview Introduction Background

Reconfigurable computing (RC) applications Debug Performance analysis

Project overview Project details

ReCAP framework & tool Special features HLL-based debug & performance analysis Case studies

Conclusions

Introduction Debugging and optimization are an integral part of

application development Typically at end of development cycle (after formulation

and design phases) Designers often spend longer debugging the application

than designing it! * Optimization is often just left for a later version, if ever Every optimization made has to re-pass through debug phase

To improve productivity in application design, it is critical to address debug and optimization

* Debugging FPGA systems - ftp://ftp.altera.com/outgoing/download/education/events/ highspeed/Tek_ALTERAFPGADEBUG_IPIntegration_final.pdf

Formulation Design Translation Execution

Background – RC Applications Why reconfigurable computing (RC)?

General-purpose architectures can be wasteful interms of performance and power

Impractical to have an ASIC for every application RC ~= FPGAs (Field-Programmable Gate Arrays)

Application-specific hardware and parallelism Retain flexibility and programmability

RC applications typicallyemploy CPUs and FPGAs Leverage strengths of both types of processors Potential for higher performance using less power Programmed using Hardware Description Languages

(HDLs) or High-Level Languages (HLLs) CPU is programmed with whatever conventional HLL is desired

(C, C++, MPI, UPC, etc.) System and application complexity can make it difficult to achieve a

correct, well-performing application

Background - Debug

Debug: to detect and remove errors from a program * Debugging methods

Stare at code At least it helps you "wrap your mind around your code"

Insert printf statements Requires some good guessing, can be tedious if more than a few printf's

Use debugger (e.g., gdb) Much better – instant access to all data and support for indicating

where/why a program crashed Use simulator

Can provide more flexibility and information than debugger, but simulators can be inaccurate and slow, not to mention hard to make

Write assertions Best – application designer documents situations that are impossible Formal and dynamic verification methods check whether assertions hold

* http://dictionary.reference.com

Background – Performance Analysis Performance analysis – investigate program behavior

using information gathered during execution * Aides designer in locating and remedying application bottlenecks, reducing

guesswork in optimization Replaces tedious, error-prone manual analysis methods (timing routines and

printf statements)

Performance Data

Target System

VisualizationsPotential

Bottlenecks

Application

Instrumented Application

Optimized Application

???

Instrument

Measure

Analyze (Automatically)

Optimize

Optimization and Performance Analysis

Un

assi

sted

O

pti

miz

atio

nF

low

Per

form

ance

A

nal

ysis

O

pti

miz

atio

n F

low

Present

Analyze (Manually)

* http://en.wikipedia.org/wiki/Performance_analysis

Project Overview RC systems and applications are even more complex than in HPC

Heterogeneous components Hierarchy of parallelism among components Lack of visibility inside RC devices

Optimizing applications is crucial for effective use of these systems Debug and performance tools are relied on heavily in HPC to productively verify and

optimize applications Debug and performance tools are even more essential in RC due to additional

system and application complexity, and yet research is lacking Objective: expand the notion and benefits of software debugging and

performance analysis into the software-hardware realm of RC

FPGA Board

No

de

No

de

No

de

Primary Interconnect

Main Memory

Network

Primary Interconnect

CPU

CPU

Network

On

-bo

ard

M

emo

ry

CPU & Primary Interconnect

Secondary InterconnectSecondary Interconnect

System

...

Machine

...

Node Board

CPU

...

Mai

n M

emor

y Dev

ice

Inte

rfac

e

App core

App core

App core

...

FPGA / Device

Secondary Interconnect

Bo

ard

Inte

rfac

e

FPGA

FPGA

FPGA

...

Embedded CPU(s)Boa

rd In

terf

ace

Legend

FPGA Communication

Traditional Processor Communication

CP

U

Inte

rco

nn

ect

To

p-l

evel

Ap

p

On-board FPGA

ReCAP Framework Reconfigurable-computing application performance

(ReCAP) framework Adds assertion-based verification and performance analysis

capabilities to FPGA portion of application Builds upon existing assertions in HLL languages AND Parallel

Performance Wizard (PPW) for performance analysis of CPU portion of application

Three main components HDL Instrumenter Hardware Measurement

Module (HMM) RC-enhanced version of

PPW (PPW+RC) Backend (instrumentation and

measurement) Frontend (analysis and visualization)

ReCAP

HDL Instrumenter

HMM

HDL Instrumentation & Measurement

PPW+RC Backend

CPU Instrumentation & Measurement

PPW+RC Frontend

CPU-FPGA Analysis & Visualization

ReCAP: HDL Instrumenter Modifies HDL design files to monitor application data at runtime

User can define "events" that are of interest e.g., buffer full, cycles spent in a state

User can define "monitors" that determine what to record when event occurs e.g., summary statistics, full trace

User can enable a number of automatic analyses e.g., decision coverage, assertions, profiling, automatic bottleneck detection

HDL InstrumenterInstrumentation Process

Component A

Component B

Component D

Modified Ports /

Interfaces

User Application (HLL)

Module Query ThreadCPU(s)FPGA Access Methods

(Wrapper)

Original Application

PPW+RC backend

FPGA(s)

User Application (HDL)

Measurement and Interface

Hardware Measurement Module (HMM)

Component E Component E

Component C

Top-level component

Legend

Original RC Application

Additions by Instrumentation

Lock Data Manager

ReCAP: Hardware Measurement Module Hardware necessary to record, store, and

retrieve data at runtime Profiling, tracing, and sampling Cycle counter and other module statistics (trace

records dropped, counter overflow, etc.) Buffers for storing trace data Module control for performance data retrieval and

miscellaneous control (e.g., clear and stop)

Hardware Measurement Module (HMM)Instrumentation Process

TriggersSignal

Analysis Block

SignalsData

...0 1 2 P - 1

Cycle Counter

Module Statistics

Module Control

Request

Perf. Data

Trigger

Data

Co

llect

or

In b

itw

idth

Ou

t b

itw

idth

Trace recordsCollector Memory

Sig

nal

(s)

combinatorial or sequential

logicProfile Data

Trace Data

Trace Data

Trace Data

...Component A

Component B

Component D

Modified Ports /

Interfaces



(Wrapper)


PPW+RC backend

FPGA(s)





Component C

Top-level component

Legend



Lock Data Manager

Trace SampleProfile

ReCAP: PPW+RC PPW+RC backend adds thread to software to query HMM at runtime

Requires lock (since we now have shared FPGA access) Handles FPGA performance data storage and migration to PPW data structures Monitors FPGA API calls in addition to normal PPW software performance monitoring

PPW+RC frontend analyzes and presents measured data for CPUs / FPGAs Table and chart views across multiple experiments Export to Jumpshot for timeline views

PPW+RC front-endInstrumentation Process

Component A

Component B

Component D

Modified Ports /

Interfaces



(Wrapper)


PPW+RC backend

FPGA(s)





Component C

Top-level component

Legend



Lock Data Manager

ReCAP Tool-Flow HDL source files are instrumented, then synthesized/implemented normally HLL source files are instrumented during compilation

Use ppwcc instead of gcc or ppwupcc instead of upcc Program is executed normally on system Performance data file produced can be viewed and analyzed with PPW+RC

HDL source

HLLsource

InstrumentedHDL source

PerformanceData Files

HDL Instrumenter

Compile with

PPW+RC

Synthesis & Implementation

Visualize with PPW+RC

InstrumentedFPGA Binary

Instrumented CPU executable

Configuration

Execute with PPW+RC

ProgramResults

Analyze with PPW+RC

Common RC Bottleneck Detection

RC bottleneck taxonomy

Hardware

Software / hardware

Excessive comm. time

Imbalance

Load

Stage

Excessive overhead time

Excessive idle time

Contention

Synchronization

Control

Late sender /receiver

Buffering

Full

Empty

Software Traditional HPC bottleneck categories...

Frequent, small transfers

Polling

Buffering

Full

Empty

Control

Inefficient transfer type

Infrequent, large transfers

Inefficient transfer size

Excessive HW stalling

Application

Interface

Barrier

Sub-par channel efficiency

Sub-par computation time

Legend

HLL-based bottleneck

HDL-based bottleneck

Clear / flush

Stall

Automatically search for common RC bottlenecks Reduces time and

knowledge needed to find bottlenecks

Requires some information from user We attempt to minimize the

amount of information requested

Currently produces text file containing All detected bottlenecks, Potential optimization

strategies for each Peak/ideal speedup if

bottleneck resolved

Architecture-Aware Visualization Architecture-aware visualization

Visualization within application & system context, with integrated common-bottleneck data

Must be scalable to large systems Allow user to experiment with different

optimization scenarios to see what provides best performance

FPGA

71%

22%

K1,1

7%

40%

58%

K2,1

2%

80%

GCmp2%

18%

21%

1%

78%

K3,1

57%

33%

K1,2

10%

28%

68%

K2,24%

7%

3%

90%

K3,2

Buffer 1

CPU 0

37%

21% 24%

Core 2

7%

11%

41%

20% 26%

Core 1

8%5%

82%

3%

S

15%

LegendIdleOverheadWorkExternal SendExternal Recv.Idle Ext. SendIdle Ext. Recv.

Buffer 2

SRAM

80%

PCI-e

66%

88%

2%

S

10%

CPU 1

55%4%

5%

Core 2

28%8%

39%

25% 21%

Core 1

2%

13%

67%

HT0

88%

HT1

52%41%

FPGA 0-1 channel3%

GTx

87%

10%

10%

Number Elements

Time (sec)

CPU 3CPU 2CPU 1CPU 0

904MB/s88%

10MB/s1%

812MB/s79%

914MB/s89%

1.79GB/s72%

6MB/s0%

2.50GB/s100%

0MB/s0%

0MB/s0%

FPGA 0 FPGA 1

2.76GB/s69%

CPU 4 CPU 5

1.98GB/s99%

Network

210MB/s10%

FPGA 2

CPU Interconnect

121MB/s12%

691MB/s67%

No

de

le

ve

l

Sy

ste

m

lev

el

HLL Performance Analysis

• Automated instrumentation– Computation

• State machines– Used for preserving execution

order in C functions– Used to control pipelines

• Control and status signals used by library functions

– Communication• Control and status signals

– Streaming communication– DMA transfers

• User-assisted instrumentation– Application-specific variables

• Monitor meaningful values selected by user

• Measurement– Employ HMM from HDL

framework

• High-level languages– Impulse-C and Carte C

• Convert subset of C to HDL• Employ DMA and streaming

communication– Speedup gained by

• Pipelining loops• Library functions• Replicated functions

– Impulse C• Pipelining of loops

– Determined by pragmas in code

– Carte (SRC)• Pipelining of loops

– Automatic pipelining of inner most loop

• Library functions– Called as C function– HDL coded

Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware

HLL Instrumentation & Measurement

HLL Hardware Wrapper

HLL API Wrapper

Application(C source)

MeasurementExtraction

Process/Thread

Application(C source)

FPGA(s)

CPU(s) HLL Tool Flow

Compilesoftware

Implementhardware

Finisheddesign

InstrumentedSignals

Loopback(C source)

Application(HDL) Loopback

(HDL)

HardwareMeasurement

Module

C source

Software-hardwaremapping

Instrumentation

Instrumentation

HLL Analysis & Visualizations Bottleneck detection (currently user-assisted)

Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention

Integration with performance analysis tool Profiling data

Pie charts showing time utilization Tree view of CPU and FPGA timing

b4s0b4s1b4s2b4s3b4s4b6s0b6s1

Main MD loop

Input stream

Pipeline transition

Output steam

C SourceHDL State Machine

?

HLL Assertion Debugging Based off of ANSI C assert function

int num, i, x[10];while(num==0) { num=x[i++]; assert(i<10);}

Failure will halt application, displaying an error

test.c:7: main: Assertion `i<10' failed.

Assertions can be disabled via #define NDEBUG Most HLLs do not synthesize standard C library functions on the FPGA

Convert assertion function to if statement (renamed via Perl script) Send line number of failed assertions on the FPGA to the CPU

Communication stream created and routed between hardware functions with assertion statements and software function

Perform failure actions via a software function (added via Perl script)

Application speedup over single 3.2GHz Xeon

33.9

7.9

37.1

0

5

10

15

20

25

30

35

40

Sp

eed

up

8-node 3.2GHz Xeon8-node H101Optimized 8-node H101

Case Study: N-Queens Overview

Find number of distinct ways n queens can be placed on an n×n board without attacking each other (via backtracking algorithm)

Multi-CPU/FPGA application (UPC/VHDL) Overhead

<= 6% area (sixteen 32-bit profile counters for state machines) <= 2% memory (96-bit-wide trace buffer for core finish time) Negligible frequency degradation observed

QQ

QQ

N-Queens results for board size of 16

XD1 Xeon-H101

Original Instr. Original Instr.

Slices

(% relative to device)

9,041 9,901

(+4%)

23,086 26,218

(+6%)

Block RAM

(% relative to device)

11 15

(+2%)

21 22

(0%)

Frequency (MHz)

(% relative to orig.)

124 123

(-1%)

101 101

(0%)

Communication (KB/s) <1 33 <1 30

FPGAs

Case study: 2D-PDF estimation* Application

Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of (x, y) coordinate data

3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X Results

Automatic bottleneck detection results showed problematic communication and control Based on tool suggestion, increased buffer sizes and restructuring of control logic was

achieved in a day, providing up to a 5.5x speedup for the 10-core design

* 2D-PDF code written by Karthik Nagarajan

FPGARead

FPGAWrite

Software functions31.0

21.6

18.7

13.8

16.7 15.1

2.5

4.0

9.7

5.4

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10

Number of FPGA Cores

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

Original design

Improved design

Case Study: Molecular Dynamics• Molecular Dynamics

– Simulates interaction of molecules over discrete time steps

• Impulse C version 2.2• XD1000 platform

– Dual-processor motherboard• Opteron 2.2GHz• Stratix-II EP2S180 XD1000 module

• MD communication architecture– Chunks of MD data read from SRAM– Data streamed to multiple pipelined

MD kernels– Results stored back to SRAM

Input Memory Access

SRAMStorage

Output Memory AccessCollector

Output Stream 1 Output Steam 16

MD kernel 16

Input Stream 16

MD kernel 1

Input Steam 1

Distributor

• Stream buffer– Increased buffer size by 32 times

• Speedup change– 6.2 vs. serial baseline before enhancements– 7.8 vs. serial baseline after enhancements

MD Kernel Runtime

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

128 256 512 1024 2048 4096

Stream buffer size (bytes)

FP

GA

ru

nti

me

(sec

on

ds)

Other

Output stream

Pipeline

HLL Debug Case Study

void Logcontrol (… { … co_int64 big, test, update; small_1=321; small_2=123; big=5000000000; test=1073741824; IF_SIM(printf("HW big:%lld\n",big);) IF_SIM(printf("HW test:%lld\n",test);) i=0; while(big<test) { co_stream_write(small_stream, &small_1, sizeof(co_int32)); IF_SIM(printf("HW if passed\n");) small_1=big&4294967295; small_2=big>>32; i++; assert(i<10);}

ni192_suif_tmp <= … & cmp_less_s(r_big(31 downto 0), r_test(31 downto 0));

1001010100000010111110010000000001000000000000000000000000000000

7050327041073741824

Impulse C performs 32 bit comparison with 64 bit values

Impulse C code

VHDL

32 bits

HLL Debug Case Study (cont)

C:\hwr\test4-assert>memtest.exeSmall stream OpenHW big:5000000000HW test:1073741824Big stream OpenSmall lower read:321Small upper read:123…

[root@xd1000-3 test4]# ./run_sw Small stream OpenBig stream Openmemtest_hw.c:31: Assertion 'i<10' failed.Small lower read:705032704Small upper read:1…

Simulation

Hardware execution

Results In simulation, loop does not execute and

assertion is never called In hardware loop executes infinitely In hardware with assert, loop executes and

assertion fails Overhead

Streaming overhead generated per process Additional FPGA resource usage < 0.1%

EP2S180 Original Modified Difference

Logic Used(143520)

13927(9.71%)

13974(9.74%)

+37(+0.03%)

Comb. ALUT(143520)

7930 (5.53%)

8073 (5.63%)

+143 (+0.10%)

Registers(143520)

10013 (6.98%)

10063 (7.01%)

+50(+0.03%)

Block RAM (9383040 bits)

222912 (2.37%)

223488 (2.38%)

+576 (+0.01%)

Frequency (MHz) 143.68 142.03 -1.65(-1.15%)

Conclusions Debug and performance analysis of RC applications is critical for

improving productivity in obtaining a correctly functioning, well-performing application ReCAP framework/tool aides designers with

verification and performance analysis Records and monitors application data on CPU

and FPGA at runtime while minimizing overhead and user effort

Can perform a number of automated analyses including common bottleneck detection, decision coverage, and assertion monitoring

Provides analysis and presentation of CPU/FPGA debug and performance data

ReCAP represents the first RC application performance framework and tool (per extensive literature review) Debug capabilities are also not currently

found in other tools

Date post:	30-Jan-2016
Category:	Documents
Upload:	nhi
View:	46 times
Download:	0 times

Debugging and Optimizing RC Applications

Documents