Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | brett-hines |
View: | 222 times |
Download: | 0 times |
John Curreri
Seth Koehler
Rafael Garcia
Project F2: Application Project F2: Application Performance AnalysisPerformance Analysis
Outline Introduction
Application mappers Historical background Performance analysis today
HLL runtime performance analysis tool Motivation Instrumentation Framework Visualization
Case study Molecular Dynamics
Conclusions & References
Application Mappers Translates C code to HDL
Higher level of abstraction Usually a subset of ANSI C
No pointers No standard C libraries for FPGA
HDL is generated as a project file for Xilinx or Altera tools
Built-in communication Separate C source files are
made for the CPU & FPGA Similar communication function
calls between CPU & FPGA
Application Mappers (continued) Computational parallelism
Pipelining of loops for(), while(), etc.
Use of library functions HDL coded functions called at HLL FFT, Floating point operations
Replication of functions defined in hardware
Types of communication DMA transfers
Efficient transfer of large chucks of data
Stream transfers Steady flow of data Buffered for transfer rate
changes
Introduction to the F2 project Goals for performance analysis in RC
Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs)
Motivations Complex systems are difficult to analyze by hand
Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data
Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC
Challenges How do we expand notion of software performance analysis into
software-hardware realm of RC? What are common bottlenecks for dual-paradigm applications? What techniques are necessary to detect performance bottlenecks? How do we analyze and present these bottlenecks to a user?
Historical Background Gettimeofday and printf
VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982)
Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function
Simulators / Emulators Too slow or too inaccurate Require significant development time
PAPI (Performance Application Programming Interface) Portable interface to hardware
performance counters on modern CPUs Provides information about caches,
CPU functional units, main memory, and more
Processor HW counters
UltraSparc II 2
Pentium 3 2
AMD Athlon 4
IA-64 4
POWER4 8
Pentium 4 18* Source: Wikipedia
7
Performance Analysis Today
Original Application
Instrument
Execute
Measure
Analyze (Automatically)Present
Optimize
Measured Data File
Execution Environment
Visualizations
Instrumented Application
Potential Bottlenecks
Analyze (Manually)
Modified Application
Optimized Application
What does performance analysis look like today? Goals
Low impact on application behavior
High-fidelity performance data Flexible Portable Automated Concise Visualization
Techniques Event-based, sample-based Profile, Trace
Above all, we want to understand application behavior in order to locate performance problems!
Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at
University of Florida Performance analysis and optimization (PGAS*
systems and MPI support) Performance data can be analyzed for
bottlenecks Offers several ways of exploring performance
data Graphs and charts to quickly view high-level
performance information at a glance [right, top] In-depth execution statistics for identifying
communication and computational bottlenecks Interacts with popular trace viewers (e.g.
Jumpshot [right, bottom]) for detailed analysis of trace data
Comprehensive support for correlating performance back to original source code
* Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.
Motivation for RC Performance Analysis Dual-paradigm applications gaining more
traction in HPC and HPEC Design flexibility allows best use of FPGAs
and traditional processors Drawback: More challenging to design
applications for dual-paradigm systems Parallel application tuning and FPGA core
debugging are hard enough!
Debug
Performance
Debug
Performance
Debug
Performance
Sequential
Parallel
Dual-Paradigm
Difficultylevel
Less
More
No existing holistic solutions for analyzing dual-paradigm applications Software-only views leave out low-level details Hardware-only views provide incomplete
performance information Need complete system view for effective tuning
of entire application
Motivation for RC Performance Analysis
Q: Is my runtime load-balancing strategy working? A: ???
ChipScope waveform
Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working?
A: ???
gprof output (×N, one for each node!)
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init
Instrumentation Level
HighLevel
Language
HardwareDescriptionLanguage
FPGABit
Stream
Instrumentation Level
Requires timeconsuming
place and route
Requires FPGA specific
instrumentation
Application mapping
disturbed by instrumentation
Constrained to HLL timing functions
(if any)XXX X
X
Dis
adva
nta
ge
High-level language (HLL) Requires HLL timing functions Application mapping disturbed
by instrumentation Hardware Description
Language (HDL) Portable between HLL and
types FPGA families Selected level for
instrumentation FPGA bit stream
Requires targeting specific FPGA family
Instrument in minutes
Instrumentation Selection Automated - Computation
State machines Used for preserving execution order in C functions Used to control state of pipelines
Control and status signals Used by library function
Automated - Communication Control and status signals
Used for streaming communication Used for DMA transfers
Application specific Monitoring variables for meaningful values
Measurement Techniques Profiling
Counters Records number of occurrences of event
Low overhead Normally uses registers Block RAM can be used for state machines
Tracing Timestamps
Indicating when event occurred Data
Associated with each event Greater overhead
Uses memory to store timestamps and data Greater fidelity
Reconstruction of sequence of events
*
* Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288.
CPU-0
3
2
1
Time
Hardware Measurement Module
TriggersSignal
Analysis Module
SignalsData
Profile Counters0 1 2 P - 1
Trace Data
Trace Data
Trace Data
Cycle Counter
Module Statistics
Module Control
Request
Perf. Data
Sample Control
signal
value
comp trigger
On-chip
Onboard
data
...
Offboard
Mem
ory
Mo
du
le
Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware
Adding Instrumentation & Measurement
HLL Hardware Wrapper
HLL API Wrapper
Application(C source)
MeasurementExtraction
Process/Thread
Application(C source)
FPGA(s)
CPU(s) HLL Tool Flow
C source
Compilesoftware
Implementhardware
Finisheddesign
Software-hardwaremapping
InstrumentedSignals
Instrumentation
Instrumentation
Loopback(C source)
Application(HDL) Loopback
(HDL)
HardwareMeasurement
Module
Reverse Mapping & Analysis
Mapping of HDL data back to HLL Variable name-matching Observing scope and other patterns
Bottleneck detection Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention
Example RC Visualization Need unified visualizations that accentuate
important statistics Must be scalable to many nodes
CPU 3CPU 2CPU 1CPU 0
904MB/s88%
10MB/s1%
812MB/s79%
914MB/s89%
1.79GB/s72%
6MB/s0%
2.50GB/s100%
0MB/s0%
0MB/s0%
FPGA 0 FPGA 1
2.76GB/s69%
CPU 4 CPU 5
1.98GB/s99%
Network
210MB/s10%
FPGA 2
Throughput (MB/s)
Time (sec)
IDLE75%
PHASE 19%
PHASE 216%
Potential Bottlenecks CPU Interconnect
121MB/s12%
691MB/s67%
Molecular Dynamics Simulation
Interactions between atoms and molecules discrete time intervals
Models forces Newtonian physics Van Der Walls forces Other interactions
Tracks molecules position and velocity X, Y and Z directions
http://en.wikipedia.org/wiki/Molecular_dynamics
Case Study Setup Impulse C v2.2 XD1000 platform
Opteron 2.2 GHz XD1000 module with Altera
Stratix-II EP2S180 FPGA in second processor socket
MD communication architecture Chunks of MD data are read
from SRAM Data is streamed to multiple
MD kernels that are pipelined Results are stored back to
SRAM
Input Memory Access
SRAMStorage
Output Memory AccessCollector
Output Stream 1 Output Steam 16
MD kernel 16
Input Stream 16
MD kernel 1
Input Steam 1
Distributor
Impulse-C Profile Percentages
Output stream of Molecular Dynamics kernel is a bottleneck.
void MD_kernel (co_stream in, co_stream out){
…for(t=0;t<16384;t++){
co_stream_read(in, &x, …);co_stream_read(in, &y, …);co_stream_read(in, &z, …);
…for(i=0;i<1024;i++){//Perform MD calculations
#pragma CO PIPELINE
…}co_stream_write(out, &x, …);co_stream_write(out, &y, …);co_stream_write(out, &z, …);
}
…}
51.67%
0.09%
0.26%
47.98%
Percent Runtime
(not shown)
Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.
MD Kernal Runtime
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
128 256 512 1024 2048 4096
Stream buffer size (bytes)
FP
GA
ru
nti
me
(sec
on
ds)
Other
Output stream
Pipeline
Performance Analysis Overhead
EP2S180 Original Modified Difference
Logic Used(143520)
126252(87.97%)
131851(91.87%)
+5599(+3.90%)
Comb. ALUT(143520)
100344 (69.92%)
104262 (72.65%)
+3918 (+2.73%)
Registers(143520)
104882 (73.08%)
110188 (76.78%)
+5306 (+3.70%)
Block RAM (9383040 bits)
3437568 (36.64%)
3557376 (37.91%)
+119808 (+1.27%)
Frequency (MHz)
80.57 78.44 -2.13(-2.64%)
Additional FPGA resource usage Less than 4%
Frequency reduction Less than 3%
Conclusions Developed prototype HLL-oriented RC
performance analysis tool First such runtime performance analysis tool framework
(per extensive literature review) Tracing & profiling available Automated instrumentation in progress
Application case study performed Observed minimal overhead from tool Speedup achieved due to performance analysis
Future work SRC support, automated instrumentation and analysis,
integration with software PAT, further case studies
References Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging
FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society.
Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006.
C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society.
Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.
S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007.
J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.