Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Center for Embedded Computer SystemsUniversity of California, Irvine and San Diego
http://www.cecs.uci.edu/~spark
SPARK: A Parallelizing High-Level Synthesis Framework
Supported by Semiconductor Research Corporation & Intel Inc
Sumit GuptaRajesh Gupta, Nikil Dutt, Alex Nicolau
2Copyright Sumit Gupta 2003
System Level SynthesisSystem Level Synthesis
System LevelModel
TaskAnalysis
HW/SWPartitioning
ASIC
ProcessorCore
Memory
FPGA
I/O
HardwareBehavioralDescription
SoftwareBehavioralDescription
SoftwareCompiler
HighLevel
Synthesis
3Copyright Sumit Gupta 2003
High Level SynthesisHigh Level Synthesis
M e m o r y
ALUCo
ntr
ol
Data path
d = e - f g = h + i
If NodeT F
c
x = a + bc = a < b
j = d x gl = e + x
x = a + b;c = a < b;if (c) then d = e – f;else g = h + i;j = d x g;l = e + x;
Transform behavioral descriptions to RTL/gate level
From C to CDFG to Architecture
Problem # 1Problem # 1 : :Poor quality of HLS results beyond Poor quality of HLS results beyond straight-line behavioral descriptionsstraight-line behavioral descriptions
Poor/No controllability of Poor/No controllability of the HLSthe HLSresultsresults
Problem # 2Problem # 2 : :
4Copyright Sumit Gupta 2003
High-level SynthesisHigh-level Synthesis Well-researched area: from early 1980’sWell-researched area: from early 1980’s
Renewed interest due to new system level design Renewed interest due to new system level design methodologies methodologies
Large number of synthesis optimizations have been Large number of synthesis optimizations have been proposed proposed Either Either operation leveloperation level: algebraic transformations on DSP : algebraic transformations on DSP
codescodes or or logic levellogic level: Don’t Care based control optimizations: Don’t Care based control optimizations In contrast, compiler transformations operate at both In contrast, compiler transformations operate at both
operation level (fine-grain) and source level (coarse-grain) operation level (fine-grain) and source level (coarse-grain) Parallelizing Compiler TransformationsParallelizing Compiler Transformations
Different optimization objectives and cost models than HLSDifferent optimization objectives and cost models than HLS Our aimOur aim: Develop Synthesis and : Develop Synthesis and ParallelizingParallelizing Compiler Compiler
Transformations that are “useful” for HLS Transformations that are “useful” for HLS Beyond scheduling results: in Beyond scheduling results: in Circuit Area and DelayCircuit Area and Delay For large designs with For large designs with complex control flowcomplex control flow (nested (nested
conditionals/loops)conditionals/loops)
5Copyright Sumit Gupta 2003
Our Approach: Our Approach: ParallelizingParallelizing HLS HLS (PHLS)(PHLS)
C Input VHDLOutput
Original CDFG
Optimized CDFG
Scheduling& Binding
Source-Level Compiler
Transformations
Scheduling Compiler & Dynamic
Transformations
Optimizing Compiler and Parallelizing Compiler transformations Optimizing Compiler and Parallelizing Compiler transformations applied at applied at Source-levelSource-level (Pre-synthesis) and during (Pre-synthesis) and during SchedulingScheduling Source-level code refinement using Source-level code refinement using Pre-synthesisPre-synthesis
transformationstransformations Code Restructuring by Code Restructuring by SpeculativeSpeculative Code Motions Code Motions Operation Operation replicationreplication to improve concurrency to improve concurrency DynamicDynamic transformations: exploit new opportunities during transformations: exploit new opportunities during
schedulingscheduling
7Copyright Sumit Gupta 2003
SPARK Parallelizing HLS SPARK Parallelizing HLS FrameworkFramework C input and C input and SynthesizableSynthesizable RTL VHDL output RTL VHDL output
Tool-box Tool-box of Transformations and Heuristicsof Transformations and Heuristics Each of these can be developed independently of the otherEach of these can be developed independently of the other
Script-basedScript-based application of transformations, passes, and application of transformations, passes, and heuristics: similar to Synopsys Design Compilerheuristics: similar to Synopsys Design Compiler
Hierarchical Intermediate Representation (HTGs)Hierarchical Intermediate Representation (HTGs) Retains structural information about design (conditional blocks, Retains structural information about design (conditional blocks,
loops)loops) Enables efficient and structured application of transformationsEnables efficient and structured application of transformations
Complete HLS tool:Complete HLS tool: Does Resource Binding & Control Does Resource Binding & Control Synthesis Synthesis
Enables Enables Graphical VisualizationGraphical Visualization of Design description and of Design description and intermediate results (CDFG, DFG, HTG)intermediate results (CDFG, DFG, HTG)
BenchmarkedBenchmarked on large set of multimedia & image on large set of multimedia & image processing designsprocessing designs
SPARK System Release SPARK System Release available for downloadavailable for download User ManualUser Manual for running tool and changing synthesis scripts for running tool and changing synthesis scripts TutorialTutorial for the synthesis of a portion of a MPEG player for the synthesis of a portion of a MPEG player
100,000+ lines of C++ code100,000+ lines of C++ code
8Copyright Sumit Gupta 2003
PHLS TransformationsPHLS TransformationsOrganized into Four GroupsOrganized into Four Groups
Pre-SynthesisPre-Synthesis Source-to-Source Transformations Source-to-Source Transformations Loop-Invariant Code Motions, Loop Unrolling, CSELoop-Invariant Code Motions, Loop Unrolling, CSE
SchedulingScheduling synthesis & compiler transformations synthesis & compiler transformations Speculative Code Motions, Multi-cycling, Operation Speculative Code Motions, Multi-cycling, Operation
Chaining, Loop Shifting (Incremental Loop Pipelining Chaining, Loop Shifting (Incremental Loop Pipelining technique)technique)
DynamicDynamic: Transformations applied dynamically : Transformations applied dynamically during schedulingduring scheduling Dynamic CSE & Copy Propagation, Dynamic Branch Dynamic CSE & Copy Propagation, Dynamic Branch
BalancingBalancing Basic CompilerBasic Compiler Transformations Transformations
Copy Propagation, Dead Code Elimination, constant Copy Propagation, Dead Code Elimination, constant propagationpropagationApplication of these transformations is Application of these transformations is
guided by Synthesis Scriptsguided by Synthesis Scripts
9Copyright Sumit Gupta 2003
ExperimentsExperiments We used We used SPARKSPARK to synthesize designs derived from to synthesize designs derived from
several industrial designsseveral industrial designs Example: MPEG-1, MPEG-2, GIMP Image Processing Example: MPEG-1, MPEG-2, GIMP Image Processing
softwaresoftware Case StudyCase Study of Intel Instruction Length Decoder of Intel Instruction Length Decoder
Quantified effects of individual transformations on Quantified effects of individual transformations on QORQOR Pre-synthesis transformationsPre-synthesis transformations Speculative Code Motions, Loop PipelilingSpeculative Code Motions, Loop Pipeliling Dynamic TransformationsDynamic Transformations
Scheduling ResultsScheduling Results Number of States in Number of States in
FSMFSM Cycles on Longest Path Cycles on Longest Path
through Designthrough Design
VHDL: Logic VHDL: Logic Synthesis Synthesis Critical Path Length Critical Path Length
(ns)(ns) Unit AreaUnit Area
10Copyright Sumit Gupta 2003
MPEG-1 Pred1 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
MPEG-1 Pred2 Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResults
Non-speculative CMs: Within BBs & Across Hier Blocks
42%
10%
36%
36%
8%
39%
Overall: 63-66 % improvement in DelayOverall: 63-66 % improvement in Delay
Almost constant Area Almost constant Area
11Copyright Sumit Gupta 2003
Example Design: ILD Block Example Design: ILD Block from Intelfrom Intel
Case Study: A design derived from the Case Study: A design derived from the Instruction Instruction Length DecoderLength Decoder of the Intel Pentium of the Intel Pentium®® class of class of processorsprocessors
CharacteristicsCharacteristics of Microprocessor functional blocks of Microprocessor functional blocks Low Latency: Single or Dual cycle implementationLow Latency: Single or Dual cycle implementation Consist of several small computationsConsist of several small computations Intermix of control and data logicIntermix of control and data logic
Starting with a sequential, multi-cycle specification, Starting with a sequential, multi-cycle specification, we achieved a fully parallel, single-cycle designwe achieved a fully parallel, single-cycle design Our toolbox approach enables us to develop a script to Our toolbox approach enables us to develop a script to
synthesize applications from different domainssynthesize applications from different domains Final design looks close to the actual implementation Final design looks close to the actual implementation
done by Inteldone by Intel
12Copyright Sumit Gupta 2003
Key Insights from ProjectKey Insights from Project Coarse-grain and Fine-grain Parallelizing Coarse-grain and Fine-grain Parallelizing
transformations and basic compiler transformations transformations and basic compiler transformations are essential and key to achieving high quality of are essential and key to achieving high quality of synthesis resultssynthesis results
Language-level pre-synthesis optimizations are Language-level pre-synthesis optimizations are important due to the high-level of abstraction at the important due to the high-level of abstraction at the level of behavioral C level of behavioral C Also important for coarse-grain design space explorationAlso important for coarse-grain design space exploration
Although a range of (compiler & synthesis) Although a range of (compiler & synthesis) optimizations exist, they have to be carefully guided optimizations exist, they have to be carefully guided by heuristics and scripts to achieve desired resultsby heuristics and scripts to achieve desired results
Transformations from compilers and parallelizing Transformations from compilers and parallelizing compilers do not directly translate over to synthesiscompilers do not directly translate over to synthesis Need to be radically changed with completely different Need to be radically changed with completely different
cost models and guiding principlescost models and guiding principles New parallelizing transformations (or transformations that New parallelizing transformations (or transformations that
are not useful for compilers) have to be developed for are not useful for compilers) have to be developed for synthesissynthesis
13Copyright Sumit Gupta 2003
Key Insights from ProjectKey Insights from Project Designers want script based control over transformations, Designers want script based control over transformations,
passes – similar to Synopsys Design Compilerpasses – similar to Synopsys Design Compiler Designer Insights can be used to guide transformations – Designer Insights can be used to guide transformations –
especially coarse-grain code restructuring for design space especially coarse-grain code restructuring for design space explorationexploration
Optimizations that improve schedule length (cycles) do Optimizations that improve schedule length (cycles) do not necessarily improve circuit delay (due to longer not necessarily improve circuit delay (due to longer critical paths, i.e., clock period)critical paths, i.e., clock period) For example, loop unrolling and loop pipelining: they increase the For example, loop unrolling and loop pipelining: they increase the
number of operations in the design and hence, resource utilization number of operations in the design and hence, resource utilization and in turn, size of multiplexers and controllers increaseand in turn, size of multiplexers and controllers increase
Traditional CDFG and DFG representations used in high-Traditional CDFG and DFG representations used in high-level synthesis are not sufficient for designs with complex level synthesis are not sufficient for designs with complex control flowcontrol flow A Hierarchical intermediate representation (Hierarchical Task A Hierarchical intermediate representation (Hierarchical Task
Graphs – HTGs) is required for retaining control and structural Graphs – HTGs) is required for retaining control and structural information for efficient coarse-level optimizationsinformation for efficient coarse-level optimizations
Full set of data dependencies (RAW, WAR, WAW) are required for Full set of data dependencies (RAW, WAR, WAW) are required for correlating output VHDL and C with input C.correlating output VHDL and C with input C.
14Copyright Sumit Gupta 2003
ConclusionsConclusions Parallelizing code transformations enable a new Parallelizing code transformations enable a new
range of HLS transformationsrange of HLS transformations Provide the needed improvement in quality of HLS results Provide the needed improvement in quality of HLS results
Possible to be competitive against manually designed Possible to be competitive against manually designed circuits. circuits.
Can enable productivity improvements in microelectronic Can enable productivity improvements in microelectronic designdesign
Built a C-to-VHDL synthesis system with a range of Built a C-to-VHDL synthesis system with a range of code transformationscode transformations Platform for applying Coarse and Fine-grain OptimizationsPlatform for applying Coarse and Fine-grain Optimizations Tool-box approach where transformations and heuristics Tool-box approach where transformations and heuristics
can be developedcan be developed Enables the designer to find the right Enables the designer to find the right synthesis scriptsynthesis script
for different application domainsfor different application domains Performance improvements of 60-70 % across a number Performance improvements of 60-70 % across a number
of designsof designs We have shown its effectiveness on an Intel designWe have shown its effectiveness on an Intel design
15Copyright Sumit Gupta 2003
SPARK ReleaseSPARK Release Available for downloadAvailable for download
http://www.cecs.uci.edu/~spark User ManualUser Manual
Running the toolRunning the tool Customizing the synthesis scriptsCustomizing the synthesis scripts
TutorialTutorial Synthesis of Portion of the Motion Synthesis of Portion of the Motion
Compensation algorithm in MPEG-1 Compensation algorithm in MPEG-1 playerplayer
17Copyright Sumit Gupta 2003
Ongoing WorkOngoing Work: Interface Synthesis : Interface Synthesis Co-Design Targeting a FPGA Co-Design Targeting a FPGA
PlatformPlatform Developed novel memory mapping Developed novel memory mapping
algorithm to fit memory elements/ algorithm to fit memory elements/ application onto FPGA platformapplication onto FPGA platform
C InputMPEG-1
Pred Block
ExecutionProfiling
Manual HW/SW
PartitioningProcessor
Core
MemoryI/O
Hardware CDescription
SoftwareC
Description
SoftwareCompiler
SPARK High-LevelSynthesis
FPGA
FPGA Platform
InterfaceSynthesis
18Copyright Sumit Gupta 2003
Future Plans or What is Future Plans or What is MissingMissing
Need for ability to specify timing of Need for ability to specify timing of signalssignals
Interface with logic synthesis tools to Interface with logic synthesis tools to enable better module selection, operator enable better module selection, operator chaining/mergingchaining/merging
Time-constrained synthesisTime-constrained synthesis Power Analysis of parallelizing Power Analysis of parallelizing
optimizationsoptimizations More transformations such as loop More transformations such as loop
fusion, range analysis requiredfusion, range analysis required
19Copyright Sumit Gupta 2003
Synthesizable CSynthesizable C ANSI-C front end from Edison Design Group ANSI-C front end from Edison Design Group
(EDG)(EDG) Features of C not supported for synthesisFeatures of C not supported for synthesis
PointersPointers However, Arrays and passing by reference However, Arrays and passing by reference areare
supportedsupported Recursive Function CallsRecursive Function Calls GotosGotos
Features for which support has not been Features for which support has not been implementedimplemented Multi-dimensional arraysMulti-dimensional arrays StructsStructs Continue, BreaksContinue, Breaks
Hardware component generated for each function Hardware component generated for each function A called function is instantiated as a hardware A called function is instantiated as a hardware
component in calling functioncomponent in calling function
21Copyright Sumit Gupta 2003
Resource Utilization GraphResource Utilization Graph
SchedulingScheduling
22Copyright Sumit Gupta 2003
Example of Example of ComplexComplex HTG HTG Example of a real Example of a real
design: MPEG-1 pred2 design: MPEG-1 pred2 functionfunction Just for demonstration; Just for demonstration;
you are not expected to you are not expected to read the textread the text
Multiple nested loops Multiple nested loops and conditionalsand conditionals
23Copyright Sumit Gupta 2003
Target ApplicationsTarget ApplicationsDesignDesign # of # of
IfsIfs# of # of
LoopsLoops# Non-# Non-Empty Empty Basic Basic BlocksBlocks
# of # of OperatiOperati
onsons
MPEG-1 MPEG-1 pred1pred1
44 22 1717 123123
MPEG-1 MPEG-1 pred2pred2
1111 66 4545 287287
MPEG-2 MPEG-2 dp_framdp_fram
ee
1818 44 6161 260260
GIMP GIMP
tilertiler1111 22 3535 150150
24Copyright Sumit Gupta 2003
Non-speculative CMs: Within BBs & Across Hier Blocks
+ Speculative Code Motions
+ Pre-Synthesis Transforms
+ Dynamic CSE
Scheduling & Logic Synthesis Scheduling & Logic Synthesis ResultsResultsMPEG-2 DpFrame Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
GIMP Tiler Function
0
0.2
0.4
0.6
0.8
1
1.2
Longest Path(lcyc)
Critical Path(cns)
Total Delay (c*l) Unit Area
14%
20%1%
33%
41%
52%
Overall: 48-76 % improvement in DelayOverall: 48-76 % improvement in Delay
Almost constant Area Almost constant Area
25Copyright Sumit Gupta 2003
Case Study: Case Study: IntelIntel Instruction Length Instruction Length DecoderDecoder
Stream ofInstructions
Instruction Length Decoder
FirstInsn
SecondInsn
ThirdInstruction
Instruction BufferInstruction Buffer
26Copyright Sumit Gupta 2003
ILD Synthesis: Resulting ILD Synthesis: Resulting ArchitectureArchitecture
Speculate Operations,Fully Unroll Loop,
Eliminate Loop Index Variable
Multi-cycle Sequential
Architecture
Multi-cycle Sequential
Architecture
Single cycle Parallel
Architecture
Single cycle Parallel
Architecture
Our toolbox approach enables us to develop a Our toolbox approach enables us to develop a script to synthesize applications from different script to synthesize applications from different domainsdomains
Final design looks close to the actual Final design looks close to the actual implementation done by Intelimplementation done by Intel