Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | conrad-richard |
View: | 218 times |
Download: | 1 times |
High Performance Embedded Computing
© 2007 Elsevier
Lecture 18: Hardware/Software Codesign
Embedded Computing SystemsMikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
© 2006 Elsevier
Topics
Platforms. Performance analysis. Design representations. Hardware/software partitioning. Co-synthesis for general multiprocessors. Optimization concepts Simulation
© 2006 Elsevier
Design platforms
Different levels of integration: PC + board. Custom board with CPU + FPGA or ASIC. Platform FPGA. System-on-chip.
© 2006 Elsevier
CPU/accelerator architecture
CPU is sometimes called host.
Accelerator communicate via shared memory. May use DMA to
communicate.
CPU
memory
accelerator
© 2006 Elsevier
Example: Xilinx Virtex-4
System-on-chip: FPGA fabric. PowerPC. On-chip RAM. Specialized I/O devices.
FPGA fabric is connected to PowerPC bus. MicroBlaze CPU can be added in FPGA
fabric.
© 2006 Elsevier
Performance analysis
Must analyze accelerator performance to determine system speedup.
High-level synthesis helps: Use as estimator for accelerator performance. Use to implement accelerator.
© 2006 Elsevier
Data path/controller architecture Data path performs
regular operations, stores data in registers.
Controller provides required sequencing.
Data path
controller
© 2006 Elsevier
High-level synthesis
High-level synthesis creates register-transfer description from behavioral description.
Schedules and allocates: Operators. Variables. Connections.
Control step or time step is one cycle in system controller.
Components may be selected from technology library.
© 2006 Elsevier
Models
Model as data flow graph.
Critical path is set of nodes on path that determines schedule length.
© 2006 Elsevier
Accelerator estimation
How do we use high-level synthesis, etc. to estimate the performance of an accelerator?
We have a behavioral description of the accelerator function.
Need an estimate of the number of clock cycles.
Need to evaluate a large number of candidate accelerator designs. Can’t afford to synthesize them all.
© 2006 Elsevier
Estimation methods
Hermann et al. used numerical methods. Estimated incremental costs due to adding blocks to
the accelerator. Henkel and Ernst used path-based scheduling.
Cut CFDG into subgraphs: reduce loop iteration count; cut at large joins; divide into equal-sized pieces.
Schedule each subgraph independently. Vahid and Gajski estimate controller and data
path costs incrementally.
© 2006 Elsevier
Single- vs. multi-threaded
One critical factor is available parallelism: single-threaded/blocking: CPU waits for
accelerator; multithreaded/non-blocking: CPU continues to
execute along with accelerator. To multithread, CPU must have useful work
to do. But software must also support multithreading.
© 2006 Elsevier
Execution time analysis
Single-threaded: Count execution time of
all component processes.
Multi-threaded: Find longest path
through execution.
© 2006 Elsevier
Hardware-software partitioning
Partitioning methods usually allow more than one ASIC.
Typically ignore CPU memory traffic in bus utilization estimates.
Typically assume that CPU process blocks while waiting for ASIC.
CPU
ASIC
ASIC
mem
© 2006 Elsevier
Synthesis tasks
Scheduling: make sure that data is available when it is needed.
Allocation: make sure that processes don’t compete for the PE.
Partitioning: break operations into separate processes to increase parallelism, put serial operations in one process to reduce communication.
Mapping: take PE, communication link characteristics into account.
© 2006 Elsevier
Scheduling and allocation Must
schedule/allocate computation communication
Performance may vary greatly with allocation choice.
P1P2
P3
P1 P2 P3
CPU1ASIC1
© 2006 Elsevier
Problems in scheduling/allocation Can multiple processes execute concurrently? Is the performance granularity of available
components fine enough to allow efficient search of the solution space?
Do computation and communication requirements conflict?
How accurately can we estimate performance? software custom ASICs
© 2006 Elsevier
Partitioning example
beforeafter
r = p1(a,b);s = p2(c,d);
z = r + s;
r=p1(a,b); s=p2(c,d);
z = r + s
© 2006 Elsevier
Problems in partitioning At what level of granularity must partitioning
be performed? How well can you partition the system without
an allocation? How does communication overhead figure
into partitioning?
© 2006 Elsevier
Problems in mapping Mapping and allocation are strongly
connected when the components vary widely in performance.
Software performance depends on bus configuration as well as CPU type.
Mappings of PEs and communication links are closely related.
© 2006 Elsevier
Program representations
CDFG: single-threaded, executable, can extract some parallelism.
Task graph: task-level parallelism, no operator-level detail. TGFF generates random task graphs.
UNITY: based on parallel programming language.
© 2006 Elsevier
Platform representations
Technology table describes PE, channel characteristics. CPU time. Communication time. Cost. Power.
Multiprocessor connectivity graph describes PEs, channels.
Type Speed cost
ARM 7 50E6 10
MIPS 50E6 8
PE 1
PE 2
PE 3
© 2006 Elsevier
Hardware/software partitioning assumptions CPU type is known.
Can determine software performance. Number of processing elements is known.
Simplifies system-level performance analysis. Only one processing element can multi-task.
Simplifies system-level performance analysis.
© 2006 Elsevier
Two early HW/SW partitioning systems Vulcan:
Start with all tasks on accelerator.
Move tasks to CPU to reduce cost.
COSYMA: Start with all functions on
CPU. Move functions to
accelerator to improve performance.
© 2006 Elsevier
Additional Co-synthesis Approaches Vahid: Binary constraint search CoWare: communicating processes model Simulated annealing & Tabu search
heuristics [Ele96] LYCOS: CDFG representation [Mad97] Several others in book (skim)
© 2006 Elsevier
Multi-objective optimization
Operations research provides notions for optimization functions with multiple objectives.
Pareto optimality: optimal solution cannot be improved without making something else worse.
© 2006 Elsevier
Large search space: Genetic algorithms Modeled as:
Genes = strings of symbols. Mutations = changes to strings.
Types of moves: Reproduction makes a copy of a string. Mutation changes a string. Crossover interchanges parts of two strings.
© 2006 Elsevier
Hardware/software co-simulation Must connect models with
different models of computation, different time scales.
Simulation backplane manages communication.
Becker et al. used PLI in Verilog-XL to add C code that communicates with software models, UNIX networking to connect hardware simulator.
© 2006 Elsevier
Mentor Graphics Seamless
Hardware modules described using standard HDLs.
Software can be loaded as C or binary. Bus interface module connects hardware
models to processor instruction set simulator. Coherent memory server manages shared
memory.