Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | theodore-todd |
View: | 36 times |
Download: | 0 times |
DARMSTADT, GERMANY - 11/07/2013
A Framework for Effective Exploitation of Partial Reconfiguration in Dataflow Computing Riccardo Cattaneo∗, Xinyu Niu†, Christian Pilato∗, Tobias Becker†,
Wayne Luk†, Marco D. Santambrogio∗ * Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano
† Department of Computing Imperial College of London
International Workshop on Reconfigurable Communication-
centric Systems-on-Chip RECOSOC13
DARMSTADT, GERMANY - 11/07/2013
2Motivations
The design of heterogeneous, reconfigurable systems is a complex task
Adequate computer-assisted design (CAD) tools required One of the foreseen predominant platforms of the
future is the MPSoCLots of heterogeneous cores onto single chips
Typically, we want to accelerate an application or o class of applications onto the MPSoC
Starting point should be the application, not the architecture alone
Decisions in the frontend phase may highly affect the backend implementation
iterative exploration is a practical requirement
This is an ongoing project at Politecnico di Milano to assist in the design of such complex systems
DARMSTADT, GERMANY - 11/07/2013
3Contents
Framework Overview
Preliminary Results – Test Case
Conclusions and Future Work
DARMSTADT, GERMANY - 11/07/2013
4Framework Overview
Inputs (single XML file):Information about the target deviceApplication source files (.C) plus custom pragmas for additional information
• (e.g., task level parallelism/kernels)
Architectural template to use Application Analysis
Task graph generationDataflow Graph generation (per function)
High Level Analysis:Estimates of resource consumption for each node (DFG based)
Mapping and SchedulingMapping, SchedulingRefinement of the architectural template
Output:Project files ready for the synthesiswith back-end tools
DARMSTADT, GERMANY - 11/07/2013
5XML Exchange Format
The entire project is contained inside an XML fileArchitecture: components’ characteristics (e.g., reconfigurable regions), …Applications: source code files and profiling informationLibrary: task implementations with the characterization (time, resources, ...)Partitions: task graph, mapping and scheduling, …
It allows a modular organization of the framework, but also the sharing of information among the different phases
Specific details of the target platform are taken into account only in the final phase (interaction with backend tools)
DARMSTADT, GERMANY - 11/07/2013
6Task Graph Generation
Application source code files can be analyzed to extract the task graphs
Profiling information can drive the generation of such solutions
Task graph will be then specified in the XML file as processing nodes connected by data transfers
#pragma omp taskvoid threshold(unsigned char *o1,unsigned char *r, unsigned char t, int * p){ nt DIMH = p[0]; int minH1 = p[1]; int maxH1 = p[2]; int minV1 = p[3]; int maxV1 = p[4]; for(v=minV1;v<maxV1;v++) for(h=minH1;h<maxH1;h++){ If(original1[v*DIMH+h]>thresh){ result[v*DIMH*BPP+h*BPP]=255; result[v*DIMH*BPP+h*BPP+1]=255; result[v*DIMH*BPP+h*BPP+2]=255; } else{ result[v*DIMH*BPP+h*BPP]=0; result[v*DIMH*BPP+h*BPP+1]=0; result[v*DIMH*BPP+h*BPP+2]=0; } }}
DARMSTADT, GERMANY - 11/07/2013
7Library Generation: a collection of different implementations
LLVM-based compiler to extract the dataflow graph of each task
Estimation of required resources (including bit-width analysis)Possibility to interact with HLS synthesis tools to obtain more accurate results (trading off design time with estimation accuracy)
Generated implementations are then stored into the XML file to offer opportunities to the mapper and floorplacer
Politecnico di Milano/Imperial College of London joint effort to integrate High Level Analysis techniques into the toolchain
DARMSTADT, GERMANY - 11/07/2013
8Mapping, Scheduling and Floorplacing
We generate one or more configurations where each task of the application is analyzed and assigned (via Mapping, Scheduling and Floorplanning – M/S/FP) to
An available and admissible implementation A component of the architecture (GPP, IP or reconfigurable region)
This allows to“share” implementations across different tasks (hardware sharing)move a task implementation to another processing element at run-time (task relocation)
DARMSTADT, GERMANY - 11/07/2013
9Architecture Exploration
During exploration, the target architecture can be refined
Adding/removing processing elements (reconfigurable regions)Modifying their parametersDetermining the proper interconnection topology
It can iteratively affect:mapping and scheduling: modification to the computational resources (especially the number of reconfigurable regions)floorplacing: resources might become more scarce or more available due to the presence of more or less components to floorplace
It allows a progressive and iterative refinement of the solution and a concurrent customization of both architecture and application
E.g.: mapping and floorplacing can suggest which resources should be added
DARMSTADT, GERMANY - 11/07/2013
10Supported Platforms
Virtex-5 XC5VLX110T (embedded)Two XCF32P Platform Flash PROMs (32Mbyte each) SystemACE™ Compact Flash configuration controller64-bit wide 256Mbyte DDR2 small outline DIMM (SODIMM)
Maxeler MaxWorkstation (HPC system)
Intel i7 [email protected], 16GB RAM, 500GB HDDMax3 dataflow engine (DFE)Virtex 6 SX475T FPGA, 24GB memoryDFE connected to CPU via PCI Express
XUPV5
Reconf. Area
DDR2(256MB)CPU0
CPU1
CPUCPUCPUCPU
MAX3 DFE
DRAM(16GB)
Interface FPGA Compute
FPGA
DRAM (24GB)
DARMSTADT, GERMANY - 11/07/2013
11Backend Toolchains
CPU Compiler
.c .xml
Bitstream Generation
HLS (MaxJ-VHDL)
- Source code for CPU- DFGs for HW tasks- Mapping configurations
Bitstream Generation
exec bin bit bit
Manual VHDL Implementations
DFG-C
HLS (C-VHDL)
Manual MaxJ Implementations
FPGA-based embedded system MaxWorkstation
The code can be always further optimized by
hand; e.g., glue code for data transfers
MaxIDE
DFG-MaxJ
DARMSTADT, GERMANY - 11/07/2013
12Helper Graphical User Interface
Practical GUI to support the designer, to limit the errors in the interactions with the XML and to allow custom design methodologies
DARMSTADT, GERMANY - 11/07/2013
13Preliminary results: edge detection
Edge detection application: 4 stages of computationC + custom #pragmas based descriptionExtracted taskgraph and corresponding DFG of first stage (Scale, 1x parallelism)
We generate 4 implementations with different levels of parallelism and resource consumption for each of the 4 tasks of the application
“parallelism X”: X pixels processed at once
Maxeler Backend
DARMSTADT, GERMANY - 11/07/2013
14Experimental Results / 1
Static vs reconfigurable design (both extracted using the framework)
R0:S,T
R1:B,E
Task Name
Area Occupation
S 664
B 64
E 7680
T 7376
Region Name Final Area Occupation
R0 max(664,64)=664
R1 max(7680,7376)=7680
Total area consumption
7376+64=8344
Reconfigurable (parallelism 8)
Task Name
Area Occupation
S 332
B 32
E 3840
T 3688
Region Name Final Area Occupation
Total area consumption
332+32+3840+3688=7876
Static (parallelism 4)
IP0:S
IP1:B
IP2:E
IP3:T
We limit the available area to 10klut and implement the most performing design
DARMSTADT, GERMANY - 11/07/2013
15Experiment Results / 2
Reconfiguration time is automatically masked (when possible)
Partial Reconfiguration improves performance of application via automatic resource multiplexing
Automatic due to exploration of different schedulings
DARMSTADT, GERMANY - 11/07/2013
16Experiment Results / 3
HLA estimates are fairly accurate, given that they are extracted in a matter of seconds on a commodity desktop machine.
Average values over the set of tasks
Average accuracy is > 85%
DARMSTADT, GERMANY - 11/07/2013
17Conclusions and Future Work
We presented a modular framework to design heterogeneous, reconfigurable systems
Easy to plug alternative methods for each of the phasePossibility to perform progressive refinement of both application and architecture
Critical part: multi-objective optimization strategy. Different experiments with different heuristics or possibly different algorithms
Easy to plug in different components This is becoming part of a larger project (ASAP –
Advanced Synthesis of Applications and Platforms)SystemC TLM backend for (co-)simulation and early validationMore architectural templatesCloser interaction with actual synthesis (e.g., high-level synthesis)Automated methodologies to accelerate the design
DARMSTADT, GERMANY - 11/07/2013
Thank you! Riccardo Cattaneo
Research partially funded by the European Community’s Seventh Framework Programme, FASTER
project.