Date post: | 31-Mar-2015 |
Category: |
Documents |
Upload: | osbaldo-munoz |
View: | 214 times |
Download: | 0 times |
Vector Processing as a Soft-core CPU Accelerator
Jason Yu, Guy Lemieux, Chris Eagleston
{jasony, lemieux, ceaglest}@ece.ubc.caUniversity of British Columbia
Prepared for FPGA2008, Altera, and XilinxFebruary 26-28, 2008
2
Motivation FPGAs for embedded processing
High performance, computationally intensive Growing use of embedded processor on FPGA Nios/MicroBlaze too slow
Faster performance Faster Nios/MicroBlaze Multiprocessor-on-FPGA Custom hardware accelerator Synthesized accelerator
3
Problems… Faster Nios/MicroBlaze not feasible
2 or 4-way superscalar/VLIW register file maps inefficiently to FPGA
Superscalar complex dependency checking
Multiprocessor-on-FPGA complexity Parallel programming and debugging System design Cache coherence, memory consistency
Custom hardware accelerator cost Need hardware engineer Time-consuming to design and debug 1 hardware accelerator per function
4
Possible Solutions… Automatically synthesized hardware accelerators
Change software regenerate & recompile RTL Altera C2H Xilinx CHiMPS Mitrion Virtual Processor CriticalBlue Cascade
Soft vector processorSoft vector processor Change software same RTL, just recompile software
Purely software-based Decouples hardware/software development teams
5
Advantages of Vector Processing Simple programming model
Short to long vector data parallelism Regular, easy to accelerate
Purely software-based One hardware accelerator supports many
applications
Scalable performance and area
6
Contributions Configurable soft vector processor
Selectable performance/resource tradeoff Area customization
FPGA-specific enhancements Partitioned register file Vector reductions using MAC chain Local vector datapath memory
Overview of Vector Processing
8
Acceleration with Vector Processing Organize data as long vectors Data-level parallelism
Vector instruction execution Multiple vector lanes (SIMD) Repeated SIMD operation
over length of vector
SourceSourcevectorvector
registersregisters
DestinationDestinationvectorvectorregisterregister
Vector lanes
for (i=0; i<NELEM; i++) a[i] = b[i] * c[i]
vmult a, b, c
9
Compared to CPUs with SIMD Extensions Intel SSE2, PowerPC
Altivec, etc Short, fixed-length
vectors (eg, 4) Single cycle per
instruction Many data
pack/unpack instructions
SourceSourceSIMDSIMD
registersregisters
DestinationDestinationSIMDSIMDregisterregister
SIMD Unit
11
Hybrid vector-SIMD vs Traditional Vector
Traditional vectorprocessing
HybridVector-SIMDprocessing
For (i=0; i<NELEM; i++) { C[i] = A[i] + B[i] E[i] = C[i] * D[i] }
0
1
2
3
C
E
C
E
4
5
6
7
12
Vector ISA Features Vector length (VL)
register Conditional execution
Vector flag registers
Vector addressing modes Unit stride Constant stride Indexed offset
0
1
0
0
1
0
1
0
Merge
Sourceregisters
DestinationregisterFlag
register
Vector Merge Operation
13
Example: Simple 5x5 Median Filtering
Pseudocode (Bubble sort)
Load the 25 pixel vectors P[0..24]For i=0 to 12 {
minimum = P[i]For j=i to 24 {
if (P[j] < minimum) {swap (minimum, P[j])}
}}
Slide “window” over after 1 median
Repeated over entire image Many windows
Output pixelOutput pixel
14
Example: Simple 5x5 Median Filtering
Pseudocode (Bubble sort)
Load the 25 pixel vectors P[0..24]For i=0 to 12 {
minimum = P[i]For j=i to 24 {
if (P[j] < minimum) {swap (minimum, P[j])}
}}
Bubble sort on vector registers
Vector flag register to mask execution
“VL” results at once!
25 rows ->25 vector registers
“VL” pixels each
Soft Vector Processor Architecture
16
Nios II coreShared instructionmemory
(scalar / vectorinstructions)
Shared scalar / vectorMemory interface
Distributedvector register file
Overlappedscalar / vector
execution
Configurablememory width
Configurablenumber of lanes
17
0
0
1
1
3
3
4
4
5
5
7
7
One vectorRegister(eg, v0)
Distributedvector register file
18
Local vectordatapath memory
MAC chain
Result toVLane 0
19
Vector Sum Reduction with MAC
Sum reduction
R = A[i] * B[i]
R = A[i] (using B[i] = 1)
Reduces VL elements in vector register to single number
Two instruction sequence: vmac
multiply accum. to accumulators vcczacc
compress copy and zero accumulators
Side effect: can only reduce 18-bit inputs
Accumulatechain
20
Configurable Parameters Some configurable features
Number of vector lanes Vector ALU width Vector memory access granularity (8, 16, 32b) Local memory size (or none)
Strongly affect performance, area
21
Partial List of Configurable Parameters
Primary ParametersSoft vector processors
Parameter Description Typical V4 V8 V16M32
NLane Number of vector lanes 4-128 4 8 16
MVL Maximum vector length 16-512 16 32 64
VPUW Processor data width (bits) 8, 16, 32 32 32 32
MemMinWidth
Minimum accessible data width in memory
8, 16, 32 8 8 32
Parameters for Optional Features
MultW Multiplier width (bits, 0 is off) 0, 8, 16, 32 16 16 16
MACL MAC chain length (0 is no MAC) 0,1,2,4 1 2 0
LMemN Local memory number of words 0-1024 256 256 0
LMemShare Shared local memory address space within lane
On/Off Off Off Off
Performance Results
23
Benchmarking 3 sample application kernels
5x5 median filter Motion estimation (full search block matching) 128-bit AES encryption (MiBench)
C code, 3 versions Nios II Nios II with inline vector assembly Nios II with C2H accelerator
24
Methodology and Assumptions Compile C code with nios2-gcc
Run time Instructions * cycles-per-instruction / Fmax
Nios II Instruction: 1 cycle Memory load: 1 cycle
Nios II with vectors Vector instruction: (VL / NLane) cycles Vector load: 2 * (VL / NLane) + 2 cycles
25
Altera C2H Compiler Nios II with C2H accelerator
Synthesizes HW accelerator from a C function C memory reference = master port to that memory Current limitations:
No automatic loop unrolling Up to user to efficiently partition memory
Memory
Nios IIC2H
acceleratorC2H
accelerator
Arbiter
Memory
Master portsMaster ports
ArbiterAvalonFabric
26
C2H Methodology Compile application kernels with C2H
compiler Automatic pipelining and scheduling Manually unroll loops Manually “vectorize” C code
Nios II with C2H accelerator C2H compiler reports # of clock cycles Includes memory arbitration overhead
27
C2H Example AES encryption round
Shift 4 32-bit words(by different amounts)
4 table lookups XOR results, XOR with key
Acceleration steps1. Process multiple blocks in parallel (increase
array sizes)2. Manually create 4 on-chip memories for 4
lookup tables
32-bit word
28
Vector soft processordesign flow
Design vectoralgorithm
Design vectoralgorithm
Develop C code,vector assembly
Develop C code,vector assembly
Compile source code,assemble with
vector assembler
Compile source code,assemble with
vector assembler
Result meetsrequirements?
Result meetsrequirements?
Synthesize system,place and route
Synthesize system,place and route
Yes
No
Configure softprocessor parameters
Configure softprocessor parameters
Download applicationto processor
Download applicationto processor
Determine FPGAresource budget
Determine FPGAresource budget
Hardware acceleratordesign flow
Develop Ccode for Nios II
Develop Ccode for Nios II
Identify areas forHW acceleration
Identify areas forHW acceleration
Isolate sectionsto accelerate into
C functions
Isolate sectionsto accelerate into
C functions
Analyze compilationestimates
Analyze compilationestimates
Result meetsrequirements?
Result meetsrequirements?
Tune systemarchitecture
Tune systemarchitecture
Apply optimizationsto C source code
Apply optimizationsto C source code
Yes
No
Run C2H compilerRun C2H compiler
Synthesize system,place and route
Synthesize system,place and route
Hardware-awaretransformations Software-only
optimizations
Programmersknow how to
do this!
Synthesize system,place and route
Synthesize system,place and route/
29
Resource Utilization
0
0.2
0.4
0.6
0.8
1
NormalizedResource
(to smallestStratix III)
Nios
II/ s
C2H M
edian
C2H M
otion
C2H AES V4 V8 V1
6
ALM M9K DSP Elements
Biggest Stratix III = 7x more resources
Note: These Vector processors include a large local memory in each vector lane (an optional feature), hence the high M9K utilization. Removal would save 60% of M9K in V16.
30
Resource Utilization Estimates
ALM DSP Elements M9K Fmax
Smallest Stratix III 19000 216 108 -
Nios II/s 489 8 4 153
+ C2H Median filtering 825 8 4* 147
+ C2H Motion estimation 977 10 4* 135
+ C2H AES encryption 2480 8 6* 119
UTIIe 324 0 3 193
+V4 5215 21 32 115
+V8 7011 34 53 114
+V16 10266 58 95 113
* C2H results are obtained from compiling to Stratix II; uses M4K memories
31
Results: Clock Cycles
00.10.20.30.40.50.60.70.80.9
1
Normalized Clock Cycles
(to idealNios II)
Nios II C2H V4 V8 V16
Median filtering Motion estimation AES encryption
32
Speedup vs Resource Utilization Summary
0
5
10
15
20
25
30
0 5 10 15 20 25 30
Normalized Area (Number of ALMs)
Sp
ee
du
p
Nios II/s
V16
V32
C2H
Vector
Median filteringAES encryptionMotion estimation
33
Summary of Effort C2H accelerators
1. “Vectorize” code for C2H: 1 day2. Extra-effort optimization: 1 day3. Place-and-route waiting: 1 hour
Each iteration = 1 day + P&R
Vector soft processor1. Vector algorithm, write vector assembly: 2 days2. Revise vector algorithm: 0.5 day
Each iteration = 0.5 day + SW compile only
34
Lessons from Vector Processor Design Register files
2-read, 1-write memory very common for CPUs Multiple write ports for wide-issue processing
Wide, flexible vector memory interface very costly Memory crossbars: several multi-bit multiplexers ~1/3 the resources of soft vector processor
(128b, byte access)
Stratix III specific DSP shift chain can no longer dynamically select input MAC chain is useful
Would like 32-bit MAC chain
35
Current Progress Development toolchain integration
Packaged as SOPC builder component No built-in debug core
Uses real Nios II processor to download code on to system
Inline vector assembly in Nios II IDE
Future work Compiler Floating-point
36
Conclusion Vector processing maps well to FPGA
Many small memories, DSP blocks Simple programming model
Soft vector processor Purely software-based acceleration
No hardware design / RTL recompile needed—just program One hardware accelerator supports many applications
Scalable performance and area More vector lanes more performance for more area Soft core parameters/features area customization
37
Conclusion FPGA-specific enhancements
Partitioned register file reduces resource utilization
MAC chain for efficient vector reduction Local vector datapath memory
Table lookup operations
Download the processor now! http://www.ece.ubc.ca/~jasony/