Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
University of MichiganElectrical Engineering and Computer Science1
Reducing Control Power in CGRAs with Token Flow
Hyunchul Park, Yongjun Park, and Scott Mahlke
University of Michigan
University of MichiganElectrical Engineering and Computer Science2
Coarse-Grained Reconfigurable Architecture (CGRA)
• Array of PEs connected in a mesh-like interconnect• High throughput with a large number of resources• Distributed hardware offers low cost/power consumption• High flexibility with dynamic reconfiguration
University of MichiganElectrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
• Suitable for running multimedia applications for future embedded systems– High throughput, low power consumption, high flexibility
• Morphosys : 8x8 array with RISC processor• SiliconHive : hierarchical systolic array• ADRES : 4x4 array with tightly coupled VLIW
viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW
Morphosys SiliconHive ADRES
3
University of MichiganElectrical Engineering and Computer Science
Control Power Explosion
• Large number of configuration signals– Distributed interconnect, many resources to control– Nealy 1000 bits each cycle
• No code compression technique developed for CGRAs– Fully decoded instructions are stored in memory– 45% total power
4
Single PE PE Instruction
University of MichiganElectrical Engineering and Computer Science
Code Compressions
• Huffman encoding– High efficiency, but sequential process
• Dictionary-based– Recurring patterns stored in dictionary– Not many patterns found in CGRAs
• Instruction level code compression– No-op compression : Itanium, DSPs– Only 17% are no-ops in CGRA
University of MichiganElectrical Engineering and Computer Science
Fine-grain Code Compression
• Compress unused fields rather than the whole instruction– Opcode, MUX selection, register address– 35% of fields contain valid information
• Instruction format needs be stored in the memory– Information regarding which fields exist in the memory– Significant overhead : 172 bits (20%) for a 4x4 CGRA
6
University of MichiganElectrical Engineering and Computer Science
Dynamic Instruction Format Discovery
• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized
– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format
7
FU : dest <- src0 + src1
RF : reg write
University of MichiganElectrical Engineering and Computer Science
Dynamic Configuration of PEs
• Each cycle, tokens are sent to the consuming PEs– Consuming resources collect incoming tokens, discover instruction formats,
and fetch only necessary instruction fields
• Next cycle, resources can execute the scheduled operations
8
Cycle 0Cycle 1Cycle 2Cycle 3Cycle 4
Dataflow Graph Mapping Configuration
configured
executed
routing node
University of MichiganElectrical Engineering and Computer Science
Token Generation
9
• Tokens are generated at the beginning of dataflow : live-in nodes in RFs• Each RF read port needs token generation info : 26 read ports in 4x4 CGRA
– 26 bits for token generation vs. 172 bits for instruction format
University of MichiganElectrical Engineering and Computer Science
Token Network
10
• Token network between datapath and decoder– No instruction format, but token generation info
in the memory– Adds 1 cycle between IF and EX stage
• Created by cloning the datapath– 1 bit interconnect with same topology– Each resource translated to a token processing
module– Encode dest fields, not src fields
University of MichiganElectrical Engineering and Computer Science
Register File Token Module
11
• Write port MUXes are converted to token receivers– Determine selection bits
• Read ports are converted to token senders– Tokens are initially generated here– Token generation information stored in a separate memory
token_gen
token sender
University of MichiganElectrical Engineering and Computer Science
FU Token Module
12
• Input MUXes are converted to token receivers• Opcode processor
– Fetch opcode field if necessary– Determine token type (data/pred), latency
University of MichiganElectrical Engineering and Computer Science
System Overview
datapathtoken
generation
University of MichiganElectrical Engineering and Computer Science
Experimental Setup
• Target multimedia applications for embedded systems– Modulo scheduling for compute intensive loops in 3D graphics,
AAC decoder, AVC decoder (214 loops)
• Three different control path designs– baseline : fully decoded instructions– static : fine-grained code compression with instruction format
stored in the memory– token : fine-grain code compression with token network
14
University of MichiganElectrical Engineering and Computer Science
Code Size / Performance
• Fine grain code compression increase code efficiency• Token network further improve code efficiency• Performance degradation
– Sharing of fields, allowing only 2 dests
15
University of MichiganElectrical Engineering and Computer Science
Power / Area
16
• SRAM read power is greatly reduced with token network– Introducing token network slightly increases power and area
• Area overhead can be mitigated with the reduced SRAM area• Hardware overhead for migrating staging predicates into token
network is minimal
University of MichiganElectrical Engineering and Computer Science
Staging Predicates Optimization
• Modulo scheduled loops– Prolog (filling pipeline)– Kernel code (steady state)– Epilog (draining pipeline)
• Only kernel code is stored in memory– Staging predicate control
prolog/epilog phases
17
II
Overlapped Execution
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
i0
i1
i2
i0 i1 i2
University of MichiganElectrical Engineering and Computer Science
Migrating Staging Predicate
• Staging predicate – Control information, not data dependent– 10% configurations used for routing staging
predicate
• Move staging predicates into control path– Increase token by 1 bit : staging predicate– Only top nodes are guarded– Staging predicate flows along with tokens
• Benefits– Code size reduction– Performance increase
18
data
staging
predicate
stage 0
stage 1
stage 2
stage 3
University of MichiganElectrical Engineering and Computer Science
Code Size / Performance
• Code size reduction by 9% • Migrating staging predicates improve performance by 7%
– 5% increase over baseline
19
University of MichiganElectrical Engineering and Computer Science
Power / Area
20
• Power/area of token network increase due to valid bit• Reduced code size decreases SRAM power/area• Overall overhead for migrating staging predicates is minimal
University of MichiganElectrical Engineering and Computer Science
Overall Power
• System power measured for a kernel loop in AVC• Introducing token network reduces the overall system
power by 25%, while achieving 5% performance gain
21
226.4 mW 170.0 mW
University of MichiganElectrical Engineering and Computer Science
Conclusion
• Fine grain code compression is a good fit for CGRAs
• Token network can eliminate the instruction format overhead– Dynamic discovery of instruction format– Small overhead (< 3%)– Migrating staging predicates to token network improves
performance• Applicable to other highly distributed architectures
22
University of MichiganElectrical Engineering and Computer Science
Questions?
23
University of MichiganElectrical Engineering and Computer Science
Token Sender
• Each output port of resources are converted into a token sender– FU output, routing mux output, register file read ports
• Send out tokens only to the specified consumers in dest fields– Allow only two destinations for each output, potentially limits the performance
University of MichiganElectrical Engineering and Computer Science
Token Receiver
• Input MUXes are converted to token receivers– Dest fields are stored in the memory, not src fields– MUX selection bits are determined with incoming token
position
25
University of MichiganElectrical Engineering and Computer Science
Dynamic Instruction Format Discovery
• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized
– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format
26
University of MichiganElectrical Engineering and Computer Science
Who Generates Tokens?
• Tokens are generated at the start of dataflow– Live-ins– Terminate when they get into a register file
• Tokens terminated in register files can be re-generated
• Read ports of register files generate tokens– Token generation information at RF read ports are
stored separately– 26 read ports in 4x4 CGRA
27
Live Live
Add
Add
Live
Live Live
Add
Add
Live
Live
RF
Live
Add
Add
Live
Live
RF
Live
Add
Add
Live
Live
RF
Live
Add
Add
Live
Live Live
Add
Add
Live
University of MichiganElectrical Engineering and Computer Science
Reducing Decoder Complexity
28
• Partitioning the configuration memory and decoder– Trade-off between number of memories and decoder complexity
• Design space exploration for memory partitioning– Which fields are stored in the same memory?– Sharing of field entries in the memory : under-utilized fields
MEM
……decoder
MEM
decoder
MEM
decoder
MEM
decoder
Token Network
University of MichiganElectrical Engineering and Computer Science
Memory Partitioning
• Bundle fields with the same type : field width uniformity• Design space exploration result for a 4x4 CGRA
– sharing degree = # total entries / # total fields
• Reduces decoder complexity by 33% over naïve partitioning– Sharing incurs less than 1% performance degradation
29
type # fields # memories # entries # total entries
sharing degree
opcode 16 2 8 16 1.0
dest 96 8 8 64 0.75
const 16 2 6 12 0.75
reg addr 48 4 6 24 0.5