University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs...

University of MichiganElectrical Engineering and Computer Science1

Reducing Control Power in CGRAs with Token Flow

Hyunchul Park, Yongjun Park, and Scott Mahlke

University of Michigan

University of MichiganElectrical Engineering and Computer Science2

Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput with a large number of resources• Distributed hardware offers low cost/power consumption• High flexibility with dynamic reconfiguration

University of MichiganElectrical Engineering and Computer Science

CGRA : Attractive Alternative to ASICs

• Suitable for running multimedia applications for future embedded systems– High throughput, low power consumption, high flexibility

• Morphosys : 8x8 array with RISC processor• SiliconHive : hierarchical systolic array• ADRES : 4x4 array with tightly coupled VLIW

viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW

Morphosys SiliconHive ADRES

3


Control Power Explosion

• Large number of configuration signals– Distributed interconnect, many resources to control– Nealy 1000 bits each cycle

• No code compression technique developed for CGRAs– Fully decoded instructions are stored in memory– 45% total power

4

Single PE PE Instruction


Code Compressions

• Huffman encoding– High efficiency, but sequential process

• Dictionary-based– Recurring patterns stored in dictionary– Not many patterns found in CGRAs

• Instruction level code compression– No-op compression : Itanium, DSPs– Only 17% are no-ops in CGRA


Fine-grain Code Compression

• Compress unused fields rather than the whole instruction– Opcode, MUX selection, register address– 35% of fields contain valid information

• Instruction format needs be stored in the memory– Information regarding which fields exist in the memory– Significant overhead : 172 bits (20%) for a 4x4 CGRA

6


Dynamic Instruction Format Discovery

• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized

– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format

7

FU : dest <- src0 + src1

RF : reg write


Dynamic Configuration of PEs

• Each cycle, tokens are sent to the consuming PEs– Consuming resources collect incoming tokens, discover instruction formats,

and fetch only necessary instruction fields

• Next cycle, resources can execute the scheduled operations

8

Cycle 0Cycle 1Cycle 2Cycle 3Cycle 4

Dataflow Graph Mapping Configuration

configured

executed

routing node


Token Generation

9

• Tokens are generated at the beginning of dataflow : live-in nodes in RFs• Each RF read port needs token generation info : 26 read ports in 4x4 CGRA

– 26 bits for token generation vs. 172 bits for instruction format


Token Network

10

• Token network between datapath and decoder– No instruction format, but token generation info

in the memory– Adds 1 cycle between IF and EX stage

• Created by cloning the datapath– 1 bit interconnect with same topology– Each resource translated to a token processing

module– Encode dest fields, not src fields


Register File Token Module

11

• Write port MUXes are converted to token receivers– Determine selection bits

• Read ports are converted to token senders– Tokens are initially generated here– Token generation information stored in a separate memory

token_gen

token sender


FU Token Module

12

• Input MUXes are converted to token receivers• Opcode processor

– Fetch opcode field if necessary– Determine token type (data/pred), latency


System Overview

datapathtoken

generation


Experimental Setup

• Target multimedia applications for embedded systems– Modulo scheduling for compute intensive loops in 3D graphics,

AAC decoder, AVC decoder (214 loops)

• Three different control path designs– baseline : fully decoded instructions– static : fine-grained code compression with instruction format

stored in the memory– token : fine-grain code compression with token network

14


Code Size / Performance

• Fine grain code compression increase code efficiency• Token network further improve code efficiency• Performance degradation

– Sharing of fields, allowing only 2 dests

15


Power / Area

16

• SRAM read power is greatly reduced with token network– Introducing token network slightly increases power and area

• Area overhead can be mitigated with the reduced SRAM area• Hardware overhead for migrating staging predicates into token

network is minimal


Staging Predicates Optimization

• Modulo scheduled loops– Prolog (filling pipeline)– Kernel code (steady state)– Epilog (draining pipeline)

• Only kernel code is stored in memory– Staging predicate control

prolog/epilog phases

17

II

Overlapped Execution

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

i0

i1

i2

i0 i1 i2


Migrating Staging Predicate

• Staging predicate – Control information, not data dependent– 10% configurations used for routing staging

predicate

• Move staging predicates into control path– Increase token by 1 bit : staging predicate– Only top nodes are guarded– Staging predicate flows along with tokens

• Benefits– Code size reduction– Performance increase

18

data

staging

predicate

stage 0

stage 1

stage 2

stage 3


Code Size / Performance

• Code size reduction by 9% • Migrating staging predicates improve performance by 7%

– 5% increase over baseline

19


Power / Area

20

• Power/area of token network increase due to valid bit• Reduced code size decreases SRAM power/area• Overall overhead for migrating staging predicates is minimal


Overall Power

• System power measured for a kernel loop in AVC• Introducing token network reduces the overall system

power by 25%, while achieving 5% performance gain

21

226.4 mW 170.0 mW


Conclusion

• Fine grain code compression is a good fit for CGRAs

• Token network can eliminate the instruction format overhead– Dynamic discovery of instruction format– Small overhead (< 3%)– Migrating staging predicates to token network improves

performance• Applicable to other highly distributed architectures

22


Questions?

23


Token Sender

• Each output port of resources are converted into a token sender– FU output, routing mux output, register file read ports

• Send out tokens only to the specified consumers in dest fields– Allow only two destinations for each output, potentially limits the performance


Token Receiver

• Input MUXes are converted to token receivers– Dest fields are stored in the memory, not src fields– MUX selection bits are determined with incoming token

position

25


Dynamic Instruction Format Discovery

• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized

– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format

26


Who Generates Tokens?

• Tokens are generated at the start of dataflow– Live-ins– Terminate when they get into a register file

• Tokens terminated in register files can be re-generated

• Read ports of register files generate tokens– Token generation information at RF read ports are

stored separately– 26 read ports in 4x4 CGRA

27

Live Live

Add

Add

Live

Live Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live Live

Add

Add

Live


Reducing Decoder Complexity

28

• Partitioning the configuration memory and decoder– Trade-off between number of memories and decoder complexity

• Design space exploration for memory partitioning– Which fields are stored in the same memory?– Sharing of field entries in the memory : under-utilized fields

MEM

……decoder

MEM

decoder

MEM

decoder

MEM

decoder

Token Network


Memory Partitioning

• Bundle fields with the same type : field width uniformity• Design space exploration result for a 4x4 CGRA

– sharing degree = # total entries / # total fields

• Reduces decoder complexity by 33% over naïve partitioning– Sharing incurs less than 1% performance degradation

29

type # fields # memories # entries # total entries

sharing degree

opcode 16 2 8 16 1.0

dest 96 8 8 64 0.75

const 16 2 6 12 0.75

reg addr 48 4 6 24 0.5

Date post:	19-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs...

Documents