+ All Categories
Home > Documents > University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs...

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs...

Date post: 19-Dec-2015
Category:
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott Mahlke University of Michigan
Transcript
Page 1: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science1

Reducing Control Power in CGRAs with Token Flow

Hyunchul Park, Yongjun Park, and Scott Mahlke

University of Michigan

Page 2: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science2

Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput with a large number of resources• Distributed hardware offers low cost/power consumption• High flexibility with dynamic reconfiguration

Page 3: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

CGRA : Attractive Alternative to ASICs

• Suitable for running multimedia applications for future embedded systems– High throughput, low power consumption, high flexibility

• Morphosys : 8x8 array with RISC processor• SiliconHive : hierarchical systolic array• ADRES : 4x4 array with tightly coupled VLIW

viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW

Morphosys SiliconHive ADRES

3

Page 4: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Control Power Explosion

• Large number of configuration signals– Distributed interconnect, many resources to control– Nealy 1000 bits each cycle

• No code compression technique developed for CGRAs– Fully decoded instructions are stored in memory– 45% total power

4

Single PE PE Instruction

Page 5: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Code Compressions

• Huffman encoding– High efficiency, but sequential process

• Dictionary-based– Recurring patterns stored in dictionary– Not many patterns found in CGRAs

• Instruction level code compression– No-op compression : Itanium, DSPs– Only 17% are no-ops in CGRA

Page 6: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Fine-grain Code Compression

• Compress unused fields rather than the whole instruction– Opcode, MUX selection, register address– 35% of fields contain valid information

• Instruction format needs be stored in the memory– Information regarding which fields exist in the memory– Significant overhead : 172 bits (20%) for a 4x4 CGRA

6

Page 7: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Dynamic Instruction Format Discovery

• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized

– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format

7

FU : dest <- src0 + src1

RF : reg write

Page 8: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Dynamic Configuration of PEs

• Each cycle, tokens are sent to the consuming PEs– Consuming resources collect incoming tokens, discover instruction formats,

and fetch only necessary instruction fields

• Next cycle, resources can execute the scheduled operations

8

Cycle 0Cycle 1Cycle 2Cycle 3Cycle 4

Dataflow Graph Mapping Configuration

configured

executed

routing node

Page 9: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Token Generation

9

• Tokens are generated at the beginning of dataflow : live-in nodes in RFs• Each RF read port needs token generation info : 26 read ports in 4x4 CGRA

– 26 bits for token generation vs. 172 bits for instruction format

Page 10: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Token Network

10

• Token network between datapath and decoder– No instruction format, but token generation info

in the memory– Adds 1 cycle between IF and EX stage

• Created by cloning the datapath– 1 bit interconnect with same topology– Each resource translated to a token processing

module– Encode dest fields, not src fields

Page 11: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Register File Token Module

11

• Write port MUXes are converted to token receivers– Determine selection bits

• Read ports are converted to token senders– Tokens are initially generated here– Token generation information stored in a separate memory

token_gen

token sender

Page 12: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

FU Token Module

12

• Input MUXes are converted to token receivers• Opcode processor

– Fetch opcode field if necessary– Determine token type (data/pred), latency

Page 13: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

System Overview

datapathtoken

generation

Page 14: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• Target multimedia applications for embedded systems– Modulo scheduling for compute intensive loops in 3D graphics,

AAC decoder, AVC decoder (214 loops)

• Three different control path designs– baseline : fully decoded instructions– static : fine-grained code compression with instruction format

stored in the memory– token : fine-grain code compression with token network

14

Page 15: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Code Size / Performance

• Fine grain code compression increase code efficiency• Token network further improve code efficiency• Performance degradation

– Sharing of fields, allowing only 2 dests

15

Page 16: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Power / Area

16

• SRAM read power is greatly reduced with token network– Introducing token network slightly increases power and area

• Area overhead can be mitigated with the reduced SRAM area• Hardware overhead for migrating staging predicates into token

network is minimal

Page 17: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Staging Predicates Optimization

• Modulo scheduled loops– Prolog (filling pipeline)– Kernel code (steady state)– Epilog (draining pipeline)

• Only kernel code is stored in memory– Staging predicate control

prolog/epilog phases

17

II

Overlapped Execution

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

i0

i1

i2

i0 i1 i2

Page 18: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Migrating Staging Predicate

• Staging predicate – Control information, not data dependent– 10% configurations used for routing staging

predicate

• Move staging predicates into control path– Increase token by 1 bit : staging predicate– Only top nodes are guarded– Staging predicate flows along with tokens

• Benefits– Code size reduction– Performance increase

18

data

staging

predicate

stage 0

stage 1

stage 2

stage 3

Page 19: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Code Size / Performance

• Code size reduction by 9% • Migrating staging predicates improve performance by 7%

– 5% increase over baseline

19

Page 20: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Power / Area

20

• Power/area of token network increase due to valid bit• Reduced code size decreases SRAM power/area• Overall overhead for migrating staging predicates is minimal

Page 21: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Overall Power

• System power measured for a kernel loop in AVC• Introducing token network reduces the overall system

power by 25%, while achieving 5% performance gain

21

226.4 mW 170.0 mW

Page 22: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Conclusion

• Fine grain code compression is a good fit for CGRAs

• Token network can eliminate the instruction format overhead– Dynamic discovery of instruction format– Small overhead (< 3%)– Migrating staging predicates to token network improves

performance• Applicable to other highly distributed architectures

22

Page 23: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Questions?

23

Page 24: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Token Sender

• Each output port of resources are converted into a token sender– FU output, routing mux output, register file read ports

• Send out tokens only to the specified consumers in dest fields– Allow only two destinations for each output, potentially limits the performance

Page 25: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Token Receiver

• Input MUXes are converted to token receivers– Dest fields are stored in the memory, not src fields– MUX selection bits are determined with incoming token

position

25

Page 26: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Dynamic Instruction Format Discovery

• Resources need configuration only when data flows through them• Instruction format can be discovered by looking at the data flow• Token network from dataflow machines can be utilized

– Token is 1 bit information indicating incoming data in next cycle– Each PE observes incoming tokens and determines the instruction format

26

Page 27: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Who Generates Tokens?

• Tokens are generated at the start of dataflow– Live-ins– Terminate when they get into a register file

• Tokens terminated in register files can be re-generated

• Read ports of register files generate tokens– Token generation information at RF read ports are

stored separately– 26 read ports in 4x4 CGRA

27

Live Live

Add

Add

Live

Live Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live

RF

Live

Add

Add

Live

Live Live

Add

Add

Live

Page 28: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Reducing Decoder Complexity

28

• Partitioning the configuration memory and decoder– Trade-off between number of memories and decoder complexity

• Design space exploration for memory partitioning– Which fields are stored in the same memory?– Sharing of field entries in the memory : under-utilized fields

MEM

……decoder

MEM

decoder

MEM

decoder

MEM

decoder

Token Network

Page 29: University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of MichiganElectrical Engineering and Computer Science

Memory Partitioning

• Bundle fields with the same type : field width uniformity• Design space exploration result for a 4x4 CGRA

– sharing degree = # total entries / # total fields

• Reduces decoder complexity by 33% over naïve partitioning– Sharing incurs less than 1% performance degradation

29

type # fields # memories # entries # total entries

sharing degree

opcode 16 2 8 16 1.0

dest 96 8 8 64 0.75

const 16 2 6 12 0.75

reg addr 48 4 6 24 0.5


Recommended