+ All Categories
Home > Documents > Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction...

Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction...

Date post: 26-Dec-2015
Category:
Upload: johnathan-perry
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
31
Embedded Systems Group IIT Delhi S l i d e 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001
Transcript
Page 1: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

A Framework for Studying Effects of VLIW Instruction Encoding and Decoding

Schemes

Anup Gangwar

November 28, 2001

Page 2: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

Overview

• The VLIW code size expansion problem

• What all such a framework needs to support?

• Trimaran compiler infrastructure

• The HPL-PD architecture

• Extensions to the various modules of Trimaran

• Results

• Future work

• Acknowledgements

Page 3: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 3

Choices for exploiting ILP

• The architectural choices for utilizing ILP– Superscalar processors

• Try to extract ILP at run time

• Complex hardware

• Limited clock speeds and high power dissipation

• Not suited for embedded type of applications

– VLIW processors• Compiler has lot of knowledge about hardware

• Compiler extracts ILP statically

• Simplified hardware

• Possible to attain higher clock speeds

Page 4: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 4

Problems with VLIW processors

• Complex compiler required to extract ILP from application program

• Requires adequate support in hardware for compiler controlled execution

• Code size expansion due to explicit NOPs if,– The application does not contain enough parallelism– The compiler is not able to extract parallelism from the

application– Need for good instruction encoding and NOP compression

schemes

Page 5: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 5

What all such a framework should support?

• The framework should have quick retargetability

• Studying the effect of a particular instruction

encoding and decoding scheme on processor

performance

• Studying the code size minimization due to a

particular instruction encoding scheme

• Studying memory bandwidth requirements imposed

by a particular instruction decoding scheme.

Page 6: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 6

Trimaran Compiler InfrastructureC Program

IMPACT

SIMULATOR

ELCOR

Bridge Code

ELCOR IR

HMDES Machine Description

STATISTICS

•ANSI C Parsing•Code profiling•Classical machine independent optimizations•Basic block formation

•Machine dependent

code optimizations

•Code scheduling

•Register allocation•ELCOR IR to low level C files•HPL-PD virtual machine•Cache simulation•Performance statistics

•Compute and stall cycles•Cache stats•Spill code info

Page 7: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 7

Various modules of Trimaran - 1

• IMPACT– Developed by UIUC’s IMPACT group

– Trimaran uses only the IMPACT front-end

– Classical machine independent optimizations

– Outputs a low level IR, Trimaran bridge code

• ELCOR– Developed by HPL’s CAR group

– It is the compiler backend

– Performs registration allocation and code scheduling

– Parameterized by HMDES machine description

– Outputs ELCOR IR with annotated HPL-PD assembly

Page 8: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 8

Various modules of Trimaran - 2

• HMDES– Developed by UIUC’s IMPACT group

– Specifies resource usage and latency information for an arch.

– Input is translated to a low level representation

– Has efficient mechanisms for querying the database

– Does not specify instruction format information

• HPL-PD Simulator– Developed by NYU’s REACT-ILP group

– Converts ELCOR’s annotated IR to low level C representation

– Processor performance and cache simulation

– Generates statistics and execution trace

Page 9: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 9

Various modules of Trimaran - 3

Example ELCOR Operation in IR

Op 7 ( ADD_W [ br<11 :I gpr 14>] [br<27 :I gpr 14> I<1> ]

p<t> s_time( 3 ) s_opcode( ADD_W.0 ) attr(lc ^52) flags( sched ) )

Page 10: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

0

Various modules of Trimaran - 4

• HMDES Sections– Field_Type e.g. REG, Lit etc.

– Resource e.g. Slot0, Slot1 etc.

– Resource_Usage e.g. RU_slot0 time( 0 )

– Reservation_Table e.g. RT_slot0 use( Slot0 )

– Operation_Latency e.g. lat1 ( time( 1 ) )

– Scheduling_Alternative e.g. (format(std1) resv(RT1) latency(lat1) )

– Operation e.g. ADD_W.0 ( Alt_1 Alt_2 )

– Elcor_Operation e.g. ADD_W( op( “ADD_W.0” “ADD_W.1” ) )

Page 11: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

1

Various modules of Trimaran - 5

HPL-PD Simulator in detail

Code Processor

Native Compiler

REBEL

HMDES

Low level C files C libraries Emulation Library

Executable for the host platform

Page 12: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

2

Various modules of Trimaran - 7

HPL-PD Virtual Machine

Fetch Next Instruction Fetch Data Execute Instruction

Level I Instruction-Cache Level I Data-Cache

Level II Unified Cache

Instruction Accesses Data Accesses

HPL-PD Simulator in detail

Dinero IV Cache Simulator

Page 13: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

3

The HPL-PD architecture• Parameterized ILP architecture from HP Labs

• Possible to vary,– Number and types of FUs

– Number and types of registers

– Width of instruction words

– Instruction latencies

• Predicated instruction execution

• Compiler visible cache hierarchy

• Result multicast is supported for predicate registers

• Run time memory disambiguation instructions

Page 14: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

4

The HPL-PD memory hierarchy

Registers

L1 Cache

L2 Cache

Main Memory

Data Prefetch Cache

•Independent of L1 Cache•Used to store large amount of

cache polluting data•Doesn’t require sophisticated

cache replacement mechanism

Page 15: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

5

The Framework

TRIMARAN

HMDES

Decoder Model

ASSEMBLER(using NJMC)

Obj. File

DISASSEMBLER(using NJMC)

Code Size

Perf. Stats

Cache. Stats

Instruction Addressor Next Instr RequestInstruction Address

Bytes Fetched

Page 16: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

6

Studying impact on performance

• The HMDES modeling of decompressor,– Add a new resource with latency of decoder

– Add a new resource usage section for this decoder

– Add this resource usage to all the HPL-PD operations

• In the results there are two decompressor units with latency = 1

• The latency of decompressor should be estimated or generated using actual simulation.

Page 17: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

7

Studying code size minimization - 1

A simple template based instruction encoding scheme

IALU.0 IALU.1 FALU.0 MU.0 BU.0Issue Slots

MUL_OP Format MUL_OP OPCODE & OPERANDS OPCODE & OPERANDS …..

•Multi-ops are decided after profiling the generated assembly code.•Multi-op field encodes:

•Size and position of each Uni-op•Number, size and position of operands of each Uni-op

ADD_W and L_W_C1_C1 00010 IOP ; Sgpr1, Slit1, Dgpr2 MemOP ; Sgpr1, Dgpr1

Page 18: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

8

Studying code size minimization - 2

• Instrumenting ELCOR to generate assembly code

1. Arrange all the ops in IR in forward control order

2. Choose the next basic block and initialize cycle to 0

3. Walk the ops of this BB and dump those with the s_time = cycle

4. If BBs are left goto step 2

5. Dump the global data

• Actual instruction encoding is done using procedures

created by NJMC

Page 19: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 1

9

Studying code size minimization - 3

The New Jersey Machine Code Toolkit

• Deals with bits at symbolic level

• Can be used to write assemblers, disassemblers etc.

• Supports concatenation to emit large binary data

• Representation is specified in SLED

• Has been used to write assemblers for Sparc, i486 etc.

• VLIW instructions need to be broken up into 32 bit (max) size tokens

• Emitted binary data must end on a 8 bit boundary

Page 20: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

0

Studying code size minimization - 4

Machine specifications in SLED

bit 0 is least significant

fields of TOK32 (32) Dgpr_1 0:3 Slit_1_part1 4:31

fields of TOK8 (16) Slit_1_part2 0:3 Sgpr_1 4:7 IOP 8:11 tmpl 12:14

patterns IOP_pats is any of

[

ADD MUL SUB

], which is tmpl = 1 & IOP = { 0 to 2 }

constructors

IOP_pats Sgpr_1, Slit_1, Dgpr_1 is

IOP_pats & Sgpr_1 & Slit_1_part2 = Slit_1 @[28:31];

Slit_2_part1 = Slit_1 @[0:27] & Dgpr_1

Page 21: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

1

Studying code size minimization - 5

Toolkit encoder output

ADD( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );

MUL( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );

SUB( unsigned Sgpr_1, unsigned Slit_1, unsigned Dgpr_1 );

Specifying matcher for disassembler

match

| ADD( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something

| MUL( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something

| SUB( Sgpr_1, Slit_1, Dgpr_1 ) => //Do something

endmatch

Page 22: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

2

Studying code size minimization - 6

• The matcher application needs functions for fetching data

• Bit ordering is different on little and big endian machines

• The matcher fails when large number of complex

templates are given

• Breaking large sized multi-ops across 32 bit tokens makes

the representation messy and error prone

• Specifying addresses for forward branches requires two passes

Page 23: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

3

Studying impact on memory bandwidth - 1

The Typical VLIW Pipeline

Instruction Fetch Align DecodeDecompress

Instruction Decode

DF/AGExecuteStore Results

Page 24: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

4

Studying impact on memory bandwidth - 2

• The cache simulation requires the generation of,– Instruction address

– No. of bytes to fetch

• Instruction address can be generated by disassembling

the instructions at run time and keeping track of jumps

• The matcher application returns the number of bytes

required to disassemble an instruction

• The disassembled instruction can be compared with the

instruction issued to check correctness

Page 25: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

5

Studying impact on memory bandwidth - 3

• Run time verification of disassembled instructions can be

turned off for faster simulation

• Due to restricted size of matcher results could not be

obtained for larger programs

• Memory access addresses and bytes to fetch have been

generated by hand for SumToN application

Page 26: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

6

Results - Impact on code size (Strcpy)

207

280

370

0

100

200

300

400

X86

Sparc

HPL-PD

Page 27: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

7

Results - Impact on code size (SumToN)

59

97

159

0

50

100

150

200

X86

Sparc

HPL-PD

Page 28: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

8

Results - Size of SLED specification for various archs.

15553

1150013199

0

5000

10000

15000

20000

X86

Sparc

HPL-PD

Page 29: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 2

9

Results - Cache performance comparison (SumToN)

320

256

196160

050

100150200250300350

1 2

Canonical

Encoded

Page 30: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 3

0

Future work

• Need for automation in most parts of the framework

• Better representation for VLIW instructions than SLED– Unlimited token size

– Facility to bind one field with multiple patterns

• Methodology for predicting latency for decompressor

• Framework for finding the optimal instruction formats

Page 31: Embedded Systems GroupIIT Delhi Slide 1 A Framework for Studying Effects of VLIW Instruction Encoding and Decoding Schemes Anup Gangwar November 28, 2001.

Embedded Systems Group IIT Delhi

Slid

e 3

1

Acknowledgements

• Prof. M.Balakrishnan and Prof. Anshul Kumar

• Rodric M. Rabbah, Georgia Institute of Technology

• Shail Aditya, HP Labs

• All the friends at Philips Lab. for stimulating discussions


Recommended