+ All Categories
Home > Documents > Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David...

Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David...

Date post: 14-Dec-2015
Category:
Upload: jaylyn-brecher
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the Dynamic Mapping of Alternate Register Structures
Transcript
Page 1: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Cases 2007

Florida State University

Chris Zimmer, Steve Hines, Prasad Kulkarni

Gary Tyson, David Whalley

     Facilitating Compiler Optimizations Through the

          Dynamic Mapping of Alternate Register Structures

Page 2: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Motivation

2

Embedded Processors have fewer registers.

Compiler Optimizations increase register pressure

Difficult to apply aggressive compiler optimizations on embedded systems

Page 3: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Vector Multiply Example

3

Even before aggressive optimizations, 60% of available registers are already used

Further optimizations like Loop Unrolling and Software Pipelining are inhibited

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * B[I-2];}

.L3: ldr r1,[r2,r3, lsl #2] ldr r12,[r4], #4 mul r0,r12,r1 str r0,[r5,r3, lsl #2] add r3,r3,#1 cmp r3, #1000 blt .L3

Page 4: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Application Configurable Processors

4

Exploit common reference patterns found in code

Small register files mimic these reference behaviors.

Map Table provides register redirection.Changed architecture to add more

registers, but have minimal impact on ISA support, particularly not increasing operand size

Page 5: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Architectural Modifications

5

RegisterFile

Queue Q1

Queue Q2

Queue Q3

Stack Q4

Circular Buffer Q5

MapTable

R6 R6

R0 R0

R1 Q1

R15 R15

Page 6: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Software Pipelining

6

Software pipelining is not often found in embedded compilers.

Software pipelining reduces the overall cycle time of a loop.

Extracts iterations

Consumes Stalls

Consumes registers!!

Page 7: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Software Pipelining Example

7

Stalls Present when Loop Run

.L3:

ldr r1,[r2,r3, lsl #2]

ldr r12,[r4], #4

stall

stall

stall

mul r0,r12,r1

stall

stall

stall

str r0,[r5,r3, lsl #2]

add r3,r3,#1

cmp r3, #1000

bgt .L3

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}

.L3:

ldr r1,[r2,r3, lsl #2]

ldr r12,[r4], #4

mul r0,r12,r1

str r0,[r5,r3, lsl #2]

add r3,r3,#1

cmp r3, #1000

blt .L3

Page 8: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Instruction

8

Goal: Minimal modification to existing instruction set.

Single cycle instruction latencyMethod: Add a single instruction to the

ISA that is used to map and unmap a common register specifier into a customized register structure.

qmap <Reg Specifier> <Custom reg map information> <Custom reg specifier>

qmap r3,#4,q3

Page 9: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Architectural Modifications

9

RegisterFile

Queue Q1

Queue Q2

Queue Q3

Destructive Queue Q4

Circular Buffer Q5

MapTable

R6 R6

R0

An access to R0, which has no mapping in the table would get the data from the register file.

R1 is mapped into Q1 and would retrieve its data from there.

R0

R1 Q1

R15 R15

Page 10: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

4

30

Software Pipelining Example

10

1525

Q1

Q2

530

Q3

int A[1000], B[1000];void vmul() { int I; for (I=2; I < 1000; I++) B[I] = A[I] * C[I];}

5

2 13

75

Page 11: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Register Usage

11

Benchmark AR in Original Loop AR needed to Pipeline AR contained in customized structuresN Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 10Fir2Dim 3 Similar Loops 10 10 4

N Real Updates 10 10 6Dot Product 9 9 4Matrix Multiply 9 9 4Fir 6 6 4Mac 10 8 12Fir2Dim 10 10 4

N Real Updates 10 10 9Dot Product 9 9 8Matrix Multiply 9 9 8Fir 6 6 12Mac 10 8 18Fir2Dim 10 10 8

Loads 16x4 Register Savings Using Register Structures

Loads 32x4 Register Savings Using Register Structures

Loads 8x4 Register Savings Using Register Structures

Page 12: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Results – Multiplies varying latency, load latency set at four

12

In-Order Issue

0

10

20

30

40

50

2 4 8 16 32

Multiply Latency

Pe

rce

nt

Cy

cle

Re

du

cti

on

Dot Product

Matrix

Fir

N Real Updates

Conv 45

Mac

Fir2Dim

Page 13: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Results – Loads varying latency, multiply latency set at four

13

In-Order Issue

-10

0

10

20

30

40

50

60

2 4 8 16 32

Load Latency

Pe

rce

nt

Cy

cle

Re

du

cti

on

Dot Product

Matrix

Fir

N Real Upates

Conv45

Mac

Fir2Dim

Page 14: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Conclusions

14

Customized register structures reduce register pressure.

Software pipelining is viable in resource constrained environments

Performance can be improved with minor impact to the ISA.

Page 15: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Extra’s

Page 16: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Reference Behaviors

16

ldr r1,[r6,r4, lsl #4]

ldr r12,[r6,r4, lsl #8]

ldr r8,[r6,r4, lsl #12]

str r8,[r3,r4, lsl #16]

str r12,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Stack Reference Behavior

Page 17: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Application Configurable Architecture

17

Application configurable processors are designed using a mapping table similar to a register rename table found in many out of order implementations.

The map table is read during every access to the architected register file.

This serves as a method of determining if a register specifier is used in the original architected register file or a customized register structure.

Page 18: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Application Configurable Architecture

18

The customized register files are small in size but they efficiently manage the values that would require many architected registers.

The customized register files can mimic queues, stacks, and circular buffers.

These structures are accessed using the same register specifier that is used to access the architected register file.

Page 19: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

RemoveReference Behaviors

19

ldr r1,[r6,r4, lsl #4]

ldr r12,[r6,r4, lsl #8]

ldr r8,[r6,r4, lsl #12]

str r8,[r3,r4, lsl #16]

str r12,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Stack Reference

Behavior

R8

R12

R1

r1

ldr r1,[r6,r4, lsl #4]

ldr r1,[r6,r4, lsl #8]

ldr r1,[r6,r4, lsl #12]

str r1,[r3,r4, lsl #16]

str r1,[r3,r4, lsl #20]

str r1,[r3,r4, lsl #24]

Free up r8 and r12 for use.

Page 20: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

RemoveQmap Instruction

20

R8

R12

R1

q0

Free up r8 and r12 for use.

Page 21: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Modulo Scheduling

21

For our work we used modulo scheduling. This requires using the dependences and latencies of the loop instructions to generate a modulo scheduled loop.

The prolog and epilog are then built based off of this schedule.

The prolog and epilog in require register renaming of loop carried dependencies to verify a correct loop. Renaming in embedded processors is often not

possible.

Page 22: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Register Renaming due to software pipelining

22

Renaming doesn’t work… not enough registers.

Rotating registers would require a significant rewrite of the embedded ISA.

The loop carried values can simply be mapped into a register queue to hold the value across several iterations.

Page 23: Cases 2007 Florida State University Chris Zimmer, Steve Hines, Prasad Kulkarni Gary Tyson, David Whalley Facilitating Compiler Optimizations Through the.

Results Register Savings

23

As latency grows for the instructions more iterations of the loop are extracted to spread out the latency.

The extra registers that would be required to perform renaming have measured from 25% to 200% of the available registers in the ARM.


Recommended