University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly...

1 University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling for Highly Customized Datapaths to Increase

Hardware Reusability

Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

April 8, 2008


Introduction

• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

iPhone board


Loop Accelerators

C Code HardwareLoop

LD +/- *


FPGAs

Hardware Implementations

• Customization gets order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

General PurposeProcessors

DSPsCGRAs

Loop Accelerators,ASICs

Efficiency, Performance

Fle

xibi

lity

MultifunctionLoop Accelerators


What About Programmability?

• Software changes – bug fixes, evolving standards• dct_8x8() from H.264 reference implementation

Version 13.0 Version 13.1 Version 13.2

for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) {

i=pos_scan[coeff_ctr][0];

j=pos_scan[coeff_ctr][1];

run++;

ilev=0;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

MCcoeff = MC(coeff_ctr);

runs[MCcoeff]++;

}

m7 = &curr_res[block_y + j][block_x];

level = iabs (m7[i]);

if (img->AdaptiveRounding)

{

fadjust8x8[j][block_x+i] = 0;

}

if (level != 0)

{

nonzero = TRUE;


{

*coeff_cost += MAX_VALUE;

img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{


ACLevel[scan_pos ] = isignab(level,m7[i]);

ACRun [scan_pos++] = run;

run=-1; // reset zero level counter

}

level = isignab(level, m7[i]);

ilev = level;

}

}




run++;

ilev=0;


{


runs[MCcoeff]++;

}




{


}

if (level != 0)

{

nonzero = TRUE;


{


img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{





}


ilev = level;

}

}




run++;

ilev=0;


{


runs[MCcoeff]++;

}




{


}

if (level != 0)

{

nonzero = TRUE;


{


img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{





}


ilev = level;

}

}


FPGAs

Programmable Loop Accelerator

• Reusable hardware → reduced NRE costs• Generalize accelerator without losing efficiency

General PurposeProcessors

DSPsCGRAs

Loop Accelerators,ASICs

Efficiency, Performance

Fle

xibi

lity

MultifunctionLoop Accelerators

ProgrammableLoop Accelerators


Flexible Accelerators

HardwareLoop 1

SynthesisSystem

Loop 2

Compiler

• Generalize accelerator architecture• Map new loops to existing hardware


Loop Accelerator Architecture

Point-to-point Connections

+

… …

&

… …

MEM

… …

LocalMem

FSM

Controlsignals

CRF

BR

• Hardware realization of modulo scheduled loop• Parameterized execution resources, storage, connectivity


Programmable Accelerator Architecture

Point-to-point Connections

+/-

… …

&/|

… …

MEM

… …

LocalMem

ControlMemory

Controlsignals

CRF

BR

RR RRRRRR

Literals

Bus

• ~50% area overhead vs. non-programmable accelerator

• Generalize architectural features that limit programmability


Mapping Loops onto Hardware

General-purpose Customized

Central register file Distributed registers

Homogeneous Point-to-point

Processor Accelerator

FUs

Storage

Connectivity

ALU ALU

CRF

LD +/- *

88 16


Scheduling Example

ADDER1 ADDER2MEM

0

1

2

3

4

II=2

Time+2 +3

+4 +5

LD1

LD1

+2 +3

LD1

+2 +3

+4

+4LD1

+3 +2

+3 +2

+4

+4

+5

+5

+5 ?


Modulo Scheduling for LAs

• Large search space, few solutions• Op-centric approaches unable to find solutions• Satisfiability Modulo Theory (SMT) formulation to

solve linear and SAT constraints simultaneously

MoveInsertion

SMTScheduling

RegisterAllocation

LoopControl SignalsMachine

descriptionIncrement II


SMT Formulation

• Boolean variables Xi,f,t are true if operation i is scheduled on FU f at time slot t.

• Integer variables Si represent stage of operation i.

( Xi,fi,ti Xj,fj,tj ) ( )

sched_time(j) sched_time(i) + lat(i) – dist(i,j) II

i

j

lat(i)

dist(i,j)

Sj II + tj Si II + ti + lat(i) – dist(i,j) II

• More details in paper


Measuring Programmability

• How well can different loops be mapped onto the same hardware?

• Performance matters – how much does II increase?• Need set of loops with different degrees of similarity

FU FU

Hardware

Loop LoopLoop

LoopLoop

Loop

?


Graph Perturbation

• Synthetically generated graphs• More perturbations → less similar to original graph• Iteratively apply random transformations:

Add edge between existing operationsAdd edge with new producerAdd edge with new consumer

Remove edge


0

2

4

6

8

10

12

14

16

18

20

dcac dequant fft fir fmradio turbo fsed sobel heat lu

# P

ert

urb

ati

on

s

≥3

<3

<2

<1

Results – Perturbed Graphs

AverageII increase

4 8 7 2 4 4 44 6 9

MPEG4 Signal processing Image Math

Base II


Results – Restricted Datapath

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# Perturbations

II in

cre

as

e r

ati

o

PLA

No RRF

No Bus

No Mux


Conclusion

• Increase flexibility of customized hardware without sacrificing performance, efficiency

• Successfully map loops to heterogeneous hardware• Compile times of 5 minutes – 1 hour• Software changing faster than hardware →

patchable ASIC


Questions?



Results – Cross Compilation

Date post:	19-Dec-2015
Category:	Documents
View:	213 times
Download:	0 times

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly...

Documents