+ All Categories
Home > Documents > University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly...

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly...

Date post: 19-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
21
1 University of Michigan Electrical Engineering and Computer Science Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability Kevin Fan, Hyunchul Park, Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan April 8, 2008
Transcript

1 University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling for Highly Customized Datapaths to Increase

Hardware Reusability

Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

April 8, 2008

2 University of MichiganElectrical Engineering and Computer Science

Introduction

• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

iPhone board

3 University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

C Code HardwareLoop

LD +/- *

4 University of MichiganElectrical Engineering and Computer Science

FPGAs

Hardware Implementations

• Customization gets order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

General PurposeProcessors

DSPsCGRAs

Loop Accelerators,ASICs

Efficiency, Performance

Fle

xibi

lity

MultifunctionLoop Accelerators

5 University of MichiganElectrical Engineering and Computer Science

What About Programmability?

• Software changes – bug fixes, evolving standards• dct_8x8() from H.264 reference implementation

Version 13.0 Version 13.1 Version 13.2

for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) {

i=pos_scan[coeff_ctr][0];

j=pos_scan[coeff_ctr][1];

run++;

ilev=0;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

MCcoeff = MC(coeff_ctr);

runs[MCcoeff]++;

}

m7 = &curr_res[block_y + j][block_x];

level = iabs (m7[i]);

if (img->AdaptiveRounding)

{

fadjust8x8[j][block_x+i] = 0;

}

if (level != 0)

{

nonzero = TRUE;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

*coeff_cost += MAX_VALUE;

img->cofAC[b8+pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[b8+pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{

*coeff_cost += MAX_VALUE;

ACLevel[scan_pos ] = isignab(level,m7[i]);

ACRun [scan_pos++] = run;

run=-1; // reset zero level counter

}

level = isignab(level, m7[i]);

ilev = level;

}

}

for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) {

i=pos_scan[coeff_ctr][0];

j=pos_scan[coeff_ctr][1];

run++;

ilev=0;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

MCcoeff = MC(coeff_ctr);

runs[MCcoeff]++;

}

m7 = &curr_res[block_y + j][block_x];

level = iabs (m7[i]);

if (img->AdaptiveRounding)

{

fadjust8x8[j][block_x+i] = 0;

}

if (level != 0)

{

nonzero = TRUE;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

*coeff_cost += MAX_VALUE;

img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{

*coeff_cost += MAX_VALUE;

ACLevel[scan_pos ] = isignab(level,m7[i]);

ACRun [scan_pos++] = run;

run=-1; // reset zero level counter

}

level = isignab(level, m7[i]);

ilev = level;

}

}

for (coeff_ctr = 0; coeff_ctr < 64; coeff_ctr++) {

i=pos_scan[coeff_ctr][0];

j=pos_scan[coeff_ctr][1];

run++;

ilev=0;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

MCcoeff = MC(coeff_ctr);

runs[MCcoeff]++;

}

m7 = &curr_res[block_y + j][block_x];

level = iabs (m7[i]);

if (img->AdaptiveRounding)

{

fadjust8x8[j][block_x+i] = 0;

}

if (level != 0)

{

nonzero = TRUE;

if (currMB->luma_transform_size_8x8_flag && input->symbol_mode == CAVLC)

{

*coeff_cost += MAX_VALUE;

img->cofAC[pl_off][MCcoeff][0][scan_poss[MCcoeff] ] = isignab(level,m7[i]);

img->cofAC[pl_off][MCcoeff][1][scan_poss[MCcoeff]++] = runs[MCcoeff];

++scan_pos;

runs[MCcoeff]=-1;

}

else

{

*coeff_cost += MAX_VALUE;

ACLevel[scan_pos ] = isignab(level,m7[i]);

ACRun [scan_pos++] = run;

run=-1; // reset zero level counter

}

level = isignab(level, m7[i]);

ilev = level;

}

}

6 University of MichiganElectrical Engineering and Computer Science

FPGAs

Programmable Loop Accelerator

• Reusable hardware → reduced NRE costs• Generalize accelerator without losing efficiency

General PurposeProcessors

DSPsCGRAs

Loop Accelerators,ASICs

Efficiency, Performance

Fle

xibi

lity

MultifunctionLoop Accelerators

ProgrammableLoop Accelerators

7 University of MichiganElectrical Engineering and Computer Science

Flexible Accelerators

HardwareLoop 1

SynthesisSystem

Loop 2

Compiler

• Generalize accelerator architecture• Map new loops to existing hardware

8 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Architecture

Point-to-point Connections

+

… …

&

… …

MEM

… …

LocalMem

FSM

Controlsignals

CRF

BR

• Hardware realization of modulo scheduled loop• Parameterized execution resources, storage, connectivity

9 University of MichiganElectrical Engineering and Computer Science

Programmable Accelerator Architecture

Point-to-point Connections

+/-

… …

&/|

… …

MEM

… …

LocalMem

ControlMemory

Controlsignals

CRF

BR

RR RRRRRR

Literals

Bus

• ~50% area overhead vs. non-programmable accelerator

• Generalize architectural features that limit programmability

10 University of MichiganElectrical Engineering and Computer Science

Mapping Loops onto Hardware

General-purpose Customized

Central register file Distributed registers

Homogeneous Point-to-point

Processor Accelerator

FUs

Storage

Connectivity

ALU ALU

CRF

LD +/- *

88 16

11 University of MichiganElectrical Engineering and Computer Science

Scheduling Example

ADDER1 ADDER2MEM

0

1

2

3

4

II=2

Time+2 +3

+4 +5

LD1

LD1

+2 +3

LD1

+2 +3

+4

+4LD1

+3 +2

+3 +2

+4

+4

+5

+5

+5 ?

12 University of MichiganElectrical Engineering and Computer Science

Modulo Scheduling for LAs

• Large search space, few solutions• Op-centric approaches unable to find solutions• Satisfiability Modulo Theory (SMT) formulation to

solve linear and SAT constraints simultaneously

MoveInsertion

SMTScheduling

RegisterAllocation

LoopControl SignalsMachine

descriptionIncrement II

13 University of MichiganElectrical Engineering and Computer Science

SMT Formulation

• Boolean variables Xi,f,t are true if operation i is scheduled on FU f at time slot t.

• Integer variables Si represent stage of operation i.

( Xi,fi,ti Xj,fj,tj ) ( )

sched_time(j) sched_time(i) + lat(i) – dist(i,j) II

i

j

lat(i)

dist(i,j)

Sj II + tj Si II + ti + lat(i) – dist(i,j) II

• More details in paper

14 University of MichiganElectrical Engineering and Computer Science

Measuring Programmability

• How well can different loops be mapped onto the same hardware?

• Performance matters – how much does II increase?• Need set of loops with different degrees of similarity

FU FU

Hardware

Loop LoopLoop

LoopLoop

Loop

?

15 University of MichiganElectrical Engineering and Computer Science

Graph Perturbation

• Synthetically generated graphs• More perturbations → less similar to original graph• Iteratively apply random transformations:

Add edge between existing operationsAdd edge with new producerAdd edge with new consumer

Remove edge

16 University of MichiganElectrical Engineering and Computer Science

0

2

4

6

8

10

12

14

16

18

20

dcac dequant fft fir fmradio turbo fsed sobel heat lu

# P

ert

urb

ati

on

s

≥3

<3

<2

<1

Results – Perturbed Graphs

AverageII increase

4 8 7 2 4 4 44 6 9

MPEG4 Signal processing Image Math

Base II

17 University of MichiganElectrical Engineering and Computer Science

Results – Restricted Datapath

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

# Perturbations

II in

cre

as

e r

ati

o

PLA

No RRF

No Bus

No Mux

18 University of MichiganElectrical Engineering and Computer Science

Conclusion

• Increase flexibility of customized hardware without sacrificing performance, efficiency

• Successfully map loops to heterogeneous hardware• Compile times of 5 minutes – 1 hour• Software changing faster than hardware →

patchable ASIC

19 University of MichiganElectrical Engineering and Computer Science

Questions?

20 University of MichiganElectrical Engineering and Computer Science

21 University of MichiganElectrical Engineering and Computer Science

Results – Cross Compilation


Recommended