+ All Categories
Home > Documents > Locality management using multiple SPMs on the Multi-Level...

Locality management using multiple SPMs on the Multi-Level...

Date post: 05-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
25
Locality management using multiple SPMs on the Multi-Level Computing Architecture Ahmed M. Abdelkhalek and Tarek S. Abdelrahman The Edward S. Rogers Department of Electrical and Computer Engineering University of Toronto Oct. 26 th , 2006 ESTIMedia, Seoul, Korea
Transcript
Page 1: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

Locality management using multiple SPMs on the Multi-Level Computing Architecture

Ahmed M. Abdelkhalek and Tarek S. Abdelrahman

The Edward S. Rogers Department of Electrical and Computer Engineering

University of Toronto

Oct. 26th, 2006ESTIMedia, Seoul, Korea

Page 2: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

2

Motivation for MLCA

1. Parallel programming is difficult2. Need flexible MP-SoC architectures

Developed by:F. Karim, A. Mellan, A. Nguyen - STMicroelectronicsU. Aydonat, T. Abdelrahman - Univ. of Toronto“A Multi-Level Computing Architecture for Multimedia Applications”

IEEE Micro, vol. 24, no. 3, 2004

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 3: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

3

What is the MLCA?

Abstract micro-processor architecture

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

sequential program

Fetch & decode

Instr. queue

GPR

FU FUMemory

FU

Page 4: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

4

What is the MLCA?

Abstract micro-processor architecture

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

sequential program

Fetch & decode

Instr. queue

GPR

FU FUMemory

FU

superscalar technologyexploits instr. level

parallelism

Page 5: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

5

What is the MLCA?

Abstract micro-processor architecture

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

sequential program

Fetch & decode

Instr. queue

GPR

FU FUMemory

FU

superscalar technologyexploits instr. level

parallelism

Page 6: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

6

What is the MLCA?

Abstract micro-processor architecture

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

sequential program

Fetch & decode

Instr. queue

GPR

FU FUMemory

FU

superscalar technologyexploits instr. level

parallelism

Isn’t parallel execution the goal of parallel programming?

Page 7: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

7

What is the MLCA?

Abstract micro-processor architecture

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

sequential program of task instructions

Fetch & decode

Instr. queue

GPR

FU FUMemory

FU

superscalar technologyexploits task level

parallelism

AbstractMulti-Level Computing Architecture

Control processor

Taskdispatcher

Memory

PU PU

UniversalRegister

File PU

Page 8: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

8

What is the Multi-Level Computing Architecture?

Novel flexible MP-SoC architectureNew parallel programming model

Targets application TLP and ILP

Uses layered approach in HW and SWUpper layer exploits TLP

HW: control processor, task dispatcher, and universal register file (URF)SW: control program

Lower layer exploits ILPHW: processing unitsSW: task functions

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 9: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

9

MLCA Architecture & Programming Model

Exploit ILPCPU, DSP, ASIC, etc.

do {notzero = Add (in v1, in v2, out v3);if (notzero)

Div (in v3, in v4, out v5);done = CheckDone (in v4, in v6, out v3);

} while (done==0);

Sample control program

int Add () {int n1 = readArg(0);int n2 = readArg(1);writeArg(0, n1+n2);return (n1+n2)!=0;

}

Sample task function

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Control processor

Taskdispatcher

Memory

PU PU

UniversalRegister

File PU

ExploitTLP

Page 10: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

10

MLCA Architecture & Programming Model

do {notzero = Add (in v1, in v2, out v3);if (notzero)

Div (in v3, in v4, out v5);done = CheckDone (in v4, in v6, out v3);

} while (done==0);

Sample control program

int Add () {int n1 = readArg(0);int n2 = readArg(1);writeArg(0, n1+n2);return (n1+n2)!=0;

}

Sample task function

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Reduced SW complexity:no explicit parallel programmingsynchronization and communication separate from actual computations

Automatic extraction of parallelismsuperscalar technology

FlexibilityPU number/typesmemory hierarchyscheduling policy

Page 11: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

11

MLCA Architecture & Programming Model

do {notzero = Add (in v1, in v2, out v3);if (notzero)

Div (in v3, in v4, out v5);done = CheckDone (in v4, in v6, out v3);

} while (done==0);

Sample control program

int Add () {int n1 = readArg(0);int n2 = readArg(1);writeArg(0, n1+n2);return (n1+n2)!=0;

}

Sample task function

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Optimizing systemHow divide application into tasks?How decide on task arguments?Application-architecture matching

Simple path to initial solution exists

Page 12: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

12

Outline

MLCA introMotivationTarget MLCAProblem definitionGlobal task data mgmtEvaluationConclusion

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 13: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

13

Motivation

MLCA flexible architecture:Opportunity for optimizationFocus on memory hierarchy

Silicon technology scaling:Performance improving faster for gates than wiresCross-chip communication becoming more expensive

Avoid centralized memory:Better scalability for future MLCA chips

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 14: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

14

Target MLCA

MLCA naturally breaks down data into two types:Intra-task data: created and destroyed by task each time it executes, not needed by other tasksInter-task data: needed by more than one task, identified through the URF

store intra-task data

Distributed-shared (NUMA)Scratch-Pad Memory banks

store inter-task data

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Task dispatcher

PU 0

Private memory

URF data bank 0

PU 1

Private memory

URF data bank 1

PU 2

Private memory

URF data bank 2

Control processor

URFControl

Page 15: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

15

Problem definition

How do we efficiently use the target MLCA?How is global data allocated in the distributed banks?How to ensure access locality?Use static approach or allow dynamic data movement between banks?

How to easily integrate with MLCA 2-level programming model?Focus on global data mgmt only

Local task data handled by PU cache, etc.

Goal: better performance and easy-to-use

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 16: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

16

Global task data mgmt

Approach:Minimize cross-chip communicationExecute task on PU near bank with global data it needs

Methodology:Bank memory allocation: task creates data in certain bankTask-bank association: indicate preference of where to scheduleBank data replication/migration: copy/move global data between banksAppropriate task scheduling policiesEasy to use in control program

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 17: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

17

Example control program

while (…) {setup (out x bank 1,

out y bank 2, // bank memory allocationout z bank 3);

taskA (in x, out x) on bank 1; // task-bank associationtaskB (in y) on bank 2;taskC (in z) on bank 3;

move x, bank 3; // bank data migrationcopy y, ycopy, bank 3; // bank data replicationtaskD (in x, in ycopy, in z) on bank 3;…

}

Bank identifier

Problem with loops:All iterations use same sets of banksNot desirable with independent iterations

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 18: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

18

Example control program

while (…) {setup (out x bank 1, out y bank 2, out z bank 3); // bank memory allocation

taskA (in x, out x) on bank 1; // task-bank associationtaskB (in y) on bank 2;taskC (in z) on bank 3;

move x, bank 3; // bank data migrationcopy y, ycopy, bank 3; // bank data replicationtaskD (in x, in ycopy, in z) on bank 3;…remap bank 1, bank 2, bank 3; // bank remapping

}

Virtual bank number

Solution for loops:Application uses virtual bank numbersVirtual numbers mapped to physical ones at run-time

Bank remapping: indicate next iteration can use different banks

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 19: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

19

Example control program

while (…) {setup (out x bank 1, out y bank 2, out z bank 3); // bank memory allocation

taskA (in x, out x) on bank 1; // task-bank associationtaskB (in y) on bank 2;taskC (in z) on bank 3;

move x, bank 3; // bank data migrationcopy y, ycopy, bank 3; // bank data replicationtaskD (in x, in ycopy, in z) on bank 3;…remap bank 1, bank 2, bank 3; // bank remapping

}

Virtual bank number

Focus on optimization not correctnessLimit copies to constant data

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 20: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

20

Task scheduling policies

Task-bank association serves as hint to schedulerVarious ways to deal with at run-time:

Completely ignoreE.g. schedule first ready task on any PU

Strictly adhere toE.g. only schedule task on PU preference

Somewhere in betweenE.g. schedule on preference, but ignore if wait too long

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 21: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

21

Evaluation

MLCA simulatorC++/SystemC timed functional simulator

Media applications:MP3 decoder, FM radio demodulator, GSM voice encoder

Evaluate against:minimum support needed to use target MLCAround-robin for data allocation in banks and for task scheduling

Vary NUMA-ness of bank accesses

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 22: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

22

Results

Impact of individual techniques:Bank memory allocation and task-bank association: up to 21%Bank remapping: up to 18%Bank data replication/migration: up to 22%

Applications with various types of parallelism can benefit:GSM: pipeline //ism across iterations: 19%MP3: //ism within iteration and coarse pipeline //ism: 33%FMR: //ism within iteration and fine pipeline //ism: 40%

Impact increases with NUMA-ness of banksAll apps benefit when remote bank access >= 14 PU cycles

Scheduling policies that favor local access are necessary

Only 6-14% potential for improvement remaining!

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 23: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

23

Conclusion

This work:Introduced distributed-shared memory MLCASolution for global task data mgmt

programming directivestask scheduling policies

Showed effectiveness of our approach at improving performance

Future work:Compiler supportHardware evaluation

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul

Page 24: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

24

Thank you!

Questions / Comments?

Page 25: Locality management using multiple SPMs on the Multi-Level ...peace.snu.ac.kr/ESTIMedia/pdf/4_1.pdf · C++/SystemC timed functional simulator Media applications: MP3 decoder, FM radio

25

Task scheduling policies

FR:first ready task on first ready PU, visit PUs in round-robin fashion

CL:first ready task on closest ready PU

POLA:schedule task only on PU near bank preference, allow look ahead down task ready queue

POLATO:same as POLA but timeout after certain threshold and revert back to FR

POLAEM/POLAEMTO:same as POLA/POLATO but apply bank preference scheduling on moves and copies as well

A. M. Abdelkhalek - ESTIMedia, Oct. 26-27, 2006, Seoul


Recommended