Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman...

transcript

Processor Architectures and Program Mapping

TU/e 5kk10Henk Corporaal

Jef van Meerbergen

Bart Mesman

Exploiting DLPSIMD architectures

04/18/23 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

flexibility

efficiency

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor

SIMD PerformanceSIMD Performance

2 1 0.5 0.25 0.13 0.07

Computational efficiency [MOPS/W]

Feature size [um]

Application specific cores

Programmable processors[Roza]

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

operation 1 operation 2 operation 3 operation 4

Instruction format:

operation 5

SIMD: Topics Overview• Enhance performance: architecture methods

• Data Level Parallelism

– Application area– Subword parallelism

• Locally connected SIMDs

– Xetal

• Fully connected SIMDs

– Imagine

• Communication in SIMD processors– RCSIMD

– DCSIMD

Enhance performance: 3 architecture methods

• (Super)-pipelining

• Powerful instructions– MD-technique

• multiple data operands per operation

– MO-technique• multiple operations per instruction

• Multiple instruction issue

Characteristics of Media Applications• Poorly matched to conventional architectures

– Caches

– Instruction-Level Parallelism

– Few arithmetic units

• Well-matched to modern VLSI technology

– Lots (100’s - 1000’s) of ALUs fit on a single chip

Communication bandwidth is the scarce resource

Architecture methods

Powerful Instructions (1)

MD-technique• Multiple data operands per operation• SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64) c[i] = a[i] + 5*b[i];

c = a + 5*b

Assembly:

set vl,64ldv v1,0(r2)mulvi v2,v1,5ldv v1,0(r1)addv v3,v1,v2stv v3,0(r3)

SIMD computing• Exploit data locality of

e.g. image processing applications

• Effect on code size?• Effect on power

consumption?

SIMD Execution Method

Instruction 1

Instruction 2

Instruction 3

Instruction n

node1 node2 node-K

• Sub-word parallelism– SIMD on restricted scale:

– Used for Multi-media instructions

– Motivation: use a powerful 64-bit alu as 4 x 16-bit alus

• Examples– MMX, SUN-VIS, HP MAX-2, AMD-

K7/Athlon 3Dnow, Trimedia II

– Example: i=1..4|ai-bi| * * * *

LC-SIMD

LC-SIMD (Locally connected; e.g. Xetal, Imap) long communication delays: shift operations

PE1 PE2 PE319PE0

Instructions Bus

Memory

One wide port

FC-SIMDFC-SIMD (Fully Connected; Imagine)

expensive communication network

PE1 PE2 PE319PE0

Instructions Bus

Fully Connected Communication Network

LC: Xetal Objectives

High-degree of system integration

CMOS imaging + DSP

low cost camera systems

Low power consumption

mobile & remote sensing

Flexibility

programmable DSP and control functions

04/18/23 1

Xetal Architecture

Global ControllerGlobal Controller

tuned for Xetal Archit.

functions

loop/iteration control

system synchronization

exposure-time control

white balancing . . .

04/18/23 1

Xetal Architecture

Parallel Processing (SIMD) Parallel Processing (SIMD)

2 columns /processor

neighbour communication

low-speed clock (16 MHz)

clock gating

shared address decoding

minimal memory read access

LOW-POWER

Parallel Processing (Contd.) Parallel Processing (Contd.)

Xetal Specs & PerformanceXetal Specs & Performance

Simulation Results(1-input)Simulation Results(1-input)

Simulation Results(1-output)Simulation Results(1-output)

Simulation Results (2)Simulation Results (2)

Imagine

• Combining DLP (SIMD) and ILP (VLIW)– toplevel SIMD– per PE: VLIW

• Stereo Depth Extraction

• Polygon Rendering

• MPEG Encoding/Decoding

Encoded 2D Data 2D Video Stream

Encode/Decode

Imagine: Representative Applications

Render

101100010110001001

Stream Processing

• Little data reuse (pixels never revisited)• Highly data parallel (output pixels not dependent on other output pixels)• Compute intensive (60 arithmetic ops per memory reference)

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Stream Architecture Provides Data Bandwidth Hierarchy

2GB/s 32GB/s

ALU Cluster

544GB/s

ALU Cluster

SIMD/VLIW Control

Peak BW:

Application Data: Bandwidth Usage

2GB/s 32GB/s

SDRAMS

ALU Cluster

544GB/s

Memory BW Global RF BW Local RF BW

Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

Stream Register File: Details

Single-ported128KB SRAM

(1024 x 32W)

Stream

32W/cycle

ArbiterTo/From:

Arithmetic Clusters,I/O,

Interprocessorcommunication,

and Main Memory

From SRF

To SRF

Cross Point

Local Register File

Arithmetic Cluster: Details

• Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions

– 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC

– 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles)

The Imagine Stream Processor

Stream Register File:32kW SRAM

NetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Imagine Floorplan• 22 million

transistors

• 500 MHz

• TI GS30KA:– 0.15 m Ldrawn

– 0.m Leff

– CMOS process

Micro-Controller

ALU Cluster 0

ALU Cluster 1

ALU Cluster 2

ALU Cluster 3

ALU Cluster 4

ALU Cluster 5

ALU Cluster 6

ALU Cluster 7

SRFControl

StreamController

NetworkInterface

Imagine Programming Environment

StreamC

C++ compiler

KernelC

kernelscheduler

microcode

streamscheduler

Run-time

Compile-time

StereoDepthExtraction(…) {

// Load Input Images

// Run Kernels

convolve7x7 (RawImage,ConvImage);

convolve3x3 (ConvImage,Conv2Image);

// Store Output

Convolve7x7(…){

while(!In.empty()) {

p0 = k0 * in10;

p12 = k21 * in32;

p34 = k43 * in54;

p56 = k65 * in76;

sum = (p0 + p12)

+ (p34 + p56);

RC-SIMD

• Imagine support full interconnect between PEs

• Do we need this expensive interconnect?

• Alternative: RC-SIMD

Basic template of communication architecture

PE3PE0

Instructions Bus

Example

• 4-tap filter

PE 0 PE 1 PE 2 PE 3

0 LD 0 LD 0 LD 0 LD 0

1 * C0 * C0 * C0 * C0

2 LD +1 LD +1 LD +1 LD +1

3 * C1 * C1 * C1 * C1

4 LD +2 LD +2 LD +2 LD +2

5 * C2 * C2 * C2 * C2

6 LD +3 LD +3 LD +3 LD +3

7 * C3 * C3 * C3 * C3

8 sum sum sum sum

9 ST ST ST ST

Example

PE 0 PE 1 PE 2 PE 3

0 LD 0 LD 0 LD 0 LD 0

1 * C0 * C0 * C0 * C0

2 LD +1 LD +1 LD +1 LD +1

3 * C1 * C1 * C1 * C1

4 LD +2 LD +2 LD +2 LD +2

5 * C2 * C2 * C2 * C2

6 LD +3 LD +3 LD +3 LD +3

7 * C3 * C3 * C3 * C3

8 sum sum sum sum

9 ST ST ST ST

Resource sharing conflict

How to solve????

Pipeline (shift 1 cycle)

PE2 PE3PE0

RC-SIMD: Basic architecture

cycle PE 0 PE 1 PE 2 PE 3

0 LD 0 - - -

1 * C0 LD 0 - -

2 LD +1 * C0 LDP0 -

3 * C1 LD +1 * C0 LD 0

4 LD +2 * C1 LD +1 * C0

5 * C2 LD +2 * C1 LD +1

6 LD +3 * C2 LD +2 * C1

7 * C3 LD +3 * C2 LD +2

8 sum * C3 LD +3 * C2

9 ST sum * C3 LD +3

10 - ST sum * C3

11 - - ST sum

12 - - - ST

• Schedule with delay-line

PE2 PE3PE0

delay delay delay

Conflict model

Ld +2 S0 S1

• Schedule PE0 (using FACTS)

Node: resource usage

Sequence edge: timing dependency

Fact tools

Move problem From hardware to software

PE2 PE3PE0

delay delay delay

Basic architecture

• Valid schedule

PE2 PE3PE0

delay delay delay

cycle PE 0 PE 1 PE 2 PE 3

0 LD +3 - - -

1 LD +1 LD +3 - -

2 LD 0 LD +1 LD +3 -

3 * C0 LD 0 LD +1 LD +3

4 LD +2 * C0 LD 0 LD +1

5 * C1 LD +2 * C0 LD 0

6 * C2 * C1 LD +2 * C0

7 * C3 * C2 * C1 LD +2

8 sum * C3 * C2 * C1

9 ST sum * C3 * C2

10 - ST sum * C3

11 - - ST sum

12 - - - ST

Drawback

Ins 2 Ins 1

Ins 4 Ins 3

• 319 cycle between PE0 & PE319

• Size of conflict model (compile time)

PE 0 PE 1 PE 2 PE 3 PE 319

Update Architecture

Ins 2 Ins 1

Ins 4 Ins 3

Ins 2 Ins 1

Ins 4 Ins 3

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 5 PE 6

Updated RC-SIMD Architecture

PE2 PE3PE0

delay delay

Results of mapping several kernels

Kernel Num operation Num cycle in LC-SIMD

Num cycle in FC-SIMD

Num cycle in RC-SIMD

Comm. Overhead in LC-SIMD

Cycle Improvement (compare to LC-

4-tap filter 8 10 8 8 2 20%

Image sub-sampling

21 26 21 21 5 19%

Convolution 7x7

98 126 98 98 28 22%

Haar filter 162 216 162 162 54 25%

FFT 26 - 26 28 - -

Difficult SIMD Applications

• Algorithms need Dynamic communication:– lens distortion– bucket processing– Mirroring,…

DC-SIMD Architecture

PE_2 PE_3

PE_4 PE_5 PE_6

PE_6 PE_3

PE_4 PE_2V dst-add data src-add

Message format

PE_2 PE_3

PE_4 PE_5 PE_6

Larger distance: PE_7 PE_1

PE_2 PE_3

PE_4 PE_5 PE_6

PE_7 PE_5

PE_6 PE_2

Priority

DC-SIMD: arbitration

V des-add data src-add

xorRead data

write: give priority to further PES

PEn PEn+1 PEn+2

V des-add data src-add

Next reg.

Select (ab)

n+2 01

n+1 10

n 11a=v’.2’

b=a’.v’+a.1’

n+2 : 2.v

n+1 : (2+v).1

n : (1+2+v).0

Buffer instruction:

Measurements: Area overhead

1 2 4 8 16 32 64 128 256 512

Number of PEs

Measurements: Performance

0102030405060708090

1 2 3 4 5 6 7 8 9Number of Buses

Measurements: required instruction buffer size

22 43 64 85

Buffer Size

In PS3: CELL architecture

CELL Highlights• Observed clock speed

– – > 4 GHz

• Peak performance (single precision)– – > 256 GFlops

• Peak performance (double precision)– – >26 GFlops

• Area 221 mm2• Technology 90nm SOI• Total # of transistors 234M

Conclusions• SIMD nicely matches

– Image applications: data-level parallelism– VLSI efficiency: copy-paste of simple elements

• So – Very efficient architecture for image processing– Low power! Also by trading off clock vs peroformance

• But – Programmer is burdened with vector thinking– Vectorizing compilers are not good at recognizing opportunities for vector

executions– Need for a “control” processor for control code and if-then-else

• Communication is a problem:– Dimensioned for peak BW requirements -> RCSIMD– Unable to perform indirect PE addressing-> DCSIMD

Processor Architectures and Program Mapping TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman...

Documents