RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting...

Post on 02-Jan-2016

216 views 0 download

transcript

RICE UNIVERSITY

‘Stream’-based wireless computing

Sridhar Rajagopal

Research group meeting December 17, 2002

The figures used in the slides are borrowed from papers at VT and Stanford.

RICE UNIVERSITY

Motivation

‘Stream’-based computing what does it mean?

Not a well-defined term ‘computation’ that uses flow of self-guided

info. ‘sequence of data’

Related to flow of data through architecture

Application to implementing wireless algorithms

RICE UNIVERSITY

Outline

Stallion reconfigurable computing at Virginia Tech ‘stream’-based computing #1 Custom Configurable Machines (CCM)

Imagine media processing at Stanford ‘stream’-based computing #2 programmable architectures

RICE UNIVERSITY

Stallion at VT

Wormhole Run-Time Reconfiguration (RTR) coarse-grained structure reconfiguration using ‘streams’

RICE UNIVERSITY

‘Stream’ packets

A stream packet

Stream flow through architecture

RICE UNIVERSITY

Functional description of PE

RICE UNIVERSITY

Stream module description

4 States:IDLE – reconf. in progressBUSY – doing workPROGRAM – load reconf. dataPASS – meant for next module

Need to output packet/cycleVALID – maintain sync. - set INVALID instead of wait states - strip information off stack

RICE UNIVERSITY

Processing layer

Static section configures the reconf. section buffers data during reconf. & sends ‘IDLE’

packets Reconf. Section

processing of the data done here

Higher layers convert algorithm to data and configuration patterns

RICE UNIVERSITY

Cart before the horse Colt before the Stallion

Colt architecture (also at VT)

IFU Mesh – Mesh of interconnected func. units

RICE UNIVERSITY

Stallion chip

16-bit data4-control

3

3

4

4

2

2

RICE UNIVERSITY

IFU mesh in Stallion

Dash-line –-skip buses

Can send operandsover 1/more IFUs

RICE UNIVERSITY

IFU details

Only left input can do barrel shifting

ALU based on LUT

Control register – stores control information for reconfiguration

Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams

Cond. unit

Output control unit

RICE UNIVERSITY

Radio testbed at VT

Stallion

RICE UNIVERSITY

Worm-hole routing

stream = worm architecture = holes

multiple, independent streams can wind their way through the chip simultaneously

parts of system can be processing, parts could be reconfiguring

GOAL: Layered Software Radio Architecture

RICE UNIVERSITY

‘Stream’ processing at Stanford

Speeding up media applications

Need lots of computations per memory reference

Lots of data and sub-word parallelism

Current GPP architectures do not have enough ALUs

‘Stream’ processors to the rescue

RICE UNIVERSITY

Special-purpose processors

Fed by dedicated wires/memoriesLots (100s) of ALUs

RICE UNIVERSITY

Care and feeding of ALUs

DataBandwidth

Instruction Bandwidth

Regs

Instr.Cache

IR

IP‘Feeding’ Structure Dwarfs ALU

RICE UNIVERSITY

Architecture implications

Tremendous opportunities media problems have lots of parallelism and locality VLSI technology enables 100s of ALUs/chip (1000s

soon)• (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder)

Challenging problems locality - global structures won’t work explicit parallelism - ILP won’t keep 100 ALUs busy memory - streaming applications don’t cache well

Its time to try some new approaches

RICE UNIVERSITY

Register file organization

Register files functions: short term storage for intermediate results communication between multiple function

units

Global register files don’t scale with #ALUs need more registers to hold more results (grows with #ALUs ) need more ports to connect all of the units (grows with #ALUs 2 )

RICE UNIVERSITY

Register files dwarf ALUs

N A rithm etic Units

1 cm

32 ALUs

Size of RFto support32 ALUs

Size of1 ALU

Size of RFto support

1 ALU1 cm

4 ALUs 16 ALUs

RICE UNIVERSITY

Distributed register files

Distributed register files means: not all functional units can access all data each functional unit input/output no longer

has a dedicated route from/to all register files

A D D 0 L/S A D D 1

can write toeither or

both busescan read

from eitherbus

RICE UNIVERSITY

Stream processing

SAD

Kernel StreamInput Data

Output Data

Image 1 convolve convolve

Image 0 convolve convolve

Depth Map

Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output

pixels) Compute intensive (60 operations per memory reference)

RICE UNIVERSITY

Stream programming

Streams Communication

void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... }

Kernels Computation

KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }

RICE UNIVERSITY

Stream Processor

Instructions are Load, Store, and Operateoperands are streams

Operate performs a compound stream operationread elements from input streamsperform a local computationappend elements to output streamsrepeat until input stream is consumed(e.g., triangle transform)

RICE UNIVERSITY

Imagine

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

RICE UNIVERSITY

Arithmetic clusters

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

RICE UNIVERSITY

Bandwidth hierarchy

VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

RICE UNIVERSITY

Conclusions

‘Streams’ shown to be promising for reconfigurable computing wireless may need reconfigurability

‘Streams’ shown to be promising for media processing wireless may have similar workloads

Important to understand pros and cons of different methodologies for good wireless architectures

Important to have the right tools