Model Based Design for DSP: Presentation to Stevens

Model Based Design for DSP:Presentation to Stevens

Will Plishker, Chung-Ching Shen, Nimish Sane, George Zaki, Soujanya Kedilaya, Shuvra S. Bhattacharyya

Maryland DSPCAD Research Group(http://www.ece.umd.edu/DSPCAD/home/dspcad.htm)Department of Electrical and Computer Engineering, andInstitute for Advanced Computer StudiesUniversity of Maryland, College Park

Outline

Model Based Design Dataflow Interchange Format Multiprocessor Scheduling Preliminary Setup and Results with GPUs Future Directions

Introduction In modern, complex systems we

would like to Create an application description

independent of the target Interface with a diverse set of tools

and teams Achieve high performance Arrive at an initial prototype quickly

But algorithms are far removed from their final implementation Low level programming

environments Diverse and changing platforms Non-uniform functional verification Entrenched design processes Tool selection

Implementation Gap

Abstract representation of an

algorithm

Low level, high performance,

implementation

ThresholdModule

1

2

3

4

Pattern comparator

Pattern (4 bits)

Decision check

1 2 3 4

Decision (1 bit)

NO zero (38 bit)

E Adder

H Adder

E/Gamma EGamma (1 bit)

Fine Grain OR Finegrain(1 bit)

Channel EtAdder

Channel Et4x9 bits

YES

38 bit

Model-Based Design for Embedded Systems High level application subsystems are specified

in terms of components that interact through formal models of computation C or other “platform-oriented” languages can be used

to specify intra-component behavior Model-specific language can be used to specify inter-

component behavior Object-oriented techniques can be used to maintain

libraries of components Popular models for embedded systems

Dataflow and KPNs (Kahn process networks) Continuous time, discrete event FSM and related control formalisms

Dataflow-based Design: Related Trends Dataflow-based design (in our context) is a

specific form of model-based design Dataflow-based design is complementary to

Object-oriented design DSP C compiler technology Synthesis tools for hardware description

languages (e.g., Verilog and VHDL)

Example: Dataflow-based design for DSP

Example from Agilent ADS tool

Example: QAM Transmitter in National Instruments LabVIEW

Source: [Evans 2005]

Rate Control

QAM Encoder

TransmitFilters

PassbandSignal

Crossing the Implementation Gap:Design Flow Using DIF

Dataflow Models DSP Designs

The DIF Package (TDP)

DSPLibraries

Dataflow-based DSP Design Tools

EmbeddedProcessing Platforms

The DIF Language (TDL)

DIF Specification

Signal ProcImage/Video

Comm Sys

Meta-ModelingPDF BLDF

DynamicCFDF BDF

DIF-to-CAlgorith

msFront-end

DIF RepresentationAIF / Porting

StaticSDF MDSDF

HSDF CSDF

C

DSP

Other Embedded Platforms

Other Tools

Other Ex/Im

VSIPLTI Othe

r

Autocoding ToolsetPtolemy II

DIF-A T Ex/ImPtolemy Ex/Im

Java

Java VM

Ada

VDM

Dataflow with Software Defined Radio:DIF + GNU Radio

GRC

The DIF Package (TDP)

Platforms

GPUsMulti-processors

GNU Radio Engine Python/C++

Python Flowgraph

(.py)

3a) Perform online scheduling

DIF specification

(.dif)

3b) Architecture specification (.arch?)

Cell FPGA

XML Flowgraph

(.grc)

Schedule (.dif,

.sched)

4) Architecture aware MP scheduling

• (assignment, ordering, invocation)

• Processors• Memories• Interconnect

1) Convert or generate .dif file(Complete)

Platform Retargetable

Library

Uniprocessor Scheduling

Existing or Completed

Proposed

Legend

DIF Lite2) Execute static schedules from DIF (Complete)

Background: Dataflow Graphs Vertices (actors) represent computation Edges represent FIFO buffers Edges may have delays, implemented as

initial tokens Tokens are produced and consumed on edges Different models have different rules for

production (SDF=fixed, CSDF=periodic, BDF=dynamic)

X Y 5Zp1 c1 p2 c2e1 e2

Evolution of Dataflow Models of Computation for DSP: Examples

Computation Graphs and Marked Graphs [Karp 1966, Reiter 1968] Synchronous dataflow, [Lee 1987]

Static multirate behavior SPW (Cadence) , National Instruments LabVIEW, and others.

Well behaved stream flow graphs [1992] Schemas for bounded dynamics

Boolean/integer dataflow [Buck 1994] Turing complete models

Multidimensional synchronous dataflow [Lee 1992] Image and video processing

Scalable synchronous dataflow [Ritz 1993] Block processing COSSAP (Synopsys)

CAL [Eker 2003] Actor-based dataflow language

Cyclo-static dataflow [Bilsen 1996] Phased behavior Eonic Virtuoso Synchro, Synopsys El Greco and Cocentric,

Angeles System Canvas

Bounded dynamic dataflow Bounded dynamic data transfer

[Pankert 1994] The processing graph method

[Stevens, 1997] Reconfigurable dynamic dataflow U. S. Naval Research Lab, MCCI

Autocoding Toolset Stream-based functions [Kienhuis

2001] Parameterized dataflow [Bhattacharya

2001] Reconfigurable static dataflow Meta-modeling for more general

dataflow graph reconfiguration Reactive process networks [Geilen

2004] Blocked dataflow [Ko 2005]

Image and video through parameterized processing

Windowed synchronous dataflow [Keinert 2006]

Parameterized stream-based functions [Nikolov 2008]

Enable-invoke dataflow [Plishker 2008] Variable rate dataflow [Wiggers 2008]

Modeling Design Space

XPSDF

XPCSDF

Ex

pre

ss

ive

po

we

r

Verification / synthesis power

XC, BDF, DDF

XSDF

XCSDF

XCSDF, SSDFMDSD,

WBDF

X

Dataflow Interchange Format Describe DF graphs in text

Simple DIF file:dif graph1_1 { topology { nodes = n1, n2, n3, n4; edges = e1 (n1, n2), e2 (n2, n1), e3 (n1, n3), e4 (n1, n3), e5 (n4, n3), e6 (n4, n4);

}}

More features of DIF Ports interface {

inputs = p1, p2:n2; outputs = p3:n3, p4:n4; } Hierarchy refinement {

graph2 = n3; p1 : e3; p2 : e4; p3 : e5; p4 : p3; }

More features of DIF Production and consumption production { e1 = 4096; e10 = 1024; ...

} consumption { e1 = 4096; e10 = 64;

... }

Computation keyword User defined attributes

4096

4096

1024

64

The DIF Language SyntaxdataflowModel graphID { basedon { graphID; } topology { nodes = nodeID, ...; edges = edgeID (srcNodeID,

snkNodeID), ...; } interface { inputs = portID [:nodeID], ...; outputs = portID [:nodeID], ...; } parameter { paramID [:dataType]; paramID [:dataType] = value; paramID [:dataType] : range; } refinement { subgraphID = supernodeID; subPortID : edgeID; subParamID = paramID; }

builtInAttr { [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; }attribute usrDefAttr{ [elementID] = value; [elementID] = id; [elementID] = id1, id2, ...; }actor nodeID { computation = stringValue; attrID [:attrType] [:dataType] =

value; attrID [:attrType] [:dataType] =

id; attrID [:attrType] [:dataType] =

id1, ...; }}

Uniprocessor Scheduling for Synchronous Dataflow An SDF graph G = (V,E) has a valid schedule if it

is deadlock-free and is sample rate consistent (i.e., it has a periodic schedule that fires each actor at least once and produces no net change in the number of tokens on each edge).

Balance eqs: e E, prd(e) x q[src(e)] = cns(e) x q[snk(e)].

Repetition vector q is the minimum solution of balance eqs.

A valid schedule is then a sequence of actor firings where each actor v is fired q[v] (repetition count) times and the firing sequence obeys the precedence constraints imposed by the SDF graph.

Example: Sample Rate Conversion

Flat strategy Topological sort the graph and iterate each actor

v q[v] times. Low context switching but large buffer

requirement and latency CD to DAT Flat Schedule:

(147A)(147B)(98C)(56D)(40E)(160F)

CD to DAT: 44.1 kHz to 48 kHz sampling rate conversion.

CD FIR1 FIR2 FIR3 FIR4 DAT1 1 2 3 4 7 5 7 4 1

e1 e2 e3 e4 e5(A) (B) (C) (D) (E) (F)

Scheduling Algorithms Acyclic pairwise grouping of adjacent nodes (APGAN)

An adaptable (to different cost functions) and low-complexity heuristic to compute a nested looped schedule of an acyclic graph in a way that precedence constraints (topological sort) is preserved through the scheduling process.

Dynamic programming post optimization (DPPO) Dynamic programming over a given actor ordering (any topological sort). GDPPO, CDPPO, SDPPO.

Recursive procedure call (RPC) based MAS Generate MASs for a given R-schedule through recursive graph

decomposition. The resulting schedule is bounded polynomially in the graph size.Algorithm Looped Schedule Buffer Size

Flat (147A)(147B)(98C)(56D)(40E)(160F) 1273

APGAN (49(3AB)(2C))(8(7D)(5E(4F))) 438DPPO (7(7(3AB)(2C))(8D))(40E(4F)) 347RPC-basedMAS

((2(((7((AB)(2(AB)C))D)D)(5E(4F)))(2(((7((AB)(2(AB)C))D)D)(5E(4F)))(E(4F))))((((7((AB)(2(AB)C))D)D)(5E(4F)))(E(4F))))

69

Representative Dataflow Analyses and Optimizations Bounded memory and deadlock detection: consistency Buffer minimization: minimize communication cost Multirate loop scheduling: optimize code/data trade-off Parallel scheduling and pipeline configuration Heterogeneous task mapping and co-synthesis Quasi-static scheduling: minimize run-time overhead Probabilistic design: adapt system resources and exploit

slack Data partitioning: exploit parallel data memories Vectorization: improve context switching, pipelining Synchronization optimization: self-timed

implementation Clustering of actors into atomic scheduling units

Multiprocessor Scheduling

Multiprocessor scheduling problem: Actor assignment (mapping) Actor ordering Actor invocation

Approaches to each of these tend to be platform specific Tools can be brought under a common formal umbrella

Multiprocessor SchedulingMapping/SchedulingApplication Model,

G(V, E, t(v), C(e))

Multiprocessor MappingApplication Model, G(V, E, t(v), C(e)) Mapping

P1

P2

P4

P3

Invocation Example: Self-Timed (ST) scheduling

E

B

F

D

C

G

A

H

Proc 1

Proc 5 Proc

2

Proc 4Proc

3

Application Graph

Execution TimesA, B, F: 3C, H : 5D : 6E : 4G : 2

HH

DD D

GCGC

E A E A E A E

B F B F

C

B F B

C G

F

G

D

H H

Proc 1Proc 2Proc 3Proc 4Proc 5

18TST=

9

Assignment and ordering performed at compile-time.Invocation performed at run-time (via synchronization)

Gantt Chart for ST schedule

Multicore Schedules Traditional multicore scheduling

Convert application DAG to Homogenous Synchronous Dataflow (HSDF)

Perform HSDF mapping Problem: exponential graph explosion

Our solution: single processor schedule (SPS) represented

as a generalized schedule tree (GST) generate an equivalent multiprocessor

schedule (MPS) to be represented as a forest of GSTs.

Traditional Dataflow Multiprocessor Scheduling (MPS)

A B C

A

B

C

A

A

A

A

A

B

1

3 2

1

1

1

1

1

1

1

1

1

11

1

11

1

1

1

Synchronous Dataflow (SDF)representation of application

Homogenous SDF representation of application

GST Representation for MPS - Simple Example

(a) An SDF graph

(b) SPS as a GST

(c) MPS represented as a forest of GSTs

P1 P2 P3

Demonstration on GPUs:Start with parallel actors

Within an actor (FIR Filter).

N

ii inxbny

0

][][

Limitation (IIR Filter)

Q

jj

P

ii jnyainxbny

10

][][][

Individual actor results:CUDA FIR vs. Stock GR FIR

Individual Actor Results:Turbo Block Decode

Future Direction: Tackling the general MP scheduling problem with dataflow analysis

Many dataflow analysis techniques are available once the problem is well defined in dataflow terms

Maximize multicore utilization by replicating and fusing actors/blocks Stateless vs. stateful Computation to communication ratios Firing rates/execution times to number of blocks

Once application is mapped to blocks/processors Single processor scheduling to minimize buffering

Focus first on MP Scheduling for GPUs Blocks Threads Memory

Refine to a simpler question:When to off-load onto a GPU? Given:

An application graph Actor timing

characteristics for communication and computation

A target architecture with heterogeneous multiprocessing

Find optimal implementation Latency Throughput

69

2

1

54

38

7

GPUCPU

?

Summary

Model Based Design Dataflow Interchange Format Multiprocessor Scheduling Preliminary Setup and Results with GPUs Future Directions

Date post:	23-Feb-2016
Category:	Documents
Upload:	meagan
View:	42 times
Download:	0 times

Model Based Design for DSP: Presentation to Stevens

Documents