+ All Categories
Home > Documents > IBM Research: Software Technology © 2006 IBM Corporation Programming Technologies 1 Determinate...

IBM Research: Software Technology © 2006 IBM Corporation Programming Technologies 1 Determinate...

Date post: 01-Jan-2016
Category:
Upload: jasper-fields
View: 218 times
Download: 0 times
Share this document with a friend
41
IBM Research: Software Technology © 2006 IBM Corporation 1 Programming Technologies Determinate Imperative Programming Vijay Saraswat, Radha Jagadeesan, Armando Solar-Lezama, Christoph von Praun November, 2006 IBM Research This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004
Transcript

IBM Research: Software Technology

© 2006 IBM Corporation1

Pro

gram

min

g T

echn

olog

ies

Determinate Imperative Programming

Vijay Saraswat, Radha Jagadeesan, Armando Solar-Lezama, Christoph von Praun

November, 2006IBM Research

This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004

IBM Research: Software Technology

© 2006 IBM Corporation2

Pro

gram

min

g T

echn

olog

ies

Outline

Problem: – Many concurrent imperative

programs are determinate.– Determinacy is not

apparent from the syntax.– Design a language in which

all programs are guaranteed determinate.

Basic idea– A variable is the stream of

values written to it by a thread.

Many examples

Semantics

Implementation

Future work

IBM Research: Software Technology

© 2006 IBM Corporation3

Pro

gram

min

g T

echn

olog

ies

Acknowledgments

Recent Publications

1. "Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005.

2. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005.

3. “A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007.

4. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.

5. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005.

Tutorials TiC 2006, PACT 2006, OOPSLA06

X10 Core Team– Rajkishore Barik– Vincent Cave– Chris Donawa– Allan Kielstra– Igor Peshansky– Christoph von Praun – Vijay Saraswat – Vivek Sarkar– Tong Wen

X10 Tools– Philippe Charles – Julian Dolby– Robert Fuhrer– Frank Tip– Mandana Vaziri

Emeritus– Kemal Ebcioglu– Christian Grothoff

Research colleagues– R. Bodik, G. Gao, R. Jagadeesan,

J. Palsberg, R. Rabbah, J. Vitek– Several others at IBM

IBM Research: Software Technology

© 2006 IBM Corporation4

Pro

gram

min

g T

echn

olog

ies

A new era of mainstream parallel processing

The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software

Multi-core chips Cluster ParallelismHeterogeneous Parallelism

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF. . .

L2 Cache

PEs,L1 $

PEs,L1 $

. . .

. . .. . .

L2 Cache

PEs,L1 $

PEs,L1 $

. . .

Memory

PEs,

SMP NodePEs,

. . . . . .

Memory

PEs,

SMP NodePEs,

Interconnect

Our response: Use X10 as a new language for parallel hardware that builds onexisting tools, compilers, runtimes, virtual machines and libraries

IBM Research: Software Technology

© 2006 IBM Corporation5

Pro

gram

min

g T

echn

olog

ies

Server Trends: Concurrency, Distribution, Heterogeneity at all levels

2 processors

2 chips, 1x2x14 processors

11.2 GF/s1.0 GB

16 compute cards(16 compute, 0-2 IO

cards)64 processors

180 GF/s16 GB

32 Node Cards2048 processors

5.6 TF/s512 GB

20 KWatts1 m2 footprint

64 Racks, 64x32x32131,072 processors

360 TF/s32 TB

1.3M Watts

SystemRack

Mode Card

Compute Card

Chip

Network

Apps

Servers

Workload

Appliance

HPCScale Out

CommercialScale Out

Blade

Multi-CoreChip

Sh

ared

Ad

min

istr

ativ

e D

om

ain

IBM Research: Software Technology

© 2006 IBM Corporation6

Pro

gram

min

g T

echn

olog

ies

The X10 Programming Model

Place = collection of residentactivities & objects

Storage classes Immutable Data PGAS

– Local Heap– Remote Heap

Activity Local

Locality RuleAny access to a mutable datum must be performed by a local activity remote dataaccesses can be performed by creating remote activities

Ordering Constraints (Memory Model)Locally Synchronous: Guaranteed coherence for local heap Sequential consistency

Globally Asynchronous: No ordering of inter-place activities use explicit synchronization for coherence

Few concepts, done right.

IBM Research: Software Technology

© 2006 IBM Corporation7

Pro

gram

min

g T

echn

olog

ies

X10 v0.41 Cheat sheet

Stm:

async [ ( Place ) ] [clocked ClockList ] Stm

when ( SimpleExpr ) Stm

finish Stm

next; c.resume() c.drop()

for( i : Region ) Stm

foreach ( i : Region ) Stm

ateach ( I : Distribution ) Stm

Expr:

ArrayExpr

ClassModifier : Kind

MethodModifier: atomic

DataType:

ClassName | InterfaceName | ArrayType

nullable DataType

future DataType

Kind :

value | reference

x10.lang has the following classes (among others)

point, range, region, distribution, clock, array

Some of these are supported by special syntax.Forthcoming support: closures, generics, dependent types, place types, implicit syntax, array literals.

IBM Research: Software Technology

© 2006 IBM Corporation8

Pro

gram

min

g T

echn

olog

ies

X10 v0.41 Cheat sheet: Array supportArrayExpr:

new ArrayType ( Formal ) { Stm }

Distribution Expr -- Lifting

ArrayExpr [ Region ] -- Section

ArrayExpr | Distribution -- Restriction

ArrayExpr || ArrayExpr -- Union

ArrayExpr.overlay(ArrayExpr) -- Update

ArrayExpr. scan( [fun [, ArgList] )

ArrayExpr. reduce( [fun [, ArgList] )

ArrayExpr.lift( [fun [, ArgList] )

ArrayType:

Type [Kind] [ ]

Type [Kind] [ region(N) ]

Type [Kind] [ Region ]

Type [Kind] [ Distribution ]

Region:

Expr : Expr -- 1-D region

[ Range, …, Range ] -- Multidimensional Region

Region && Region -- Intersection

Region || Region -- Union

Region – Region -- Set difference

BuiltinRegion

Dist:

Region -> Place -- Constant distribution

Distribution | Place -- Restriction

Distribution | Region -- Restriction

Distribution || Distribution -- Union

Distribution – Distribution -- Set difference

Distribution.overlay ( Distribution )

BuiltinDistribution

Language supports type safety, memory safety, place safety, clock safety.

IBM Research: Software Technology

© 2006 IBM Corporation9

Pro

gram

min

g T

echn

olog

ies

Memory Model

X10 v 0.41 specifies sequential consistency per place.– Not workable.

We are considering a weaker memory model.

Built on the notion of atomic: identify a step as the basic building block.– A step is a partial write

function. Use links for non hb-reads.

A process is a pomset of steps closed under certain transformations:– Composition– Decomposition– Augmentation– Linking– Propagation

There may be opportunity for a weak notion of atomic: decouple atomicity from ordering.

Please see: http://www.saraswat.org/rao.html

Correctly synchronized programs behave as SC.

Correctly synchronized programs= programs whose SC executions have no races.

IBM Research: Software Technology

© 2006 IBM Corporation10

Pro

gram

min

g T

echn

olog

ies

async

async (P) S Creates a new child activity

at place P, that executes statement S

Returns immediately S may reference final

variables in enclosing blocks Activities cannot be named Activity cannot be aborted or

cancelled

// global dist. arrayfinal double a[D] = …; final int k = …;

async ( a.distribution[99] ) { // executed at A[99]’s // place atomic a[99] = k; }

Stmt ::= async PlaceExpSingleListopt Stmt

cf Cilk’s spawn

Memory model: hb edge between stm before async and start of async.

IBM Research: Software Technology

© 2006 IBM Corporation11

Pro

gram

min

g T

echn

olog

ies

finish

finish S Execute S, but wait until all (transitively)

spawned asyncs have terminated.

Rooted exception model Trap all exceptions thrown by spawned

activities. Throw an (aggregate) exception if any

spawned async terminates abruptly. implicit finish at main activity

finish is useful for expressing “synchronous” operations on (local or) remote data.

finish ateach(point [i]:A) A[i] = i;

finish async (A.distribution [j]) A[j] = 2;

// all A[i]=i will complete // before A[j]=2;

Stmt ::= finish Stmt

cf Cilk’s sync

Memory model: hb edge between last stm of each async and stm after finish S.

IBM Research: Software Technology

© 2006 IBM Corporation12

Pro

gram

min

g T

echn

olog

ies

foreach

foreach (point p: R) S Creates |R| async statements in parallel at current place.

Termination of all (recursively created) activities can be ensured with finish.

finish foreach is a convenient way to achieve master-slave fork/join parallelism (OpenMP programming model)

foreach ( FormalParam: Expr ) Stmt

for (point p: R) async { S }

foreach (point p:R) S

IBM Research: Software Technology

© 2006 IBM Corporation13

Pro

gram

min

g T

echn

olog

ies

atomic

Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity.

An atomic block ...– must be nonblocking– must not create concurrent

activities (sequential)– must not access remote data

(local) // push data onto concurrent // list-stackNode node = new Node(data);atomic { node.next = head; head = node; }

// target defined in lexically// enclosing scope.atomic boolean CAS(Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false;}

Stmt ::= atomic StatementMethodModifier ::= atomic

Memory model: end of tx hb start of next tx in the same place.

IBM Research: Software Technology

© 2006 IBM Corporation14

Pro

gram

min

g T

echn

olog

ies

Clocks: Motivation

Activity coordination using finish and force() is accomplished by checking for activity termination

However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination– The activities involved may be in the same place or in different places

Activity 0 Activity 1 Activity 2 . . .

Phase 0

Phase 1

. . .

IBM Research: Software Technology

© 2006 IBM Corporation15

Pro

gram

min

g T

echn

olog

ies

Clocks (1/2)

clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts.

async(…) clocked (c1,c2,…) Sateach(…) clocked (c1,c2,…) Sforeach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, …

c.resume(); Nonblocking operation that signals completion of work by current

activity for this phase of clock c

next; Barrier --- suspend until all clocks that the current activity is registered

with can advance. c.resume() is first performed for each such clock, if needed.

Next can be viewed like a “finish” of all computations under way in the current phase of the clock

IBM Research: Software Technology

© 2006 IBM Corporation16

Pro

gram

min

g T

echn

olog

ies

Clocks (2/2)

c.drop(); Unregister with c. A terminating

activity will implicitly drop all clocks that it is registered on.

c.registered() Return true iff current activity is

registered on clock c c.dropped() returns the opposite

of c.registered() Activity is deregistered from a clock

when it terminates.

Static semantics– An activity may operate only on

those clocks it is registered with.– In finish S,S may not contain

any (top-level) clocked asyncs.

Dynamic semantics– A clock c can advance only when

all its registered activities have executed c.resume().

– An activity may not pass-on clocks on which it is not live to sub-activities. ClockUseException

– An activity may not transmit a clock into the scope of a finish. ClockUseException

No explicit operation to register a clock.

Memory model: hb edge between next stm of all registered activities on c, and the subsequent stm in each activity.

IBM Research: Software Technology

© 2006 IBM Corporation17

Pro

gram

min

g T

echn

olog

ies

Example (TutClock1.x10)finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { while ( true ) { int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]);

B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; } // while } // foreach } // finish async

parent transmits clock to child

exiting from while loop terminates activity for iteration i, and automatically deregisters activity from clock

IBM Research: Software Technology

© 2006 IBM Corporation18

Pro

gram

min

g T

echn

olog

ies

Clocked final variables

Permit variables to be marked as clocked final, e.g.

clocked(c) final double[.] a = …

In each phase of the clock, the variable is immutable.

Writes for such a variable are performed on a shadow copy.

Main copy of variable updated with value in shadow copy when clock moves to the next phase.

Clocked final variables cannot introduce non-determinism.– Assuming multiple writers write

the same value in each phase.

IBM Research: Software Technology

© 2006 IBM Corporation19

Pro

gram

min

g T

echn

olog

ies

clocked (c) final int [0:M-1,0:N-1] G = …;

finish foreach (int i,j in [1:M-1,1:N-1]) clocked (c) {

for (int p in [0:TimeStep-1]) {

G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j];

next;

}

}

Clocked final example: Array relaxationG elements are assigned to at most once in each phase of clock c.

Wait for clock to advance.

Read current value of cell.

Each activity is registered on c.

Write visible (only) when clock advances.

Value written into ghost copy of G[i,j]

IBM Research: Software Technology

© 2006 IBM Corporation20

Pro

gram

min

g T

echn

olog

ies

Imperative Programming Revisited

Variables– Variable=Value in a Box– Read: fetch current value– Write: change value– Stability condition: Value does

not change unless a write is performed

Very powerful– Permit repeated many-writer,

many-reader communication through arbitrary reference graphs

– Mutability in the presence of sharing

– Permits different variables to change at different rates.

Asynchrony introduces indeterminacy

May write out either 0 or 1.

Bugs due to races are very difficult to debug.

int x = 0;

async x=1;

print(x);

Reader-reader, reader-writer, writer-writer conflicts.

IBM Research: Software Technology

© 2006 IBM Corporation21

Pro

gram

min

g T

echn

olog

ies

Determinate programming design patterns

NAS parallel benchmarks– Conjugate gradient– Multigrid– LU factorization

Stencil computations– Jacobi, SOR

Molecular dynamics

Detecting stable properties– Clocks!– Short circuit technique

Single producer, multiple copying consumers– Kahn networks, StreamIt– Pipelining

Graph algorithms– Connected components

Parallelism for performance/scaling, not control.

IBM Research: Software Technology

© 2006 IBM Corporation22

Pro

gram

min

g T

echn

olog

ies

Determinate programming anti-patterns

Reactive computing: arrival-order indeterminism– “Races in the world”– E.g. Bank accounts

Resource contention: any of several possible outcomes is acceptable– Mutual exclusion– Load balancing

• Shared work list

Algorithm may permit any one of many possible solutions– One solution for N-queens– Some minimal spanning tree– Some Delauney triangulation

But the program may still contain determinate concurrent components.

IBM Research: Software Technology

© 2006 IBM Corporation23

Pro

gram

min

g T

echn

olog

ies

Determinate Concurrent Imperative frameworks

Asynchronous Kahn networks– Nodes can be thought of as

(continuous) functions over streams.

– Pop/peek– Push– Node-local state may

mutate arbitrarily

Concurrent Constraint Programming– Tell constraints– Ask if a constraint is true– Subsumes Kahn networks

(dataflow).– Subsumes (det) concurrent

logic programming, lazy functional programming

Do not support arbitrary mutable variables.

IBM Research: Software Technology

© 2006 IBM Corporation24

Pro

gram

min

g T

echn

olog

ies

Determinate Concurrent Imperative Frameworks

Safe Asynchrony (Steele 1991)– Parent may communicate

with children.– Children may communicate

with parent.– Siblings may communicate

with each other only through commutative, associative writes (“commuting writes”).

int x=0;

finish foreach (int i in 1:N) {

x += i;

}

print(x); // N*(N+1)/2

int x=0;

finish foreach (int i in 1:N) {

x += i;

async print(x);

}

Good:

Bad:

Useful but limited. Does not permit dataflow synch.

IBM Research: Software Technology

© 2006 IBM Corporation25

Pro

gram

min

g T

echn

olog

ies

Determinate X10

Stm:

async [ ( Place ) ] [clocked ClockList ] Stm

when ( SimpleExpr ) Stm

finish Stm

next; c.resume() c.drop()

for( i : Region ) Stm

foreach ( i : Region ) Stm

ateach ( I : Distribution ) Stm

Expr:

ArrayExpr

ClassModifier : Kind

MethodModifier: atomic

DataType:

ClassName | InterfaceName | ArrayType

nullable DataType

future DataType

local DataType

det DataType

indet DataType

Kind :

value | reference

Constructs not available

Constructs added.

IBM Research: Software Technology

© 2006 IBM Corporation26

Pro

gram

min

g T

echn

olog

ies

local variables

Instances of value classes always considered local.

A mutable object is local only if it is marked as local when created:– new local T(…)

A value of a local type can be assigned only into a variable of local type, or a field of a local object.

Variables of a local type are not visible to contained asyncs.

Local objects of terminated activity become local objects of parent.

An async spawned in a finish may assign a value of a local type to a local variable of the parent activity.

A value may be cast to local T; the cast may fail.– E.g. local T x = (local T) this;

Invariant: Each activity owns the local objects and local variables it creates. Only local objects can reference local objects.

Ownership type system used to maintain locality.

IBM Research: Software Technology

© 2006 IBM Corporation27

Pro

gram

min

g T

echn

olog

ies

det locations

A det location is represented in memory as a stream (indexed sequence) of immutable values.

Each activity maintains an index i + clean/dirty bit for every det location. – Initially i=1, v[0] contains initial

value.– Read: If clean, block until v[i] is

written and return v[i++] else return v[i-1]. Mark as clean.

– Write: Write into v[i++]. Mark as dirty.

Note: index updated only as a result of activity’s operations.

World Map = Collection of indices for an activity.

Index transmission rules.– async: Activity initialized with

current world map of parent activity.

– finish: world map of activity is lubbed with world map of finished activities.

– (clean lub dirty = dirty)

The clock of clocked final is made implicit.

IBM Research: Software Technology

© 2006 IBM Corporation28

Pro

gram

min

g T

echn

olog

ies

Indet locations

Can be recovered as det locations + a mutable shared index (“current”).

All activities read and update location through current.

Therefore stream representation is not necessary, only the “current” value need be kept.

An activity’s world map does not need to contain index for indet locations.

IBM Research: Software Technology

© 2006 IBM Corporation29

Pro

gram

min

g T

echn

olog

ies

det example: Array relaxation

det int [0:M-1,0:N-1] G = …;

finish foreach (local int i,j in [1:M-1,1:N-1]) {

for (local int p in [0:TimeStep-1]) {

G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j];

}

}

All clock manipulations are implicit.

IBM Research: Software Technology

© 2006 IBM Corporation30

Pro

gram

min

g T

echn

olog

ies

Some simple examples

det int x=0;

finish {

async {

int r1 = x; int r2 = x;

println(r1); println(r2);

}

async {x=1;x=2;}

}

0

1

Only one result – independent of the scheduler!

i x A1 A2

0 0 read r1

1 1 read r2 write 1

2 2 write 2

Convention: A type not marked det is assumed marked local.

IBM Research: Software Technology

© 2006 IBM Corporation31

Pro

gram

min

g T

echn

olog

ies

Some simple examples

det int x=0;

finish {

async {int r1 = x; int r2 = x; println(r1); println(r2);}

async {x=1;}

async {x=1; int r3 = x; async {x=2;}}

}

println(x);

All programs are determinate.

0

1

2

i x A1 (0) A2 (0) A3 (0) A4 (2)

0 0 read r1

1 1 read r2 write 1 write 1; read r3

2 2 write 2

IBM Research: Software Technology

© 2006 IBM Corporation32

Pro

gram

min

g T

echn

olog

ies

Some StreamIt examples

void -> void pipeline Minimal {

add IntSource;

add IntPrinter;

}

void ->int filter IntSource {

int x;

init {x=0;}

work push 1 { push(x++);}

}

int->void filter IntPrinter {

work pop 1 { print(pop());}

}

det int x=0;

async while (true) x++;

async while (true) println(x);

StreamIt0

1

The communication is through assignment to x, so the same result is obtained with:

det int x=0;

async while (true) ++x;

async while (true) println(x);

0

1

Det X10

Each shared variable is a multi-reader, multi-writer stream.

IBM Research: Software Technology

© 2006 IBM Corporation33

Pro

gram

min

g T

echn

olog

ies

Some StreamIt examples: fibonacci

det int x=1, y=1;

async while (true) y=x;

async while (true) x+=y;

i y x

0 1 1

1 1 2

2 2 3

3 3 5

… … …

Activity 1

Activity 2

Can express any recursive, asynchronous Kahn network.

IBM Research: Software Technology

© 2006 IBM Corporation34

Pro

gram

min

g T

echn

olog

ies

StreamIt examples: Moving Average

void->void pipeline MovingAverage {

add intSource();

add Averager(10);

add IntPrinter();

}

int->int filter Average(int n) {

work pop 1 push 1 peek n {

int sum=0;

for (int i=0; i < n; i++)

sum += peek(i);

push(sum/n);

pop();

}

}

det int y=0;

det int x=0; async while (true) x++;

async while (true) {

int sum=x;

for (int i in 1:N-1) sum += peek(x, i);

y = sum/N;

}

• peek(x, i) reads the i’th future value, without popping it. Blocks if necessary.

IBM Research: Software Technology

© 2006 IBM Corporation35

Pro

gram

min

g T

echn

olog

ies

Canon matrix multiplication

void canon (det double[N,N] c, det double[N,N] a,

det double[N,N] b) {

finish foreach (int i,j in [0:N-1,0:N-1]) {

a[i,j] = a[i,(j+1) % N];

b[i,j] = b[(i+j)%N, j];

}

for (int k in [0:N-1])

finish foreach (int i,j in [0:N-1,0:N-1]) {

c[i,j] = c[i+j] + a[i,j]*b[i,j];

a[i,j] = a[i,(j+1)%N];

b[i,j] = b[(i+1)%N, j];

}

}The natural sequential program works (for finish foreach).

IBM Research: Software Technology

© 2006 IBM Corporation36

Pro

gram

min

g T

echn

olog

ies

Implementation

Each activity’s world map increases monotonically with time.

Use garbage collection to erase past unreachable values.

Programs with no sibling communication may be executed in buffers with unit windows.

Considering permitting user to specify bounds on variables (cf push/pop specifications in StreamIt).– This will force writes to

become blocking as well.

Scheduling strategy affects size of buffers, not result.

IBM Research: Software Technology

© 2006 IBM Corporation37

Pro

gram

min

g T

echn

olog

ies

Future work

Formalization– MJ/CF– Very straightforward additions

to field read/write.– Paper contains details.

Paper contains ideas on detecting deadlock (stabilities) at runtime and recovering from them. – Programmability being

investigated.– Devise static type system to

establish deadlock-freedom.

Implementation.– Leverage connection with

StreamIt, and static scheduling.

Coarser granularity for indices.– Use same clock for many

variables.– Permits “coordinated” changes

to multiple variables.

Introduce fusion operation (x -> y) to support CCP.

IBM Research: Software Technology

© 2005 IBM Corporation38

Pro

gram

min

g T

echn

olog

ies

Backup

IBM Research: Software Technology

© 2006 IBM Corporation39

Pro

gram

min

g T

echn

olog

ies

StreamIt examples: Bandpass filter

float->float pipeline BandPassFilter(float rate,

float low, float high, int taps) {

add BPFCore(rate, low, high, taps);

add Subtracter();}

float ->float splitjoin BPFCore

(float rate, float low,

float high, int taps) {

split duplicate;

add LowPass(rate, low, taps, 0);

add LowPass(rate, high, taps, 0);

join roundrobin;}

float->float filter Subtracter {

Work pop 2 push 1 {

push(peek(1)-peek(0));

pop(); pop();}}

float bandPassFilter(float rate, float low,

float high, int taps, int in) {

int tmp=in;

det int in1=tmp, in2=tmp;

async while (true) in1=in;

async while (true) in2=in;

det int o1 = lowPass(rate, low, taps, 0, in1),

o2 = lowPass(rate, high, taps, 0, in2);

det int o = o1-o2;

async while(true) o = o1-o2;

return o;

}

Functions return streams.

IBM Research: Software Technology

© 2006 IBM Corporation40

Pro

gram

min

g T

echn

olog

ies

Histogram

Permit “commuting” writes to be performed simultaneously in the same phase.

Phase is completed when all activities that can write have written.

<int N> [1:N][] histogram([1:N][] A) {

final int[] B = new int [1:N];

finish foreach(int i in A) B[A[i]]++;

return B;

}

B’s phase is not yet complete. A subsequent read will complete it.

IBM Research: Software Technology

© 2006 IBM Corporation41

Pro

gram

min

g T

echn

olog

ies

Cilk programs with races

int x;

cilk void foo() {

x = x +1;

}

cilk int main() {

x=0;

spawn foo();

spawn foo();

sync;

printf(“x is \%d\n”, x);

return 0;

}

Determinate: Will always print 1 in CF.

CF smoothly combines Cilk and StreamIt.


Recommended