Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | jasper-fields |
View: | 218 times |
Download: | 0 times |
IBM Research: Software Technology
© 2006 IBM Corporation1
Pro
gram
min
g T
echn
olog
ies
Determinate Imperative Programming
Vijay Saraswat, Radha Jagadeesan, Armando Solar-Lezama, Christoph von Praun
November, 2006IBM Research
This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004
IBM Research: Software Technology
© 2006 IBM Corporation2
Pro
gram
min
g T
echn
olog
ies
Outline
Problem: – Many concurrent imperative
programs are determinate.– Determinacy is not
apparent from the syntax.– Design a language in which
all programs are guaranteed determinate.
Basic idea– A variable is the stream of
values written to it by a thread.
Many examples
Semantics
Implementation
Future work
IBM Research: Software Technology
© 2006 IBM Corporation3
Pro
gram
min
g T
echn
olog
ies
Acknowledgments
Recent Publications
1. "Concurrent Clustered Programming", V. Saraswat, R. Jagadeesan. CONCUR conference, August 2005.
2. "X10: An Object-Oriented Approach to Non-Uniform Cluster Computing", P. Charles, C. Donawa, K. Ebcioglu, C. Grothoff, A. Kielstra, C. von Praun, V. Saraswat, V. Sarkar. OOPSLA Onwards! conference, October 2005.
3. “A Theory of Memory Models”, V Saraswat, R Jagadeesan, M. Michael, C. von Praun, to appear PPoPP 2007.
4. “Experiences with an SMP Implementation for X10 based on the Java Concurrency Utilities Rajkishore Barik, Vincent Cave, Christopher Donawa, Allan Kielstra,Igor Peshansky, Vivek Sarkar. Workshop on Programming Models for Ubiquitous Parallelism (PMUP), September 2006.
5. "X10: an Experimental Language for High Productivity Programming of Scalable Systems", K. Ebcioglu, V. Sarkar, V. Saraswat. P-PHEC workshop, February 2005.
Tutorials TiC 2006, PACT 2006, OOPSLA06
X10 Core Team– Rajkishore Barik– Vincent Cave– Chris Donawa– Allan Kielstra– Igor Peshansky– Christoph von Praun – Vijay Saraswat – Vivek Sarkar– Tong Wen
X10 Tools– Philippe Charles – Julian Dolby– Robert Fuhrer– Frank Tip– Mandana Vaziri
Emeritus– Kemal Ebcioglu– Christian Grothoff
Research colleagues– R. Bodik, G. Gao, R. Jagadeesan,
J. Palsberg, R. Rabbah, J. Vitek– Several others at IBM
IBM Research: Software Technology
© 2006 IBM Corporation4
Pro
gram
min
g T
echn
olog
ies
A new era of mainstream parallel processing
The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software
Multi-core chips Cluster ParallelismHeterogeneous Parallelism
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
SMF
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
PXUL1
PPU
16B/cycle
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF. . .
L2 Cache
PEs,L1 $
PEs,L1 $
. . .
. . .. . .
L2 Cache
PEs,L1 $
PEs,L1 $
. . .
Memory
PEs,
SMP NodePEs,
. . . . . .
Memory
PEs,
SMP NodePEs,
Interconnect
Our response: Use X10 as a new language for parallel hardware that builds onexisting tools, compilers, runtimes, virtual machines and libraries
IBM Research: Software Technology
© 2006 IBM Corporation5
Pro
gram
min
g T
echn
olog
ies
Server Trends: Concurrency, Distribution, Heterogeneity at all levels
2 processors
2 chips, 1x2x14 processors
11.2 GF/s1.0 GB
16 compute cards(16 compute, 0-2 IO
cards)64 processors
180 GF/s16 GB
32 Node Cards2048 processors
5.6 TF/s512 GB
20 KWatts1 m2 footprint
64 Racks, 64x32x32131,072 processors
360 TF/s32 TB
1.3M Watts
SystemRack
Mode Card
Compute Card
Chip
Network
Apps
Servers
Workload
Appliance
HPCScale Out
CommercialScale Out
Blade
Multi-CoreChip
Sh
ared
Ad
min
istr
ativ
e D
om
ain
IBM Research: Software Technology
© 2006 IBM Corporation6
Pro
gram
min
g T
echn
olog
ies
The X10 Programming Model
Place = collection of residentactivities & objects
Storage classes Immutable Data PGAS
– Local Heap– Remote Heap
Activity Local
Locality RuleAny access to a mutable datum must be performed by a local activity remote dataaccesses can be performed by creating remote activities
Ordering Constraints (Memory Model)Locally Synchronous: Guaranteed coherence for local heap Sequential consistency
Globally Asynchronous: No ordering of inter-place activities use explicit synchronization for coherence
Few concepts, done right.
IBM Research: Software Technology
© 2006 IBM Corporation7
Pro
gram
min
g T
echn
olog
ies
X10 v0.41 Cheat sheet
Stm:
async [ ( Place ) ] [clocked ClockList ] Stm
when ( SimpleExpr ) Stm
finish Stm
next; c.resume() c.drop()
for( i : Region ) Stm
foreach ( i : Region ) Stm
ateach ( I : Distribution ) Stm
Expr:
ArrayExpr
ClassModifier : Kind
MethodModifier: atomic
DataType:
ClassName | InterfaceName | ArrayType
nullable DataType
future DataType
Kind :
value | reference
x10.lang has the following classes (among others)
point, range, region, distribution, clock, array
Some of these are supported by special syntax.Forthcoming support: closures, generics, dependent types, place types, implicit syntax, array literals.
IBM Research: Software Technology
© 2006 IBM Corporation8
Pro
gram
min
g T
echn
olog
ies
X10 v0.41 Cheat sheet: Array supportArrayExpr:
new ArrayType ( Formal ) { Stm }
Distribution Expr -- Lifting
ArrayExpr [ Region ] -- Section
ArrayExpr | Distribution -- Restriction
ArrayExpr || ArrayExpr -- Union
ArrayExpr.overlay(ArrayExpr) -- Update
ArrayExpr. scan( [fun [, ArgList] )
ArrayExpr. reduce( [fun [, ArgList] )
ArrayExpr.lift( [fun [, ArgList] )
ArrayType:
Type [Kind] [ ]
Type [Kind] [ region(N) ]
Type [Kind] [ Region ]
Type [Kind] [ Distribution ]
Region:
Expr : Expr -- 1-D region
[ Range, …, Range ] -- Multidimensional Region
Region && Region -- Intersection
Region || Region -- Union
Region – Region -- Set difference
BuiltinRegion
Dist:
Region -> Place -- Constant distribution
Distribution | Place -- Restriction
Distribution | Region -- Restriction
Distribution || Distribution -- Union
Distribution – Distribution -- Set difference
Distribution.overlay ( Distribution )
BuiltinDistribution
Language supports type safety, memory safety, place safety, clock safety.
IBM Research: Software Technology
© 2006 IBM Corporation9
Pro
gram
min
g T
echn
olog
ies
Memory Model
X10 v 0.41 specifies sequential consistency per place.– Not workable.
We are considering a weaker memory model.
Built on the notion of atomic: identify a step as the basic building block.– A step is a partial write
function. Use links for non hb-reads.
A process is a pomset of steps closed under certain transformations:– Composition– Decomposition– Augmentation– Linking– Propagation
There may be opportunity for a weak notion of atomic: decouple atomicity from ordering.
Please see: http://www.saraswat.org/rao.html
Correctly synchronized programs behave as SC.
Correctly synchronized programs= programs whose SC executions have no races.
IBM Research: Software Technology
© 2006 IBM Corporation10
Pro
gram
min
g T
echn
olog
ies
async
async (P) S Creates a new child activity
at place P, that executes statement S
Returns immediately S may reference final
variables in enclosing blocks Activities cannot be named Activity cannot be aborted or
cancelled
// global dist. arrayfinal double a[D] = …; final int k = …;
async ( a.distribution[99] ) { // executed at A[99]’s // place atomic a[99] = k; }
Stmt ::= async PlaceExpSingleListopt Stmt
cf Cilk’s spawn
Memory model: hb edge between stm before async and start of async.
IBM Research: Software Technology
© 2006 IBM Corporation11
Pro
gram
min
g T
echn
olog
ies
finish
finish S Execute S, but wait until all (transitively)
spawned asyncs have terminated.
Rooted exception model Trap all exceptions thrown by spawned
activities. Throw an (aggregate) exception if any
spawned async terminates abruptly. implicit finish at main activity
finish is useful for expressing “synchronous” operations on (local or) remote data.
finish ateach(point [i]:A) A[i] = i;
finish async (A.distribution [j]) A[j] = 2;
// all A[i]=i will complete // before A[j]=2;
Stmt ::= finish Stmt
cf Cilk’s sync
Memory model: hb edge between last stm of each async and stm after finish S.
IBM Research: Software Technology
© 2006 IBM Corporation12
Pro
gram
min
g T
echn
olog
ies
foreach
foreach (point p: R) S Creates |R| async statements in parallel at current place.
Termination of all (recursively created) activities can be ensured with finish.
finish foreach is a convenient way to achieve master-slave fork/join parallelism (OpenMP programming model)
foreach ( FormalParam: Expr ) Stmt
for (point p: R) async { S }
foreach (point p:R) S
IBM Research: Software Technology
© 2006 IBM Corporation13
Pro
gram
min
g T
echn
olog
ies
atomic
Atomic blocks are conceptually executed in a single step while other activities are suspended: isolation and atomicity.
An atomic block ...– must be nonblocking– must not create concurrent
activities (sequential)– must not access remote data
(local) // push data onto concurrent // list-stackNode node = new Node(data);atomic { node.next = head; head = node; }
// target defined in lexically// enclosing scope.atomic boolean CAS(Object old, Object new) { if (target.equals(old)) { target = new; return true; } return false;}
Stmt ::= atomic StatementMethodModifier ::= atomic
Memory model: end of tx hb start of next tx in the same place.
IBM Research: Software Technology
© 2006 IBM Corporation14
Pro
gram
min
g T
echn
olog
ies
Clocks: Motivation
Activity coordination using finish and force() is accomplished by checking for activity termination
However, there are many cases in which a producer-consumer relationship exists among the activities, and a “barrier”-like coordination is needed without waiting for activity termination– The activities involved may be in the same place or in different places
Activity 0 Activity 1 Activity 2 . . .
Phase 0
Phase 1
. . .
IBM Research: Software Technology
© 2006 IBM Corporation15
Pro
gram
min
g T
echn
olog
ies
Clocks (1/2)
clock c = clock.factory.clock(); Allocate a clock, register current activity with it. Phase 0 of c starts.
async(…) clocked (c1,c2,…) Sateach(…) clocked (c1,c2,…) Sforeach(…) clocked (c1,c2,…) S Create async activities registered on clocks c1, c2, …
c.resume(); Nonblocking operation that signals completion of work by current
activity for this phase of clock c
next; Barrier --- suspend until all clocks that the current activity is registered
with can advance. c.resume() is first performed for each such clock, if needed.
Next can be viewed like a “finish” of all computations under way in the current phase of the clock
IBM Research: Software Technology
© 2006 IBM Corporation16
Pro
gram
min
g T
echn
olog
ies
Clocks (2/2)
c.drop(); Unregister with c. A terminating
activity will implicitly drop all clocks that it is registered on.
c.registered() Return true iff current activity is
registered on clock c c.dropped() returns the opposite
of c.registered() Activity is deregistered from a clock
when it terminates.
Static semantics– An activity may operate only on
those clocks it is registered with.– In finish S,S may not contain
any (top-level) clocked asyncs.
Dynamic semantics– A clock c can advance only when
all its registered activities have executed c.resume().
– An activity may not pass-on clocks on which it is not live to sub-activities. ClockUseException
– An activity may not transmit a clock into the scope of a finish. ClockUseException
No explicit operation to register a clock.
Memory model: hb edge between next stm of all registered activities on c, and the subsequent stm in each activity.
IBM Research: Software Technology
© 2006 IBM Corporation17
Pro
gram
min
g T
echn
olog
ies
Example (TutClock1.x10)finish async { final clock c = clock.factory.clock(); foreach (point[i]: [1:N]) clocked (c) { while ( true ) { int old_A_i = A[i]; int new_A_i = Math.min(A[i],B[i]); if ( i > 1 ) new_A_i = Math.min(new_A_i,B[i-1]); if ( i < N ) new_A_i = Math.min(new_A_i,B[i+1]); A[i] = new_A_i; next; int old_B_i = B[i]; int new_B_i = Math.min(B[i],A[i]); if ( i > 1 ) new_B_i = Math.min(new_B_i,A[i-1]); if ( i < N ) new_B_i = Math.min(new_B_i,A[i+1]);
B[i] = new_B_i; next; if ( old_A_i == new_A_i && old_B_i == new_B_i ) break; } // while } // foreach } // finish async
parent transmits clock to child
exiting from while loop terminates activity for iteration i, and automatically deregisters activity from clock
IBM Research: Software Technology
© 2006 IBM Corporation18
Pro
gram
min
g T
echn
olog
ies
Clocked final variables
Permit variables to be marked as clocked final, e.g.
clocked(c) final double[.] a = …
In each phase of the clock, the variable is immutable.
Writes for such a variable are performed on a shadow copy.
Main copy of variable updated with value in shadow copy when clock moves to the next phase.
Clocked final variables cannot introduce non-determinism.– Assuming multiple writers write
the same value in each phase.
IBM Research: Software Technology
© 2006 IBM Corporation19
Pro
gram
min
g T
echn
olog
ies
clocked (c) final int [0:M-1,0:N-1] G = …;
finish foreach (int i,j in [1:M-1,1:N-1]) clocked (c) {
for (int p in [0:TimeStep-1]) {
G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j];
next;
}
}
Clocked final example: Array relaxationG elements are assigned to at most once in each phase of clock c.
Wait for clock to advance.
Read current value of cell.
Each activity is registered on c.
Write visible (only) when clock advances.
Value written into ghost copy of G[i,j]
IBM Research: Software Technology
© 2006 IBM Corporation20
Pro
gram
min
g T
echn
olog
ies
Imperative Programming Revisited
Variables– Variable=Value in a Box– Read: fetch current value– Write: change value– Stability condition: Value does
not change unless a write is performed
Very powerful– Permit repeated many-writer,
many-reader communication through arbitrary reference graphs
– Mutability in the presence of sharing
– Permits different variables to change at different rates.
Asynchrony introduces indeterminacy
May write out either 0 or 1.
Bugs due to races are very difficult to debug.
int x = 0;
async x=1;
print(x);
Reader-reader, reader-writer, writer-writer conflicts.
IBM Research: Software Technology
© 2006 IBM Corporation21
Pro
gram
min
g T
echn
olog
ies
Determinate programming design patterns
NAS parallel benchmarks– Conjugate gradient– Multigrid– LU factorization
Stencil computations– Jacobi, SOR
Molecular dynamics
Detecting stable properties– Clocks!– Short circuit technique
Single producer, multiple copying consumers– Kahn networks, StreamIt– Pipelining
Graph algorithms– Connected components
Parallelism for performance/scaling, not control.
IBM Research: Software Technology
© 2006 IBM Corporation22
Pro
gram
min
g T
echn
olog
ies
Determinate programming anti-patterns
Reactive computing: arrival-order indeterminism– “Races in the world”– E.g. Bank accounts
Resource contention: any of several possible outcomes is acceptable– Mutual exclusion– Load balancing
• Shared work list
Algorithm may permit any one of many possible solutions– One solution for N-queens– Some minimal spanning tree– Some Delauney triangulation
But the program may still contain determinate concurrent components.
IBM Research: Software Technology
© 2006 IBM Corporation23
Pro
gram
min
g T
echn
olog
ies
Determinate Concurrent Imperative frameworks
Asynchronous Kahn networks– Nodes can be thought of as
(continuous) functions over streams.
– Pop/peek– Push– Node-local state may
mutate arbitrarily
Concurrent Constraint Programming– Tell constraints– Ask if a constraint is true– Subsumes Kahn networks
(dataflow).– Subsumes (det) concurrent
logic programming, lazy functional programming
Do not support arbitrary mutable variables.
IBM Research: Software Technology
© 2006 IBM Corporation24
Pro
gram
min
g T
echn
olog
ies
Determinate Concurrent Imperative Frameworks
Safe Asynchrony (Steele 1991)– Parent may communicate
with children.– Children may communicate
with parent.– Siblings may communicate
with each other only through commutative, associative writes (“commuting writes”).
int x=0;
finish foreach (int i in 1:N) {
x += i;
}
print(x); // N*(N+1)/2
int x=0;
finish foreach (int i in 1:N) {
x += i;
async print(x);
}
Good:
Bad:
Useful but limited. Does not permit dataflow synch.
IBM Research: Software Technology
© 2006 IBM Corporation25
Pro
gram
min
g T
echn
olog
ies
Determinate X10
Stm:
async [ ( Place ) ] [clocked ClockList ] Stm
when ( SimpleExpr ) Stm
finish Stm
next; c.resume() c.drop()
for( i : Region ) Stm
foreach ( i : Region ) Stm
ateach ( I : Distribution ) Stm
Expr:
ArrayExpr
ClassModifier : Kind
MethodModifier: atomic
DataType:
ClassName | InterfaceName | ArrayType
nullable DataType
future DataType
local DataType
det DataType
indet DataType
Kind :
value | reference
Constructs not available
Constructs added.
IBM Research: Software Technology
© 2006 IBM Corporation26
Pro
gram
min
g T
echn
olog
ies
local variables
Instances of value classes always considered local.
A mutable object is local only if it is marked as local when created:– new local T(…)
A value of a local type can be assigned only into a variable of local type, or a field of a local object.
Variables of a local type are not visible to contained asyncs.
Local objects of terminated activity become local objects of parent.
An async spawned in a finish may assign a value of a local type to a local variable of the parent activity.
A value may be cast to local T; the cast may fail.– E.g. local T x = (local T) this;
Invariant: Each activity owns the local objects and local variables it creates. Only local objects can reference local objects.
Ownership type system used to maintain locality.
IBM Research: Software Technology
© 2006 IBM Corporation27
Pro
gram
min
g T
echn
olog
ies
det locations
A det location is represented in memory as a stream (indexed sequence) of immutable values.
Each activity maintains an index i + clean/dirty bit for every det location. – Initially i=1, v[0] contains initial
value.– Read: If clean, block until v[i] is
written and return v[i++] else return v[i-1]. Mark as clean.
– Write: Write into v[i++]. Mark as dirty.
Note: index updated only as a result of activity’s operations.
World Map = Collection of indices for an activity.
Index transmission rules.– async: Activity initialized with
current world map of parent activity.
– finish: world map of activity is lubbed with world map of finished activities.
– (clean lub dirty = dirty)
The clock of clocked final is made implicit.
IBM Research: Software Technology
© 2006 IBM Corporation28
Pro
gram
min
g T
echn
olog
ies
Indet locations
Can be recovered as det locations + a mutable shared index (“current”).
All activities read and update location through current.
Therefore stream representation is not necessary, only the “current” value need be kept.
An activity’s world map does not need to contain index for indet locations.
IBM Research: Software Technology
© 2006 IBM Corporation29
Pro
gram
min
g T
echn
olog
ies
det example: Array relaxation
det int [0:M-1,0:N-1] G = …;
finish foreach (local int i,j in [1:M-1,1:N-1]) {
for (local int p in [0:TimeStep-1]) {
G[i,j] = omega/4*(G[i-1,j]+G[i+1,j]+G[i,j-1]+G[i,j+1])+(1-omega)*G[i,j];
}
}
All clock manipulations are implicit.
IBM Research: Software Technology
© 2006 IBM Corporation30
Pro
gram
min
g T
echn
olog
ies
Some simple examples
det int x=0;
finish {
async {
int r1 = x; int r2 = x;
println(r1); println(r2);
}
async {x=1;x=2;}
}
0
1
Only one result – independent of the scheduler!
i x A1 A2
0 0 read r1
1 1 read r2 write 1
2 2 write 2
Convention: A type not marked det is assumed marked local.
IBM Research: Software Technology
© 2006 IBM Corporation31
Pro
gram
min
g T
echn
olog
ies
Some simple examples
det int x=0;
finish {
async {int r1 = x; int r2 = x; println(r1); println(r2);}
async {x=1;}
async {x=1; int r3 = x; async {x=2;}}
}
println(x);
All programs are determinate.
0
1
2
i x A1 (0) A2 (0) A3 (0) A4 (2)
0 0 read r1
1 1 read r2 write 1 write 1; read r3
2 2 write 2
IBM Research: Software Technology
© 2006 IBM Corporation32
Pro
gram
min
g T
echn
olog
ies
Some StreamIt examples
void -> void pipeline Minimal {
add IntSource;
add IntPrinter;
}
void ->int filter IntSource {
int x;
init {x=0;}
work push 1 { push(x++);}
}
int->void filter IntPrinter {
work pop 1 { print(pop());}
}
det int x=0;
async while (true) x++;
async while (true) println(x);
StreamIt0
1
…
The communication is through assignment to x, so the same result is obtained with:
det int x=0;
async while (true) ++x;
async while (true) println(x);
0
1
…
Det X10
Each shared variable is a multi-reader, multi-writer stream.
IBM Research: Software Technology
© 2006 IBM Corporation33
Pro
gram
min
g T
echn
olog
ies
Some StreamIt examples: fibonacci
det int x=1, y=1;
async while (true) y=x;
async while (true) x+=y;
i y x
0 1 1
1 1 2
2 2 3
3 3 5
… … …
Activity 1
Activity 2
Can express any recursive, asynchronous Kahn network.
IBM Research: Software Technology
© 2006 IBM Corporation34
Pro
gram
min
g T
echn
olog
ies
StreamIt examples: Moving Average
void->void pipeline MovingAverage {
add intSource();
add Averager(10);
add IntPrinter();
}
int->int filter Average(int n) {
work pop 1 push 1 peek n {
int sum=0;
for (int i=0; i < n; i++)
sum += peek(i);
push(sum/n);
pop();
}
}
det int y=0;
det int x=0; async while (true) x++;
async while (true) {
int sum=x;
for (int i in 1:N-1) sum += peek(x, i);
y = sum/N;
}
• peek(x, i) reads the i’th future value, without popping it. Blocks if necessary.
IBM Research: Software Technology
© 2006 IBM Corporation35
Pro
gram
min
g T
echn
olog
ies
Canon matrix multiplication
void canon (det double[N,N] c, det double[N,N] a,
det double[N,N] b) {
finish foreach (int i,j in [0:N-1,0:N-1]) {
a[i,j] = a[i,(j+1) % N];
b[i,j] = b[(i+j)%N, j];
}
for (int k in [0:N-1])
finish foreach (int i,j in [0:N-1,0:N-1]) {
c[i,j] = c[i+j] + a[i,j]*b[i,j];
a[i,j] = a[i,(j+1)%N];
b[i,j] = b[(i+1)%N, j];
}
}The natural sequential program works (for finish foreach).
IBM Research: Software Technology
© 2006 IBM Corporation36
Pro
gram
min
g T
echn
olog
ies
Implementation
Each activity’s world map increases monotonically with time.
Use garbage collection to erase past unreachable values.
Programs with no sibling communication may be executed in buffers with unit windows.
Considering permitting user to specify bounds on variables (cf push/pop specifications in StreamIt).– This will force writes to
become blocking as well.
Scheduling strategy affects size of buffers, not result.
IBM Research: Software Technology
© 2006 IBM Corporation37
Pro
gram
min
g T
echn
olog
ies
Future work
Formalization– MJ/CF– Very straightforward additions
to field read/write.– Paper contains details.
Paper contains ideas on detecting deadlock (stabilities) at runtime and recovering from them. – Programmability being
investigated.– Devise static type system to
establish deadlock-freedom.
Implementation.– Leverage connection with
StreamIt, and static scheduling.
Coarser granularity for indices.– Use same clock for many
variables.– Permits “coordinated” changes
to multiple variables.
Introduce fusion operation (x -> y) to support CCP.
IBM Research: Software Technology
© 2006 IBM Corporation39
Pro
gram
min
g T
echn
olog
ies
StreamIt examples: Bandpass filter
float->float pipeline BandPassFilter(float rate,
float low, float high, int taps) {
add BPFCore(rate, low, high, taps);
add Subtracter();}
float ->float splitjoin BPFCore
(float rate, float low,
float high, int taps) {
split duplicate;
add LowPass(rate, low, taps, 0);
add LowPass(rate, high, taps, 0);
join roundrobin;}
float->float filter Subtracter {
Work pop 2 push 1 {
push(peek(1)-peek(0));
pop(); pop();}}
float bandPassFilter(float rate, float low,
float high, int taps, int in) {
int tmp=in;
det int in1=tmp, in2=tmp;
async while (true) in1=in;
async while (true) in2=in;
det int o1 = lowPass(rate, low, taps, 0, in1),
o2 = lowPass(rate, high, taps, 0, in2);
det int o = o1-o2;
async while(true) o = o1-o2;
return o;
}
Functions return streams.
IBM Research: Software Technology
© 2006 IBM Corporation40
Pro
gram
min
g T
echn
olog
ies
Histogram
Permit “commuting” writes to be performed simultaneously in the same phase.
Phase is completed when all activities that can write have written.
<int N> [1:N][] histogram([1:N][] A) {
final int[] B = new int [1:N];
finish foreach(int i in A) B[A[i]]++;
return B;
}
B’s phase is not yet complete. A subsequent read will complete it.
IBM Research: Software Technology
© 2006 IBM Corporation41
Pro
gram
min
g T
echn
olog
ies
Cilk programs with races
int x;
cilk void foo() {
x = x +1;
}
cilk int main() {
x=0;
spawn foo();
spawn foo();
sync;
printf(“x is \%d\n”, x);
return 0;
}
Determinate: Will always print 1 in CF.
CF smoothly combines Cilk and StreamIt.