Scalable Performance for Scala Message-Passing Concurrency · cso.io Scalable Performance for Scala...

Post on 12-Oct-2019

12 views 0 download

transcript

cso.io

Scalable Performance for Scala Message-Passing Concurrency

Andrew Bate

Department of Computer Science University of Oxford

Motivation

Multi-core commodity hardware

Non-uniform shared memory

Expose potential parallelism

Correctness and formal verification

Compatibility

int arr[x][y];

EMBEDDED DOMAIN-SPECIFIC LANGUAGE

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

Why an Embedded DSL?

Ease of implementation

Leverage existing tools

Leverage known syntax Higher-order functions

Rich type system

Lightweight syntax

Compile-time macros

def map[I, O](f: I => O)(in: ?[I], out: ![O]) = proc { repeat { out ! (f(in?)) } run (proc { in.closein } || proc { out.closeout }) }

in 𝑓𝑓(v) v

out map 𝑓𝑓

Examples

def tee[@specialized T](in: ?[T], outs: Seq[![T]]) = proc { var v = null val outputs = (|| (out <- outs) proc { out ! v })) repeat { v = in?; run outputs } run (proc { in.closein } || (|| (out <- outs) proc { out.closeout })) }

in

v

v

v

v

out1

out2

outn

tee

Examples

SCALABLE PERFORMANCE through bytecode rewriting

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

CPS Transformation

Call n(f)

Return

Init

Pre-call

Post-call

Prelude

rewinding

pausing

Call n()

Return

Init

Analysing the call graph

do()

x()

y()

z()

?() do()

y()

Transform these methods

Engineering

Live variable analysis

Lazy load and store

Constant inlining

Functional Expressions for (i <- 0 until n; j <- i until n) println(i)

intWrapper(0).until(n).foreach( i: Int => intWrapper(i).until(n).foreach(j: Int => println(i)) )

Com

pile

s to

Tran

sfor

ms t

o

var i = 0 while (i < n) { var j = i while (j < n) { println(i); j += 1 } i += 1 }

Tail call optimisations

Shared memory

SBT plugin support

More Features

CHANNELS

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

More Features

Generalised alt

Specialization for primitives

Optimised extended rendezvous

SCHEDULER

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

Scheduler States

Created

Waiting

Terminated

Paused

Running

Scheduling: Central FIFO

Thre

ad 1

Thre

ad 2

Thre

ad 𝑚𝑚

Scheduler

𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃𝑛𝑛

Scheduling: FIFO per thread

Thre

ad 1

Thre

ad 2

Thre

ad 𝑚𝑚

Scheduler

𝑃𝑃1

Scheduler

𝑃𝑃3

Scheduler

𝑃𝑃𝑛𝑛

Scheduler

Scheduling: Batches per thread

Thre

ad 1

Thre

ad 2

Thre

ad 𝑚𝑚

Scheduler

Scheduler

⋯ ? ? ?

Scheduling: Batches per thread

Scheduler

𝑃𝑃1 𝑃𝑃2 𝑃𝑃𝑛𝑛 𝑄𝑄1 𝑄𝑄𝑚𝑚 𝑅𝑅1 𝑅𝑅𝑘𝑘

Dispatch Count = max const × Batch Length, Dispatch Limit

DEADLOCK DETECTION

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

Example Tee

x2 x3

x5

Merge

Prefix 1

Console

Tee

Merge

Tee

Example Tee

x2 x3

x5

Merge

Prefix 1

Console

Tee

Merge

Tee

!

! !

! !

! ? ?

?

!

Deadlock detected! The cycle of ungranted requests is: Prefix1 -!-> Tee1 Tee3 -!-> x5 Tee1 -!-> Tee2 x5 -!-> Merge2 Tee2 -!-> Tee3 Merge2 -!-> Prefix1

PERFORMANCE EVALUATION

1

2

3

4

5

Embedded DSL

Bytecode rewriting

Channels

Scheduler

Deadlock detection

Ring topology

100

1000

10000

100000

Tim

e to

pas

s a m

essa

ge 3

00 ti

mes

aro

und

an n

pro

cess

ring

(ms)

Number n of processes spawned

CSO2 FIFO Scheduler

Java primitives

CSO2 Batch Scheduler

Ring topology

0

50000

100000

150000

200000

250000

300000

Tim

e to

pas

s a m

essa

ge 3

00 ti

mes

aro

und

an n

pro

cess

ring

(ms)

Number n of processes spawned

CSO2 FIFO Scheduler

Java primitives

CSO2 Batch Scheduler

Fully connected topology

10

100

1000

10000

100000

1000000

Tim

e to

pas

s n2 m

essa

ges (

ms)

Number n of processes / actors spawned

ErlangScala ActorsJCSPJava PrimitivesOccamCSO2 FIFO SchedulerCSO2 Batch SchedulerGo

CSO2

CSO2

Fully connected topology

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

Tim

e to

pas

s n2 m

essa

ges (

ms)

Number n of processes / actors spawned

ErlangScala ActorsJCSPJava PrimitivesOccamCSO2 FIFO SchedulerCSO2 Batch SchedulerGo

CSO2 CSO2

Fully connected topology

0

10000

20000

30000

40000

50000

60000

Tim

e to

pas

s n2 m

essa

ges (

ms)

Number n of processes / actors spawned

JCSP

Occam

CSO2 FIFO Scheduler

CSO2 Batch Scheduler

Go

CSO2

CSO2

Fully connected topology

0

2000

4000

6000

8000

10000

12000

14000

16000

Tim

e to

pas

s n2 m

essa

ges (

ms)

Number n of processes / actors spawned

CSO2 Batch Scheduler

CSO2 FIFO Scheduler

Go

Summary

• High performance library for building massively concurrent systems on the JVM

• Deadlock detection

• Outperforms Java primitives, JCSP, Scala Actors, Occam, and very close to Go