+ All Categories
Home > Education > Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

Date post: 14-Apr-2017
Category:
Upload: arvind-surve
View: 121 times
Download: 0 times
Share this document with a friend
30
© 2015 IBM Corporation S7/8: SystemML’s Optimizer and Runtime Matthias Boehm 1 , Arvind C. Surve 2 1 IBM Research – Almaden 2 IBM Spark Technology Center IBM Research
Transcript
Page 1: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

S7/8: SystemML’s Optimizer and Runtime

Matthias Boehm1, Arvind C. Surve2

1 IBM Research – Almaden2 IBM Spark Technology Center

IBM Research

Page 2: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Abstraction: The Good, the Bad and the Ugly

2 IBM Research

q = t(X) %*% (w * (X %*% v))

[adapted from Peter Alvaro: "I See What You Mean“, Strange Loop, 2015]

Simple & Analysis-CentricData Independence

Platform IndependenceAdaptivity

(Missing)Size InformationOperator

Selection

(Missing) Rewrites

Distributed Operations

Distributed Storage

(Implicit) Copy-on-WriteData Skew

Load Imbalance

Latency

Complex Control Flow

Local / Remote Memory Budgets

The Ugly: Expectations ≠ Reality

è Understanding of optimizer and runtime techniques underpinning declarative, large-scale ML

Efficiency & Performance

Page 3: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

3 IBM Research

Page 4: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Optimization through ParFor§ Motivation

– SystemML focus primarily on data parallelism– Dedicated parfor construct for task parallelism

§ ParFor approach:– Complementary parfor parallelization strategies– Cost-based optimization framework for task-parallel ML– Memory budget as common constraint

4 IBM Research

Page 5: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Recap: Basic HOP DAG Compilation Example Pearson Correlation§ DML

Script

§ HOP DAG

5 IBM Research

X = read( "./in/X" ); #data on HDFSY = read( "./in/Y" );m = nrow(X);sigmaX = sqrt( centralMoment(X,2)*(m/(m-1.0)) );sigmaY = sqrt( centralMoment(Y,2)*(m/(m-1.0)) );r = cov(X,Y) / (sigmaX * sigmaY);write( r, "./out/r" );

b(cov)

X

r (“./out/r“)

Y (“./in/Y“, 106x1)

b(cm) b(cm)

b(*) b(* )

2

u(sqrt ) u(sqrt )

b(* )

b(/ )

b(/ )

b(-)

1,000,000 1

w/ o constant folding (1.000001)

(“./in/X“, 106x1)

u() … unary operatorb() … binary operatorcov … covariancecm … central momentsqrt … square root

yxyx

YXσσ

ρ),cov(

, =

Exploit Spark/MRdata parallelism

if beneficial/required

Page 6: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Running Example: Pairwise Pearson Correlation§ Representative for more complex bivariate statistics

(Pearson‘s R, Anova F, Chi-squared, Degree of freedom, P-value, Cramers V, Spearman, etc)

6 IBM Research

D = read("./input/D");m = nrow(D);n = ncol(D);R = matrix(0, rows=n, cols=n);parfor( i in 1:(n-1) ) {

X = D[ ,i];m2X = centralMoment(X,2);sigmaX = sqrt( m2X*(m/(m-1.0)) );parfor( j in (i+1):n ) {

Y = D[ ,j];m2Y = centralMoment(Y,2);sigmaY = sqrt( m2Y*(m/(m-1.0)) );R[i,j] = cov(X,Y) / (sigmaX*sigmaY);

}}write(R, "./output/R");

Challenges:• Triangular nested loop• Column-wise access on

unordered distributed data• Bivariate all-to-all data

shuffling pattern.

Exploit task anddata parallelism

if beneficial/required

Page 7: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Overview Parallelization Strategies§ Conceptual Design: Master/worker

– Task: group of parfor iterations

§ Task Partitioning– Naive, static, fixed, factoring,

factoring_cmax– Task overhead vs load balance?

§ Task Execution– Local, remote (Spark/MR), remoteDP (Spark/MR) – Various runtime optimizations– Degree of parallelism/IO/latency?

§ Result Aggregation– Local memory, local file, remote (Spark/MR)– W/ and w/o compare– Result locality/IO/latency?

7 IBM Research

n = 12parfor( i in 1:(n-1) ) {

X = D[ ,i];…R[i,j] = …

}

è Optimizer leverages these to generate efficient execution

plans

Page 8: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Example Task Partitioning

8 IBM Research

§ Scenario: k=24 workers, 10,000 iterations

Factoring Factoring CMAX (150)

0

50

100

150

200

250

#ofIterations

Tasks(1to208)

Naive Fixed(250) Static

0

50

100

150

200

250

300

350

400

450

#ofIterations

Tasks(1to24)0

50

100

150

200

250

300

#ofIterations

Tasks(1to40)0

10

20

30

40

50

1Iterationpertask

Tasks(1to10000)

0

50

100

150

200

250

#ofIterations

Tasks(1to228)

Page 9: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Task Execution: Local and Remote Parallelism

9 IBM Research

Local execution (multicore) Remote execution (cluster)

Local ParWorker k

ParFOR (local)

Local ParWorker 1

while(wßdeq()) foreach pi ∈ w execute(prog(pi))

Task Part it ioning

Parallel Result Aggregat ion

Task Queue

...

w5: i, { 11}w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

HadoopParWorker Mapper k

ParFOR (remote)

ParWorker Mapper 1

map(key,value) wßparse(value) foreach pi ∈ w execute(prog(pi))

Task Part it ioning

Parallel Result Aggregat ion

...

…A|MATRIX|./ out / A7tmp

w5: i, { 11}w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

Hybrid parallelism: combinations of local/remote and data-parallel jobs

Page 10: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Task Execution: Runtime Optimizations§ Data Partitioning

– Problem: Repeated MR jobs for indexed access

– Access-awareness (cost estimation, correct plan generation)

– Operators: local file-based, remote MR job

§ Data Locality– Problem: Co-location of parfor tasks to partitions/matrices– Location reporting

per logical parfortask (e.g., forparfor(i) à D[, i])

10 IBM Research

parfor( i in 1:(n-1) ) {X = D[ ,i]; …parfor( j in (i+1):n ){

Y = D[ ,j]; …}}

N ode2

D3

D4

D5

D9

D10

D11

Node 1

N ode1

D1

D2

D6

D7

D8

Node 2Node 1Node 1, 2Node 2 w5: i, { 11}

w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

R eported Locat ions: Task File

Part it ions Part it ions

Page 11: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Optimization Framework – Problem Formulation§ Design: Runtime optimization for each top-level parfor

§ Plan Tree P– Nodes NP

• Exec type et• Parallelism k• Attributes A

– Height h– Exec contexts ECP

§ Plan Tree Optimization Problem–

11 IBM Research

ParFOR

b(cm)

Generic ParFOR

Generic

RIX LIX b(cov)...

RIX b(cm)...

ec0 ParFOR

b(cm)

Generic ParFOR

ec1 Generic

RIX LIX b(cov)...

RIX b(cm)... cmec = 600 MBckec = 1

cmec = 1024 MBckec = 16

MR

ec … execut ion contextcm … memory const raintck … parallelism const raint

[M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML PVLDB 7(7), 2014]

[M. Boehm et al. Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs. CoRR,2015]

Page 12: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Optimization Framework – Cost Model / Optimizer§ Overview Heuristic Optimizer

– Time- and memory-based cost model w/o shared reads– Heuristic high-impact rewrites– Transformation-based search strategy with global opt scope

§ Cost Model– HOP DAG

size propagation– Worst-case

memory estimates– Time estimates– Plan tree statistics

aggregation

12 IBM Research

ParFOR

b(cm)

Generic ParFOR

Generic

RIX LIX b(cov)...

RIX b(cm)...

Plan Tree Pk=4

MappedHOP DAGs

D

RIX

b(cov) b(cm)

j

...

X

d1= 0, d2= 0

d1= 1Md2= 1

d1= 0, d2= 0

d1= 1Md2= 1

d1= 1Md2= 10

M = (80 M B, 80 M B)

M = (8 M B, 8 M B)

M=(8 MB, 88 MB)

M = (0 M B, 8 M B)

M = (0 M B, 16 M B)

M= (< output mem> , < operat ion mem> )

Y

M=88MB

M=352MB

Page 13: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Hands-On Lab: Task-Parallel ParFor Programs § Exercise: Pairwise Pearson Correlation

– a) Simple for – loop w/ -stats

– b) Task-parallel parfor w/ -stats

13 IBM Research

D = rand(rows=100000, cols=100);m = nrow(D);n = ncol(D);R = matrix(0, rows=n, cols=n);parfor( i in 1:(n-1) ) {

X = D[ ,i];m2X = centralMoment(X,2);sigmaX = sqrt( m2X*(m/(m-1.0)) );parfor( j in (i+1):n ) {

Y = D[ ,j];m2Y = centralMoment(Y,2);sigmaY = sqrt( m2Y*(m/(m-1.0)) );R[i,j] = cov(X,Y) / (sigmaX*sigmaY);

}}write(R, "./tmp/R", format="binary");

Page 14: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

14 IBM Research

Page 15: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Buffer Pool Overview§ Motivation

– Exchange of intermediates between local and remote operations (HDFS, RDDs, GPU divide memory)

– Eviction of in-memory objects (integrated with garbage collector)§ Primitives

– acquireRead, acquireModify, release, exportData, getRdd, getBroadcast

§ Spark Specifics– Lineage tracking

RDDs/broadcasts– Guarded RDD

collect/parallelize– Partitioned

Broadcast variables

15 IBM Research

MatrixObject/WriteBuffer

Lineage Tracking

Page 16: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

16 IBM Research

Page 17: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Spark-Specific Optimizations§ Spark-Specific Rewrites

– Automatic caching/checkpoint injection (MEM_DISK / MEM_DISK_SER)

– Automatic repartition injection§ Operator Selection

– Spark exec type selection– Transitive Spark exec type– Physical operator selection

§ Extended ParFor Optimizer– Deferred checkpoint/repartition injection– Eager checkpointing/repartitioning– Fair scheduling for concurrent jobs– Local degree of parallelism

§ Runtime Optimizations– Lazy Spark context creation– Short-circuit read/collect

17 IBM Research

X = read($1);y = read($2);...r = -(t(X) %*% y);while(i < maxi & norm_r2 > norm_r2_trgt) {q = t(X)%*%(X%*%p) + lambda*p;alpha = norm_r2 / (t(p)%*%q);w = w + alpha * p;old_norm_r2 = norm_r2;r = r + alpha * q;norm_r2 = sum(r * r);beta = norm_r2 / old_norm_r2;p = -r + beta * p;i = i + 1;

}...write(w, $4);

chkpt X MEM_DISK

Ex: Checkpoint Injection LinregCG

Spark Exec (24 cores)

25% user75% data&exec(50% Min & 75% Max)

Page 18: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

SystemML on Spark: Lessons Learned§ Spark over Custom Framework

– Well engineered framework with strong contributor base– Seamless data preparation and feature engineering

§ Stateful Distributed Caching– Standing executors with distributed caching and fast task scheduling– Challenges: task parallelism, memory constraints, fair resource management

§ Memory Efficiency– Compact data structures to avoid cache spilling (serialization, CSR)– Custom serialization and compression

§ Lazy RDD Evaluation– Automatic grouping of operations into distributed jobs, incl partitioning– Challenges: multiple actions/repeated execution, runtime plan compilation!

§ Declarative ML– Introduction of Spark backend did not require algorithm changes!– Automatically exploit distributed caching and partitioning via rewrites

18 IBM Research

25% tasks

Page 19: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

19 IBM Research

Page 20: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Partitioning-Preserving Operations on Spark§ Partitioning-preserving ops

– Op is partitioning-preserving if key not changed (guaranteed)– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())– 2) Explicit: Partition computation w/ declaration of partitioning-preserving

(memory-efficiency via “lazy iterators”)

§ Partitioning-exploiting ops– 1) Implicit: Operations based on join, cogroup, etc– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)

20 IBM Research

Physical Blocking and Partitioning

Page 21: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Partitioning-Exploiting ZIPMM§ Operation:

Z = t(X) %* % y

21 IBM Research

§ Operations: Transpose, Join, Multiplication§ Shuffle

§ Operations: Join, Transpose & Multiplication§ Avoid unnecessary shuffle

X yInput:

1,1

1,2

1,3

Approach: zipmmX y Z

1,1

1,2

1,3

Partitions not preserved after transpose, as keys changed.

t(X)

yApproach: Naive

1,1 2,1 3,1

Page 22: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Example Multiclass SVM§ Example: Multiclass SVM

– Vectors in nrow(X) neither fit into driver nor broadcast (MapMM not applicable)

– ncol(X) ≤ Bc (zipmm applicable)

22 IBM Research

parfor(iter_class in 1:num_classes) {Y_local = 2 * (Y == iter_class) – 1;g_old = t(X) %*% Y_local; ...while( continue ) {

Xd = X %*% s;... inner while loop (compute step_sz)Xw = Xw + step_sz * Xd;out = 1 - Y_local * Xw;out = (out > 0) * out;g_new = t(X) %*% (out * Y_local) ...

repart, chkpt X MEM_DISK

chkpt y_local MEM_DISK

zipmm

chkpt Xd, Xw MEM_DISK

Page 23: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Hands-On Lab: Partitioning-Preserving Operations § Exercise: MultiClass SVM

– W/o repartition injection– W/ repartitioning injection

23 IBM Research

parfor(iter_class in 1:num_classes) {Y_local = 2 * (Y == iter_class) –

1;g_old = t(X) %*% Y_local;...while( continue ) {

Xd = X %*% s;... inner while loop (compute

step_sz)Xw = Xw + step_sz * Xd;out = 1 - Y_local * Xw;out = (out > 0) * out;g_new = t(X) %*% (out *

Y_local) ...}

}

Page 24: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

24 IBM Research

Page 25: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Update In-Place§ Loop Update In-Place

– 1) ParFor result indexing / intermediates (w/ pinned matrix objects)– 2) For/while/parfor loops with pure left indexing access to variable– Both require pinning / shallow serialize to overcome buffer pool serialization– Example Type 2:

§ Where we cannot apply Update In-Place– Matrix object cannot fit into local memory budget (CP only)– Interleaving operations (mix of update and reference, might be non-obvious)– Example

25 IBM Research

for(i in 1:nrow(X))for(j in 1:ncol(X))

X[i,j] = i+j;

R = X;X[i,j] = i+j;y = sum(R);

Would create incorrect results!

Page 26: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Hands-On Lab: Update In-Place § Exercise: Update In-Place (SystemML master/0.11 only):

– a) Update in-place application (investigate -explain and –stats)

– b) Update in-place not applicable – why?

26 IBM Research

for(i in 1:nrow(X))for(j in 1:ncol(X))

X[i,j] = i+j;

for(i in 1:nrow(X)) {for(j in 1:ncol(X)) {

print(sum(X)); X[i,j] = i+j;

}}

Page 27: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

27 IBM Research

Page 28: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Compressed Linear Algebra§ Motivation / Problem

– Iterative ML algorithms w/ repeated read-only data access– IO-bound matrix-vector multiplications è crucial to fit data in memory– General-purpose heavy-/lightweight techniques too slow / modest comp. ratios

§ Goals– Performance close to uncompressed– Good compression ratios

28 IBM Research

[A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, B. Reinwald: Compressed Linear Algebra for Large-

Scale Machine Learning. PVLDB 9(12), 2016]

Page 29: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

Compressed Linear Algebra (2)§ Approach

– Database compression– LA over compressed rep.– Column-compression

schemes (OLE, RLE, UC)– Cache-conscious CLA ops– Sampling-based

compression algorithm

§ Results

29 IBM Research

[A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, B. Reinwald: Compressed Linear Algebra for Large-

Scale Machine Learning. PVLDB 9(12), 2016]

Algorithm Dataset ULA Snappy CLAGLM Mnist40m (90GB) 409s 647s 397s

Mnist240m (540GB) 74,301s 23,717s 2,787sMLogreg Mnist40m (90GB) 630s 875s 622s

Mnist240m (540GB) 83,153s 27,626s 4,379sL2SVM Mnist40m (90GB) 394 461 429

Mnist240m (540GB) 14,041 8,423 2,593

Up to 26x

Page 30: Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation30 IBM Research

SystemML is Open Source:• Apache Incubator Project (11/2015)• Website: http://systemml.apache.org/ • Source code: https://github.com/

apache/incubator-systemml


Recommended