Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

© 2015 IBM Corporation

S7/8: SystemML’s Optimizer and Runtime

Matthias Boehm1, Arvind C. Surve2

1 IBM Research – Almaden2 IBM Spark Technology Center

IBM Research


Abstraction: The Good, the Bad and the Ugly

2 IBM Research

q = t(X) %*% (w * (X %*% v))

[adapted from Peter Alvaro: "I See What You Mean“, Strange Loop, 2015]

Simple & Analysis-CentricData Independence

Platform IndependenceAdaptivity

(Missing)Size InformationOperator

Selection

(Missing) Rewrites

Distributed Operations

Distributed Storage

(Implicit) Copy-on-WriteData Skew

Load Imbalance

Latency

Complex Control Flow

Local / Remote Memory Budgets

The Ugly: Expectations ≠ Reality

è Understanding of optimizer and runtime techniques underpinning declarative, large-scale ML

Efficiency & Performance


Outline§ Common Framework§ Optimizer-Centric Techniques§ Runtime-Centric Techniques

– ParFor Optimizer/Runtime– Buffer Pool + Specific Optimizations– Spark-Specific Rewrites– Partitioning-Preserving Operations– Update In-Place – Ongoing Research (CLA)

3 IBM Research


Optimization through ParFor§ Motivation

– SystemML focus primarily on data parallelism– Dedicated parfor construct for task parallelism

§ ParFor approach:– Complementary parfor parallelization strategies– Cost-based optimization framework for task-parallel ML– Memory budget as common constraint

4 IBM Research


Recap: Basic HOP DAG Compilation Example Pearson Correlation§ DML

Script

§ HOP DAG

5 IBM Research

X = read( "./in/X" ); #data on HDFSY = read( "./in/Y" );m = nrow(X);sigmaX = sqrt( centralMoment(X,2)*(m/(m-1.0)) );sigmaY = sqrt( centralMoment(Y,2)*(m/(m-1.0)) );r = cov(X,Y) / (sigmaX * sigmaY);write( r, "./out/r" );

b(cov)

X

r (“./out/r“)

Y (“./in/Y“, 106x1)

b(cm) b(cm)

b(*) b(* )

2

u(sqrt ) u(sqrt )

b(* )

b(/ )

b(/ )

b(-)

1,000,000 1

w/ o constant folding (1.000001)

(“./in/X“, 106x1)

u() … unary operatorb() … binary operatorcov … covariancecm … central momentsqrt … square root

yxyx

YXσσ

ρ),cov(

, =

Exploit Spark/MRdata parallelism

if beneficial/required


Running Example: Pairwise Pearson Correlation§ Representative for more complex bivariate statistics

(Pearson‘s R, Anova F, Chi-squared, Degree of freedom, P-value, Cramers V, Spearman, etc)

6 IBM Research

D = read("./input/D");m = nrow(D);n = ncol(D);R = matrix(0, rows=n, cols=n);parfor( i in 1:(n-1) ) {

X = D[ ,i];m2X = centralMoment(X,2);sigmaX = sqrt( m2X*(m/(m-1.0)) );parfor( j in (i+1):n ) {

Y = D[ ,j];m2Y = centralMoment(Y,2);sigmaY = sqrt( m2Y*(m/(m-1.0)) );R[i,j] = cov(X,Y) / (sigmaX*sigmaY);

}}write(R, "./output/R");

Challenges:• Triangular nested loop• Column-wise access on

unordered distributed data• Bivariate all-to-all data

shuffling pattern.

Exploit task anddata parallelism

if beneficial/required


Overview Parallelization Strategies§ Conceptual Design: Master/worker

– Task: group of parfor iterations

§ Task Partitioning– Naive, static, fixed, factoring,

factoring_cmax– Task overhead vs load balance?

§ Task Execution– Local, remote (Spark/MR), remoteDP (Spark/MR) – Various runtime optimizations– Degree of parallelism/IO/latency?

§ Result Aggregation– Local memory, local file, remote (Spark/MR)– W/ and w/o compare– Result locality/IO/latency?

7 IBM Research

n = 12parfor( i in 1:(n-1) ) {

X = D[ ,i];…R[i,j] = …

}

è Optimizer leverages these to generate efficient execution

plans


Example Task Partitioning

8 IBM Research

§ Scenario: k=24 workers, 10,000 iterations

Factoring Factoring CMAX (150)

0

50

100

150

200

250

#ofIterations

Tasks(1to208)

Naive Fixed(250) Static

0

50

100

150

200

250

300

350

400

450

#ofIterations

Tasks(1to24)0

50

100

150

200

250

300

#ofIterations

Tasks(1to40)0

10

20

30

40

50

1Iterationpertask

Tasks(1to10000)

0

50

100

150

200

250

#ofIterations

Tasks(1to228)


Task Execution: Local and Remote Parallelism

9 IBM Research

Local execution (multicore) Remote execution (cluster)

Local ParWorker k

ParFOR (local)

Local ParWorker 1

while(wßdeq()) foreach pi ∈ w execute(prog(pi))

Task Part it ioning

Parallel Result Aggregat ion

Task Queue

...

w5: i, { 11}w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

HadoopParWorker Mapper k

ParFOR (remote)

ParWorker Mapper 1

map(key,value) wßparse(value) foreach pi ∈ w execute(prog(pi))

Task Part it ioning

Parallel Result Aggregat ion

...

…A|MATRIX|./ out / A7tmp

w5: i, { 11}w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

Hybrid parallelism: combinations of local/remote and data-parallel jobs


Task Execution: Runtime Optimizations§ Data Partitioning

– Problem: Repeated MR jobs for indexed access

– Access-awareness (cost estimation, correct plan generation)

– Operators: local file-based, remote MR job

§ Data Locality– Problem: Co-location of parfor tasks to partitions/matrices– Location reporting

per logical parfortask (e.g., forparfor(i) à D[, i])

10 IBM Research

parfor( i in 1:(n-1) ) {X = D[ ,i]; …parfor( j in (i+1):n ){

Y = D[ ,j]; …}}

N ode2

D3

D4

D5

D9

D10

D11

Node 1

N ode1

D1

D2

D6

D7

D8

Node 2Node 1Node 1, 2Node 2 w5: i, { 11}

w4: i, { 9,10}w3: i, { 7, 8 }w2: i, { 4,5,6}w1: i, { 1,2,3}

R eported Locat ions: Task File

Part it ions Part it ions


Optimization Framework – Problem Formulation§ Design: Runtime optimization for each top-level parfor

§ Plan Tree P– Nodes NP

• Exec type et• Parallelism k• Attributes A

– Height h– Exec contexts ECP

§ Plan Tree Optimization Problem–

11 IBM Research

ParFOR

b(cm)

Generic ParFOR

Generic

RIX LIX b(cov)...

RIX b(cm)...

ec0 ParFOR

b(cm)

Generic ParFOR

ec1 Generic

RIX LIX b(cov)...

RIX b(cm)... cmec = 600 MBckec = 1

cmec = 1024 MBckec = 16

MR

ec … execut ion contextcm … memory const raintck … parallelism const raint

[M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML PVLDB 7(7), 2014]

[M. Boehm et al. Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs. CoRR,2015]


Optimization Framework – Cost Model / Optimizer§ Overview Heuristic Optimizer

– Time- and memory-based cost model w/o shared reads– Heuristic high-impact rewrites– Transformation-based search strategy with global opt scope

§ Cost Model– HOP DAG

size propagation– Worst-case

memory estimates– Time estimates– Plan tree statistics

aggregation

12 IBM Research

ParFOR

b(cm)

Generic ParFOR

Generic

RIX LIX b(cov)...

RIX b(cm)...

Plan Tree Pk=4

MappedHOP DAGs

D

RIX

b(cov) b(cm)

j

...

X

d1= 0, d2= 0

d1= 1Md2= 1

d1= 0, d2= 0

d1= 1Md2= 1

d1= 1Md2= 10

M = (80 M B, 80 M B)

M = (8 M B, 8 M B)

M=(8 MB, 88 MB)

M = (0 M B, 8 M B)

M = (0 M B, 16 M B)

M= (< output mem> , < operat ion mem> )

Y

M=88MB

M=352MB


Hands-On Lab: Task-Parallel ParFor Programs § Exercise: Pairwise Pearson Correlation

– a) Simple for – loop w/ -stats

– b) Task-parallel parfor w/ -stats

13 IBM Research

D = rand(rows=100000, cols=100);m = nrow(D);n = ncol(D);R = matrix(0, rows=n, cols=n);parfor( i in 1:(n-1) ) {

X = D[ ,i];m2X = centralMoment(X,2);sigmaX = sqrt( m2X*(m/(m-1.0)) );parfor( j in (i+1):n ) {

Y = D[ ,j];m2Y = centralMoment(Y,2);sigmaY = sqrt( m2Y*(m/(m-1.0)) );R[i,j] = cov(X,Y) / (sigmaX*sigmaY);

}}write(R, "./tmp/R", format="binary");




14 IBM Research


Buffer Pool Overview§ Motivation

– Exchange of intermediates between local and remote operations (HDFS, RDDs, GPU divide memory)

– Eviction of in-memory objects (integrated with garbage collector)§ Primitives

– acquireRead, acquireModify, release, exportData, getRdd, getBroadcast

§ Spark Specifics– Lineage tracking

RDDs/broadcasts– Guarded RDD

collect/parallelize– Partitioned

Broadcast variables

15 IBM Research

MatrixObject/WriteBuffer

Lineage Tracking




16 IBM Research


Spark-Specific Optimizations§ Spark-Specific Rewrites

– Automatic caching/checkpoint injection (MEM_DISK / MEM_DISK_SER)

– Automatic repartition injection§ Operator Selection

– Spark exec type selection– Transitive Spark exec type– Physical operator selection

§ Extended ParFor Optimizer– Deferred checkpoint/repartition injection– Eager checkpointing/repartitioning– Fair scheduling for concurrent jobs– Local degree of parallelism

§ Runtime Optimizations– Lazy Spark context creation– Short-circuit read/collect

17 IBM Research

X = read($1);y = read($2);...r = -(t(X) %*% y);while(i < maxi & norm_r2 > norm_r2_trgt) {q = t(X)%*%(X%*%p) + lambda*p;alpha = norm_r2 / (t(p)%*%q);w = w + alpha * p;old_norm_r2 = norm_r2;r = r + alpha * q;norm_r2 = sum(r * r);beta = norm_r2 / old_norm_r2;p = -r + beta * p;i = i + 1;

}...write(w, $4);

chkpt X MEM_DISK

Ex: Checkpoint Injection LinregCG

Spark Exec (24 cores)

25% user75% data&exec(50% Min & 75% Max)


SystemML on Spark: Lessons Learned§ Spark over Custom Framework

– Well engineered framework with strong contributor base– Seamless data preparation and feature engineering

§ Stateful Distributed Caching– Standing executors with distributed caching and fast task scheduling– Challenges: task parallelism, memory constraints, fair resource management

§ Memory Efficiency– Compact data structures to avoid cache spilling (serialization, CSR)– Custom serialization and compression

§ Lazy RDD Evaluation– Automatic grouping of operations into distributed jobs, incl partitioning– Challenges: multiple actions/repeated execution, runtime plan compilation!

§ Declarative ML– Introduction of Spark backend did not require algorithm changes!– Automatically exploit distributed caching and partitioning via rewrites

18 IBM Research

25% tasks




19 IBM Research


Partitioning-Preserving Operations on Spark§ Partitioning-preserving ops

– Op is partitioning-preserving if key not changed (guaranteed)– 1) Implicit: Use restrictive APIs (mapValues() vs mapToPair())– 2) Explicit: Partition computation w/ declaration of partitioning-preserving

(memory-efficiency via “lazy iterators”)

§ Partitioning-exploiting ops– 1) Implicit: Operations based on join, cogroup, etc– 2) Explicit: Custom physical operators on original keys (e.g., zipmm)

20 IBM Research

Physical Blocking and Partitioning


Partitioning-Exploiting ZIPMM§ Operation:

Z = t(X) %* % y

21 IBM Research

§ Operations: Transpose, Join, Multiplication§ Shuffle

§ Operations: Join, Transpose & Multiplication§ Avoid unnecessary shuffle

X yInput:

1,1

1,2

1,3

Approach: zipmmX y Z

1,1

1,2

1,3

Partitions not preserved after transpose, as keys changed.

t(X)

yApproach: Naive

1,1 2,1 3,1


Example Multiclass SVM§ Example: Multiclass SVM

– Vectors in nrow(X) neither fit into driver nor broadcast (MapMM not applicable)

– ncol(X) ≤ Bc (zipmm applicable)

22 IBM Research

parfor(iter_class in 1:num_classes) {Y_local = 2 * (Y == iter_class) – 1;g_old = t(X) %*% Y_local; ...while( continue ) {

Xd = X %*% s;... inner while loop (compute step_sz)Xw = Xw + step_sz * Xd;out = 1 - Y_local * Xw;out = (out > 0) * out;g_new = t(X) %*% (out * Y_local) ...

repart, chkpt X MEM_DISK

chkpt y_local MEM_DISK

zipmm

chkpt Xd, Xw MEM_DISK


Hands-On Lab: Partitioning-Preserving Operations § Exercise: MultiClass SVM

– W/o repartition injection– W/ repartitioning injection

23 IBM Research

parfor(iter_class in 1:num_classes) {Y_local = 2 * (Y == iter_class) –

1;g_old = t(X) %*% Y_local;...while( continue ) {

Xd = X %*% s;... inner while loop (compute

step_sz)Xw = Xw + step_sz * Xd;out = 1 - Y_local * Xw;out = (out > 0) * out;g_new = t(X) %*% (out *

Y_local) ...}

}




24 IBM Research


Update In-Place§ Loop Update In-Place

– 1) ParFor result indexing / intermediates (w/ pinned matrix objects)– 2) For/while/parfor loops with pure left indexing access to variable– Both require pinning / shallow serialize to overcome buffer pool serialization– Example Type 2:

§ Where we cannot apply Update In-Place– Matrix object cannot fit into local memory budget (CP only)– Interleaving operations (mix of update and reference, might be non-obvious)– Example

25 IBM Research

for(i in 1:nrow(X))for(j in 1:ncol(X))

X[i,j] = i+j;

R = X;X[i,j] = i+j;y = sum(R);

Would create incorrect results!


Hands-On Lab: Update In-Place § Exercise: Update In-Place (SystemML master/0.11 only):

– a) Update in-place application (investigate -explain and –stats)

– b) Update in-place not applicable – why?

26 IBM Research

for(i in 1:nrow(X))for(j in 1:ncol(X))

X[i,j] = i+j;

for(i in 1:nrow(X)) {for(j in 1:ncol(X)) {

print(sum(X)); X[i,j] = i+j;

}}




27 IBM Research


Compressed Linear Algebra§ Motivation / Problem

– Iterative ML algorithms w/ repeated read-only data access– IO-bound matrix-vector multiplications è crucial to fit data in memory– General-purpose heavy-/lightweight techniques too slow / modest comp. ratios

§ Goals– Performance close to uncompressed– Good compression ratios

28 IBM Research

[A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, B. Reinwald: Compressed Linear Algebra for Large-

Scale Machine Learning. PVLDB 9(12), 2016]


Compressed Linear Algebra (2)§ Approach

– Database compression– LA over compressed rep.– Column-compression

schemes (OLE, RLE, UC)– Cache-conscious CLA ops– Sampling-based

compression algorithm

§ Results

29 IBM Research

[A. Elgohary, M. Boehm, P. J. Haas, F. R. Reiss, B. Reinwald: Compressed Linear Algebra for Large-

Scale Machine Learning. PVLDB 9(12), 2016]

Algorithm Dataset ULA Snappy CLAGLM Mnist40m (90GB) 409s 647s 397s

Mnist240m (540GB) 74,301s 23,717s 2,787sMLogreg Mnist40m (90GB) 630s 875s 622s

Mnist240m (540GB) 83,153s 27,626s 4,379sL2SVM Mnist40m (90GB) 394 461 429

Mnist240m (540GB) 14,041 8,423 2,593

Up to 26x

© 2015 IBM Corporation30 IBM Research

SystemML is Open Source:• Apache Incubator Project (11/2015)• Website: http://systemml.apache.org/ • Source code: https://github.com/

apache/incubator-systemml

Date post:	14-Apr-2017
Category:	Education
Upload:	arvind-surve
View:	121 times
Download:	0 times

Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias Boehm

Education