+ All Categories
Home > Documents > Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable...

Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable...

Date post: 03-Apr-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
42
Scalable Multi-Language Data Analysis on BEAM The Erlang Experience orgen Brandt Humboldt-Universit¨ at zu Berlin 2016-09-08 Erlang User Conference 2016 orgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 1 / 29
Transcript
Page 1: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Scalable Multi-Language Data Analysis on BEAMThe Erlang Experience

Jorgen Brandt

Humboldt-Universitat zu Berlin

2016-09-08

Erlang User Conference 2016

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 1 / 29

Page 2: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

About me

PhD student at Humboldt (Berlin)Creator of Cuneiform workflow languageMajor area of application: Bioinformatics

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 2 / 29

Page 3: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Motivation

Introduction to CuneiformHow to implement a large-scale data analysis PLHow in Erlang?Why in Erlang?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 3 / 29

Page 4: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Scientific Workflow Systems

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 4 / 29

Page 5: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGs

Nodes are tasksEdges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

Page 6: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGsNodes are tasks

Edges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

Page 7: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Scientific Workflow Systems

Workflows as DAGs

Scientific Workflows areDAGsNodes are tasksEdges are data dependencies

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 5 / 29

Page 8: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Page 9: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Page 10: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Page 11: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Page 12: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The Next Generation Sequencing use case

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 6 / 29

Page 13: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Cuneiform is ...

Functional Language for Large-Scale Data Analysisimplemented in Erlang

Different from systems like SparkStandalone syntax, no fluent API in ScalaIntegration of foreign PLs

Different from distributed workflow languages like SnakemakeReduction of functional expression, no static dependency graph

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 7 / 29

Page 14: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

The main idea

Functional Programming+

Foreign Language Interfacing

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 8 / 29

Page 15: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Functional Programming

Functional Programming

Very expressiveNatural operations on lists(map, . . . )Natural iteration (recursion)

Gain

General data analysis

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 9 / 29

Page 16: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Functional Programming

Functional Programming

Very expressiveNatural operations on lists(map, . . . )Natural iteration (recursion)

Parallelize independentsub-expressionsLazy Evaluation

Gain

General data analysis

Gain

Automatic Parallelism

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 9 / 29

Page 17: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Page 18: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Page 19: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz' 'myarchive2.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Page 20: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Cuneiform Example

deftask gunzip( out( File ) : gz( File ) )in bash *{ out=output.file gzip -c -d $gz > $out}*

gunzip( gz: 'myarchive1.gz' 'myarchive2.gz' 'myarchive3.gz');

gunzip

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 10 / 29

Page 21: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Workflow Implementations Available

cuneiform-lang.org/examples

Available example workflows:Variant Calling(Varscan)MethylationRNA-SeqVariant Calling (GATK)etc . . .

samtools-faidx

fastq-dump

cufflinks

cummerbund

tophat-single

cuffdiff

cuffmerge

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 11 / 29

Page 22: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Example: RNA-Seqcuneiform-lang.org/examples

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 12 / 29

Page 23: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Biology-Inspired Analogy

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 13 / 29

Page 24: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Biology-Inspired Analogy

Large-scale data analysis systems as 2-tiersystem:

Algorithm-level (fast, local)Workflow-level (organizational,distributed)

Analogy to the cell:DNA transcription (fast, local)Cell-to-cell signaling (organizational,distributed)

Erlang good fit for organizational part

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 14 / 29

Page 25: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Distributed Workflow System Architecture

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Page 26: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Distributed Workflow System Architecture

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Page 27: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Distributed Workflow System Architecture

Scheduler

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Page 28: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Distributed Workflow System Architecture

Scheduler

FS Work FS Work FS Work

Cache

Query Query Query

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 15 / 29

Page 29: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Two Modeling Challenges (i)

Interpreter

Execution Environment

Interpreter

Reduction of queryexpression

Execution Environment

Distributed System

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 16 / 29

Page 30: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

How to model programming languages?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 17 / 29

Page 31: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Modeling the Cuneiform Interpreter

Compiler

Transcrip�onParser Generator

(Leex/Yecc)

BNF

Program Interpreter Result

Opera�onal

Seman�cs

Abstract

Program

Parser is generated from BNFInterpreter is transcribed from Operational Semantics

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 18 / 29

Page 32: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Abstract SyntaxAn expression in Cuneiform is . . .

Expr

x

String String

Str

(str, s)Var

(var, n)Select

(select, c, u)App

(app, c, λ, fa)Cnd

(cnd, [xc1, ...], [xt1, ...], [xe1, ...])

N+ Expr Expr Expr N+Fut

(fut, r, [po1, ...])Lam

(lam, s, b)

N+

Var

Param

String→[Expr]

Natbody

(natbody, fb)Forbody

(forbody, l, s)Sign

(sign, [po1, ...], [pi1, ...])

Param String→[Expr] StringParam

(param, n, pl)Correl

(correl, [n1, ...])bash python r

String B String

Expressions can contain other expressionsSemantics define how expressions are reduced

x0 → x1 → . . . → x∗

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 19 / 29

Page 33: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Implementing an Operational Semantics in Erlang

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 20 / 29

Page 34: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Two Modeling Challenges (ii)

Interpreter

Execution Environment

Interpreter

Reduction of queryexpression

Execution Environment

Distributed System

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 21 / 29

Page 35: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

How to model distributed systems?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 22 / 29

Page 36: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

How to Model Distributed Systems

Petri Nets

Mature theoryLivenessInvariantsTraps/Cotraps. . .

Local decision/synchronizationParallel execution of independent transitions

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 23 / 29

Page 37: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Modeling the Cuneiform Execution Environment

(a,r)

(a,r)

(a,r)

(a,r)(a,r)

a

a

App

p

p

(p,a)

(p,a,r)

(p,s,a,q)

monitor

decommission

resubmit

submit schedule

reply

lookup

S3

S4

S5

P1

P2

P3

D

A

T4 T3

(a,q) (a,q) (a,q)

(p,s,a,q)

(a,q)

(p,s)

(p,s)

(p,s)

(p,s)

(p,s,a,q)(p,s)

pmatch(s,q)

C

Execution Environmenta∈App, p∈Pid, r∈Result∪Err,

q∈Req, s∈Spec

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 24 / 29

Page 38: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

How to do Petri Nets in Erlang

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 25 / 29

Page 39: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Flow-based programming revisited

System languages for heavy liftingLarge-scale data analysis is hard

Because programming languages are hardBecause distributed systems are hard

Erlang is good fitBecause FP is already close to operational semantics style notationBecause Erlang process model already close to Petri Nets

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 26 / 29

Page 40: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

REPL

cuneiform-lang.org

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 27 / 29

Page 41: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Conclusion

Functional Programming +Foreign LanguagesDistributed Execution

Local MulticoreHTCondorHadoop

Automatic ParallelizationExpressive data analysisworkflowsForeign Language Interface

BashPerl. . .

cuneiform-lang.org

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 28 / 29

Page 42: Scalable Multi-Language Data Analysis on BEAM - The Erlang … · 2016-09-24 · Scalable Multi-Language Data Analysis on BEAM The Erlang Experience J¨orgen Brandt Humboldt-Universit¨at

Questions?

Questions?

Jorgen Brandt (HU Berlin) The Erlang Experience 2016-09-08 29 / 29


Recommended