+ All Categories
Home > Documents > Using R for Iterative and Incremental Processing

Using R for Iterative and Incremental Processing

Date post: 22-Oct-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
58
Using R for Iterative and Incremental Processing Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber UC Berkeley and HP Labs UC BERKELEY
Transcript
Page 1: Using R for Iterative and Incremental Processing

Using R for Iterative and

Incremental Processing

Shivaram Venkataraman, Indrajit Roy, Alvin AuYoung, Robert Schreiber

UC Berkeley and HP Labs

UC BERKELEY

Page 2: Using R for Iterative and Incremental Processing

Big Data, Complex Algorithms

PageRank

(Dominant eigenvector)

Recommendations

(Matrix factorization)

Anomaly detection

(Top-K eigenvalues)

User Importance

(Vertex Centrality) 6/29/2012 2

Page 3: Using R for Iterative and Incremental Processing

Big Data, Complex Algorithms

PageRank

(Dominant eigenvector)

Recommendations

(Matrix factorization)

Anomaly detection

(Top-K eigenvalues)

User Importance

(Vertex Centrality) 6/29/2012 3

Machine learning + Graph algorithms

Iterative Linear Algebra Operations

Page 4: Using R for Iterative and Incremental Processing

PageRank Using Matrices

Dominant eigenvector

M = modified web graph matrix

p = PageRank vector

6/29/2012 4

P2

P1

PN/s

M

P1

P2

PN/s

p Z

P1

P2

PN/s

p

s P1

P2

PN/s

N

s

N

Simplified algorithm:

repeat { p = M*p + Z}

Page 5: Using R for Iterative and Incremental Processing

Breadth-first Search Using Matrices

G = adjacency matrix

X = BFS vector

6/29/2012 5

A B C D E

A 1 1 1 0 0

B 0 1 0 1 0

C 0 1 1 0 0

D 0 0 0 1 1

E 0 0 0 0 1

1 0 0 0 0

A B

C

D

E

X

G

Simplified algorithm:

repeat { X = G*X }

Page 6: Using R for Iterative and Incremental Processing

Breadth-first Search Using Matrices

G = adjacency matrix

X = BFS vector

6/29/2012 6

A B C D E

A 1 1 1 0 0

B 0 1 0 1 0

C 0 1 1 0 0

D 0 0 0 1 1

E 0 0 0 0 1

1 0 0 0 0

A B

C

D

E

X

G

* * * 0 0

A B C D E

A B

C

D

E

Simplified algorithm:

repeat { X = G*X }

Page 7: Using R for Iterative and Incremental Processing

Breadth-first Search Using Matrices

G = adjacency matrix

X = BFS vector

6/29/2012 7

A B C D E

A 1 1 1 0 0

B 0 1 0 1 0

C 0 1 1 0 0

D 0 0 0 1 1

E 0 0 0 0 1

1 0 0 0 0

A B

C

D

E

X

G

* * * 0 0

A B C D E

* * * * 0

A B C D E

A B

C

D

E

A B

C

D

E

Simplified algorithm:

repeat { X = G*X }

Page 8: Using R for Iterative and Incremental Processing

Breadth-first Search Using Matrices

G = adjacency matrix

X = BFS vector

6/29/2012 8

A B C D E

A 1 1 1 0 0

B 0 1 0 1 0

C 0 1 1 0 0

D 0 0 0 1 1

E 0 0 0 0 1

1 0 0 0 0

A B

C

D

E

X

G

* * * 0 0

A B C D E

* * * * 0

A B C D E

* * * * *

A B C D E

A B

C

D

E

A B

C

D

E

A B

C

D

E

Simplified algorithm:

repeat { X = G*X }

Page 9: Using R for Iterative and Incremental Processing

Breadth-first Search Using Matrices

G = adjacency matrix

X = BFS vector

6/29/2012 9

A B C D E

A 1 1 1 0 0

B 0 1 0 1 0

C 0 1 1 0 0

D 0 0 0 1 1

E 0 0 0 0 1

1 0 0 0 0 * * * 0 0

A B C D E

* * * * 0

A B C D E

* * * * *

A B C D E

Simplified algorithm:

repeat { X = G*X }

Matrix operations

Easy to express

Efficient to implement

Page 10: Using R for Iterative and Incremental Processing

Linear Algebra on Existing Frameworks

6/29/2012 10

Matrix Operations: Structured, coarse grained

Need global state

Page 11: Using R for Iterative and Incremental Processing

Linear Algebra on Existing Frameworks

Data-parallel frameworks – MapReduce/Dryad

– Process each record in parallel

– Use case: Computing sufficient statistics

6/29/2012 11

Matrix Operations: Structured, coarse grained

Need global state

Page 12: Using R for Iterative and Incremental Processing

Linear Algebra on Existing Frameworks

Data-parallel frameworks – MapReduce/Dryad

– Process each record in parallel

– Use case: Computing sufficient statistics

Graph-centric frameworks – Pregel/GraphLab

– Process each vertex in parallel

– Use case: Graph models

6/29/2012 12

Matrix Operations: Structured, coarse grained

Need global state

Page 13: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

6/29/2012 13

Page 14: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

6/29/2012 14

Page 15: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

6/29/2012 15

Page 16: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91 Blo

ck d

en

sity

(n

orm

alized

)

Block ID

LiveJournal Netflix ClueWeb-1B

6/29/2012 16

Page 17: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91 Blo

ck d

en

sity

(n

orm

alized

)

Block ID

LiveJournal Netflix ClueWeb-1B

6/29/2012 17

Page 18: Using R for Iterative and Incremental Processing

Challenge 1 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91 Blo

ck d

en

sity

(n

orm

alized

)

Block ID

LiveJournal Netflix ClueWeb-1B

1000x more elements Computation imbalance

6/29/2012 18

Page 19: Using R for Iterative and Incremental Processing

Challenge 2 – Incremental Updates

Incremental computation on consistent view of data

6/29/2012 19

Refine

recommendations

New movie

ratings

Better

suggestions

Page 20: Using R for Iterative and Incremental Processing

Presto

Framework for large-scale iterative linear algebra

Extend R for scalability and incremental updates

6/29/2012 20

Page 21: Using R for Iterative and Incremental Processing

Outline

• Motivation

• Programming model

• Design

• Applications and Results

6/29/2012 21

Page 22: Using R for Iterative and Incremental Processing

Programming Model

One data structure: Distributed Array

A darray(…)

6/29/2012 22

Page 23: Using R for Iterative and Incremental Processing

Programming Model

Iteration: foreach

6/29/2012 23

Page 24: Using R for Iterative and Incremental Processing

Programming Model

Iteration: foreach

Compute Compute Compute Compute 6/29/2012 24

Page 25: Using R for Iterative and Incremental Processing

Programming Model

Incremental updates: onchange, update

Compute Compute Compute Compute 6/29/2012 25

Page 26: Using R for Iterative and Incremental Processing

Programming Model

Incremental updates: onchange, update

Compute Compute Compute Compute

Data Updated

6/29/2012 26

Page 27: Using R for Iterative and Incremental Processing

Programming Model

Incremental updates: onchange, update

Compute Compute Compute Compute

Data Updated

6/29/2012 27

Page 28: Using R for Iterative and Incremental Processing

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(p=splits(P,i),m=splits(M,i),

x=splits(P_old),z=splits(Z,i)) {

p (m*x)+z

}

) P_old P

} 6/29/2012 28

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 29: Using R for Iterative and Incremental Processing

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(p=splits(P,i),m=splits(M,i),

x=splits(P_old),z=splits(Z,i)) {

p (m*x)+z

}

) P_old P

}

Create Distributed Array

6/29/2012 29

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 30: Using R for Iterative and Incremental Processing

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i),

x=splits(P_old), z=splits(Z,i)) {

p (m*x)+z

}

) P_old P

} 6/29/2012 30

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 31: Using R for Iterative and Incremental Processing

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i),

x=splits(P_old), z=splits(Z,i)) {

p (m*x)+z

}

) P_old P

}

Execute function in a cluster

6/29/2012 31

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 32: Using R for Iterative and Incremental Processing

PageRank Using Presto

M darray(dim=c(N,N),blocks=(s,N))

P darray(dim=c(N,1),blocks=(s,1))

while(..){

foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i),

x=splits(P_old), z=splits(Z,i)) {

p (m*x)+z

}

) P_old P

}

Execute function in a cluster

Pass array partitions

6/29/2012 32

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 33: Using R for Iterative and Incremental Processing

Incremental PageRank

M darray(dim=c(N,N),blocks=(s,N)) P darray(dim=c(N,1),blocks=(s1)) onchange(M) { while(..){ foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p (m*x)+z update(p)

}) P_old P }} 6/29/2012 33

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 34: Using R for Iterative and Incremental Processing

Incremental PageRank

M darray(dim=c(N,N),blocks=(s,N)) P darray(dim=c(N,1),blocks=(s1)) onchange(M) { while(..){ foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p (m*x)+z update(p)

}) P_old P }}

Execute when data changes

6/29/2012 34

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 35: Using R for Iterative and Incremental Processing

Incremental PageRank

M darray(dim=c(N,N),blocks=(s,N)) P darray(dim=c(N,1),blocks=(s1)) onchange(M) { while(..){ foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p (m*x)+z update(p)

}) P_old P }}

Execute when data changes

Update page rank vector

6/29/2012 35

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

Page 36: Using R for Iterative and Incremental Processing

Outline

• Motivation

• Programming model

• Design

• Applications and Results

6/29/2012 36

Page 37: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 37

Page 38: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 38

Profile execution

Page 39: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 39

Profile execution

Page 40: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 40

Profile execution

Partition

Page 41: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 41

Profile execution

Partition

Page 42: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 42

Profile execution

Partition

Page 43: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 43

Profile execution

Partition

Programmers specify size invariants.

Page 44: Using R for Iterative and Incremental Processing

Dynamic Partitioning of Matrices

6/29/2012 44

Up to 2x performance improvement

Page 45: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

45

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

6/29/2012

Page 46: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

46

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

onchange(M1)

6/29/2012

Page 47: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

47

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

Page Rank

… …

0.035

0.006

0.008

0.032

update P1

6/29/2012

Page 48: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

48

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

Page Rank

… …

0.035

0.006

0.008

0.032

6/29/2012

Page 49: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

49

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

Page Rank

… …

0.035

0.006

0.008

0.032

0 0 0 0 1 0 0 0 1 0 0 0

0 1 1 1

onchange(M2)

6/29/2012

Page 50: Using R for Iterative and Incremental Processing

Incremental Updates Using Consistent Snapshots

50

P

Q

R S

Web Graph

… …

Adjacency Matrix

0 0 0 0 1 0 0 0 1 0 0 0

0 0 1 1

Page Rank

… …

0.035

0.006

0.008

0.032

0 0 0 0 1 0 0 0 1 0 0 0

0 1 1 1 0.035

0.028

0.008

0.032

update P2

6/29/2012

Page 51: Using R for Iterative and Incremental Processing

Versioned Distributed Arrays

Mechanics of versioning

– update: Increment version number

– onchange: Bind a version number for the array

before executing the handler

51 6/29/2012

Page 52: Using R for Iterative and Incremental Processing

Outline

• Motivation

• Programming model

• Design

• Applications and Results

6/29/2012 52

Page 53: Using R for Iterative and Incremental Processing

Applications Implemented in Presto

Application Algorithm Presto LOC

PageRank Eigenvector calculation 41

Triangle counting Top-K eigenvalues 121

Netflix recommendation Matrix factorization 130

Centrality measure Graph algorithm 132

k-path connectivity Graph algorithm 30

k-means Clustering 71

Sequence alignment Smith-Waterman 64

6/29/2012 53

Page 54: Using R for Iterative and Incremental Processing

Applications Implemented in Presto

Application Algorithm Presto LOC

PageRank Eigenvector calculation 41

Triangle counting Top-K eigenvalues 121

Netflix recommendation Matrix factorization 130

Centrality measure Graph algorithm 132

k-path connectivity Graph algorithm 30

k-means Clustering 71

Sequence alignment Smith-Waterman 64

Fewer than 140 lines of code

6/29/2012 54

Page 55: Using R for Iterative and Incremental Processing

1

10

100

1000

8 16 32 64

Tim

e (

sec)

Number of workers

Presto Hadoop-InMem

Presto is Fast !

PageRank per-iteration execution time

Data: 100M nodes, 1.2B edges. Setup: 10G network. 12 cores, 96GB RAM.

6/29/2012 55

Page 56: Using R for Iterative and Incremental Processing

1

10

100

1000

8 16 32 64

Tim

e (

sec)

Number of workers

Presto Hadoop-InMem

Presto is Fast !

PageRank per-iteration execution time

Data: 100M nodes, 1.2B edges. Setup: 10G network. 12 cores, 96GB RAM.

More than 20x faster than Hadoop (w/ in-memory storage)

6/29/2012 56

Page 57: Using R for Iterative and Incremental Processing

More in the Paper

• Memory management, caching of partitions

• Scheduling operations

• Storage driver interface to HBase

• Fault tolerance

6/29/2012 57

Page 58: Using R for Iterative and Incremental Processing

Conclusion

Linear Algebra is a powerful abstraction

Easily express machine learning, graph algorithms

Challenges: Sparse matrices, Incremental data

Presto – prototype extends R

Open source version soon !

6/29/2012 58


Recommended