+ All Categories
Home > Documents > GraphAlgorithms

GraphAlgorithms

Date post: 24-Jan-2016
Category:
Upload: ritesh-verma
View: 214 times
Download: 0 times
Share this document with a friend
Description:
Graph Algorithms Pseudocodes
61
© 2013 A. Haeberlen, Z. Ives NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Graph algorithms in MapReduce October 15, 2013
Transcript
Page 1: GraphAlgorithms

1University of Pennsylvania

© 2013 A. Haeberlen, Z. Ives

NETS 212: Scalable and Cloud Computing

Graph algorithms in MapReduce

October 15, 2013

Page 2: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

2University of Pennsylvania

Announcements

No class on October 22nd or 24th Andreas at IMC in Barcelona Please work on HW2 and HW3 (will be released this

week)

Special 'catch-up' class on October 30th

4:30-6:00pm, Location TBA

Any questions about HW2? If you haven't started yet: Please start early!!!

Page 3: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

What we have seen so far

In the first half of the semester, we saw how the map/reduce model could be used to filter, collect, and aggregate data values

This is useful for data with limited structure

We could extract pieces of input data items and collect them to run various reduce operations

We could “join” two different data sets on a common key

But that’s not enough…3

Page 4: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Beyond average/sum/count

Much of the world is a network of relationships and shared features

Members of a social network can be friends, and may have shared interests / memberships / etc.

Customers might view similar movies, and might even be clustered by interest groups

The Web consists of documents with links Documents are also related by topics, words,

authors, etc.

4

Page 5: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Goal: Develop a toolbox

We need a toolbox of algorithms useful for analyzing data that has both relationships and properties

For the next ~2 lectures we’ll start to build this toolbox

Some of the problems are studied in courses you may not have taken yet:

CIS 320 (algorithms), CIS 391/520 (AI), CIS 455 (Web Systems) So we’ll see both the traditional solution and the

MapReduce one

5

Page 6: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

6University of Pennsylvania

Plan for today

Representing data in graphs Graph algorithms in MapReduce

Computation model Iterative MapReduce

A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes

NEXT

Page 7: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Thinking about related objects

We can represent related objects as a labeled, directed graph

Entities are typically represented as nodes; relationships are typically edges

Nodes all have IDs, and possibly other properties Edges typically have values, possibly IDs and other

properties7

fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

Alice Sunita Jose

MikhailMagna Carta

Facebook

Imag

es

by Jojo

Mend

oza

, C

reati

ve C

om

mons

license

d

Page 8: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Encoding the data in a graph

Recall basic definition of a graph: G = (V, E) where V is vertices, E is edges of

the form (v1,v2) where v1,v2 V Assume we only care about connected

vertices Then we can capture a graph simply as the

edges ... or as an adjacency list: vi goes to [vj, vj+1, …

]

8

Alice Sunita Jose

MikhailMagna Carta

Facebook

Page 9: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Graph encodings: Set of edges

(Alice, Facebook)(Alice, Sunita)(Jose, Magna Carta)(Jose, Sunita)(Mikhail, Facebook)(Mikhail, Magna

Carta)(Sunita, Facebook)(Sunita, Alice)(Sunita, Jose)

9

Alice Sunita Jose

MikhailMagna Carta

Facebook

Page 10: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Graph encodings: Adding edge types

(Alice, fan-of, Facebook)(Alice, friend-of, Sunita)(Jose, fan-of, Magna Carta)(Jose, friend-of, Sunita)(Mikhail, fan-of, Facebook)(Mikhail, fan-of, Magna

Carta)(Sunita, fan-of, Facebook)(Sunita, friend-of, Alice)(Sunita, friend-of, Jose)

10

Alice Sunita Jose

MikhailMagna Carta

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

Page 11: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Graph encodings: Adding weights

(Alice, fan-of, 0.5, Facebook)(Alice, friend-of, 0.9, Sunita)(Jose, fan-of, 0.5, Magna

Carta)(Jose, friend-of, 0.3, Sunita)(Mikhail, fan-of, 0.8,

Facebook)(Mikhail, fan-of, 0.7, Magna

Carta)(Sunita, fan-of, 0.7, Facebook)(Sunita, friend-of, 0.9, Alice)(Sunita, friend-of, 0.3, Jose)

11

Alice Sunita Jose

MikhailMagna Carta

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

0.5

0.9

0.7

0.3

0.8 0.7

0.5

Page 12: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Recap: Related objects

We can represent the relationships between related objects as a directed, labeled graph

Vertices represent the objects Edges represent relationships

We can annotate this graph in various ways

Add labels to edges to distinguish different types Add weights to edges ...

We can encode the graph in various ways

Examples: Edge set, adjacency list

12

Page 13: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

13University of Pennsylvania

Plan for today

Representing data in graphs Graph algorithms in MapReduce

Computation model Iterative MapReduce

A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes

NEXT

Page 14: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

14University of Pennsylvania

A computation model for graphs

Once the data is encoded in this way, we can perform various computations on it

Simple example: Which users are their friends' best friend?

More complicated examples (later): Page rank, adsorption, ...

This is often done by annotating the vertices with additional information,

and propagating the information along the edges "Think like a vertex"!

Alice Sunita Jose

MikhailMagna Carta

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

0.5

0.9

0.7

0.3

0.8 0.7

0.5

Page 15: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

15University of Pennsylvania

A computation model for graphs

Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges

Alice Sunita Jose

MikhailMagna Carta

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

0.5

0.9

0.7

0.3

0.8 0.7

0.5 Slightly more

technical: How many of my friends have

me as their best friend?

Page 16: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

16University of Pennsylvania

A computation model for graphs

Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge

Alice Sunita Jose

Mikhail

friend-offriend-of

0.9 0.3

sunitaalice: 0.9sunitajose: 0.3

josesunita: 0.3alicesunita: 0.9

Page 17: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

17University of Pennsylvania

A computation model for graphs

Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge

Alice Sunita Jose

Mikhail

friend-offriend-of

0.9 0.3

sunitaalice: 0.9sunitajose: 0.3 josesunita: 0.3

alicesunita: 0.9 sunitaalice: 0.9sunitajose: 0.3josesunita: 0.3sunitaalice: 0.9

sunitajose: 0.3alicesunita: 0.9

Page 18: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

18University of Pennsylvania

A computation model for graphs

Example: Am I my friends' best friend? Step #1: Discard irrelevant vertices and edges Step #2: Annotate each vertex with list of friends Step #3: Push annotations along each edge Step #4: Determine result at each vertex

Alice Sunita Jose

Mikhail

friend-offriend-of

0.9 0.3

sunitaalice: 0.9sunitajose: 0.3 josesunita: 0.3

alicesunita: 0.9 sunitaalice: 0.9sunitajose: 0.3josesunita: 0.3sunitaalice: 0.9

sunitajose: 0.3alicesunita: 0.9

Page 19: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Can we do this in MapReduce?

Using adjacency list representation?

19

map(key: node, value: [<otherNode, relType, strength>]){

}reduce(key: ________, values: list of _________){

}

Page 20: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Can we do this in MapReduce?

Using single-edge data representation?

20

map(key: node, value: <otherNode, relType, strength>){

}reduce(key: ________, values: list of _________){

}

Page 21: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A real-world use case

A variant that is actually used in social networks today: "Who are the friends of multiple of my friends?"

Where have you seen this before?

Friend recommendation! Maybe these people should be my friends too!

21

Page 22: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Generalizing…

Now suppose we want to go beyond direct friend relationships

Example: How many of my friends' friends (distance-2 neighbors) have me as their best friend's best friend?

What do we need to do?

How about distance k>2?

To compute the answer, we need to run multiple iterations of MapReduce!

22

Page 23: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Iterative MapReduce The basic model:

Note that reduce output must be compatible with the map input!

What can happen if we filter out some information in the mapper or in the reducer?

23

copy files from input dir staging dir 1(optional: do some preprocessing)

while (!terminating condition) { map from staging dir 1 reduce into staging dir 2 move files from staging dir 2 staging dir1}

(optional: postprocessing)move files from staging dir 2 output dir

Page 24: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Graph algorithms and MapReduce

A centralized algorithm typically traverses a tree or a graph one item at a time (there’s only one “cursor”)

You’ve learned breadth-first and depth-first traversals

Most algorithms that are based on graphs make use of multiple map/reduce stages processing one “wave” at a time

Sometimes iterative MapReduce, other times chains of map/reduce 24

Page 25: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

25University of Pennsylvania

Recap: MapReduce on graphs

Suppose we want to: compute a function for each vertex in a graph... ... using data from vertices at most k hops away

We can do this as follows: "Push" information along the edges

"Think like a vertex" Finally, perform the computation at each vertex

May need more than one MapReduce phase

Iterative MapReduce: Outputs of stage i inputs of stage i+1

Page 26: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

26University of Pennsylvania

Plan for today

Representing data in graphs Graph algorithms in MapReduce

Computation model Iterative MapReduce

A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes

NEXT

Page 27: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Path-based algorithms

Sometimes our goal is to compute information about the paths (sets of paths) between nodes

Edges may be annotated with cost, distance, or similarity

Examples of such problems (see CIS 121+320):

Shortest path from one node to another Minimum spanning tree (minimal-cost tree connecting

all vertices in a graph) Steiner tree (minimal-cost tree connecting certain

nodes) Topological sort (node in a DAG comes before all nodes

it points to)

27

Page 28: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Single-Source Shortest Path (SSSP)

28

0s

?

?

?

?

a b

c d

10

5

2 3

1

9

7

2

4 6

Given a directed graph G = (V, E) in which each edge e has a cost c(e): Compute the cost of reaching each node from the source

node s in the most efficient way (potentially after multiple 'hops')

Page 29: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Intuition

We can formulate the problem using induction

The shortest path follows the principle of optimality: the last step (u,v) makes use of the shortest path to u

We can express this as follows:

29

bestDistanceAndPath(v) { if (v == source) then { return <distance 0, path [v]> } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return <bestDistanceAndPath[u] + dist[u,v], path[u] + v> }}

Page 30: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: CIS 320-style solution

Traditional approach: Dijkstra's algorithm

30

V: vertices, E: edges, S: start node

foreach v in V dist_S_to[v] := infinity predecessor[v] = nilspSet = {}Q := Vwhile (Q not empty) do u := Q.removeNodeClosestTo(S) spSet := spSet + {u} foreach v in V where (u,v) in E if (dist_S_To[v] > dist_S_To[u]+cost(u,v)) then dist_S_To[v] = dist_S_To[u] + cost(u,v) predecessor[v] = u

Initialize length andlast step of pathto default values

Update length andpath based on edges

radiating from u

Page 31: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

31

0s

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {s,a,b,c,d} spSet = {}dist_S_To: {(a,∞), (b,∞), (c,∞), (d,∞)}

predecessor: {(a,nil), (b,nil), (c,nil), (d,nil)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 32: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

32

0s

10

5

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b,c,d} spSet = {s}dist_S_To: {(a,10), (b,∞), (c,5), (d,∞)}

predecessor: {(a,s), (b,nil), (c,s), (d,nil)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 33: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

33

0s

8

5

14

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b,d} spSet = {c,s}dist_S_To: {(a,8), (b,14), (c,5), (d,7)}predecessor: {(a,c), (b,c), (c,s), (d,c)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 34: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

34

0s

8

5

13

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {a,b} spSet = {c,d,s}dist_S_To: {(a,8), (b,13), (c,5), (d,7)}predecessor: {(a,c), (b,d), (c,s), (d,c)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 35: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

35

0s

8

5

9

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {b} spSet = {a,c,d,s}dist_S_To: {(a,8), (b,9), (c,5), (d,7)}

predecessor: {(a,c), (b,a), (c,s), (d,c)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 36: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Dijkstra in Action

36

0s

8

5

9

7

a b

c d

10

5

2 3

1

9

7

2

4 6

Q = {} spSet = {a,b,c,d,s}dist_S_To: {(a,8), (b,9), (c,5), (d,7)}

predecessor: {(a,c), (b,a), (c,s), (d,c)}

Exam

ple

fro

m C

LR 2

nd

ed

. p

. 5

28

Page 37: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: How to parallelize? Dijkstra traverses the graph along a

single route at a time, prioritizing its traversal to the next step based on total path length (and avoiding cycles)

No real parallelism to be had here!

Intuitively, we want something that “radiates” from the origin, one “edge hop distance” at a time

Each step outwards can be done in parallel, before another iteration occurs - or we are done

Recall our earlier discussion: Scalability depends on the algorithm, not (just) on the problem!

37

s 0

?

?

?

?

0

?

?

0

?

?

?

?

0

?

?

?

?

Page 38: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: Revisiting the inductive definition

Dijkstra’s algorithm carefully considered each u in a way that allowed us to prune certain points

Instead we can look at all potential u’s for each v

Compute iteratively, by keeping a “frontier set” of u nodes i edge-hops from the source

38

bestDistanceAndPath(v) { if (v == source) then { return <distance 0, path [v]> } else { find argmin_u (bestDistanceAndPath[u] + dist[u,v]) return <bestDistanceAndPath[u] + dist[u,v], path[u] + v> }}

Page 39: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

SSSP: MapReduce formulation init:

For each node, node ID <, -, {<succ-node-ID,edge-cost>}>

map: take node ID <dist, next, {<succ-node-ID,edge-

cost>}> For each succ-node-ID:

emit succ-node ID {<node ID, distance+edge-cost>} emit node ID distance,{<succ-node-ID,edge-cost>}

reduce: distance := min cost from a predecessor; next := that

predec. emit node ID <distance, next, {<succ-node-ID,edge-

cost>}>

Repeat until no changes Postprocessing: Remove adjacency lists

39

Why is this necessary?

The shortest path we have found so far from the source to nodeID has length

...

... and here is the adjacency list for nodeID

This is a new path from the source to succ-node-

IDthat we just discovered

(not necessarily shortest)

... this is the nexthop on that path...

Page 40: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Iteration 0: Base case

40

mapper: (a,<s,10>) (c,<s,5>) edges

reducer: (a,<10, ...>) (c,<5, ...>)

"Wave"

0s

a b

c d

10

5

2 3

1

9

7

2

4 6

Page 41: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Iteration 1

41

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)

(b,<c,14>) (d,<c,7>) edgesreducer: (a,<8, ...>) (c,<5, ...>) (b,<11, ...>) (d,<7, ...>)

0

10

5

10

5

2 3 9

7

4 6s

a b

c d

1

2

"Wave"

Page 42: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Iteration 2

42

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>) (b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edgesreducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)

0

8

5

11

7

10

5

2 3 9

7

4 6s

a b

c d

1

2

"Wave"

Page 43: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Iteration 3

43

mapper: (a,<s,10>) (c,<s,5>) (a,<c,8>) (c,<a,9>) (b,<a,11>)

(b,<c,14>) (d,<c,7>) (b,<d,13>) (d,<b,15>) edgesreducer: (a,<8>) (c,<5>) (b,<11>) (d,<7>)

No change! Convergence!

Question: If a vertex's path costis the same in two consecutive

rounds, can we be sure thatthis vertex has converged?

0

8

5

11

7

10

5

2 3 9

7

4 6s

a b

c d

1

2

Page 44: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Summary: SSSP

Path-based algorithms typically involve iterative map/reduce

They are typically formulated in a way that traverses in “waves” or “stages”, like breadth-first search

This allows for parallelism They need a way to test for convergence

Example: Single-source shortest path (SSSP)

Original Dijkstra formulation is hard to parallelize But we can make it work with the "wave" approach

44

Page 45: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

45University of Pennsylvania

Plan for today

Representing data in graphs Graph algorithms in MapReduce

Computation model Iterative MapReduce

A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve Bayes

NEXT

Page 46: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Learning (clustering / classification)

Sometimes our goal is to take a set of entities, possibly related, and group them

If the groups are based on similarity, we call this clustering

If the groups are based on putting them into a semantically meaningful class, we call this classification

Both are instances of machine learning

46

Page 47: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

The k-clustering Problem

Given: A set of items in a n-dimensional feature space

Example: data points from survey, people in a social network

Goal: Group the items into k “clusters” What would be a 'good' set of clusters? 47

Ag eExpenses

Items

Clusters

Page 48: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Approach: k-Means

Let m1, m2, …, mk be representative points for each of our k clusters

Specifically: the centroid of the cluster

Initialize m1, m2, …, mk to random values in the data

For t = 1, 2, …: Map each observation to the closest mean

Assign the mi to be a new centroid for each set

48

)(

)(

)1( 1tij Sx

jti

ti x

Sm

kimxmxxS tij

tijj

ti ,...,1*,: )(

*)()(

Page 49: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A simple example (1/4)

49

Ag

e

Expenses

(10,10)(15,12)

(11,16)

(18,20)(30,21)

(20,21)

Page 50: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A simple example (2/4)

50

Ag

e

Expenses

(10,10)(15,12)

(11,16)

(18,20)(30,21)

(20,21)

Randomly choseninitial centers

Page 51: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A simple example (3/4)

51

Ag

e

Expenses

(10,10)(15,12)

(11,16)

(18,20)(30,21)

(20,21)

(19.75,19.5)

(12.5,11)

Page 52: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A simple example (4/4)

52

Ag

e

Expenses

(10,10)(15,12)

(11,16)

(18,20)(30,21)

(20,21)

(22.67,20.67)

(12,12.67)

Stable!

Page 53: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

k-Means in MapReduce Map #1:

Input: node ID <position, centroid ID, [centroid IDs and positions]>

Compute nearest centroid; emit centroid ID <node ID, position> Reduce #1:

Recompute centroid position from positions of nodes in it Emit centroidID <node IDs, positions> and for all other centroid

IDs, emit otherCentroidID centroid(centroidID,X,Y) Each centroid will need to know where all the other centroids are

Map #2: Pass through values to Reducer #2

Reduce #2: For each node in the current centroid, emit

node ID <position, centroid ID, [centroid IDs and positions]> Input for the next map iteration

Also, emit <X, <centroid ID, position>> This will be the 'result' (remember that we wanted the centroids!)

Repeat until no change 53

Page 54: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

54University of Pennsylvania

Plan for today

Representing data in graphs Graph algorithms in MapReduce

Computation model Iterative MapReduce

A toolbox of algorithms Single-source shortest path (SSSP) k-means clustering Classification with Naïve BayesNEXT

Page 55: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Classification

Suppose we want to learn what is spam (or interesting, or …)

Predefine a set of classes with semantic meaning Train an algorithm to look at data and assign a class

Based on giving it some examples of data in each class … and the sets of features they have

Many probabilistic techniques exist Each class has probabilistic relationships with others

e.g., p (spam | isSentLocally), p (isSentLocally | fromBob), … Typically represented as a graph(ical model)! See CIS 520

But we’ll focus on a simple, “flat” model: Naïve Bayes

55

Page 56: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

A simple example

Suppose we just look at the keywords in the email's title:

Message(1, “Won contract”)Message(2, “Won award”)Message(3, "Won the lottery")Message(4, “Unsubscribe”)Message(5, "Millions of customers")Message(6, "Millions of dollars")

What is probability message "Won Millions" is ?p(spam|containsWon,containsMillions)

= p(spam) p(containsWon,containsMillions |spam) p(containsWon,containsMillions)

56

Bayes’ Theorem

Page 57: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Classification using Naïve Bayes Basic assumption: Probabilities of events are

independent This is why it is called 'naïve'

Under this assumption,

p(spam) p(containsWon,containsMillions | spam) p(containsWon,containsMillions)

= p(spam) p(containsWon | spam) p(containsMillions | spam)

p(containsWon) p(containsMillions)

= 0.5 * 0.67 * 0.33 / (0.5 * 0.33) = 0.67

So how do we “train” a learner (compute the above probabilities) using MapReduce?

57

Page 58: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

58University of Pennsylvania

What do we need to train the learner?

p(spam) Count how many spam emails there are Count total number of emails

p(containsXYZ | spam) Count how many spam emails contain XYZ Count how many emails contain XYZ overall

p(containsXYZ) Count how many emails contain XYZ overall Count total number of emails

Easy

Easy

Easy

1

2

2

Page 59: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Training a Naïve Bayes Learner

map 1: takes messageId <class, {words}> emits <word, class> 1

reduce 1: emits <word, class> <count>

map 2: takes messageId -> <class, {words}> emits word 1

reduce 2: emits word <totalCount>

59

Count how manyemails in the classcontain the word

(modified WordCount)

Count how manyemails contain the

word overall(WordCount)

Page 60: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

Summary: Learning and MapReduce

Clustering algorithms typically have multiple aggregation stages or iterations

k-means clustering repeatedly computes centroids, maps items to them

Fixpoint computation

Classification algorithms can be quite complex

In general: need to capture conditional probabilities Naïve Bayes assumes everything is independent Training is a matter of computing probability

distribution Can be accomplished using two Map/Reduce passes

60

Page 61: GraphAlgorithms

© 2013 A. Haeberlen, Z. Ives

61University of Pennsylvania

Stay tuned

Next time you will learn about: PageRank and Adsorption