+ All Categories
Home > Documents > Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

Date post: 03-Jan-2016
Category:
Upload: marilynn-allison
View: 219 times
Download: 0 times
Share this document with a friend
29
Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]
Transcript
Page 1: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

Google News and the theory behind it

Sections 4.5, 4.6, 4.7 of [KT]

Page 2: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

2

Google News

Automatically collects news stories from web sources and classifies them.

Has to decide which stories can be put together.

Page 3: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

3

How Google News Works

Collect News stories Identify important keywords Define distance between different stories Cluster according to this distance

Exact algorithm proprietary One approach: Hierarchical Clustering Based on Minimum spanning trees

Page 4: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

4

Minimum Spanning Tree

Minimum spanning tree. Given a connected graph G = (V, E) with real-valued edge weights ce, an MST is a subset of the edges T E

such that T is a spanning tree whose sum of edge weights is minimized.

Cayley's Theorem. There are nn-2 spanning trees of Kn.

5

23

10

21

14

24

16

6

4

189

7

11 8

5

6

4

9

7

11 8

G = (V, E) T, eT ce = 50

can't solve by brute force

Page 5: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

5

Applications

MST is fundamental problem with diverse applications.

Network design.– telephone, electrical, hydraulic, TV cable, computer, road

Approximation algorithms for NP-hard problems.– traveling salesperson problem, Steiner tree

Indirect applications.– max bottleneck paths– LDPC codes for error correction– image registration with Renyi entropy– learning salient features for real-time face verification– reducing data storage in sequencing amino acids in a protein– model locality of particle interactions in turbulent fluid flows– autoconfig protocol for Ethernet bridging to avoid cycles in a network

Cluster analysis.

Page 6: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

6

Greedy Algorithms

Kruskal's algorithm. Start with T = . Consider edges in ascending order of cost. Insert edge e in T unless doing so would create a cycle.

Reverse-Delete algorithm. Start with T = E. Consider edges in descending order of cost. Delete edge e from T unless doing so would disconnect T.

Prim's algorithm. Start with some root node s and greedily grow a tree T from s outward. At each step, add the cheapest edge e to T that has exactly one endpoint in T.

Remark. All three algorithms produce an MST.

Page 7: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

7

Cycles and Cuts

Cycle. Set of edges of the form a-b, b-c, c-d, …, y-z, z-a.

Cutset. A cut is a subset of nodes S. The corresponding cutset D is the subset of edges with exactly one endpoint in S.

Cycle C = 1-2, 2-3, 3-4, 4-5, 5-6, 6-1

13

8

2

6

7

4

5

Cut S = { 4, 5, 8 }Cutset D = 5-6, 5-7, 3-4, 3-5, 7-8

13

8

2

6

7

4

5

Page 8: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

8

Greedy Algorithms

Simplifying assumption. All edge costs ce are distinct.

Cut property. Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then the MST contains e.

Cycle property. Let C be any cycle, and let f be the max cost edge belonging to C. Then the MST does not contain f.

f C

S

e is in the MST

e

f is not in the MST

Page 9: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

9

Cut Property

Cut Property Let S be any subset of nodes, and let e be the min cost edge with exactly one endpoint in S. Then the MST T* contains e.

P

Suppose e T* P is a path from v to w in T*

v’ is the last node on P in S w’ is the first node on P not in S T’=T - (v’,w’) + (v,w) cost(T’) < cost(T*) T’ is also a spanning tree

Note: T-(v’,w’)+f is not a spanning treeCannot replace e by any edge leaving S

Page 10: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

10

Kruskal's Algorithm

Kruskal's algorithm. [Kruskal, 1956] Consider edges in ascending order of weight. Case 1: If adding e to T creates a cycle, discard e according

to cycle property. Case 2: Otherwise, insert e = (u, v) into T according to cut

property where S = set of nodes in u's connected component.

Case 1

v

u

Case 2

e

eS

Page 11: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

11

Kruskal’s Algorithm: Proof of Correctness

Let T be the algorithm produced by the algorithm Consider any edge e=(u,v) added by the algorithm in iteration i Let S be the set of nodes to which u has a path before this iteration u S, v S No edge from S to V-S exists e is the cheapest edge from S to V-S e belongs to every MST (cut property)

Suppose T is not a spanning tree Then there are components S, V-S that are not connected by edges of T

Page 12: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

12

Prim's Algorithm

Prim's algorithm. [Jarník 1930, Dijkstra 1957, Prim 1959] Initialize S = any node. Apply cut property to S. Add min cost edge in cutset corresponding to S to T, and add

one new explored node u to S.

S

Page 13: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

13

Prim’s algorithm: proof of correctness

In any iteration, there is a partial spanning tree within S The edge chosen to add satisfies the cut property for S

Page 14: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

14

Cycle Property

Cycle property. Let C be any cycle in G, and let e be the max cost edge belonging to C. Then the MST T* does not contain e.

Suppose T* contains e S, V-S are the components in T*-e There must be another edge e’ in C in the cutset of S, since a cycle intersects a cutset in an even number of edges T*-e+e’ is a spanning tree of lesser cost

Page 15: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

15

Proof of Reverse-delete algorithm

Reverse-Delete algorithm. Start with T = E. Consider edges in descending order of cost. Delete edge e from T unless doing so would disconnect T.

If edge e is deleted in some iteration, it must be the most-expensive edge in some cycle. By the cycle property, the final set of edges must form the MST

Page 16: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

16

Implementation: Prim's Algorithm

Prim(G, c) { foreach (v V) a[v] Initialize an empty priority queue Q foreach (v V) insert v onto Q Initialize set of explored nodes S

while (Q is not empty) { u delete min element from Q S S { u }

foreach (edge e = (u, v) incident to u) if ((v S) and (ce < a[v]))

decrease priority a[v] to ce

}

Implementation. Use a priority queue ala Dijkstra. Maintain set of explored nodes S. For each unexplored node v, maintain attachment cost a[v] =

cost of cheapest edge v to a node in S. O(n2) with an array; O(m log n) with a binary heap.

Page 17: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

17

Kruskal’s algorithm: Implementation

Maintain a set of components Find shortest edge e whose end points are in different components Add e and merge the components containing end points of e

Operations needed Find: the component containing node u Union/Merge: the components containing the end points of e

Abstract Set Operations Maintain a collection of sets of elements Find(u): set containing element u Merge(A,B): merge the sets A and B

Page 18: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

18

Union-Find Data Structure

Name each set S by one of the elements in S: representative Store pointer with each element u that leads to name of the set Also store size of each set Initially: each node points to itself (singleton sets) Each set maintained as a tree, with the root being the representative Find(u): follow pointers to root of tree containing u Merge(A,B): If A is smaller than B, the root of A points to root of B

Page 19: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

19

Union-Find Data Structure

Sequence of operationsUnion(w,u), Union(s,u), Union(t,v), Union(z,v), Union(i,x), Union(y,j), Union(x,j)Union(u,v)

Page 20: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

20

Union-Find: Complexity

Merge(A,B): takes O(1) timeFind(u): trace pointers from u to the root of tree containing u Follow pointer from node u to node v: # nodes in subtree rooted at v ≥ 2(# nodes in subtree rooted at u) O(log n) depth of tree

Page 21: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

21

Improvements: Union-Find

When Find(u) is done, redirect all pointers on path from u to the root

Page 22: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

22

Implementation: Kruskal's Algorithm

Kruskal(G, c) { Sort edges weights so that c1 c2 ... cm. T

foreach (u V) make a set containing singleton u

for i = 1 to m (u,v) = ei

if (u and v are in different sets) { T T {ei} merge the sets containing u and v } return T}

Implementation. Use the union-find data structure. Build set T of edges in the MST. Maintain set for each connected component. O(m log n) for sorting and O(m (m, n)) for union-find.

are u and v in different connected components?

merge two components

m n2 log m is O(log n) essentially a constant

Page 23: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

23

Lexicographic Tiebreaking

To remove the assumption that all edge costs are distinct: perturb all edge costs by tiny amounts to break any ties.

Impact. Kruskal and Prim only interact with costs via pairwise comparisons. If perturbations are sufficiently small, MST with perturbed costs is MST with original costs.

Implementation. Can handle arbitrarily small perturbations implicitly by breaking ties lexicographically, according to index.

boolean less(i, j) { if (cost(ei) < cost(ej)) return true else if (cost(ei) > cost(ej)) return false else if (i < j) return true else return false}

e.g., if all edge costs are integers,perturbing cost of edge ei by i / n2

Page 24: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

4.7 Clustering

Outbreak of cholera deaths in London in 1850s.Reference: Nina Mishra, HP Labs

Page 25: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

25

Clustering

Clustering. Given a set U of n objects labeled p1, …, pn, classify into

coherent groups.

Distance function. Numeric value specifying "closeness" of two objects.

Fundamental problem. Divide into clusters so that points in different clusters are far apart.

Routing in mobile ad hoc networks. Identify patterns in gene expression. Document categorization for web search. Similarity searching in medical image databases Skycat: cluster 109 sky objects into stars, quasars, galaxies.

photos, documents. micro-organisms

number of corresponding pixels whose

intensities differ by some threshold

Page 26: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

26

Clustering of Maximum Spacing

k-clustering. Divide objects into k non-empty groups.

Distance function. Assume it satisfies several natural properties. d(pi, pj) = 0 iff pi = pj (identity of indiscernibles) d(pi, pj) 0 (nonnegativity) d(pi, pj) = d(pj, pi) (symmetry)

Spacing. Min distance between any pair of points in different clusters.

Clustering of maximum spacing. Given an integer k, find a k-clustering of maximum spacing.

spacing

k = 4

Page 27: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

27

Greedy Clustering Algorithm

Single-link k-clustering algorithm. Form a graph on the vertex set U, corresponding to n clusters. Find the closest pair of objects such that each object is in a

different cluster, and add an edge between them. Repeat n-k times until there are exactly k clusters.

Key observation. This procedure is precisely Kruskal's algorithm(except we stop when there are k connected components).

Remark. Equivalent to finding an MST and deleting the k-1 most expensive edges.

Page 28: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

28

Greedy Clustering Algorithm: Analysis

Theorem. Let C* denote the clustering C*1, …, C*k formed by

deleting thek-1 most expensive edges of a MST. C* is a k-clustering of max spacing.

Pf. Let C denote some other clustering C1, …, Ck. The spacing of C* is the length d* of the (k-1)st most expensive

edge. Let pi, pj be in the same cluster in C*, say C*r, but different

clusters in C, say Cs and Ct. Some edge (p, q) on pi-pj path in C*r spans two different clusters

in C. All edges on pi-pj path have length d*

since Kruskal chose them. Spacing of C is d* since p and q

are in different clusters. ▪

p qpi pj

Cs Ct

C*r

Page 29: Google News and the theory behind it Sections 4.5, 4.6, 4.7 of [KT]

29

MST Algorithms: Theory

Deterministic comparison based algorithms. O(m log n) [Jarník, Prim, Dijkstra, Kruskal, Boruvka] O(m log log n). [Cheriton-Tarjan 1976, Yao 1975] O(m (m, n)). [Fredman-Tarjan 1987] O(m log (m, n)). [Gabow-Galil-Spencer-Tarjan 1986] O(m (m, n)). [Chazelle 2000]

Holy grail. O(m).

Notable. O(m) randomized. [Karger-Klein-Tarjan 1995] O(m) verification. [Dixon-Rauch-Tarjan 1992]

Euclidean. 2-d: O(n log n).compute MST of edges in Delaunay k-d: O(k n2). dense Prim


Recommended