1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti...

transcript

A Graph-Theoretic Approach to Webpage Segmentation

Deepayan Chakrabarti (deepay@yahoo-inc.com)

Ravi Kumar (ravikuma@yahoo-inc.com)

Kunal Punera (kpunera@yahoo-inc.com)

Motivation and Related WorkHeader

Navigation bar

Primary content

Related links

Copyright

Divide a webpage into visually and semantically cohesive sections

Motivation and Related Work

Sectioning can be useful in: Webpage classification Displaying webpages on mobile phones and

small-screen devices Webpage ranking Duplicate detection …

A lot of recent interest Informative Structure Mining [Cai+/2003,

Kao+/2005] Displaying webpages on small screens

[Chen+/2005, Baluja/2006] Template detection: [Bar-Yossef+/2002] Topic distillation: [Chakrabarti+/2001]

Based solely on visual, or content, or DOM based clues

Mostly heuristic approaches

Our contributions Combine visual, DOM, and content based cues Propose a formal graph-based combinatorial

optimization approach Develop two instantiations, both with:

Approximation guarantees Automatic determination of the number of sections

Develop methods for automatic learning of graph weights

Outline

Motivation and Related Work Proposed Work Experiments Conclusions

Proposed Work

A graph-based approach Construct a neighborhood graph of

DOM tree nodes Neighbors close according to:

DOM tree distance, or, visual distance when

rendered on the screen, or, similar content types

Partition the neighborhood graph to optimize a cost function

DOM Tree

Neighborhood Graph

Proposed Work

A graph-based approach What is a good cost function?

Intuitive Has polynomial-time algorithms that

can get provably close to the optimal

Correlation Clustering Energy-minimizing Graph Cuts How should we set weights in

the neighborhood graph?

DOM Tree

Neighborhood Graph

Correlation Clustering

Assign each DOM node p to a section S(p)

Vpq are edge weights in the neighborhood graph

Neighborhood Graph

VAB VAE

Penalty for having DOM nodes p and q in different sections

Rendering Constraint: Each pixel on the screen must belong to at most

one section

Parent section = child section Constraint only applies to DOM

nodes “aimed” at visual rendering

Either SA=SB=SC, or SA≠SB and SA≠Sc

DOM Tree

Rendering Constraint: Each pixel on the screen must belong to at most

one section Not enforced by CCLUS

Workaround: Use only leaf nodes in the neighborhood graph But content cues may be too

noisy at the leaf level

DOM Tree

Algorithm: [Ailon+/2005] Pick a random leaf node p Create a new section of p, and all nodes q which are

strongly connected to p: Remove p and q’s from the neighborhood graph Iterate

Within a factor of 2 of the optimal Number of sections picked automatically

Proposed Work

DOM Tree

Neighborhood Graph

Energy-minimizing Graph Cuts Extra: A predefined set of labels Assign to each node p a label S(p)

Distance of node to label

Distance between pairs of nodes

Energy-minimizing Graph Cuts

Difference from CCLUS: Node weights Dp in addition to

edge weights Vpq

Dp and Vpq can depend on the labels (not just “same” or “different”)

Neighborhood Graph

VAB VAE

Distance between pairs of nodes

How can we fit the Rendering Constraint? Have a special “invisible” label ξ Parent is invisible, unless all

children have the same label Can set the Vpq values accordingly

SA=?ξ

children have the same label Can set the Vpq values accordingly Automatically infer “rendering”

versus “structural” DOM nodes

Energy-minimizing Graph Cuts Why couldn’t we use this trick in CCLUS as

well? CCLUS only asks: Are nodes p and q in the same

section or not? It cannot handle “special” sections like the

invisible section Hence, labels are giving us extra power

Advantages Can use all DOM nodes, while still obeying the

Rendering Constraint Better than CCLUS Factor of 2 approximation of the optimal, by

performing iterative min-cuts of specially constructed graphs We extend [Kolmogorov+/2004] Number of sections are picked automatically

Energy-minimizing Graph Cuts Theorem: Vpq must obey the constraint

Separation cost ≥ Merge cost

Set Vpq(different) >> Vpq(same) for nodes that are extremely close Cost minimization tries to place them in the same section

Energy-minimizing Graph Cuts Theorem: Vpq must obey the constraint

Separation cost ≥ Merge cost

However, we cannot use Vpq to push two nodes to be in different sections Use Dp instead

To separate nodes p and q: Ensure that either Dp(α) or Dq(α) is large, for any

label α So, assigning both p and q to the same label will

be too costly

Invisible label lets us use the parent-child DOM tree structure

Ensures that nodes with very different content or visual features are split up

Ensures that nodes with very similar content or visual features are merged

Proposed Work

DOM Tree

Neighborhood Graph

Learning graph weights

Extract content and visualfeatures from training data

Learning Vpq(.) Learn a logistic regression classifier

(prob. that p and q belong to the same section)

Neighborhood Graph

VAB VAE

Learning Dp(.) Training data does not provide labels Set of labels = Set of DOM tree nodes in that webpage

Dp(α) = distance in some feature space

Learn a Mahalanobis distance metric between nodes(distances within section < distances across sections)

Neighborhood Graph

VAB VAE

Outline

Experiments

Manually sectioned 105 randomly chosen webpages to get 1088 sections

Two measures were used: Adjusted RAND: fraction of leaf node pairs which

are correctly predicted to be together or apart (over and above random sectioning)

Normalized Mutual Information Both are between 0 and 1, with higher values

indicating better results.

Experiments CCLUS:Only 20% of the webpages score better than 0.6

GCUTS:Almost 50% of the webpages

score better than 0.6

Adjusted RAND

Experiments

GCUTS is better than CCLUS

Over all webpages

Experiments

Application to duplicate detection on the Web Collected lyrics of the same songs from 3 different

sites (~2300 webpages) Nearly similar content Different template structures

Our approach: Section all webpages Perform duplicate detection using only the largest

section (primary content)

Experiments

Sectioning > No sectioning GCUTS > CCLUS

Outline

Conclusions

Combined visual, DOM, and content based cues

Optimization on a neighborhood graph Node and edge weights are learnt from training

data Developed CCLUS and GCUTS, both with:

Approximation guarantees Automatic determination of the number of sections

Neighborhood Graph

VAB VAE

What is such a Dp(.) function? Use the set of internal DOM nodes as the set of

labels Dp(α) measures the difference in feature vectors

between node p and internal node (label) α If nodes p and q are very different, Dp(α) and Dq(α)

will differ for all α

Does not enforce the Rendering Constraint: Each pixel on the screen must belong to at most

one section Parent nodes should have same section as their children

Workaround: Consider only leaf nodes in the neighborhood graph But content cues may be too noisy at the leaf level

Does not enforce the Rendering Constraint Each pixel on the screen must

belong to at most one section Parent section = child section

Apply rule only for ancestors “aimed” at visual rendering

Does not enforce the Rendering Constraint

Workaround: Consider only leaf nodes in the neighborhood graph But content cues may be too

noisy at the leaf level

CBSB=5 SC=7

children have the same label Can set the Vpq values accordingly Automatically infer “rendering”

versus “structural” DOM nodes

SB=5 SC=7

SA=?ξ

What is the set of labels? The set of internal DOM nodes

Available at the beginning of the algorithm The labels are themselves nodes, with feature vectors

Dp(α) = distance in some feature space “Tuned” to the current webpage

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti...

Documents