Structured Data Extraction and Template Detection
Bing LiuDepartment of Computer Science
University of Illinois at Chicago (UIC)[email protected]
Bing Liu, UIC Yandex 2
Introduction
The Web is perhaps the single largest data source in the world.
Mining the Web is to develop new techniques to extract/mine useful knowledge from the Web.
Due to the heterogeneity and lack of structure of Web data, automated discovery of useful information is a challenging task.
This talk focuses on two problems:
Data extraction and template detection
1. Structured Data Extraction
Wrapper induction
Automatic data extraction
Bing Liu, UIC Yandex 4
Introduction
A large amount of information on the Web is contained in regularly structured data objects.
often data records retrieved from databases.
Such Web data records are important: lists of products and services.Applications: Gather data to provide valued added services
comparative shopping, object search (rather than page search), etc.
Two types of pages with structured data:List pages, and detail pages
Bing Liu, UIC Yandex 5
List Page – two lists of products
Two lists
Bing Liu, UIC Yandex 6
Detail Page – detailed description
Bing Liu, UIC Yandex 7
Extraction Task: an illustration
$19.95 ***** Cookware Lid Rack 22x6 Cabinet Organizersimage 2
$7.95 ***** Cabinet Organizer (Non-skid): White
14.75x9Cabinet Organizers image 2
$7.95 ***** Round Turntable: White 12-in. Cabinet Organizers by Copcoimage 1
$4.95 ***** Round Turntable: White 9-in. Cabinet Organizers by Copcoimage 1
nesting
Bing Liu, UIC Yandex 8
Data Model and Solution
Web data model: Nested relations See formal definitions in (Grumbach and Mecca, ICDT-99; Liu, Web Data Mining book 2006)
Solve the problemTwo main types of techniques
Wrapper induction – supervisedAutomatic extraction – unsupervised
Information that can be exploitedSource files (e.g., Web pages in HTML)
Represented as strings or treesVisual information (e.g., rendering information)
Bing Liu, UIC Yandex 9
Tree and Visual information
HTML
HEADBODY
TR|TD
TD TD TD TD
TR TR|
TD
TR TRTR|
TD
TR|TD
TR|TD
TABLE P
TR
TD TD TD TD
TD TD TD TD
TABLE
TBODY
data record 1
data record 2
TR|TD
Bing Liu, UIC Yandex 10
Road map
Wrapper Induction (supervised)Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.
Automatic data extraction (unsupervised)Given only a single page with multiple data records, generate extraction patterns.
Given a set of positive pages, generate extraction patterns.
Bing Liu, UIC Yandex 11
Wrapper inductionUsing machine learning to generate extraction rules.
The user marks/labels the target items in a few training pages. The system learns extraction rules from these pages. The rules are applied to extract target items from other pages.
Many wrapper induction systems, e.g., WIEN (Kushmerick et al, IJCAI-97), Softmealy (Hsu and Dung, 1998), Stalker (Muslea et al. Agents-99), BWI (Freitag and Kushmerick, AAAI-00), WL2 (Cohen et al. WWW-02),IDE (Zhai and Liu. WISE-05).
We will only focus on Stalker, which also has a commercial version called Fetch.
Bing Liu, UIC Yandex 12
Stalker: A hierarchical wrapper induction system (Muslea et al. Agents-99)
Hierarchical wrapper learningExtraction is isolated at different levels of hierarchy
This is suitable for nested data records (embedded list)
Each item is extracted independent of others.
Manual labeling is needed for each level.
Each target item is extracted using two rulesA start rule for detecting the beginning of the target item.
A end rule for detecting the ending of the target item.
Bing Liu, UIC Yandex 13
Hierarchical extraction based on tree
To extract each target item (a node), the wrapper needs a rule that extracts the item from its parent.
Name: John SmithBirthday: Oct 5, 1950Cities:
Chicago:(312) 378 3350(312) 755 1987
New York:(212) 399 1987
Person
List(Cities)BirthdayName
city List(phoneNo)
NumberArea Code
Bing Liu, UIC Yandex 14
Wrapper Induction (Muslea et al., Agents-99)
Using machine learning to generate extraction rules.The user marks the target items in a few training pages. The system learns extraction rules from these pages. The rules are applied to extract items from other pages.
Training ExamplesE1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
Output Extraction RulesStart rules: End rules:R1: SkipTo(() SkipTo())R2: SkipTo(-<b>) SkipTo(</b>)
Bing Liu, UIC Yandex 15
Learning extraction rules
Stalker uses sequential covering to learn extraction rules for each target item.
In each iteration, it learns a perfect rule that covers as many positive examples as possible without covering any negative example.
Once a positive example is covered by a rule, it is removed.
The algorithm ends when all the positive examples are covered. The result is an ordered list of all learned rules
Bing Liu, UIC Yandex 16
Some other issues in wrapper learning
Active learningHow to automatically choose examples for the user to label (Muslea et al, AAAI-00)
IDE (Zhai & Liu, WISE-05), which uses instance-based learning, and it is automatically active.
Wrapper verificationCheck whether the current wrapper still work properly (Kushmerick, 2003)
Wrapper maintenanceIf the wrapper no longer works properly, is it possible to re-label automatically (Kushmerick AAAI-99; Lerman et al, JAIR-03)
Bing Liu, UIC Yandex 17
Limitations of Supervised Learning
Manual Labeling is labor intensive and time consuming, especially if one wants to extract data from a huge number of sites.Wrapper maintenance is very costly:
If Web sites change frequently.It is necessary to detect when a wrapper stops to work properly. Any change may make existing extraction rules invalid.Re-learning is needed, and most likely manual re-labeling as well.
Bing Liu, UIC Yandex 18
Road map
Wrapper Induction (supervised)Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.
Automatic data extraction (unsupervised)Given only a single page with multiple data records, generate extraction patterns.
Given a set of positive pages, generate extraction patterns.
Bing Liu, UIC Yandex 19
Automatic Extraction
There are two main problem formulations:Problem 1: Extraction based on a single list page (Liu et al. KDD-03; Liu, Web Data Mining book
2007)
Problem 2: Extraction based on multiple input pages of the same type (list pages or detail pages) (Grumbach and Mecca, ICDT-99).
Problem 1 is more general: Algorithms for solving Problem 1 can solve Problem 2.
Thus, we only discuss Problem 1.
Bing Liu, UIC Yandex 20
Automatic Extraction: Problem 1
Data region1
Data region2
Data records
Bing Liu, UIC Yandex 21
Solution Techniques
Identify data regions and data recordsBy finding repeated patterns
string matching (treat HTML source as a string)
tree matching (treat HTML source as a tree)
Align data items: Multiple alignmentMany multiple alignment algorithms exist, however, they
tend to make unnecessary commitments in early (can be wrong) alignments.
inefficient.
An new algorithm, called Partial Tree Alignment, was proposed to deal with the problems (Zhai and Liu, WWW-05)
Bing Liu, UIC Yandex 22
String edit distance (definition)
Bing Liu, UIC Yandex 23
Tree matching
There are many definitions of tree matching and tree edit distances. E.g.,
Here we only briefly discuss a restricted tree matching algorithm, called Simple Tree Matching(Yang, 1991), which is quite effective for data extraction
No node replacement and no level crossing are allowed. Dynamic programming solution
Bing Liu, UIC Yandex 24
Simple tree matching (Yang 1991; Liu, Web Data Mining book 2007)
Let A = RA:⟨A1, …, Ak⟩ and B = RB:⟨B1,…, Bn⟩ be two trees,
where RA and RB are the roots of A and B, and Ai and Bj are their i-th and j-th first-level subtrees
Let W(A, B) be the number of pairs in the maximum matching of trees A and B.If RA and RB contain identical symbols, then
W(A, B)) = m(⟨A1, …,Ak⟩, ⟨B1, …, Bn⟩) + 1, where m(⟨A1, …, Ak⟩, ⟨B1, …, Bn⟩) is the number of pairs in the maximum matching of ⟨A1, …, Ak⟩ and ⟨B1, …, Bn⟩.
If RA ≠ RB, W(A, B)) = 0.
Bing Liu, UIC Yandex 25
Simple tree match formulation (Liu, Web Data Mining Book 2007)
A dynamic programming formulation
Bing Liu, UIC Yandex 26
Multiple alignment
Pairwise alignment/matching is not sufficient because a web page usually contain more than one data record.
We need multiple alignment.
Optimal alignment/matching is exponential.
We discuss two techniquesCenter Star method
Partial tree alignment.
Bing Liu, UIC Yandex 27
Center star method
A simple & classic technique. Often used for multiple string alignments, but can be adopted for trees.
Let the set of strings to be aligned be S. In the method, a string sc that minimizes,
is first selected as the center string. d(sc, si) is the distance of two strings. O(k2n2) pair-wise matches.
The algorithm then iteratively computes the alignment of rest of the strings with sc.
∑ ∈Ss ici
ssd ),(
Bing Liu, UIC Yandex 28
Partial tree alignment (Zhai and Liu, WWW-05)
Choose a seed tree: A seed tree, denoted by Ts, is picked with the maximum number of data items. The seed tree is similar to center string, but without the O(k2n2) pair-wise tree matching to choose it. Tree matching: For each unmatched tree Ti (i ≠ s),
match Ts and Ti. Each pair of matched nodes are linked (aligned). For each unmatched node nj in Ti do
expand Ts by inserting nj into Ts if a position for insertion can be uniquely determined in Ts.
The expanded seed tree Ts is then used in subsequent matching.
Bing Liu, UIC Yandex 29
p p
a b e dc eb
dc e
pNew part of Ts
e ab x
p pTs Ti
a e
ba
Ts Ti
Insertion is possible
Insertion is not possible
Partial tree alignment of two trees
Bing Liu, UIC Yandex 30
More information …
More information,IEPAD (Chang and Lui, WWW-01) MDR (Liu et al. KDD-03)DeLa (Wang and Lochovsky, WWW-03)Lerman et al (SIGMOD-04)DEPTA (Zhai and Liu, WWW-2005)NET (Liu and Zhai WISE-2005)(Zhao et al, WWW-05)
(Liu, Web Data Mining book, 2007) contains formulations, algorithms and more …
Bing Liu, UIC Yandex 31
Road map
Wrapper Induction (supervised)Given a set of manually labeled pages, a machine learning method is applied to learn extraction rules or patterns.
Automatic data extraction (unsupervised)Given only a single page with multiple data records, generate extraction patterns.
Given a set of positive pages, generate extraction patterns.
Bing Liu, UIC Yandex 32
The RoadRunner System(Crescenzi et al. VLDB-01)
Given a set of positive examples (multiple sample pages). Each contains one or more data records.
From these pages, generate a wrapper as a union-free regular expression (i.e., no disjunction).
The approach
To start, a sample page is taken as the wrapper.
The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper.
A mismatch occurs when some token in the sample does not match the grammar of the wrapper.
Bing Liu, UIC Yandex 33
Bing Liu, UIC Yandex 34
The EXALG System (Arasu and Garcia-Molina, SIGMOD-03)
The same setting as for RoadRunner: need multiple input pages of the same template.
The approach:Step 1: find sets of tokens (called equivalence classes)
having the same frequency of occurrence in every page.
Step 2: expand the sets by differentiating “roles” of tokens using contexts. Same token in different contexts are treated as different tokens.
Step 3: build the page template using the equivalence classes based on what is in between two consecutive tokens, empty, data or list.
2. Template Detection and Page Segmentation
Bing Liu, UIC Yandex 36
Introduction
Most web sites, especially commercial sites, use well designed templates.
A templatized page is one among a number of pages sharing a common look and feel.
A templatized page typically contains many blocks:Main content blocks, navigation blocks, advertisements, etc.
Each block is basically an (micro) information unit (Li and Liu, CIKM-02).
Words in different blocks should not be combined in search.
Information in unimportant blocks can adversely affect search ranking, IR and DM algorithms.
Bing Liu, UIC Yandex 37
Bing Liu, UIC Yandex 38
Road map
Site-level template detection
Page-level template detection
Identify what you need
Bing Liu, UIC Yandex 39
Frequent pagelet (template) detection (Bar-Yossef and Rajagopalan, WWW-02)
Templates as frequent pagelets. A pagelet is a self-contained logical region within a page that has a well defined topic or functionality.
Definition 1 [Pagelet - semantic definition] A pageletis a region of a web page that (1) has a single well-defined topic or functionality; and (2) is not nested within another region that has exactly the same topic or functionality.
Definition 2 [Pagelet - syntactic definition] An HTML element in the parse tree of a page p is a pagelet if (1) none of its children contains at least k (=3) hyperlinks; and (2) none of its ancestor elements is a pagelet.
Bing Liu, UIC Yandex 40
Bing Liu, UIC Yandex 41
Templates as a collection of pages
Definition 3 [Template - semantic definition] A template is a collection of pages that (1) share the same look and feel and (2) are controlled by a single authority.
Definition 4 [Template - syntactic definition] A template is a collection of pagelets p1,…,pk that satisfies the following two requirements:
C(pi) = C(pj) for all 1 ≤ i ≠ j ≤ k. // C(p) is content of pO(p1),…,O(pk) form an undirected connected (graph) component. (O is the page owning p).
Content equality or similarity is determined using the shingle technique in (Broder et al, WWW-97).
Bing Liu, UIC Yandex 42
Templates as frequent DOM trees(Yi and Liu, KDD-03 and IJCAI-03)
Pages with similar “look and feel” are basically reflected by their similar DOM trees.
Similar layout or presentation Style
Given a set of pages, the method merges their DOM trees to build a Compressed Structure Tree (CST).
Merge similar branches and
Split on differences
The final CST represents a set of templates. The algorithm uses CST for Web page cleaning and find main content blocks
Bing Liu, UIC Yandex 43
Bing Liu, UIC Yandex 44
Compressed Structure Tree
TABLETABLETABLETABLEwidth=800 width=800 bcbc=red=red
bcbc=white=whiteBODYBODY
rootrootdd22
bcbc=red=redTABLETABLE TABLETABLE
bcbc=white=whiteBODYBODY
rootrootdd11
SPANSPANwidth=800 width=800
bcbc=red=redTABLETABLE
BODYBODY
SPANSPAN
bcbc=white=white
CST:CST:
Width=800Width=800TABLETABLE
{<(TABLE,{width=800}), (SPAN,{}), (TABLE, {bc=red})>, <(TABLE,{width=800}), (TABLE, {bc=red})>}
{<BODY, {bc=white}>}rootroot
22
22 2211
Bing Liu, UIC Yandex 45
Merging trees (or pages)Element node: E = (Tag, Attr, TAGs, STYLEs, CHILDs)
Tag — tag name. E.g., TABLE, IMG;Attr — display attributes of Tag.TAGs — actual tag nodesSTYLEs — presentation stylesCHILDs — pointers to child element nodes
Many techniques and their combinations can be used in merging:
HTML tagsVisual information, e.g., size, location, and background color.Tree edit distance and alignmentText contents at the leave nodes. A combination of the above methods.
Bing Liu, UIC Yandex 46
Finding the main content and noisy blocksInner Node Importance
(1)
l = |E.STYLEs|, m = |E.TAGs|pi — percentage of tag nodes (in E.TAGs) using the i-thpresentation styleInner NodeImp(E) — diversity of presentation styles
Leaf Node Importance
(2)N — number of features in Eai — a feature of content in E(1-HE(ai)) — information contained in ai
Leaf NodeImp(E) —content diversity of E
⎪⎩
⎪⎨⎧
=>−= ∑
= 1
1
1
log)(
1 mif
mifppENodeImp
l
iimi
N
aH
N
aHENodeImp
N
iiE
N
iiE ∑∑
== −=−
= 11
)(1
))(1()(
Bing Liu, UIC Yandex 47
Identify the main content blocks or weighting features (words)
Similarly, features or words can also be evaluated. Based on the computation, one can combine the evaluations to:
Remove noisy blocks, e.g., navigation, site description, etc. Weight features or words for data mining, as a feature selection mechanism based on the site structure of the pages.
Tree matching is applicable also.
Bing Liu, UIC Yandex 48
Automatically extracting Web news (Reis et al, WWW-04)
The setting is similar to (Yi & Liu, KDD-03).Given a set of crawled pages, find the template patterns to identify news articles in the pages.
It first generates clusters of pages that share the same templates.
Distance measure is based on tree edit distance
Each cluster is then generalized into an extraction tree by tree matching,Pattern: A regular expression for trees.
Bing Liu, UIC Yandex 49
Fragment (or blocks) detection(Ramaswamy et al WWW-04)
This paper also uses the shingling method in (Broder et al, WWW-97).
Its block (called fragment in the paper) segmentation is more complex, Based on AF-tree (augmented fragment tree), which is a compact DOM tree with
text formatting tags removed and
shingle values (encoding) attached to nodes.
The method detects Shared Fragments.
Bing Liu, UIC Yandex 50
Detecting shared fragments
Given a set of AF-trees, it uses the following to detect shared fragments in a set of pages.
Minimum Fragment Size(MinFragSize): This parameter specifies the minimum size of the detected fragment.Sharing Factor(ShareFactor): This indicates the minimum number of pages that should share a segment in order for it to be declared a fragment.MinimumMatching Factor(MinMatchFactor): This parameter specifies the minimum overlap between the SubtreeShingles to be considered as a shared fragment.
Bing Liu, UIC Yandex 51
Site-level detection - summary
Templates = “page-fragments” that recur across several pages of a website.
Eg, copyright, navigation links.
Page-fragment can beHTML code (tags)
Visible Text
DOM nodes (structure + text)
Simple two pass algorithms Hash page-fragments and count occurrences (or near duplicates)
Mark templates in second pass
Bing Liu, UIC Yandex 52
Road map
Site-level template detection
Page-level template detection
Identify what you need
Bing Liu, UIC Yandex 53
Learning Block Importance Models (Song et al, WWW-04)
Different blocks in a page are not equally important. Web page designers tend to organize page contents in a reasonable way:
giving prominence to important items and Deemphasizing unimportant parts
with features, e.g., position, size, color, word, image, link, etc.
A block importance model is a function that maps from features to importance of each block.
Blocks are categorizes into a few importance levels.A machine learning method is used to learn the model.
Block segmentation is done using a visual-based method (Cai et al. APWeb-03).
Bing Liu, UIC Yandex 54
Bing Liu, UIC Yandex 55
Machine learning and user study
Feature engineeringSpatial features:BlockCenterX, BlockCenterY, BlockRectWidth, BlockRectHeight
Content features: {ImgNum, ImgSize, LinkNum, LinkTextLength, InnerTextLength, InteractionNum, InteractionSize, FormNum, FormSize}
Learning methods: SVM and Neural networksSVM performs better
A user study is also done showing that there is a general agreement of block importance.
Bing Liu, UIC Yandex 56
Page-level detection without labeling(Chakrabarti et al. WWW-07)
Building a classifier to detect page-level templates without manual labeling training data.
Use site-level template detection to generate training data. Identify a set of features for learning, e.g., placement on screen, background color, link density, fraction of text outside anchor text, etc.Use logistic regression to learn a two-class classifier, which is used to label each html element in a new page.
Bing Liu, UIC Yandex 57
Applications of Page Segmentation
Removing noise or identify main content blocks of a page, e.g., for information retrieval and data mining purposes (Lin and Ho, KDD-02; Yi & Liu, IJCAI-03; Yi, Liu & Li, KDD-03; Reis et al, WWW-04; etc). Information unit-based or block-based Web search and ranking (e.g., Li et al, CIKM 02; Cai et al, SIGIR-04; Bar-Yossef and Rajagopalan, WWW-02).Browsing on small mobile devices (Gupta et al, WWW-03; Ying and Lee WWW-04).Cost-effective caching (Ramaswamy et al, WWW-04). Etc.
Bing Liu, UIC Yandex 58
Road map
Site-level template detection
Page-level template detection
Identify what you need
Bing Liu, UIC Yandex 59
Identify what you need
Instead of detecting templates, and remove noise.
We can directly find what we want.News
a lot of text
Productsa lot of regular structured objects)
…
Unwanted information is throw away.
Bing Liu, UIC Yandex 60
Conclusions
This talk introduced two important topics of Web data mining:
Structured data extraction
Template detection
The coverage is by no means exhaustive. Both are still actively researched, e.g., WWW-07.
They both are also ready for applications. Both visual cues and HTML tag information are useful and important.