1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion...

1

A Hierarchical Approach to Wrapper Induction

Presentation by Tim Chartrand of

A paper by

Ion Muslea, Steve Minton and

Craig Knoblock

http://blondie.cs.byu.edu/CS652/muslea99hierarchical.pdf

2

Introduction

IE from web pages is important because of the amount of semistructured information on the web

IE depends on the construction of wrappers

Manual wrapper construction is tedious and hard

Previous wrapper learning systems require a lot of hand-marked training data

STALKER is a supervised learning algorithm for inducing wrappers

STALKER requires fewer triaining examples than other approaches and is able to wrap more pages

3

Structure of a document

Semistructured documents follow a formal grammar

An Embedded Catalog (EC) represents the structure of a document as a tree Leaves are items of interest Internal nodes are lists of k-tuples Each item of a k-tuple can be a leaf or another

list

5

Extraction Rules

Each tree node represents a sequence of tokens

The root node represents the entire document

Each node’s sequence is a subsequence of its parent’s

An extraction rule is associated with each edge of the tree Specifies how to extract the child content x from the parent content p In other words describes how to match the prefix of x w.r.t p --

Prefixx(p) Each child’s extraction rule is independent of its siblings

Use Landmarks – either tokens or wildcards (token classes)

Can be disjunctive – apply rule R1 or rule R2

6

Extraction Rules – examples

SkipTo(<b>)

SkipTo(Name)SkipTo(<b>)

SkipTo(Name Symbol HtmlTag)

SkipTo(</b>)

SkipTo(,)SkipUntil(Num)

SkipTo(AllCaps)NextLMark(Num)

7

Extraction Rules as Finite Automata

An extraction rule is equivalent to an FSA

Transition conditions correspond to the landmarks used in the extraction rules

Empty looping transitions are taken when a landmark has not been reached

R1 = SkipTo(()

R2 = SkipTo(Phone)SkipTo(<B>)

8

Learning Algorithm

Sequential Covering Algorithm

Choose the rule that covers most examples and remove the examples it covers

Return the disjunction of all rules found

11

Related Work

Manual wrapper construction TSIMMIS, procedural languages, etc Hard and error prone

Automatic wrapper construction WEIN

Less expressive – only uses the equivalent of SkipTo() without wildcards Not able to express arbitrarily deep tuples

SoftMealy Generates rules as finite transducers More expressive than WEIN but strictly less expressive than STALKER Must see all possible item orderings

WHISK, RAPIER, and SRV Use NLP Techniques Use landmarks similar to STALKER

Ontology approach – DEG Can handle lists with multiplicity constraints Character based rather than token based landmarks

12

Conclusions

STALKER uses EC formalism to turns a hard problem into several smaller ones Unseen permutations of data items can be recognized Arbitrarily long lists can be recognized The entire document can be interpreted as a list of tuples

STALKER rules use an expressive landmark based format

High accuracy wrappers can be induced automatically based on very few training examples compared with other systems

13

Related Work – BYU DEG

RAPIER rules correspond closely to DEG data frames. Data frames are finer-grained, based on character patterns,

whereas rules are based on word patterns Pre-filler and Post-filler patterns correspond closely to data frame

contexts and key words Semantic categories correspond closely with lexicons

Not mentioned how RAPIER handles multiple record documentsRapier data structure is given by the template (slots) defined in the input dataRAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form

14

Conclusions

Extracting desired pieces of information from NL text is importantManually constructing IE systems too hardRAPIER uses relational learning to build a set of pattern-match rules given a database of texts and filled templatesLearned patterns employ syntactic and semantic information to match slot fillers and contextFairly accurate results can be obtained for a real-world problem with relatively small datasetsRAPIER compares favorably with other IE learning systems

Date post:	02-Jan-2016
Category:	Documents
Upload:	cuthbert-fowler
View:	214 times
Download:	0 times

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion...

Documents