Post on 22-Nov-2014
description
transcript
DIADEM domain-centric intelligent automated data extraction methodology
Automatically Learning Gazetteers from the
Deep WebChristian Schallhart
April 19th, 2012 @ WWW in Lyon
joint work with Tim Furche, Giovanni Grasso, Giorgio Orsi, and Cheng Wang
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
data areas records attributes
precision recall
>98.5% F1 score
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
DomainKnowledge
(no per-site training)
>98.5% F1 score
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Little Ontology(mandatory)
attribute types
DomainKnowledge
(no per-site training)
>98.5% F1 score
Friday, May 11, 2012
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
Little Ontology(mandatory)
attribute types
DomainKnowledge
(no per-site training)
>98.5% F1 score
Friday, May 11, 2012
Quite easy.
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
Little Ontology(mandatory)
attribute types
DomainKnowledge
(no per-site training)
>98.5% F1 score
Friday, May 11, 2012
Quite easy. A lot of work!
AMBER: Extraction from Result Pages
2
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
Little Ontology(mandatory)
attribute types
DomainKnowledge
(no per-site training)
>98.5% F1 score
Friday, May 11, 2012
AMBER: From Extraction to Learning
3
Gazetteersterm lists
A lot of work!
Leverage the repeated structure in result pages
to learn new terms.
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
A lot of work!
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
•Page Segmentation
‣clusters attribute instances
‣analyses repeated structures
A lot of work!
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
•Page Segmentation
‣clusters attribute instances
‣analyses repeated structures
AMBER annotates firstto integrate semantic information into its repeated structure analysis.
A lot of work!
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
•Page Segmentation
‣clusters attribute instances
‣analyses repeated structures
•Attribute Alignmentmatches knowledge with observations
AMBER annotates firstto integrate semantic information into its repeated structure analysis.
A lot of work!
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
•Page Segmentation
‣clusters attribute instances
‣analyses repeated structures
•Attribute Alignmentmatches knowledge with observations
•Gazetteer Learningturns phrases into terms
AMBER annotates firstto integrate semantic information into its repeated structure analysis.
•Gazetteer Learningturns phrases into terms
A lot of work!
Friday, May 11, 2012
AMBER: Automatically Learning Gazetteers
4
<offer> <price>! <currency>GBP</currency>! <amount>4000000</amount> </price> <bedrooms>5</bedrooms> <location>Radcliffe House, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>7</bedrooms> <location>Jarn Way, Boars Hill, Oxford, Oxfordshire</location></offer>
<offer> <price>! <currency>GBP</currency>! <amount>3950000</amount> </price> <bedrooms>6</bedrooms> <location>Old Boars Hill, Oxford</location></offer>
Gazetteersterm lists
>98.5% F1 score
•Page Segmentation
‣clusters attribute instances
‣analyses repeated structures
•Attribute Alignmentmatches knowledge with observations
•Gazetteer Learningturns phrases into terms
AMBER annotates firstto integrate semantic information into its repeated structure analysis.
•Gazetteer Learningturns phrases into terms
A lot of work!
Friday, May 11, 2012
AMBER: Page Segmentation
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
5
Friday, May 11, 2012
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
Data Area Identification
Pivot node clustering
AMBER: Page Segmentation
6
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
Friday, May 11, 2012
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
Data Area Identification
Pivot node clustering
A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.
AMBER: Page Segmentation
6
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
Friday, May 11, 2012
AMBER: Page Segmentation
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
Data Area Identification
Pivot node clustering
Record Segmentation
head/tail cut-off
segmentation boundaryshifting
7
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
Friday, May 11, 2012
AMBER: Page Segmentation
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
Data Area Identification
Pivot node clustering
Record Segmentation
head/tail cut-off
segmentation boundaryshifting
8
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
Friday, May 11, 2012
AMBER: Page Segmentation
Page Retrieval
Mozilla via XUL Runner
GATE Annotations
Data Area Identification
Pivot node clustering
Record Segmentation
head/tail cut-off
segmentation boundaryshifting
8
R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
A result record is a sequence of children of thedata area root.
A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.
Friday, May 11, 2012
AMBER: Attribute Alignment
9
L
P X
LL
P P P A
L
P A
L
P
L
P AA
Friday, May 11, 2012
AMBER: Attribute Alignment
9
L
P X
LL
P P P A
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Friday, May 11, 2012
AMBER: Attribute Alignment
Attribute Cleanup
discard attributes withlow support
9
L
P X
LL
P P P A
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Cleanup
Friday, May 11, 2012
AMBER: Attribute Alignment
Attribute Cleanup
discard attributes withlow support
Attribute Disambiguation
discard ambiguous attributes with lower support
9
L
P X
LL
P P P A
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Disam-biguation
Cleanup
Friday, May 11, 2012
AMBER: Attribute Alignment
Attribute Cleanup
discard attributes withlow support
Attribute Disambiguation
discard ambiguous attributes with lower support
Attribute Generalisation
add new un-annotatedattributes with sufficientsupport
9
L
P X
LL
P P P A
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Disam-biguation
Cleanup Generation
Friday, May 11, 2012
AMBER: Gazetteer Learning
10
Oxford, Walton Street, top-floor apartment
Friday, May 11, 2012
AMBER: Gazetteer Learning
Term Formulation
split newly generated attributesinto terms
10
Oxford, Walton Street, top-floor apartment
Oxford
Walton Street
top-floor apartment
Friday, May 11, 2012
AMBER: Gazetteer Learning
Term Formulation
split newly generated attributesinto terms
discard terms onblack-lists and from non-overlapping attributes
10
Oxford, Walton Street, top-floor apartment
Oxford
Walton Street
top-floor apartment
Friday, May 11, 2012
AMBER: Gazetteer Learning
Term Formulation
split newly generated attributesinto terms
discard terms onblack-lists and from non-overlapping attributes
Term Validation
track term relevance
discard irrelevant ones
10
Oxford, Walton Street, top-floor apartment
Oxford
Walton Street
top-floor apartment
Friday, May 11, 2012
AMBER: Evaluation
11
Friday, May 11, 2012
AMBER: Evaluation
Learning Location from 250 pages from 150 sites(UK real estate market)
Starting with a 25% sample of our full gazetteer(containing 33.243 terms)
11
Friday, May 11, 2012
AMBER: Evaluation
Learning Location from 250 pages from 150 sites(UK real estate market)
Starting with a 25% sample of our full gazetteer(containing 33.243 terms)
initially failed to annotate 328 locations
after 3 learning rounds learned 265 of those (recall: 80.6% precision: 95.1%)
11
Friday, May 11, 2012
AMBER: Evaluation!"##$%
&'(#)%
!"#$%&%'(
)$"*&+&,%
!"#$%&%'
-"*#..
/,0#.
)$"*&+&,%
/,0#.
-"*#..
123
453
613
773
8223
-,9%:(8 -,9%:(; -,9%:(5
Friday, May 11, 2012
AMBER: Evaluation
!"##$%
&'(#)%
!"#$%&' !"#$%&( !"#$%&)
*
+*
'**
'+*
(**
(+*
)**
,-./$0&,"1.02"$3 4"//-10&,"1.02"$3
Friday, May 11, 2012
!
"
#
$
%
DEMO
Friday, May 11, 2012
!
"
#
$
%
Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with
confidence values3 URLs for analysis4 Seed Gazetteer5
AMBER GUI
Reas
onin
g Attribute Alignment
Record Segmentaton
DataArea IdentificationAnno
tatio
n
GATE
Domain Gazetteers
Web
Acc
ess
Browser Common API
WebKitMozilla
AMBER Learning Evaluation!"##$%
&'(#)%
!"#$%&%'(
)$"*&+&,%
!"#$%&%'
-"*#..
/,0#.
)$"*&+&,%
/,0#.
-"*#..
123
453
613
773
8223
-,9%:(8 -,9%:(; -,9%:(5
!"##$%
&'(#)%
!"#$%&' !"#$%&( !"#$%&)
*
+*
'**
'+*
(**
(+*
)**
,-./$0&,"1.02"$3 4"//-10&,"1.02"$3
Learning Accuracy Terms learnt
• Learning Locations from 250 pages from 150 sites (UK real estate)
• Starting with 25% of the full gazetteer (containing 33.243 terms)
unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
data areas records attributes
precision recall
94.0%
96.0%
98.0%
100.0%
reception price bathroom
legal status
detailed page bedroom
location postcode
property type
precision recall
overall attributes large scale
Real Estate
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
data areas records attributes
precision recall
90.0%
92.5%
95.0%
97.5%
100.0%
door num detail page make
fuel type color price mileage
cartype trans model
engine size location
precision recall
overall attributes
Used Cars
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
price location
detailed page bedroom
legal status postcode
property type bathroom
reception
250 pages, manual 2215 pages, automatic
281real estate 14,6142785
12,732
attributes
114,714
records
20,7231,608151
2,215
pages
used car(large scale)
extracts attributes with >99% precision and >98% recall
Automatically LearningGazetteers from the Deep Web
AMBER Learning Cycle
Page Segmentation
PageRetrieval
Mozilla,GATE annotations
Data AreaIdentification
Pivot node (mandatoryfields) clustering
RecordSegmentation
Head/tail cut off,Segment boundary shifting
Attribute Alignment
AttributeCleanup
Discard attributesof low support
AttributeDisambiguation
Discard redundantattributes of lower support
AttributeGeneralization
Add new attributes ofsufficient support
Gazetteer Learning
TermFormulation
Spilt new attributes intoterms
TermValidation
Track term relevance,Discard irrelevant terms
1
2
3
2
3
L L
P X
L
P P P A
L
P A
L
P
L
P AA2 31
1
1
2
L
P A
1
Oxford, Walton Street, top-floor apartment
Oxford Walton Street top-floor apartment
Gazet-teers
A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.
A result record is a sequence of children of the data area root.
A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.
2 R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
3 R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Remove terms which occur• in black lists,• in other gazetteers
Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute
Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang
Authors
diadem.cs.ox.ac.uk/amberdiadem-amber@cs.ox.ac.uk
Digital Home
Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation
easy integration with domain knowledge
Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34
AMBER Applications
Result PageAnalysis
Data ExtractionExample Generation
forWrapper Induction
Gazetteer LearningGazetteerOntology
Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.
... but useable independently of DIADEM as well...
Sponsors
AMBER Evaluation AMBER Architecture
DIADEM data extraction methodologydomain-centric intelligent automated
diadem.cs.ox.ac.uk
P is only allowed toappear once, thus the
second P with less supportis dropped.
X only occurs onceand has too low
support to be kept.
A has a support of3/4 at this node andhence we add the
annotation.
We inferred that thisnode is of type A --
hence we learn its terms.
Friday all day,otherwise during breaks!
Booth Number 1well-hidden in the corner
Friday, May 11, 2012
!
"
#
$
%
Webpage with identifiedrecords and attributes1 Domain schema concepts2 Learned terms with
confidence values3 URLs for analysis4 Seed Gazetteer5
AMBER GUI
Reas
onin
g Attribute Alignment
Record Segmentaton
DataArea IdentificationAnno
tatio
n
GATE
Domain Gazetteers
Web
Acc
ess
Browser Common API
WebKitMozilla
AMBER Learning Evaluation!"##$%
&'(#)%
!"#$%&%'(
)$"*&+&,%
!"#$%&%'
-"*#..
/,0#.
)$"*&+&,%
/,0#.
-"*#..
123
453
613
773
8223
-,9%:(8 -,9%:(; -,9%:(5
!"##$%
&'(#)%
!"#$%&' !"#$%&( !"#$%&)
*
+*
'**
'+*
(**
(+*
)**
,-./$0&,"1.02"$3 4"//-10&,"1.02"$3
Learning Accuracy Terms learnt
• Learning Locations from 250 pages from 150 sites (UK real estate)
• Starting with 25% of the full gazetteer (containing 33.243 terms)
unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
• Fails to annotate 328 or 1,484 locations• Saturated after 3 rounds
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
data areas records attributes
precision recall
94.0%
96.0%
98.0%
100.0%
reception price bathroom
legal status
detailed page bedroom
location postcode
property type
precision recall
overall attributes large scale
Real Estate
97.5%
98.0%
98.5%
99.0%
99.5%
100.0%
data areas records attributes
precision recall
90.0%
92.5%
95.0%
97.5%
100.0%
door num detail page make
fuel type color price mileage
cartype trans model
engine size location
precision recall
overall attributes
Used Cars
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
price location
detailed page bedroom
legal status postcode
property type bathroom
reception
250 pages, manual 2215 pages, automatic
281real estate 14,6142785
12,732
attributes
114,714
records
20,7231,608151
2,215
pages
used car(large scale)
extracts attributes with >99% precision and >98% recall
Automatically LearningGazetteers from the Deep Web
AMBER Learning Cycle
Page Segmentation
PageRetrieval
Mozilla,GATE annotations
Data AreaIdentification
Pivot node (mandatoryfields) clustering
RecordSegmentation
Head/tail cut off,Segment boundary shifting
Attribute Alignment
AttributeCleanup
Discard attributesof low support
AttributeDisambiguation
Discard redundantattributes of lower support
AttributeGeneralization
Add new attributes ofsufficient support
Gazetteer Learning
TermFormulation
Spilt new attributes intoterms
TermValidation
Track term relevance,Discard irrelevant terms
1
2
3
2
3
L L
P X
L
P P P A
L
P A
L
P
L
P AA2 31
1
1
2
L
P A
1
Oxford, Walton Street, top-floor apartment
Oxford Walton Street top-floor apartment
Gazet-teers
A data area is a maximal DOM subtree, which• contains ≥2 pivot nodes, which are• depth consistent (depth(n)=k±ε)• distance consistent (pathlen(n,n')=k±δ)• continuous, such that• their least common ancestor is d's root.
A result record is a sequence of children of the data area root.
A result record segmentation divides a data area • into non-overlapping records, • containing the same number of siblings, • each based on a single selected pivot node.
2 R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
3 R
D
L
P X
D
LL
P P P AP
L
P A
L
P
L
P AA
The tag path of a node n in a record r is the • tag sequence occurring on the • child/next-sibling path from r's root to n.
The support of a type/tag path pair (t,p) is the• fraction of records having an • annotation for t at path p.
Remove terms which occur• in black lists,• in other gazetteers
Compute confidence based on• support of its type/tag path pair,• relative size of the term within the entire attribute
Tim Furche, Giovanni Grasso, Giorgio Orsi,Christian Schallhart, Cheng Wang
Authors
diadem.cs.ox.ac.uk/amberdiadem-amber@cs.ox.ac.uk
Digital Home
Reasoning in Datalog (DLV) rules• stratified negation• finite domains• non-recursive aggregation
easy integration with domain knowledge
Number of Rules• Data Area Idenifitication: 11• Record Segmentation: 32• Attribute Alignment: 34
AMBER Applications
Result PageAnalysis
Data ExtractionExample Generation
forWrapper Induction
Gazetteer LearningGazetteerOntology
Part of DIADEM (Domain-centric Intelligent Automated DataExtraction Methodology): Analyzing the pages reached viaOPAL to generate OXPath expressions for efficient extraction.
... but useable independently of DIADEM as well...
Sponsors
AMBER Evaluation AMBER Architecture
DIADEM data extraction methodologydomain-centric intelligent automated
diadem.cs.ox.ac.uk
P is only allowed toappear once, thus the
second P with less supportis dropped.
X only occurs onceand has too low
support to be kept.
A has a support of3/4 at this node andhence we add the
annotation.
We inferred that thisnode is of type A --
hence we learn its terms.
Friday all day,otherwise during breaks!
Booth Number 1well-hidden in the corner
Friday, May 11, 2012
Backups
Friday, May 11, 2012
AMBER: Evaluation!"##$%
&'(#)%
!"#$%&%'(
)$"*&+&,%
!"#$%&%'
-"*#..
/,0#.
)$"*&+&,%
/,0#.
-"*#..
123
453
613
773
8223
-,9%:(8 -,9%:(; -,9%:(5
unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
Friday, May 11, 2012
AMBER: Evaluation
!"##$%
&'(#)%
!"#$%&' !"#$%&( !"#$%&)
*
+*
'**
'+*
(**
(+*
)**
,-./$0&,"1.02"$3 4"//-10&,"1.02"$3unannotated instances (328) total instances (1484)
rnd. aligned corr. prec. rec. prec. rec.
1 226 196 86.7% 59.2% 84.7% 81.6%2 261 248 95.0% 74.9% 93.2% 91.0%3 271 265 97.8% 80.6% 95.1% 93.8%4 271 265 97.8% 80.6% 95.1% 93.8%
Table 1: Total learned instances
rnd. unannot. recog. corr. prec. rec. terms
1 331 225 196 86.7% 59.2% 2622 118 34 32 94.1% 27.1% 293 79 16 16 100.0% 20.3% 44 63 0 0 100.0% 0% 0
Table 2: Incrementally recognized instances and learned terms
miles around Oxford. AMBER uses the content of the seed gazetteerto identify the position of the known terms such as locations. In par-ticular, AMBER identifies three potential new locations, videlicet.“Oxford”, “Witney” and “Wallingford” with confidence of 0.70,0.62, and 0.61 respectively. Since the acceptance threshold for newitems is 50%, all the three locations are added to the gazetteer.
We repeat the process for several websites and show how AMBERidentifies new locations with increasing confidence as the numberof analyzed websites grows. We then leave AMBER to run over250 result pages from 150 sites of the UK real estate domain, in aconfiguration for fully automated learning, i.e., g = l = u = 50%,and we visualize the results on sample pages.
Starting with the sparse gazetteer (i.e., 25% of the full gazetteer),AMBER performs four learning iterations, before it saturates, as itdoes not learn any new terms. Table 1 shows the outcome of each ofthe four rounds. Using the incomplete gazetteer, we initially fail toannotate 328 out of 1484 attribute instances. In the first round, thegazetteer learning step identifies 226 unannotated instances. 196of those instances are correctly identified, which yields a precisionand recall of 86.7% and 59.2% of the unannotated instances, 84.7%and 81.6% of all instances. The increase in precision is stable inall the learning rounds so that, at the end of the fourth iteration,AMBER achieves a precision of 97.8% and a recall of 80.6% of theunannotated instances, and an overall precision and recall of 95.1%and 93.8%, respectively.
Table 2 shows the incremental improvements made in eachround. For each round, we report the number of unannotated in-stances, the number of instances recognized through attribute align-ment, and the number of correctly identified instances. For eachround we also show the corresponding precision and recall met-rics, as well as the number of new terms added to the gazetteer.Note that the number of learned terms is larger than the number ofinstances in round 1, as splitting them yields multiple terms. Con-versely, in rounds 2 to 4, the number of terms is smaller than thenumber of instances, due to terms occurring in multiple instancessimultaneously or already blacklisted.
We also show the behavior of AMBER with different settings forthe threshold g. In particular, increasing the value of g (i.e., thesupport for the discovered attributes) leads to higher precision ofthe learned terms at the cost of lower recall. The learning algorithmalso converges faster for higher values of g.
Figure 4 illustrates our evaluation of AMBER on the real-estatedomain. We evaluate AMBER on 150 UK real-estate web sites, ran-domly selected among 2810 web sites named in the yellow pages.For each site, we submit its main form with a fixed sequence of
94.0%!
96.0%!
98.0%!
100.0%!
data
are
a!re
cord
s!pr
ice!
deta
ils U
RL!
locat
ion!
legal!
postc
ode!
bedr
oom!
prop
erty
type!
rece
ption!
bath!
precision! recall!
Figure 4: AMBER Evaluation on Real-Estate Domain
fillings to obtain one, or if possible, two result pages with at leasttwo result records and compare AMBER’s results with a manuallyannotated gold standard. Using a full gazetteer, AMBER extractsdata area, records, price, detailed page link, location, legal status,postcode and bedroom number with more than 98% precision andrecall. For less regular attributes such as property type, receptionnumber and bathroom number, precision remains at 98%, but re-call drops to 94%. The result of our evaluation proves that AMBERis able to generate human-quality examples for any web site in agiven domain.
4. REFERENCES[1] V. Crescenzi and G. Mecca. Automatic Information Extraction
from Large Websites. Journal of the ACM, 51(5):731–779,2004.
[2] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan,N. Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts,D. Damljanovic, T. Heitz, M. A. Greenwood, H. Saggion,J. Petrak, Y. Li, and W. Peters. Text Processing with GATE
(Version 6). The University of Sheffield, Department ofComputer Science, 2011.
[3] N. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction:an approach based on a probabilistic tree-edit model. In Proc.
of the ACM SIGMOD International Conference on
Management of Data, pages 335–348, 2009.[4] N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic
wrappers for large scale web extraction. The Proceedings of
the VLDB Endowment, 4(4):219–230, 2011.[5] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical
Wrapper Induction for Semistructrued Information Systems.Autonomous Agents and Multi-Agent Systems, 4:93–114,2001.
[6] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, andM. Tommasi. Automatic wrapper induction from hidden-websources with domain knowledge. In Proc. of WIDM, pages9–16, 2008.
[7] K. Simon and G. Lausen. ViPER: Augmenting AutomaticInformation Extraction with visual Perceptions. In Proc. 14th
ACM Conference on Information and Knowledge
Management, pages 381–388, 2005.[8] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Ontology-Assisted Data Extraction. ACM Transactions on
Database Systems, 34(2), 2009.[9] Y. Zhai and B. Liu. Structured Data Extraction from the Web
Based on Partial Tree Alignment. IEEE Transactions on
Knowledge and Data Engineering, 18(12):1614–1628, 2006.
Friday, May 11, 2012