+ All Categories
Home > Documents > Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W....

Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W....

Date post: 20-Dec-2015
Category:
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
138
Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National Science Foundation
Transcript
Page 1: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Toward Tomorrow’s Semantic WebAn Approach Based on

Information Extraction Ontologies

David W. EmbleyBrigham Young University

Funded in part by the National Science Foundation

Page 2: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 3: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grand Challenge

Semantic UnderstandingSemantic Understanding

Can we quantify & specify the nature of this grand challenge?

Page 4: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grand Challenge

Semantic UnderstandingSemantic Understanding“If ever there were a technology that could generatetrillions of dollars in savings worldwide …, it wouldbe the technology that makes business informationsystems interoperable.”

(Jeffrey T. Pollock, VP of Technology Strategy, Modulant Solutions)

Page 5: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grand Challenge

Semantic UnderstandingSemantic Understanding“The Semantic Web: … content that is meaningful tocomputers [and that] will unleash a revolution of newpossibilities … Properly designed, the Semantic Webcan assist the evolution of human knowledge …”

(Tim Berners-Lee, …, Weaving the Web)

Page 6: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grand Challenge

Semantic UnderstandingSemantic Understanding“20th Century: Data Processing“21st Century: Data Exchange “The issue now is mutual understanding.”

(Stefano Spaccapietra, Editor in Chief, Journal on Data Semantics)

Page 7: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grand Challenge

Semantic UnderstandingSemantic Understanding“The Grand Challenge [of semantic understanding] has become mission critical. Current solutions … won’t scale. Businesses need economic growth dependent on the web working and scaling (cost: $1 trillion/year).”

(Michael Brodie, Chief Scientist, Verizon Communications)

Page 8: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

What is Semantic Understanding?

Understanding: “To grasp or comprehend [what’s]intended or expressed.’’

Semantics: “The meaning or the interpretation of a word, sentence, or other language form.”

- Dictionary.com

Page 9: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Can We Achieve Semantic Understanding?

“A computer doesn’t truly ‘understand’ anything.”

But computers can manipulate terms “in ways that are useful and meaningful to the human user.”

- Tim Berners-Lee

Key Point: it only has to be good enough.And that’s our challenge and our opportunity!

Page 10: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 11: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Information Value Chain

Meaning

Knowledge

Information

Data

Translating data into meaning

Page 12: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Page 13: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Page 14: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Page 15: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Foundational Definitions

Meaning: knowledge that is relevant or activates Knowledge: information with a degree of

certainty or community agreement (ontology) Information: data in a conceptual framework Data: attribute-value pairs

- Adapted from [Meadow92]

Page 16: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Data

Attribute-Value Pairs• Fundamental for information• Thus, fundamental for knowledge & meaning

Page 17: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Data

Attribute-Value Pairs• Fundamental for information• Thus, fundamental for knowledge & meaning

Data Frame• Extensive knowledge about a data item

�Everyday data: currency, dates, time, weights & measures

�Textual appearance, units, context, operators, I/O conversion

• Abstract data type with an extended framework

Page 18: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 19: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 20: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 21: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 22: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Olympus C-750 Ultra Zoom

Sensor Resolution 4.2 megapixelsOptical Zoom 10 xDigital Zoom 4 xInstalled Memory 16 MBLens Aperture F/8-2.8/3.7Focal Length min 6.3 mmFocal Length max 63.0 mm

Page 23: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Digital Camera

Olympus C-750 Ultra Zoom

Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm

Page 24: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Page 25: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Page 26: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Page 27: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Page 28: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Car Advertisement

Year 2002Make FordModel ThunderbirdMileage 5,500 milesFeatures Red

ABS6 CD changerkeyless entry

Price $33,000Phone (916) 972-9117

Page 29: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 30: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 31: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Airline Itinerary

Flight # Class From Time/Date To Time/Date Stops

Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04

Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04

Page 32: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Monday, October 13, 2003

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

Page 33: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Monday, October 13, 2003

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

Page 34: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

World Cup Soccer

Monday, October 13, 2003

Group A W L T GF GA Pts.USA 3 0 0 11 1 9Sweden 2 1 0 5 3 6North Korea 1 2 0 3 4 3Nigeria 0 3 0 0 11 0

Group B W L T GF GA Pts.Brazil 2 0 1 8 2 7…

Page 35: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

Page 36: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

Page 37: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

Page 38: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Treadmill Workout

Calories 250 calDistance 2.50 milesTime 23.35 minutesIncline 1.5 degreesSpeed 5.2 mphHeart Rate 125 bpm

Page 39: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 40: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 41: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

?

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 42: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Maps

Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,100 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW

Page 43: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 44: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Information Extraction OntologiesSource Target

InformationExtraction

InformationExchange

Page 45: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

What is an Extraction Ontology? Augmented Conceptual-Model Instance

• Object & relationship sets• Constraints• Data frame value recognizers

Robust Wrapper (Ontology-Based Wrapper)• Extracts information• Works even when site changes or when new sites

come on-line

Page 46: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

CarAds

Color

Feature

AccessoryBodyType

OtherFeatureEngine

Transmission

Mileage

ModelTrim

TrimModel

Year

Make

Price

PhoneNr

0:1

has1:*

0:1has1:*

0:0.7:1has

1:* 0:0.9:1has

1:*

0:0.78:1

has

1:*

0:1

1:*

0:1

1:*

0:1

has1:*

0:*has

1:*

0:*

has

1:*

CarAds Extraction Ontology

<ObjectSet x="329" y="51" lexical="true" name="Mileage" id="osmx50"> <DataFrame> <InternalRepresentation> <DataType typeName="String"/> </InternalRepresentation> <ValuePhraseList> <ValuePhrase hint="Mileage Pattern 1"> <ValueExpression color="ffffff"> <ExpressionText>[1-9]\d{0,2}[kK]</ExpressionText> </ValueExpression> <LeftContextExpression color="ffffff"> …

<ObjectSet x="329" y="51" lexical="true" name="Mileage" id="osmx50"> <DataFrame> <InternalRepresentation> <DataType typeName="String"/> </InternalRepresentation> <ValuePhraseList> <ValuePhrase hint="Mileage Pattern 1"> <ValueExpression color="ffffff"> <ExpressionText>[1-9]\d{0,2}[kK]</ExpressionText> </ValueExpression> <LeftContextExpression color="ffffff"> …

Page 47: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Extraction Ontologies:An Example of

Semantic Understanding

“Intelligent” Symbol Manipulation Gives the “Illusion of Understanding” Obtains Meaningful and Useful Results

Page 48: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 49: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

A Variety of Applications Information Extraction Semantic Web Page Annotation Free-Form Semantic Web Queries Task Ontologies for Free-Form Service Requests High-Precision Classification Schema Mapping for Ontology Alignment Record Linkage Accessing the Hidden Web Ontology Discovery and Generation Challenging Applications (e.g. BioInformatics)

Page 50: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #1

Information Extraction

Page 51: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Constant/Keyword Recognition

Descriptor/String/Position(start/end)

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Page 52: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Heuristics

Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation

Page 53: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Keyword Proximity

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Page 54: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Subsumed/Overlapping Constants

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Page 55: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Functional Relationships

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Page 56: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Nonfunctional Relationships

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Page 57: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

First Occurrence without Constraint Violation

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Page 58: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155

Database-Instance Generator

insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)

Page 59: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #2

Semantic Web Page Annotation

Page 60: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Annotated Web Page(Demo)

Page 61: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

OWL<owl:Class rdf:ID="CarAds"> <rdfs:label xml:lang="en">CarAds</rdfs:label>...... <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" /> <owl:minCardinality rdf:datatype="&xsd;nonNegativeInteger">0</owl:minCardinality>

</owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf>

<owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" />

<owl:maxCardinality rdf:datatype="&xsd;nonNegativeInteger">1</owl:maxCardinality>

</owl:Restriction> </rdfs:subClassOf> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#hasMileage" /> <owl:allValuesFrom rdf:resource="#Mileage" /> </owl:Restriction> </rdfs:subClassOf>……</owl:Class>……<owl:Class rdf:ID="Mileage"> <rdfs:label xml:lang="en">Mileage</rdfs:label>……</owl:Class>……

<CarAds rdf:ID="CarAdsIns2"><CarAdsValue rdf:datatype="&xsd;string">2</CarAdsValue>

</CarAds>……<Mileage rdf:ID="MileageIns2">

<StartingCharPosition rdf:datatype="&xsd;nonNegativeInteger">237</StartingCharPosition>

<EndingCharPosition rdf:datatype="&xsd;nonNegativeInteger">241</EndingCharPosition>

</Mileage>…….<owl:Thing rdf:about="#CarAdsIns2">

<hasMake rdf:resource="#MakeIns2" /><hasModel rdf:resource="#ModelIns2" /><hasYear rdf:resource="#YearIns2" /><hasMileage rdf:resource="#MileageIns2" /><hasPhoneNr rdf:resource="#PhoneNrIns2" /><hasPrice rdf:resource="#PriceIns2" />

</owl:Thing>

……

Page 62: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #3

Free-Form Semantic Web Queries

Page 63: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Find Ontology“Tell me about cruises on San Francisco Bay. I’d like to know

scheduled times, cost, and the duration of cruises on Friday of next week.”

Page 64: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Formulate Query

Friday, Oct. 29thcost

duration

Selection Constants

San Francisco Bayscheduled times

Projection

= Result ( )

Join Path

Page 65: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

StartTime Price Duration Source

10:45 am, 12:00 pm, 1:15, 2:30, 4:00 $20.00, $16.00, $12.00

1

10:00 am, 10:45 am, 11:15 am, 12:00 pm, 12:30 pm, 1:15 pm, 1:45 pm, 2:30 pm, 3:00 pm, 3:45 pm, 4:15 pm, 5:00 pm

$17.00, $16.00, $12.00

1 Hour 2

Page 66: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #4

Task Ontologies for Free-Form Service Requests

Page 67: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Basic Idea Service Request

Match with Task Ontology• Domain Ontology• Process Ontology

Complete, Negotiate, Finalize

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 68: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Domain Ontology

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Page 69: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Appointment …

context keywords/phrase: “appointment |want to see a |…”

Dermatologist …

context keywords/phrases: “([D|d]ermatologist) | …”

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 70: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Appointment …

context keywords/phrase: “appointment |want to see a |…”

Dermatologist …

context keywords/phrases: “([D|d]ermatologist) | …”

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 71: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Appointment …

context keywords/phrase: “appointment |want to see a |…”

Dermatologist …

context keywords/phrases: “([D|d]ermatologist) | …”

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 72: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Appointment …

context keywords/phrase: “appointment |want to see a |…”

Dermatologist …

context keywords/phrases: “([D|d]ermatologist) | …”

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Page 73: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Appointment …

context keywords/phrase: “appointment |want to see a |…”

Dermatologist …

context keywords/phrases: “([D|d]ermatologist) | …”

I want to see a dermatologist next week; any day would

be ok for me, at 4:00 p.m. The dermatologist must be

within 20 miles from my home and must accept my

insurance.

Date …NextWeek(d1: Date, d2: Date)returns (Boolean{T,F})context keywords/phrases: next week | week from now | …

Distanceinternal representation : real;input (s: String)context keywords/phrases: miles | mile | mi | kilometers | kilometer | meters | meter | centimeter | … Within(d1: Distance, “20”)returns (Boolean {T or F})context keywords/phrases: within | not more than | | …return (d1d2)…end;

Page 74: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->Appointment

Place

Insurance

Service Provider

Person

NameDoctor

Pediatrcian

Service Description

Duration

Medical Service Provider

Auto Service Provider Auto Mechanic

Dermatologist

Address

Cost

Date

Time

has

is at

is on

has

provides

has

accepts

hashas

"IHC"

is with

is for

is at

is at

has

"DMBA"

is at

->

Page 75: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Appointment

Place

Dermatologist

Person

Name

Address

Date

Time

is at

is on

has

hasis with

is for

is at

is at

has

is at

->Appointment

Place

Dermatologist

Person

Name

Address

Date

Time

is at

is on

has

hasis with

is for

is at

is at

has

is at

->

Page 76: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Process Ontology

ready to schedule

task-view = null

report that the appointment cannot be scheduled

task-view != null

schedule-appointment(task-view.Person.Name,task-view.Service Provider.Name, task-view.Date, task-view.Time, task-view.Address);report that the appointment is scheduled;

initial-task-view ready

no missing information missing information

task-view = get-from-system(task-view); if (still missing values) task-view = ger-from-user(task-view);

@process ontology(domain ontology)

task-view = create-task-view(domain ontology);task-constraints = create-task-constraints(task-view);

ready@create

initialize

.

.

.

ready to schedule

task-view = null

report that the appointment cannot be scheduled

task-view != null

schedule-appointment(task-view.Person.Name,task-view.Service Provider.Name, task-view.Date, task-view.Time, task-view.Address);report that the appointment is scheduled;

initial-task-view ready

no missing information missing information

task-view = get-from-system(task-view); if (still missing values) task-view = ger-from-user(task-view);

@process ontology(domain ontology)

task-view = create-task-view(domain ontology);task-constraints = create-task-constraints(task-view);

ready@create

initialize

.

.

.

Page 77: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Specification Satisfaction

"Dr. Carter" "Lynn Jones"

Dermatologist0 "IHC" "DMBA"

Appointment7 "4:00""5 Jan 05"

Person100

"Orem 600 State St." "Provo 300 State St."

"Dr. Carter" "Lynn Jones"

Dermatologist0 "IHC" "DMBA"

Appointment7 "4:00""5 Jan 05"

Person100

"Orem 600 State St." "Provo 300 State St."

Date(“28 Dec 04”) and NextWeek(“28 Dec 04”, “5 Jan 05”)Dermatologist(Dermatologist0) is at Address(“Orem 600 State St.”) and Within(DistanceBetween(“Provo 300 State St.”, “Orem 600 State St.”), “22”)i2 (Dermatologist(Dermatologist0) accepts Insurance(i2) and Equal(“IHC”, i2))

Page 78: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #5

High-Precision Classification

Page 79: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

An Extraction Ontology Solution

Page 80: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Document 1: Car Ads

Document 2: Items for Sale or Rent

Density Heuristic

Page 81: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Document 1: Car Ads

Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3

Expected Values Heuristic

Document 2: Items for Sale or Rent

Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4

Page 82: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Vector Space of Expected Values

OV ______ D1 D2Year 0.98 16 6Make 0.93 10 0Model 0.91 12 0Mileage 0.45 6 2Price 0.80 11 8Feature 2.10 29 0PhoneNr 1.15 15 11

D1: 0.996D2: 0.567

ov

D1

D2

Page 83: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Grouping Heuristic

YearMakeModelPriceYearModelYearMakeModelMileage…

Document 1: Car Ads

{{{

YearMileage…MileageYearPricePrice…

Document 2: Items for Sale or Rent

{{

Page 84: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

GroupingCar Ads----------------YearYearMakeModel-------------- 3PriceYearModelYear---------------3MakeModelMileageYear---------------4ModelMileagePriceYear---------------4…Grouping: 0.875

Sale Items----------------YearYearYearMileage-------------- 2MileageYearPricePrice---------------3YearPricePriceYear---------------2PricePricePricePrice---------------1…Grouping: 0.500

Expected Number in Group = floor(∑ Ave ) = 4 (for our example)

Sum of Distinct 1-Max Object Sets in each GroupNumber of Groups * Expected Number in a Group

1-Max

3+3+4+4 4*4

= 0.875 2+3+2+1 4*4

= 0.500

Page 85: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #6

Schema Mapping forOntology Alignment

Page 86: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Problem: Different Schemas

Target Database Schema{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Different Source Table Schemas• {Run #, Yr, Make, Model, Tran, Color, Dr}• {Make, Model, Year, Colour, Price, Auto, Air Cond.,

AM/FM, CD}• {Vehicle, Distance, Price, Mileage}• {Year, Make, Model, Trim, Invoice/Retail, Engine,

Fuel Economy}

Page 87: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Remove Internal Factoring

Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Legend

ACURA

ACURA

Page 88: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Replace Boolean Values

Legend

ACURA

ACURA

β CD Table

Yes,

CD

CD

Yes,Yes,βAutoβAir CondβAM/FMYes,

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 89: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Form Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >

Page 90: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Adjust Attribute-Value Pairs

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

<Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

Page 91: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

Page 92: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Each row is a car. πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*TableπYearTable

Note: Mappings produce sets for attributes. Joining to form recordsis trivial because we have OIDs for table rows (e.g. for each Car).

Page 93: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Infer Mappings

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table

Page 94: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

πPriceTable

Page 95: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Solution: Do Extraction

Legend

ACURA

ACURA

CD

CD

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

AM/FM

Air Cond.

Air Cond.

Air Cond.

Air Cond.

Auto

Auto

Auto

Auto

{Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

Yes,ρ Colour←Feature π ColourTable U ρ Auto←Feature π Auto β AutoTable U ρ Air Cond.←Feature π Air Cond.

β Air Cond.Table U ρ AM/FM←Feature π AM/FM β AM/FMTable U ρ CD←Featureπ CDβ CDTableYes, Yes, Yes,

Page 96: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #7

Record Linkage

Page 97: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

“Kelly Flanagan” Query

Page 98: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Gather evidence from each of several different facets• Attributes• Links• Page Similarity

Combine the evidence

A Multi-faceted Approach

Page 99: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Phone number, email address, state, city, zip code Data-frame recognizers

Attributes

Page 100: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Links

Page 101: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

“adjacent cap-word pairs”: Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word.

Page Similarity

Page 102: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

C1 C2 ….. Ci ….. Cj … Cn

C1 1 C12 C1i C1j C1n

C2 1 C2i C2j C2n

: : : :

Ci 1 Cij Cin

: : :

Cj 1 Cjn

: :

Cn 1

P(Ci and Cj refer to a same person | evidence for a facet f )

0 if no evidence for a facet f

Cij =

Training set to compute the conditional probabilities

Confidence Matrix for Each Facet

Page 103: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

0.96 + 0 + 0.78 - 0.96 * 0 - 0.96 * 0.78 - 0.78 * 0 + 0.96 * 0 * 0.78 = 0.9912

Confidence Matrix for Attributes Confidence Matrix for Links Confidence Matrix for Page Similarity

Final Matrix

Page 104: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Input: final confidence matrix Output: citations grouped by same person The idea:

{Ci , Cj} and {Cj , Ck} then {Ci , Cj , Ck}

The threshold we use for “highly confident” is 0.8.

Grouping Algorithm

Page 105: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Experimental Results

Page 106: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #8

Accessing the Hidden Web

Page 107: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Obtaining Data Behind Forms

• Web information is stored in databases

• Databases are accessed through forms

• Forms are designed in various ways

Page 108: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Hidden Web Extraction System

Input Analyzer

Retrieved Page(s)

User Query

Site Form

Output Analyzer

Extracted Information

ApplicationExtraction Ontology

“Find green cars costing no more than $9000.”

Page 109: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #9

Ontology Discovery & Generation

Page 110: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

TANGO: Table Analysis for Generating Ontologies

Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

Page 111: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Recognize Table Information

Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%

Page 112: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other

Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%

Page 113: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Discover Mappings

Page 114: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Merge

Page 115: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Application #10

Challenging Applications(e.g. BioInformatics)

Page 116: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Large Extraction Ontologies

Page 117: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Complex Semi-Structured Pages

Page 118: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Additional Analysis Opportunities

Sibling Page Comparison Semi-automatic Lexicon Update Seed Ontology Recognition

Page 119: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Sibling Page Comparison

Page 120: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Sibling Page ComparisonAttributes

Page 121: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Sibling Page Comparison

Page 122: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Sibling Page Comparison

Page 123: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Semi-automatic Lexicon Update

Additional Protein Names

Additional Source Speciesor Organisms

Page 124: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

nucleus;

nucleus;zinc ion binding;nucleic acid binding;

zinc ion binding;nucleic acid binding;

linear;

NP_079345;

9606;

Eukaryota; Metazoa;Chorata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Primates;Catarrhini;Hominidae;Homo;

NP_079345;

Homo sapiens;human;

GTTTTTGTGTT……….ATAAGTGCATTAACGGCCCACATG;

FLJ14299

msdspagsnprtpessgsgsgg………tagpyyspyalygqrlasasalgyq;

hypothetical protein FLJ14299;

8;eight;

“8:?p\s?12”;“8:?p11.2”;“8:?p11.23”;:: “37,?612,?680”;

“37,?610,?585”;

Seed Ontology Recognition

Page 125: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Seed Ontology Recognition

Page 126: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 127: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Limitations and Pragmatics

Data-Rich, Narrow Domain Ambiguities ~ Context Assumptions Incompleteness ~ Implicit Information Common Sense Requirements Knowledge Prerequisites …

Page 128: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Busiest Airport in 2003?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Page 129: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Busiest Airport in 2003?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Page 130: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Busiest Airport in 2003?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Page 131: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Busiest Airport in 2003?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Ambiguous Whom do we trust? (How do they count?)

Page 132: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Busiest Airport in 2003?

Chicago - 928,735 Landings (Nat. Air Traffic Controllers Assoc.) - 931,000 Landings (Federal Aviation Admin.)Atlanta - 58,875,694 Passengers (Sep., latest numbers available)Memphis - 2,494,190 Metric Tons (Airports Council Int’l.)

Important qualification

Page 133: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Dow Jones Industrial Average

High Low Last Chg30 Indus 10527.03 10321.35 10409.85 +85.1820 Transp 3038.15 2998.60 3008.16 +9.8315 Utils 268.78 264.72 266.45 +1.7266 Stocks 3022.31 2972.94 2993.12 +19.65

44.07

10,409.85

Graphics, Icons, …

Page 134: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Dow Jones Industrial Average

High Low Last Chg30 Indus 10527.03 10321.35 10409.85 +85.1820 Transp 3038.15 2998.60 3008.16 +9.8315 Utils 268.78 264.72 266.45 +1.7266 Stocks 3022.31 2972.94 2993.12 +19.65

44.07

10,409.85

Reported onsame date

WeeklyDaily

Implicit information: weekly stated in upper corner of page; daily not stated.

Page 135: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Presentation Outline Grand Challenge Meaning, Knowledge, Information, Data Fun and Games with Data Information Extraction Ontologies Applications Limitations and Pragmatics Summary and Challenges

Page 136: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Some Key Ideas Data, Information, and Knowledge Data Frames

• Knowledge about everyday data items• Recognizers for data in context

Ontologies• Resilient Extraction Ontologies• Shared Conceptualizations

Limitations and Pragmatics

Page 137: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Some Research Issues

Building a library of open source data recognizers Precisely finding and gathering relevant information

• Subparts of larger data• Scattered data (linked, factored, implied)• Data behind forms in the hidden web

Improving concept matching• Indirect matching• Calculations, unit conversions, alternative representations,

… …

Page 138: Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.

Some Research Challenges

Web Page Understanding• Suppose extraction is ~85% accurate• Generate a page grammar

� Increased recall (more extracted)� Increased precision (fewer false positives)�Fast extraction from same-site sibling pages

Universal Rules for Schema Matching• Must rules be domain-specific?• Can some rules be “universal”?

Boundaries of Usefulness: When should machine learning not be used?

Application to Significant Problems• Like those above• Many more …

www.deg.byu.edu

(Machine Learning)


Recommended