Post on 13-Dec-2015
transcript
Using linked data to interpret tables
Varish MulwadSeptember 14 2010
1
Interpreting a table
httpdbpediaorgresourceBaltimorehttpdbpediaorgresourceBaltimoreLink Cell Value to an entity
Find Relationships between columnshttpdbpediaorg
ontologyPopulatedPlace
httpdbpediaorgontology
PopulatedPlaceLargestCityLargestCity
2
Annotate web tables
Confirm existing facts
in LOD
Discover knowledge
and new facts
Search query over web tables
Data integration
1000 reasons why itrsquos important hellip
prefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity hellip hellip
Interpreting a table
4
Overview
bull Introductionbull Related Work amp Motivationbull Approachbull Resultsbull Upcoming Workbull Conclusion
5
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Interpreting a table
httpdbpediaorgresourceBaltimorehttpdbpediaorgresourceBaltimoreLink Cell Value to an entity
Find Relationships between columnshttpdbpediaorg
ontologyPopulatedPlace
httpdbpediaorgontology
PopulatedPlaceLargestCityLargestCity
2
Annotate web tables
Confirm existing facts
in LOD
Discover knowledge
and new facts
Search query over web tables
Data integration
1000 reasons why itrsquos important hellip
prefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity hellip hellip
Interpreting a table
4
Overview
bull Introductionbull Related Work amp Motivationbull Approachbull Resultsbull Upcoming Workbull Conclusion
5
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Annotate web tables
Confirm existing facts
in LOD
Discover knowledge
and new facts
Search query over web tables
Data integration
1000 reasons why itrsquos important hellip
prefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity hellip hellip
Interpreting a table
4
Overview
bull Introductionbull Related Work amp Motivationbull Approachbull Resultsbull Upcoming Workbull Conclusion
5
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
prefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity hellip hellip
Interpreting a table
4
Overview
bull Introductionbull Related Work amp Motivationbull Approachbull Resultsbull Upcoming Workbull Conclusion
5
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Overview
bull Introductionbull Related Work amp Motivationbull Approachbull Resultsbull Upcoming Workbull Conclusion
5
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
6
Introduction
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
The World Wide Web hellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
helliphelliphellip
hellip helliphellip
hellip helliphellip
Talk abcBy xyzVenue some location
Talk abcBy xyzVenue some location
hellip helliphellip
hellip helliphellip
7
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
The World Wide Web hellip
Good for you and me hellip
hellip not so good for machinesImages from httpwwwbbccoukblogsradiolabss5linked-datas5html
8
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Web of Data ndash The Semantic Web
Image ndash wwwlinkeddataorg9
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Linked Data
The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web
Every resource has a URI Baltimore httpdbpediaorgresourceBaltimore
10
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Related Work and Motivation
11
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Chicken Egg hellip No Chicken
bull More than a trillion documents on the Web
bull ~ 141 billion tables 154 million with high quality relational data (Cafarella et al 2008)bull Where is structured
data 13
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Automate the process
bull We need systems that can generate data from existing sources
bull Not practical for humans to encode all this into RDF manually
14
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
On the Semantic Web hellip
bull Mapping Relational databases to RDF [W3C working group ndash RDB2RDF]
bull Mapping spreadsheets to RDF [RDF123 XLWrap]
bull Practical and helpful systems but hellip ndash Require significant manual workndash Do not generate linked data
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
hellip elsewhere
bull Learning to index tables to improve search experience (Cafarella et al 2008)
bull Expanding attributes (columns) of web tables (Lin et al 2010)
bull Interpreting web tables to answer complex search queries over the web tables (Limaye et al 2010)
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Interpreting a Table
17
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
18
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
19
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Predicting Class Labels for column
City
Baltimore
Boston
New York
Type
Instance
Type
Type
Type
Class Type for the column
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Querying the KnowledgendashBase
City
Baltimore
Boston
New York
1Baltimore2 Baltimore County3 John Baltimore
1Boston2 Boston_(band)3 Boston_University
1 New_York_City2 New_York3 New_York_(album)
21
dbpedia-owlPlace dbpedia-owlAdminstrativeRegion dbpedia-owlCity dbpedia-owlAreayagoAmericanConductorsyagoLivingPeople
Types
dbpedia-owlPlace dbpedia-owlPopulatedPlace dbpedia-owlBand dbpedia-owlOrganisation hellip hellip hellip
helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Scoring the classesPossible Classes for the column - dbpedia-owlPlacedbpedia-owlAdminstrativeRegiondbpediaowlCityyagoAmericanConductorsyagoLivingPeople dbpedia-owlBanddbpedia-owlOrganisationhelliphelliphellip
[Baltimore dbpedia-owlCity][Boston dbpedia-owlCity][New York dbpedia-owlCity] helliphellip[Baltimoredbpedia-owlBand][Bostondbpedia-owlBand]helliphelliphellip
Eg Processing class ndash ldquodbpedia-owlCityrdquo
String Baltimore (R = 1) Baltimore dbpedia-owlCity dbpedia-owlPlace [PR = 6](R = 2) Baltimore County dbpedia-owlAdministrativeRegion [PR = 4](R = 3) John Baltimore yagoAmericanConductorsyagoLivingPeople [PR = 5]
Score = w x ( 1 R ) + (1 ndash w) x (Normalized Page Rank)[Baltimore dbpediaCity] = (025 x 1 1 ) + (075 x 6 7) = 0892
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
23
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Approach
Table Cell + Column Header + Row Data
+ Column Type
Requery KB with predicted class labels as additional evidence
Generate a feature vector for the top N results of the query
Classifier ranks the entities within the set
of possible results
Select the highest ranked entity
Classifier decides whether to link or
not
Link to ldquoNILrdquoLink to the top
ranked instance
24
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Learning to Rank
bull We trained a SVMrank classifier which learnt to rank entities within a given set
Feature Vector
Similarity Measures
Popularity Measures
bull Levenshtein distancebull Dice Score
bull Wikitology Scorebull PageRankbull Page Length
25
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
ldquoTo Link or not to Link hellip rsquorsquo
bull The highest ranked entity may not the correct one to link to hellip ndash Because the string we are querying may not be in
the KBndash Top N results may not include the correct answer
bull We trained an SVM classifier which would determine whether to link to the top one or not
26
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
ldquoTo Link or not to Link hellip rsquorsquo
bull Feature vector included the feature vector of the top ranked entity and additional two features ndash
ndash The SVMrank score of the top ranked entityndash The difference in scores between the top two
ranked entities
27
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
28
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Relation between columns
City
Baltimore
Boston
New York
State
Maryland
Massachusetts
New York
29
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Relation between columns
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
dbontoCapital dbontoLargestCity
Candidate relations
30
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Scoring the relations
Maryland - Baltimore
Massachusetts - Boston
New York - New York
dbontoLargestCity
dbontoLargestCitydbontoCapital
dbontoLargestCity
Candidates dbontoCapital
dbontoLargestCity
dbontoCapital Score0
dbontoCapital Score1
dbontoLargestCity Score3
31
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
32
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Annotating web tables for the Semantic Web
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Table as linked RDFprefix rdfs lthttpwwww3org200001rdf-schemagt prefix dbpedia lthttpdbpediaorgresourcegt prefix dbpedia-owl lthttpdbpediaorgontologygt prefix dbpprop lthttpdbpediaorgpropertygt
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoStaterdquoen is rdfslabel of dbpedia-owlAdminstrativeRegion
ldquoBaltimorerdquoen is rdfslabel of dbpediaBaltimore dbpediaBaltimore a dbpedia-owlCity ldquoMDrdquoen is rdfslabel of dbpediaMaryland dbpediaMaryland a dbpedia-owlAdministrativeRegion
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion dbppropLargestCity rdfsrange dbpedia-owlCity
ldquoCityrdquoen is rdfslabel of dbpedia-owlCity ldquoCityrdquo is the common human name for the class dbpedia-owlCity
dbpediaBaltimore a dbpedia-owlCity dbpediaBaltimore is a type (instance) dbpedia-owlCity
dbppropLargestCity rdfsdomain dbpedia-owlAdminstrativeRegion The subjects of the triples using the property have to be instances of dbpedia-
owlAdminstrativeRegion
dbppropLargestCity rdfsrange dbpedia-owlCity The objects of the triples using the property have to be instances of dbpedia-owlCity
34
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Results
35
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Dataset summary
Number of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
The number in the brackets indicates excluding columns that contained numbers
36
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Dataset summary
37
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Dataset summary
38
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation for class label predictions
39
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation 1 (MAP)
bull Compared the systemrsquos ranked list of labels against a human ranked list of labels
bull Metric - Mean Average Precision (MAP)
bull Commonly used in the Information Retrieval domain to compare two ranked sets
40
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation 1 (MAP)
41
8076
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation 2 (Recall)
Recall gt 06 (75 )
42
System Ranked1 Person2 Politician3 President
Evaluator Ranked1 President2 Politician3 OfficeHolder
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation 3 (Correctness)
bull Evaluated whether our predicted class labels were ldquofair and correctrdquo
bull Class label may not be the most accurate one but may be correct ndash Eg dbpediaPopulatedPlace is not the most accurate but still a
correct label for column of cities
bull Three human judges evaluated our predicted class labels
43
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation 3 (Correctness)
bull A category-wise breakdown for class label correctnessOverall
Accuracy 7692
44
Column ndash NationalityPrediction ndash MilitaryConflict
Column ndash Birth PlacePrediction ndash PopulatedPlace
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Evaluation for linking table cells to entities
45
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Category-wise accuracy for linking table cells
Overall Accuracy 6612
46
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Relation between columns
bull Idea ndash Ask human evaluators to identify relations between columns in a given table
bull Pilot Experiment ndash Asked three evaluators to annotate five random tables from our dataset
bull Evaluators identified 20 relations
bull Our accuracy ndash 5 out of 20 (25 ) were correct
47
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Future Work
48
Current
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
AutomaticSemi-automatic template learning
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Confirming LD Facts
Baltimore MD SRawlings hellipFor Baltimore Dbpedia says
DbppropLeaderName ndash SDixon
DbppropLeaderName ndash SDixon SRawlings
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Discover knowledge relations
bull Inception rdftype dbpedia-owlMovie
bull Howard County rdftype dbpediaAdminstrativeRegion
bull David Beckham dbpedia-owlTeam dbpedia Los_Angeles_Galaxy
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Conclusion
bull Therersquos lot of data that is stored in html tables spreadsheets databases and documents
bull We presented an automatic framework to interpret such data
bull We believe our work will contribute in materializing the web of data vision
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM
Referencesbull Han L Finin T Parr C Sachs J Joshi A RDF123 from Spreadsheets to RDF In Seventh
International Semantic Web Conference Springer (2008)
bull Langegger A Wob W Xlwrap - querying and integrating arbitrary spreadsheets with sparql In 8th International Semantic Web Conference (ISWC2009) (2009)
bull Cafarella MJ Halevy AYWang ZDWu E Zhang Y Webtables exploring the power of tables on the web PVLDB 1 (2008) 538 - 549
bull Limaye G Sarawagi S Chakrabarti S Annotating and searching web tables using entities types and relationships In Proc of the 36th Intl Conference on Very Large Databases (VLDB) (2010)
bull Lin C X Zhao BWeninger T Han J and Liu B 2010 Entity relation discovery from web tables and links In Rappa M Jones P Freire J and Chakrabarti S eds WWW 1145ndash1146 ACM