Post on 02-Jan-2016
transcript
Machines learnt how to understand tables. What happens next will shock you.
Welcome to the PhD
dissertation defense of
Varish Mulwad!
2
TABEL – Domain Independent and
Extensible Framework to Infer the Semantics of
Tables Varish Mulwad
Ph.D. Dissertation Defense
Adviser: Dr. Tim FininJanuary 8, 2015
3
4Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi, "Exploiting a Web of Semantic Data for Interpreting Tables", In 2nd Web Science Conference (WebSci 2010), Raleigh, NC, USA, Apr. 2010
Semantics of a Table
Name Team Position Height
Michael Jordan
Chicago Shooting Guard
1.98
Allen Iverson Philadelphia Point Guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan
San Antonio Power Forward
2.11
NationalBasketballAssociationTeams
http://dbpedia.org/resource/Allen_Iverson Map literals as
property values
playsFor
5
Semantics of a TableName Team Position Height
Michael Jordan
Chicago Shooting Guard
1.98
Allen Iverson Philadelphia Point Guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan
San Antonio Power Forward
2.11
Linked
Data
tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.
tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.
All this in a completely automated way!
6
TABEL – Domain Independent & Extensible
Framework to Infer the Semantics of Tables
7
Thesis Statement
It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.
8
Contributions
o Probabilistic Graphical Model to jointly infer the semantics + a novel inference technique Semantic Message Passing
o An proof of concept user–interactive application to generate meta-analysis reports automatically
o Develop & Explore Human in the Loop paradigm
o A novel technique to generate candidate properties from literal values
9
Why
How Evaluation
Application
Wrap up
10
Tables are everywhere!
154 million high quality relational tables on the web
~400,000 CSVs on data.gov
Healthcare, Financial and other domains
11
The Semantic Web & the Web
Spreadsheets/CSVs to RDF/OWL
12
Evidence Based MedicineCombine: All studies that compare organic milk v/s grass fed cow milk
Produce Unified report: Organic Milk is better!
Meta – Analysis report
Correlation between Cardio vascular risk factors and Venous Thrombosis
Duration of proton pump inhibitors as first line of treatment for Helicobacter pylori eradication
13
Tables are valuable
14
Meta – Analysis: Today
Correlation between Cardio vascular risk factors and Venous Thrombosis1
Initial Search >> 1949 studiesFinal # of studies selected >> 22!
1 - W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen,”Cardiovascular risk factors and venous thromboembolism a meta-analysis,” Circulation, vol. 117, no. 1, pp. 93–102, 2008.
• Keyword based search
• Initial search yields large # of results
• Manually filter out irrelevant results
15
Not restricted to healthcare …
16
Related Work
Databases & Spreadsheets to RDF:
Existing solutions: Largely manual or semi-automaticNumber of Ontologies, classes, relationsAutomatic solutions: “Row as RDF node”; local mappingsNo links to existing classes, properties, entities
17
Related Work
Semantics of Table:
Infer semantics for only parts of the table [header cells; relation between headers; data cell values or a combination of the two]
Fail to generate RDF Linked Data representation
Poor support for literals
18
Related Work Limaye et al. [Sep. 2010][Soumen Chakrabarti’s group @ IIT-B]
RDF Linked Data representationLiteral values
Knoblock et al. [May 2012]
[Craig Knoblock’s group @ USC – ISI]
Largely focuses on header cell semantics & relation between headersRequires initial user input before automatic predictions from the system
Venetis et al. [Sep. 2011][Alon Halevy’s group @ Google]
Column header and Relation semanticsLiteral values; RDF Linked Data
19
What TABEL brings to the “table” Infers the complete semantics of a table
Generates a RDF Linked Data representation
Supports tables with different structures over a variety of domains [medical tables]
Incorporates user feedback to improve the quality of inferred semantics
Infers the semantics of literal values* [numerical values]
20
Why How Evaluation
Application
Wrap up
21
TABEL – TABle Extracted as Linked Data
DECODE AAD
Pre-processing modules
Query and Rank
1
Generate RDF Linked
Data
Verify (optional)
Store / Publish
Joint Inference
Name Team Position Height
Michael Jordan Chicago Shooting Guard
1.98
Allen Iverson Philadelphia Point Guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power Forward 2.11
Your module here!
Varish Mulwad, Tim Finin and Anupam Joshi, “A Domain Independent Framework for Extracting Linked Semantic Data from Tables”, In Search Computing, ISBN 978-3-642-34212-7, vol. 7538, 2012.
22
Query – Candidate Entities
Chicago + Context {Team} + Context {Michael Jordan, Shooting Guard, 1.98}
1. Chicago2. Judy_Chicago3. Chicago_Bulls
1. Chicago_Bulls2. Chicago3. Judy_Chicago
1. Chicago2. Judy_Chicago3. Chicago_Bulls
Re-rank – Classifier(String Similarity, Popularity)
Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi, “Using linked data to interpret tables”, In 1st Int. Workshop on Consuming Linked Data, held at the 9th Int. Semantic Web Conf. (ISWC 2010), Shanghai, China, Nov. 2010.
23
Query – Candidate ClassesClass
Instance
1. Chicago_Bulls2. Chicago3. Judy_Chicago
{Place,City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams }
{Place, PopulatedPlace, Film, NationalBasketballAssociationTeams, … , … }
{……………………………………………………………. }
Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film ….
Team
Chicago
Philadelphia
Houston
San Antonio
24
Query – Candidate Relations
Name
Michael Jordan
Allen Iverson
Yao Ming
Tim Duncan
Team
Chicago
Philadelphia
Houston
San Antonio
1. Chicago_Bulls2. Chicago3. Judy_Chicago
1. Michael_Jordan2. Michael_I_Jordan3. Jordan_River
playsForlivesIn….….
…… ……
playsFor, livesIn,born, …….
25
Query – Literals* [numeral data]
Team
Chicago
Philadelphia
Houston
San Antonio
Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film
Chicago
26
Query – Literals
…
1.98
1.83
2.29
2.11
…
…
?
27
NumKB
…
320,900
183,120
229,198
211,123
…
…
Population
Income
…
1.98
1.83
2.29
2.11
…
…
Height
Person BasketBallPlayer(?)
NumKB: Encodes distributional features for Linked Data properties
Allows query using literal values (and optionally property name)
Provides information on property domains
250,0001.95
28
Identify property domains seatingCapacity
Get InstancesGet Instance
TypesOrder by frequency
Queen's_Film_TheatreRestaurant_Gordon_RamsayM&T_Bank_Stadium
TheatreStadiumRestaurant
1. seatingCapacity_Stadium [1]2. seatingCapacity_Theatre [0.70]3. seatingCapacity_Restaurant [0.57]
Duplet score: 1/
29
Identify property domain duplet values
Property, domain [seatingCapacity,Stad
ium]
Get Property Values
Sort; Trim front & back
tails; Compute µ &
σ
1777720767500
-212 : 25743 [86.66 %]-13190 : 38721 [6.56 %]
-26168 : 51699 [4.67 %]-39146 : 64677 [2.08 %]
Compute Ranges
µ - σ : µ + σµ - 2σ : µ + 2σ
30
Query – Literals
1.98, height
NumKB1. height2. diameter3. minimumElevation
minRange < 1.98 < maxRange
Fuzzy string match (ColHeaderString, PropertyName)
31
Graphical Model for Tables
C1 C2 C3
R11
R12
R13
R21
R22
R23
R31
R32
R33
Team
Chicago
Philadelphia
Houston
San Antonio
Class
Instance
NameVice-
PresidentOffice Held
Beetle RedGasolin
e
32
Parameterized Graphical Model
C1 C2 C3
𝝍𝟓
R11
R12
R13
R21
R22
R23
R31
R32
R33
𝝍𝟑 𝝍𝟑 𝝍𝟑
𝝍𝟒 𝝍𝟒 𝝍𝟒
Function that captures the affinity between the column headers and row values
Row value
Variable Node: Column header
Captures interaction between column headers
Captures interaction between row values
33
Semantic Message Passing𝝍𝟒
𝝍𝟑
Michael_I_Jordan Chicago_Bulls
“Change”playsFor
“No Change”
C1:[BasketballPlayer
]
C2:[NBATeam] C3:[BasketBallPositions
]
𝝍𝟓
Yao_MingAllen_Iverson
BasketballPlayer“Change”BasketBall
Player
“No Change”“No Change”
……
……
“No Change”
“No Change”
34
Semantic Message Passing[V] Pick new
value
[V] Send current values
[F] Identify Outliers
[F] Send semantics
V – Variable NodesF – Factor Nodes
Semantically Aware Factor
Nodes
Varish Mulwad, Tim Finin and Anupam Joshi, "Semantic Message Passing for Generating Linked Data from Tables", In 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, Oct. 2013.
35
– Column Header & Row Value Agreement
𝝍𝟑 [Michael_I_Jordan, Allen_Iverson, Yao_Ming]
GeoPopulatedPlaceBasketBallPlayerArtWorkName
Michael_I_Jordan
Allen_Iverson
Yao_MingAtheleteBasketballPlayer
ArtificialIntelligenceResearchers
1. BasketBallPlayer2.GeoPopulatedPlace….
Top Class: BasketBallPlayer topClassScore =
36
– Column Header & Row Value Agreement
Use the topClass in Message Passing process
Send topClassScore as confidence score
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming
Change
No - Change
Update Column Header Annotation = “No-Annotation”
topClassScore < thresholdclass ?
BasketBallPlayer
37
𝝍4 – Relation between Columns[Michael_I_Jordan, Chicago_Bulls][Allen_Iverson, Philadelphia_76ers][Yao_Ming, Houston_Rockets]
𝝍𝟒
Team
Chicago_Bulls
Philadelphia_76ers
Houston_Rockets
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming
playsForlivesIn….….
No – rel
playsFor
playsFor
1. playsFor2. livesIn….….
Top relation: playsFor
topRelScore =
38
𝝍4 – Relation between Columns
Use the topRel in Message Passing process
Send topRelScore as confidence
Update Rel Annotation = “No-
Annotation”
topRelScore < thresholdrelation ?
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming
ChangeplaysFor
No - Change
39
Variable Node Update
R11
Michael Jordan
𝝍𝟑𝝍𝟒
𝝍𝟒Change [BasketBallPlayer, 0.8]
Change
[playsFor,
0.6]
No-Change[0.55]
(Team)
(Chicago)
(Shooting Guard)
avgChangeConfidenceScore > avgNoChangeConfidenceScore ? = 0.70] [0.5
5]
40
Variable Node Update
[Class: BasketBallPlayer, 0.8][Relation: playsFor, 0.6]
R11
Michael Jordan
(1)BasketBallPlayer
(2)playsFor
Michael_I_Jordan
……..
Michael_Jordan
……..
Satisfy constraints: [1, 2, 3]Satisfy constraints: [1, 2]Satisfy constraints: [1,3]Satisfy constraints: [2,3]Satisfy constraints: [1]Satisfy constraints: [2]Satisfy constraints: [3]Choose “No Annotation”
41
Halting Condition
Ideal Case – No variable node receives a ‘CHANGE’ message
Practical Case – Fraction of variable nodes that receive ‘CHANGE’ message <
thresholdChange
42
Tables Ontology
dbpedia-owl:BasketBallTeam
dbpedia:Michael_Jordan
dbpedia-owl:playsFor
43
RDF Linked Data Representation
tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.
tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String;
tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.
tab:HeaderRelation_12 a tab:TableRelation; tab:relFromColumn tab:cell_01; tab:relToColumn tab:cell_02; tab:relLabel dbpedia-owl:team.
44
Human in the loopAAD DECODE
Generate RDF Linked
Data
Verify (optional)
Store / Publish
Query and Rank
2 1
Joint Inference
Name Team Position Height
Michael Jordan Chicago Shooting Guard
1.98
Allen Iverson Philadelphia Point Guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power Forward 2.11
AAD DECODE
Joint Inference
Generate RDF Linked
Data
Verify (optional)
Store / Publish
During
After
Before
Before
45
Human in the loop – Before
No. Name Team Position Height
1 Michael Jordan
Chicago Shooting Guard
1.98
2 Allen Iverson Philadelphia
Point Guard 1.83
3 Yao Ming Houston Center 2.29
4 Tim Duncan San Antonio
Power Forward
2.11
46
Human in the loop – Before
Team
WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….
Michael Jordan
Michael_I_JordanMichael_JordanMichael_JacksonMichael_Wodruff….….….
Name, Team
livesInteam….….….….….
Assignments treated as “true values”
Human in the loop – During
47
𝝍𝟑 𝝍𝟒
Team [0.2] Name, Team [0.1]
WomenArtistBasketBallTeamCitySportsTeam….….
48
Human in the Loop – Impact on Joint Inference
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming
Change
No - Change
BasketBallPlayer
𝝍𝟑 Name [BasketballPlayer]
[Class: BasketBallPlayer, 1.0] [Fixed][Relation: playsFor, 0.6]
R11
Michael Jordan
Name,Team [playsFor]
𝝍𝟒
[Class: BasketBallPlayer, 0.8][Relation: playsFor, 1.0] [Fixed]
Name
Michael_I_Jordan
Allen_Iverson
Yao_Ming
Change
No - Change
playsFor
49
Human in the Loop – Impact on Joint Inference
R11 Chicago [Chicago_Bulls]
WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….
livesInteam….….….….….
Candidate classes
Candidate relations
50
Why How Evaluation
Application
Wrap up
51
Datasets
Dataset # of tables used in Col. And Rel Annotations
# of tables used in Data Cell Annotations
Average number of columns and rows
Web_Manual 150 371 2, 36
Web_Relation 28 – 4, 67
Wiki_Manual 25 39 4, 35
Wiki_Links – 80 3, 16
Subset of the IIT-B datasetLimaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and
searching web tables using entities, types and relationships. In: Proc. 36th VLDB (2010)
52
Ground TruthHuman annotators marked each class, relation as ‘vital’, ‘okay’, ‘incorrect’
To compute precision, assign scores to class & relation predicted by the system 1 – If the class was vital 0.5 – If the class was okay, but could have been better (e.g.
Place v/s City) 0 – if it was incorrect
To compute recall assign score of 1 if vital or okay, 0 for incorrect
Ground truth for data cell value annotations from the IIT – B dataset
Column Header Annotations
530.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
55.90
36.17
60.00
24.31
47.87
18.57
Web_Manual
Web_Relation
Wiki_Manual
<--
---
Perc
enta
ge -
----
>
okay
vital
% of Relevant labels at Rank 1
54
Column Header Annotations% of Relevant labels at different ranks
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
<--
---
Perc
enta
ge
----
->
<----- Rank ----->
Web_Manual
Web_Relation
Wiki_Manual
55
Column Header Annotations
Web_Manual Web_Relation Wiki_Manual0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.68
0.60
0.69
0.49
0.42
0.530.57
0.49
0.60
Precision, Recall and F-score at rank 1
PrecisionRecall
F-score
56
Column Header Annotations
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Precision RecallWeb_ManualWeb_RelationWiki_Manual
<----- Rank (k) ----->
Precision v/s Recall at ranks 1-10
57
Column Header Annotations
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.570.6
0.43
0.56
0.65 0.67
SMP
IIT-B
GOOG SMP
IIT-B
GOOGWeb_Manual Wiki_Manual
Semantic Message Passing v/s the restF-scores
58
Example Column Header Predictions
Column: ConstituencyPredicted: N.A.
DBpedia classes [Ranks 2-10]:OfficeHolderPrimeMinisterPoliticianElectionEventAdministrativeRegionPopulatedPlaceUniversityEducationalInstitution
Column: Name of Elected M.P.Predicted: OfficeHolder
DBpedia classes [Ranks 2-10]: ElectionEventPrimeMinisterPoliticianCountryPopulatedPlaceSettlementUniversityEducationalInstitution
59
Relation Annotations
Web
_Man
ual_d
bp
Web
_Man
ual_y
ago
Web
_Relat
ion_
dbp
Web
_Relat
ion_
yago
Wiki
_Man
ual_d
bp
Wiki
_Man
ual_y
ago
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
34.43 33.33
50.00 50.00 53.85 66.67
3.28
6.25
okay
vital
<--
---
Perc
enta
ge -
----
>
% of relevant relations at rank 1
60
Relation Annotations
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
Web_Manual_dbp Web_Manual_yago
Web_Relation_dbp Web_Relation_yago
Wiki_Manual_dbp Wiki_Manual_yago
Web_Manual
Web_RelationWiki_Manual
DBpedia
Yago
<----- Rank ----->
<--
---
Perc
enta
ge -
----
>
% of relevant relations at rank 1-10
61
Relation Annotations
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.000.89 0.86
0.97
0.51
0.630.68
SMP
IIT-B
IIT-B
IIT-BWeb_Manual Wiki_ManualWeb_Relatio
n
SMP
SMP
Semantic Message Passing v/s the restF-scores
62
Example Relation PredictionsColumn: President – Birth statePredicted: N.A.
DBpedia rels [Ranks 2-10]:locationdeathPlacelocatedInAreabirthPlaceisPartOflargestCityalmaMaterregionstate
Column Pair: Name of Elected M.P. -- Party Affiliation
Predicted: party
DBpedia rels [Ranks 2-8]: affiliationotherPartyprimeMinisterdeathPlacebirthPlaceregionNA
63
Data Cell Value Annotations
Wiki_Link Web Manual Wiki_Manual0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00 75.89
63.0767.42
% of correctly linked entities
64
How long did it run ?
0 2 4 6 8 10 120
50
100
150
200
250
300
Iteration Number
Num
ber
of
vari
able
s th
at
rece
ive
mess
age c
hange
Line represents a table
Number of variables that received a “change” message at the end of a iteration
65
Literals – Experimental Setup
Subset of 16 tables [17 literal value columns] from the Wiki_Link Dataset
Generate property candidate set by querying against NumKB
Manually annotated each literal column with an appropriate DBpedia property
66
Header Cell Annotations for Literals
1 2 3 4 5 6 7 8 9 100.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
35.29
23.53
17.65
5.88 5.88
0.00 0.00
5.88
0.00 0.00
<----- Rank ----->
<--
---
Perc
enta
ge -
----
>
Percentage of correct properties at ranks 1-10
1 2 3 4 5 6 7 8 9 100.00
10.00
20.00
30.00
40.00
50.00
60.00
70.0064.71
23.53
0.00 0.00
5.88
0.00 0.00
5.88
0.00 0.00
67
Human in the loop – Experimental Setup
Subset of 11 tables from the Wiki_Link dataset
User feedback: Correct column header class [1 column in 9 tables and 2 for the remaining 2 tables]
Rest of the experimental setup same.
68
Data Cell Annotations
1 2 3 4 5 6 7 8 9 10 110.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
No HILHIL
Human in the Loop (HIL) v/s No Human in the Loop<
----
- %
of
corr
ect
ly a
nnota
ted d
ata
cells
----
>
<----- Table number----->
Correct Entities
Total %
HIL 286 402 71.14
No – HIL 245 402 60.95
69
Why How Evaluation
Application
Wrap up
70
Interpreting Medical Tables as Linked Datafor Generating Meta–Analysis Reports
71
TABEL – TABle Extracted as Linked Data
AAD DECODE
Pre-processing modules
Query and Rank
2 1
Generate RDF Linked
Data
Verify (optional)
Store / Publish
Joint Inference
Name Team Position Height
Michael Jordan Chicago Shooting Guard
1.98
Allen Iverson Philadelphia Point Guard 1.83
Yao Ming Houston Center 2.29
Tim Duncan San Antonio Power Forward 2.11
Your module here!Normalize
Varish Mulwad, Tim Finin and Anupam Joshi, "Interpreting Medical Tables as Linked Data to Generate Meta–Analysis Reports", In 15th IEEE Int. Conf. on Information Reuse and Integration (IRI 2014), San Francisco, USA, Aug. 2014.
72
Preprocessing – Normalize
73
Preprocessing – Normalize
Patients with Secondary Thrombosis
N = 146
no. --> 49; % -->33.6
no. (%)
Smoker
Split header cells into Query String and Metadata
Normalize data cells; identify types or units
74
Query – Candidate Classes* [DBpedia]
Hypertension
(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension
(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension
Re-rank – Classifier(String Similarity, Popularity)
(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension
Also evaluated against SNOMED CT & UMLS
75
Query – Candidate Classes [Hybrid]
Hypertension
(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension
No results?
SNOMED CT
(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension
API
76
Modeling Medical Tables as RDF
PatientGroup
xsd:integer owl:Thing
numberOfIndividuals
hasGroupAttribute
146
umls:Secondary_Thrombosis
Value
xsd:String
hasType
xsd:double
hasRawValue
% 33.6
77
Interactive tool to generate Meta – Analysis reports
User interface to define meta-analysis parameters and select studies
Tool automatically generates relevant SPARQL queries
78
Evaluation
79
Header Cell Annotations
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
29.51
60.66 59.02
29.51
10.66
8.2
10.66 20.49
4.1
2.46 4.92
3.28
3.28 2.46
4.1
4.92
18.85
4.1 9.02
16.39
33.61
22.13
12.3
25.41
Distribution of header cell concepts at different ranks
SNOMED CT UMLS
HYBRID
DBPEDIA
<--
---
Perc
enta
ge -
----
>
NF: Correct concept not found in the candidate set
1 2-5 6-1011-25
26-101
NF 1 2-5 6-1011-25
26-101
NF 1 2-5 6-1011-25
26-101
NF 1 2-5 6-1011-25
26-101
NF
Dataset: 7 tables (122 header cells)
80
Retrieval (Find) Evaluation Experimental Setup
• Generated Linked Data from four tables
• Executed Retrieval SPARQL queries to find tables that included correlation between venous thrombosis for four different cardio vascular risk factors
• Average Precision: 0.79; Average Recall: 0.75
81
Why How Evaluation
Application
Wrap up
82
Conclusions
It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.
I claimed:
“
’’
83
Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics
TABEL jointly inferred the semantics; thorough evaluation showed promising results
… the semantics of column headers, values (string and literal) in table cells, and relations between columns
A novel technique to generate candidate properties from literal values
84
Conclusions It is possible to generate high quality linked data from tables
Tables ontology to represent the inferred semantics
Demonstrated domain independence and extensibility and support for tables with different structures
Explored different models for Human in the loop
85
Future Work
Schema + Data driven approach
Build on the work on inferring literals; NumKB
Further develop Human in the loop
Tool to generate meta-analysis reports
86
Acknowledgements
Dr. Tim Finin
Dr. Anupam Joshi
Dr. Tim Oates
Dr. Yun Peng
Dr. L V Subramaniam
Dr. Indrajit Bhattacharya
Lab mates & Friends!
Thank You! Our papers on this research topic have garnered 93
citations!