Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD...

Post on 02-Jan-2016

215 views 3 download

Tags:

transcript

Machines learnt how to understand tables. What happens next will shock you.

Welcome to the PhD

dissertation defense of

Varish Mulwad!

2

TABEL – Domain Independent and

Extensible Framework to Infer the Semantics of

Tables Varish Mulwad

Ph.D. Dissertation Defense

Adviser: Dr. Tim FininJanuary 8, 2015

3

4Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi, "Exploiting a Web of Semantic Data for Interpreting Tables", In 2nd Web Science Conference (WebSci 2010), Raleigh, NC, USA, Apr. 2010

Semantics of a Table

Name Team Position Height

Michael Jordan

Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan

San Antonio Power Forward

2.11

NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iverson Map literals as

property values

playsFor

5

Semantics of a TableName Team Position Height

Michael Jordan

Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan

San Antonio Power Forward

2.11

Linked

Data

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

All this in a completely automated way!

6

TABEL – Domain Independent & Extensible

Framework to Infer the Semantics of Tables

7

Thesis Statement

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

8

Contributions

o Probabilistic Graphical Model to jointly infer the semantics + a novel inference technique Semantic Message Passing

o An proof of concept user–interactive application to generate meta-analysis reports automatically

o Develop & Explore Human in the Loop paradigm

o A novel technique to generate candidate properties from literal values

9

Why

How Evaluation

Application

Wrap up

10

Tables are everywhere!

154 million high quality relational tables on the web

~400,000 CSVs on data.gov

Healthcare, Financial and other domains

11

The Semantic Web & the Web

Spreadsheets/CSVs to RDF/OWL

12

Evidence Based MedicineCombine: All studies that compare organic milk v/s grass fed cow milk

Produce Unified report: Organic Milk is better!

Meta – Analysis report

Correlation between Cardio vascular risk factors and Venous Thrombosis

Duration of proton pump inhibitors as first line of treatment for Helicobacter pylori eradication

13

Tables are valuable

14

Meta – Analysis: Today

Correlation between Cardio vascular risk factors and Venous Thrombosis1

Initial Search >> 1949 studiesFinal # of studies selected >> 22!

1 - W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen,”Cardiovascular risk factors and venous thromboembolism a meta-analysis,” Circulation, vol. 117, no. 1, pp. 93–102, 2008.

• Keyword based search

• Initial search yields large # of results

• Manually filter out irrelevant results

15

Not restricted to healthcare …

16

Related Work

Databases & Spreadsheets to RDF:

Existing solutions: Largely manual or semi-automaticNumber of Ontologies, classes, relationsAutomatic solutions: “Row as RDF node”; local mappingsNo links to existing classes, properties, entities

17

Related Work

Semantics of Table:

Infer semantics for only parts of the table [header cells; relation between headers; data cell values or a combination of the two]

Fail to generate RDF Linked Data representation

Poor support for literals

18

Related Work Limaye et al. [Sep. 2010][Soumen Chakrabarti’s group @ IIT-B]

RDF Linked Data representationLiteral values

Knoblock et al. [May 2012]

[Craig Knoblock’s group @ USC – ISI]

Largely focuses on header cell semantics & relation between headersRequires initial user input before automatic predictions from the system

Venetis et al. [Sep. 2011][Alon Halevy’s group @ Google]

Column header and Relation semanticsLiteral values; RDF Linked Data

19

What TABEL brings to the “table” Infers the complete semantics of a table

Generates a RDF Linked Data representation

Supports tables with different structures over a variety of domains [medical tables]

Incorporates user feedback to improve the quality of inferred semantics

Infers the semantics of literal values* [numerical values]

20

Why How Evaluation

Application

Wrap up

21

TABEL – TABle Extracted as Linked Data

DECODE AAD

Pre-processing modules

Query and Rank

1

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

Your module here!

Varish Mulwad, Tim Finin and Anupam Joshi, “A Domain Independent Framework for Extracting Linked Semantic Data from Tables”, In Search Computing, ISBN 978-3-642-34212-7, vol. 7538, 2012.

22

Query – Candidate Entities

Chicago + Context {Team} + Context {Michael Jordan, Shooting Guard, 1.98}

1. Chicago2. Judy_Chicago3. Chicago_Bulls

1. Chicago_Bulls2. Chicago3. Judy_Chicago

1. Chicago2. Judy_Chicago3. Chicago_Bulls

Re-rank – Classifier(String Similarity, Popularity)

Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi, “Using linked data to interpret tables”, In 1st Int. Workshop on Consuming Linked Data, held at the 9th Int. Semantic Web Conf. (ISWC 2010), Shanghai, China, Nov. 2010.

23

Query – Candidate ClassesClass

Instance

1. Chicago_Bulls2. Chicago3. Judy_Chicago

{Place,City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams }

{Place, PopulatedPlace, Film, NationalBasketballAssociationTeams, … , … }

{……………………………………………………………. }

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film ….

Team

Chicago

Philadelphia

Houston

San Antonio

24

Query – Candidate Relations

Name

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Team

Chicago

Philadelphia

Houston

San Antonio

1. Chicago_Bulls2. Chicago3. Judy_Chicago

1. Michael_Jordan2. Michael_I_Jordan3. Jordan_River

playsForlivesIn….….

…… ……

playsFor, livesIn,born, …….

25

Query – Literals* [numeral data]

Team

Chicago

Philadelphia

Houston

San Antonio

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film

Chicago

26

Query – Literals

1.98

1.83

2.29

2.11

?

27

NumKB

320,900

183,120

229,198

211,123

Population

Income

1.98

1.83

2.29

2.11

Height

Person BasketBallPlayer(?)

NumKB: Encodes distributional features for Linked Data properties

Allows query using literal values (and optionally property name)

Provides information on property domains

250,0001.95

28

Identify property domains seatingCapacity

Get InstancesGet Instance

TypesOrder by frequency

Queen's_Film_TheatreRestaurant_Gordon_RamsayM&T_Bank_Stadium

TheatreStadiumRestaurant

1. seatingCapacity_Stadium [1]2. seatingCapacity_Theatre [0.70]3. seatingCapacity_Restaurant [0.57]

Duplet score: 1/

29

Identify property domain duplet values

Property, domain [seatingCapacity,Stad

ium]

Get Property Values

Sort; Trim front & back

tails; Compute µ &

σ

1777720767500

-212 : 25743 [86.66 %]-13190 : 38721 [6.56 %]

-26168 : 51699 [4.67 %]-39146 : 64677 [2.08 %]

Compute Ranges

µ - σ : µ + σµ - 2σ : µ + 2σ

30

Query – Literals

1.98, height

NumKB1. height2. diameter3. minimumElevation

minRange < 1.98 < maxRange

Fuzzy string match (ColHeaderString, PropertyName)

31

Graphical Model for Tables

C1 C2 C3

R11

R12

R13

R21

R22

R23

R31

R32

R33

Team

Chicago

Philadelphia

Houston

San Antonio

Class

Instance

NameVice-

PresidentOffice Held

Beetle RedGasolin

e

32

Parameterized Graphical Model

C1 C2 C3

𝝍𝟓

R11

R12

R13

R21

R22

R23

R31

R32

R33

𝝍𝟑 𝝍𝟑 𝝍𝟑

𝝍𝟒 𝝍𝟒 𝝍𝟒

Function that captures the affinity between the column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Captures interaction between row values

33

Semantic Message Passing𝝍𝟒

𝝍𝟑

Michael_I_Jordan Chicago_Bulls

“Change”playsFor

“No Change”

C1:[BasketballPlayer

]

C2:[NBATeam] C3:[BasketBallPositions

]

𝝍𝟓

Yao_MingAllen_Iverson

BasketballPlayer“Change”BasketBall

Player

“No Change”“No Change”

……

……

“No Change”

“No Change”

34

Semantic Message Passing[V] Pick new

value

[V] Send current values

[F] Identify Outliers

[F] Send semantics

V – Variable NodesF – Factor Nodes

Semantically Aware Factor

Nodes

Varish Mulwad, Tim Finin and Anupam Joshi, "Semantic Message Passing for Generating Linked Data from Tables", In 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, Oct. 2013.

35

– Column Header & Row Value Agreement

𝝍𝟑 [Michael_I_Jordan, Allen_Iverson, Yao_Ming]

GeoPopulatedPlaceBasketBallPlayerArtWorkName

Michael_I_Jordan

Allen_Iverson

Yao_MingAtheleteBasketballPlayer

ArtificialIntelligenceResearchers

1. BasketBallPlayer2.GeoPopulatedPlace….

Top Class: BasketBallPlayer topClassScore =

36

– Column Header & Row Value Agreement

Use the topClass in Message Passing process

Send topClassScore as confidence score

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

Update Column Header Annotation = “No-Annotation”

topClassScore < thresholdclass ?

BasketBallPlayer

37

𝝍4 – Relation between Columns[Michael_I_Jordan, Chicago_Bulls][Allen_Iverson, Philadelphia_76ers][Yao_Ming, Houston_Rockets]

𝝍𝟒

Team

Chicago_Bulls

Philadelphia_76ers

Houston_Rockets

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

playsForlivesIn….….

No – rel

playsFor

playsFor

1. playsFor2. livesIn….….

Top relation: playsFor

topRelScore =

38

𝝍4 – Relation between Columns

Use the topRel in Message Passing process

Send topRelScore as confidence

Update Rel Annotation = “No-

Annotation”

topRelScore < thresholdrelation ?

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

ChangeplaysFor

No - Change

39

Variable Node Update

R11

Michael Jordan

𝝍𝟑𝝍𝟒

𝝍𝟒Change [BasketBallPlayer, 0.8]

Change

[playsFor,

0.6]

No-Change[0.55]

(Team)

(Chicago)

(Shooting Guard)

avgChangeConfidenceScore > avgNoChangeConfidenceScore ? = 0.70] [0.5

5]

40

Variable Node Update

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 0.6]

R11

Michael Jordan

(1)BasketBallPlayer

(2)playsFor

Michael_I_Jordan

……..

Michael_Jordan

……..

Satisfy constraints: [1, 2, 3]Satisfy constraints: [1, 2]Satisfy constraints: [1,3]Satisfy constraints: [2,3]Satisfy constraints: [1]Satisfy constraints: [2]Satisfy constraints: [3]Choose “No Annotation”

41

Halting Condition

Ideal Case – No variable node receives a ‘CHANGE’ message

Practical Case – Fraction of variable nodes that receive ‘CHANGE’ message <

thresholdChange

42

Tables Ontology

dbpedia-owl:BasketBallTeam

dbpedia:Michael_Jordan

dbpedia-owl:playsFor

43

RDF Linked Data Representation

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String;

tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

tab:HeaderRelation_12 a tab:TableRelation; tab:relFromColumn tab:cell_01; tab:relToColumn tab:cell_02; tab:relLabel dbpedia-owl:team.

44

Human in the loopAAD DECODE

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Query and Rank

2 1

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

AAD DECODE

Joint Inference

Generate RDF Linked

Data

Verify (optional)

Store / Publish

During

After

Before

Before

45

Human in the loop – Before

No. Name Team Position Height

1 Michael Jordan

Chicago Shooting Guard

1.98

2 Allen Iverson Philadelphia

Point Guard 1.83

3 Yao Ming Houston Center 2.29

4 Tim Duncan San Antonio

Power Forward

2.11

46

Human in the loop – Before

Team

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

Michael Jordan

Michael_I_JordanMichael_JordanMichael_JacksonMichael_Wodruff….….….

Name, Team

livesInteam….….….….….

Assignments treated as “true values”

Human in the loop – During

47

𝝍𝟑 𝝍𝟒

Team [0.2] Name, Team [0.1]

WomenArtistBasketBallTeamCitySportsTeam….….

48

Human in the Loop – Impact on Joint Inference

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

BasketBallPlayer

𝝍𝟑 Name [BasketballPlayer]

[Class: BasketBallPlayer, 1.0] [Fixed][Relation: playsFor, 0.6]

R11

Michael Jordan

Name,Team [playsFor]

𝝍𝟒

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 1.0] [Fixed]

Name

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

playsFor

49

Human in the Loop – Impact on Joint Inference

R11 Chicago [Chicago_Bulls]

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

livesInteam….….….….….

Candidate classes

Candidate relations

50

Why How Evaluation

Application

Wrap up

51

Datasets

Dataset # of tables used in Col. And Rel Annotations

# of tables used in Data Cell Annotations

Average number of columns and rows

Web_Manual 150 371 2, 36

Web_Relation 28 – 4, 67

Wiki_Manual 25 39 4, 35

Wiki_Links – 80 3, 16

Subset of the IIT-B datasetLimaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and

searching web tables using entities, types and relationships. In: Proc. 36th VLDB (2010)

52

Ground TruthHuman annotators marked each class, relation as ‘vital’, ‘okay’, ‘incorrect’

To compute precision, assign scores to class & relation predicted by the system 1 – If the class was vital 0.5 – If the class was okay, but could have been better (e.g.

Place v/s City) 0 – if it was incorrect

To compute recall assign score of 1 if vital or okay, 0 for incorrect

Ground truth for data cell value annotations from the IIT – B dataset

Column Header Annotations

530.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

55.90

36.17

60.00

24.31

47.87

18.57

Web_Manual

Web_Relation

Wiki_Manual

<--

---

Perc

enta

ge -

----

>

okay

vital

% of Relevant labels at Rank 1

54

Column Header Annotations% of Relevant labels at different ranks

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

<--

---

Perc

enta

ge

----

->

<----- Rank ----->

Web_Manual

Web_Relation

Wiki_Manual

55

Column Header Annotations

Web_Manual Web_Relation Wiki_Manual0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.68

0.60

0.69

0.49

0.42

0.530.57

0.49

0.60

Precision, Recall and F-score at rank 1

PrecisionRecall

F-score

56

Column Header Annotations

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision RecallWeb_ManualWeb_RelationWiki_Manual

<----- Rank (k) ----->

Precision v/s Recall at ranks 1-10

57

Column Header Annotations

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.570.6

0.43

0.56

0.65 0.67

SMP

IIT-B

GOOG SMP

IIT-B

GOOGWeb_Manual Wiki_Manual

Semantic Message Passing v/s the restF-scores

58

Example Column Header Predictions

Column: ConstituencyPredicted: N.A.

DBpedia classes [Ranks 2-10]:OfficeHolderPrimeMinisterPoliticianElectionEventAdministrativeRegionPopulatedPlaceUniversityEducationalInstitution

Column: Name of Elected M.P.Predicted: OfficeHolder

DBpedia classes [Ranks 2-10]: ElectionEventPrimeMinisterPoliticianCountryPopulatedPlaceSettlementUniversityEducationalInstitution

59

Relation Annotations

Web

_Man

ual_d

bp

Web

_Man

ual_y

ago

Web

_Relat

ion_

dbp

Web

_Relat

ion_

yago

Wiki

_Man

ual_d

bp

Wiki

_Man

ual_y

ago

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

34.43 33.33

50.00 50.00 53.85 66.67

3.28

6.25

okay

vital

<--

---

Perc

enta

ge -

----

>

% of relevant relations at rank 1

60

Relation Annotations

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

Web_Manual_dbp Web_Manual_yago

Web_Relation_dbp Web_Relation_yago

Wiki_Manual_dbp Wiki_Manual_yago

Web_Manual

Web_RelationWiki_Manual

DBpedia

Yago

<----- Rank ----->

<--

---

Perc

enta

ge -

----

>

% of relevant relations at rank 1-10

61

Relation Annotations

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.000.89 0.86

0.97

0.51

0.630.68

SMP

IIT-B

IIT-B

IIT-BWeb_Manual Wiki_ManualWeb_Relatio

n

SMP

SMP

Semantic Message Passing v/s the restF-scores

62

Example Relation PredictionsColumn: President – Birth statePredicted: N.A.

DBpedia rels [Ranks 2-10]:locationdeathPlacelocatedInAreabirthPlaceisPartOflargestCityalmaMaterregionstate

Column Pair: Name of Elected M.P. -- Party Affiliation

Predicted: party

DBpedia rels [Ranks 2-8]: affiliationotherPartyprimeMinisterdeathPlacebirthPlaceregionNA

63

Data Cell Value Annotations

Wiki_Link Web Manual Wiki_Manual0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00 75.89

63.0767.42

% of correctly linked entities

64

How long did it run ?

0 2 4 6 8 10 120

50

100

150

200

250

300

Iteration Number

Num

ber

of

vari

able

s th

at

rece

ive

mess

age c

hange

Line represents a table

Number of variables that received a “change” message at the end of a iteration

65

Literals – Experimental Setup

Subset of 16 tables [17 literal value columns] from the Wiki_Link Dataset

Generate property candidate set by querying against NumKB

Manually annotated each literal column with an appropriate DBpedia property

66

Header Cell Annotations for Literals

1 2 3 4 5 6 7 8 9 100.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

35.29

23.53

17.65

5.88 5.88

0.00 0.00

5.88

0.00 0.00

<----- Rank ----->

<--

---

Perc

enta

ge -

----

>

Percentage of correct properties at ranks 1-10

1 2 3 4 5 6 7 8 9 100.00

10.00

20.00

30.00

40.00

50.00

60.00

70.0064.71

23.53

0.00 0.00

5.88

0.00 0.00

5.88

0.00 0.00

67

Human in the loop – Experimental Setup

Subset of 11 tables from the Wiki_Link dataset

User feedback: Correct column header class [1 column in 9 tables and 2 for the remaining 2 tables]

Rest of the experimental setup same.

68

Data Cell Annotations

1 2 3 4 5 6 7 8 9 10 110.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

No HILHIL

Human in the Loop (HIL) v/s No Human in the Loop<

----

- %

of

corr

ect

ly a

nnota

ted d

ata

cells

----

>

<----- Table number----->

Correct Entities

Total %

HIL 286 402 71.14

No – HIL 245 402 60.95

69

Why How Evaluation

Application

Wrap up

70

Interpreting Medical Tables as Linked Datafor Generating Meta–Analysis Reports

71

TABEL – TABle Extracted as Linked Data

AAD DECODE

Pre-processing modules

Query and Rank

2 1

Generate RDF Linked

Data

Verify (optional)

Store / Publish

Joint Inference

Name Team Position Height

Michael Jordan Chicago Shooting Guard

1.98

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan San Antonio Power Forward 2.11

Your module here!Normalize

Varish Mulwad, Tim Finin and Anupam Joshi, "Interpreting Medical Tables as Linked Data to Generate Meta–Analysis Reports", In 15th IEEE Int. Conf. on Information Reuse and Integration (IRI 2014), San Francisco, USA, Aug. 2014.

72

Preprocessing – Normalize

73

Preprocessing – Normalize

Patients with Secondary Thrombosis

N = 146

no. --> 49; % -->33.6

no. (%)

Smoker

Split header cells into Query String and Metadata

Normalize data cells; identify types or units

74

Query – Candidate Classes* [DBpedia]

Hypertension

(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension

(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension

Re-rank – Classifier(String Similarity, Popularity)

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

Also evaluated against SNOMED CT & UMLS

75

Query – Candidate Classes [Hybrid]

Hypertension

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

No results?

SNOMED CT

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

API

76

Modeling Medical Tables as RDF

PatientGroup

xsd:integer owl:Thing

numberOfIndividuals

hasGroupAttribute

146

umls:Secondary_Thrombosis

Value

xsd:String

hasType

xsd:double

hasRawValue

% 33.6

77

Interactive tool to generate Meta – Analysis reports

User interface to define meta-analysis parameters and select studies

Tool automatically generates relevant SPARQL queries

78

Evaluation

79

Header Cell Annotations

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

29.51

60.66 59.02

29.51

10.66

8.2

10.66 20.49

4.1

2.46 4.92

3.28

3.28 2.46

4.1

4.92

18.85

4.1 9.02

16.39

33.61

22.13

12.3

25.41

Distribution of header cell concepts at different ranks

SNOMED CT UMLS

HYBRID

DBPEDIA

<--

---

Perc

enta

ge -

----

>

NF: Correct concept not found in the candidate set

1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF

Dataset: 7 tables (122 header cells)

80

Retrieval (Find) Evaluation Experimental Setup

• Generated Linked Data from four tables

• Executed Retrieval SPARQL queries to find tables that included correlation between venous thrombosis for four different cardio vascular risk factors

• Average Precision: 0.79; Average Recall: 0.75

81

Why How Evaluation

Application

Wrap up

82

Conclusions

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

I claimed:

’’

83

Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics

TABEL jointly inferred the semantics; thorough evaluation showed promising results

… the semantics of column headers, values (string and literal) in table cells, and relations between columns

A novel technique to generate candidate properties from literal values

84

Conclusions It is possible to generate high quality linked data from tables

Tables ontology to represent the inferred semantics

Demonstrated domain independence and extensibility and support for tables with different structures

Explored different models for Human in the loop

85

Future Work

Schema + Data driven approach

Build on the work on inferring literals; NumKB

Further develop Human in the loop

Tool to generate meta-analysis reports

86

Acknowledgements

Dr. Tim Finin

Dr. Anupam Joshi

Dr. Tim Oates

Dr. Yun Peng

Dr. L V Subramaniam

Dr. Indrajit Bhattacharya

Lab mates & Friends!

Thank You! Our papers on this research topic have garnered 93

citations!