Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD...

transcript

Machines learnt how to understand tables. What happens next will shock you.

Welcome to the PhD

dissertation defense of

Varish Mulwad!

TABEL – Domain Independent and

Extensible Framework to Infer the Semantics of

Tables Varish Mulwad

Ph.D. Dissertation Defense

Adviser: Dr. Tim FininJanuary 8, 2015

4Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi, "Exploiting a Web of Semantic Data for Interpreting Tables", In 2nd Web Science Conference (WebSci 2010), Raleigh, NC, USA, Apr. 2010

Semantics of a Table

Name Team Position Height

Michael Jordan

Chicago Shooting Guard

Allen Iverson Philadelphia Point Guard 1.83

Yao Ming Houston Center 2.29

Tim Duncan

San Antonio Power Forward

NationalBasketballAssociationTeams

http://dbpedia.org/resource/Allen_Iverson Map literals as

property values

playsFor

Semantics of a TableName Team Position Height

Michael Jordan

Tim Duncan

San Antonio Power Forward

Linked

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

All this in a completely automated way!

TABEL – Domain Independent & Extensible

Framework to Infer the Semantics of Tables

Thesis Statement

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

Contributions

o Probabilistic Graphical Model to jointly infer the semantics + a novel inference technique Semantic Message Passing

o An proof of concept user–interactive application to generate meta-analysis reports automatically

o Develop & Explore Human in the Loop paradigm

o A novel technique to generate candidate properties from literal values

How Evaluation

Application

Wrap up

Tables are everywhere!

154 million high quality relational tables on the web

~400,000 CSVs on data.gov

Healthcare, Financial and other domains

The Semantic Web & the Web

Spreadsheets/CSVs to RDF/OWL

Evidence Based MedicineCombine: All studies that compare organic milk v/s grass fed cow milk

Produce Unified report: Organic Milk is better!

Meta – Analysis report

Correlation between Cardio vascular risk factors and Venous Thrombosis

Duration of proton pump inhibitors as first line of treatment for Helicobacter pylori eradication

Tables are valuable

Meta – Analysis: Today

Correlation between Cardio vascular risk factors and Venous Thrombosis1

Initial Search >> 1949 studiesFinal # of studies selected >> 22!

1 - W. Ageno, C. Becattini, T. Brighton, R. Selby, and P. W. Kamphuisen,”Cardiovascular risk factors and venous thromboembolism a meta-analysis,” Circulation, vol. 117, no. 1, pp. 93–102, 2008.

• Keyword based search

• Initial search yields large # of results

• Manually filter out irrelevant results

Not restricted to healthcare …

Related Work

Databases & Spreadsheets to RDF:

Existing solutions: Largely manual or semi-automaticNumber of Ontologies, classes, relationsAutomatic solutions: “Row as RDF node”; local mappingsNo links to existing classes, properties, entities

Related Work

Semantics of Table:

Infer semantics for only parts of the table [header cells; relation between headers; data cell values or a combination of the two]

Fail to generate RDF Linked Data representation

Poor support for literals

Related Work Limaye et al. [Sep. 2010][Soumen Chakrabarti’s group @ IIT-B]

RDF Linked Data representationLiteral values

Knoblock et al. [May 2012]

[Craig Knoblock’s group @ USC – ISI]

Largely focuses on header cell semantics & relation between headersRequires initial user input before automatic predictions from the system

Venetis et al. [Sep. 2011][Alon Halevy’s group @ Google]

Column header and Relation semanticsLiteral values; RDF Linked Data

What TABEL brings to the “table” Infers the complete semantics of a table

Generates a RDF Linked Data representation

Supports tables with different structures over a variety of domains [medical tables]

Incorporates user feedback to improve the quality of inferred semantics

Infers the semantics of literal values* [numerical values]

Why How Evaluation

Application

Wrap up

TABEL – TABle Extracted as Linked Data

DECODE AAD

Pre-processing modules

Query and Rank

Generate RDF Linked

Verify (optional)

Store / Publish

Joint Inference

Michael Jordan Chicago Shooting Guard

Tim Duncan San Antonio Power Forward 2.11

Your module here!

Varish Mulwad, Tim Finin and Anupam Joshi, “A Domain Independent Framework for Extracting Linked Semantic Data from Tables”, In Search Computing, ISBN 978-3-642-34212-7, vol. 7538, 2012.

Query – Candidate Entities

Chicago + Context {Team} + Context {Michael Jordan, Shooting Guard, 1.98}

1. Chicago2. Judy_Chicago3. Chicago_Bulls

1. Chicago_Bulls2. Chicago3. Judy_Chicago

1. Chicago2. Judy_Chicago3. Chicago_Bulls

Re-rank – Classifier(String Similarity, Popularity)

Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi, “Using linked data to interpret tables”, In 1st Int. Workshop on Consuming Linked Data, held at the 9th Int. Semantic Web Conf. (ISWC 2010), Shanghai, China, Nov. 2010.

Query – Candidate ClassesClass

Instance

{Place,City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams }

{Place, PopulatedPlace, Film, NationalBasketballAssociationTeams, … , … }

{……………………………………………………………. }

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film ….

Chicago

Philadelphia

Houston

San Antonio

Query – Candidate Relations

Michael Jordan

Allen Iverson

Yao Ming

Tim Duncan

Chicago

Philadelphia

Houston

San Antonio

1. Michael_Jordan2. Michael_I_Jordan3. Jordan_River

playsForlivesIn….….

…… ……

playsFor, livesIn,born, …….

Query – Literals* [numeral data]

Chicago

Philadelphia

Houston

San Antonio

Place, City, WomenArtist, LivingPeople, NationalBasketballAssociationTeams, PopulatedPlace, Film

Chicago

Query – Literals

320,900

183,120

229,198

211,123

Population

Income

Height

Person BasketBallPlayer(?)

NumKB: Encodes distributional features for Linked Data properties

Allows query using literal values (and optionally property name)

Provides information on property domains

250,0001.95

Identify property domains seatingCapacity

Get InstancesGet Instance

TypesOrder by frequency

Queen's_Film_TheatreRestaurant_Gordon_RamsayM&T_Bank_Stadium

TheatreStadiumRestaurant

1. seatingCapacity_Stadium [1]2. seatingCapacity_Theatre [0.70]3. seatingCapacity_Restaurant [0.57]

Duplet score: 1/

Identify property domain duplet values

Property, domain [seatingCapacity,Stad

Get Property Values

Sort; Trim front & back

tails; Compute µ &

1777720767500

-212 : 25743 [86.66 %]-13190 : 38721 [6.56 %]

-26168 : 51699 [4.67 %]-39146 : 64677 [2.08 %]

Compute Ranges

µ - σ : µ + σµ - 2σ : µ + 2σ

Query – Literals

1.98, height

NumKB1. height2. diameter3. minimumElevation

minRange < 1.98 < maxRange

Fuzzy string match (ColHeaderString, PropertyName)

Graphical Model for Tables

C1 C2 C3

Chicago

Philadelphia

Houston

San Antonio

Instance

NameVice-

PresidentOffice Held

Beetle RedGasolin

Parameterized Graphical Model

C1 C2 C3

𝝍𝟓

𝝍𝟑 𝝍𝟑 𝝍𝟑

𝝍𝟒 𝝍𝟒 𝝍𝟒

Function that captures the affinity between the column headers and row values

Row value

Variable Node: Column header

Captures interaction between column headers

Captures interaction between row values

Semantic Message Passing𝝍𝟒

𝝍𝟑

Michael_I_Jordan Chicago_Bulls

“Change”playsFor

“No Change”

C1:[BasketballPlayer

C2:[NBATeam] C3:[BasketBallPositions

𝝍𝟓

Yao_MingAllen_Iverson

BasketballPlayer“Change”BasketBall

Player

“No Change”“No Change”

……

“No Change”

Semantic Message Passing[V] Pick new

[V] Send current values

[F] Identify Outliers

[F] Send semantics

V – Variable NodesF – Factor Nodes

Semantically Aware Factor

Varish Mulwad, Tim Finin and Anupam Joshi, "Semantic Message Passing for Generating Linked Data from Tables", In 12th Int. Semantic Web Conf. (ISWC 2013), Sydney, Australia, Oct. 2013.

– Column Header & Row Value Agreement

𝝍𝟑 [Michael_I_Jordan, Allen_Iverson, Yao_Ming]

GeoPopulatedPlaceBasketBallPlayerArtWorkName

Michael_I_Jordan

Allen_Iverson

Yao_MingAtheleteBasketballPlayer

ArtificialIntelligenceResearchers

1. BasketBallPlayer2.GeoPopulatedPlace….

Top Class: BasketBallPlayer topClassScore =

– Column Header & Row Value Agreement

Use the topClass in Message Passing process

Send topClassScore as confidence score

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

Update Column Header Annotation = “No-Annotation”

topClassScore < thresholdclass ?

BasketBallPlayer

𝝍4 – Relation between Columns[Michael_I_Jordan, Chicago_Bulls][Allen_Iverson, Philadelphia_76ers][Yao_Ming, Houston_Rockets]

𝝍𝟒

Chicago_Bulls

Philadelphia_76ers

Houston_Rockets

Michael_I_Jordan

Allen_Iverson

Yao_Ming

playsForlivesIn….….

No – rel

playsFor

1. playsFor2. livesIn….….

Top relation: playsFor

topRelScore =

𝝍4 – Relation between Columns

Use the topRel in Message Passing process

Send topRelScore as confidence

Update Rel Annotation = “No-

Annotation”

topRelScore < thresholdrelation ?

Michael_I_Jordan

Allen_Iverson

Yao_Ming

ChangeplaysFor

No - Change

Variable Node Update

Michael Jordan

𝝍𝟑𝝍𝟒

𝝍𝟒Change [BasketBallPlayer, 0.8]

Change

[playsFor,

No-Change[0.55]

(Team)

(Chicago)

(Shooting Guard)

avgChangeConfidenceScore > avgNoChangeConfidenceScore ? = 0.70] [0.5

Variable Node Update

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 0.6]

Michael Jordan

(1)BasketBallPlayer

(2)playsFor

Michael_I_Jordan

……..

Michael_Jordan

……..

Satisfy constraints: [1, 2, 3]Satisfy constraints: [1, 2]Satisfy constraints: [1,3]Satisfy constraints: [2,3]Satisfy constraints: [1]Satisfy constraints: [2]Satisfy constraints: [3]Choose “No Annotation”

Halting Condition

Ideal Case – No variable node receives a ‘CHANGE’ message

Practical Case – Fraction of variable nodes that receive ‘CHANGE’ message <

thresholdChange

Tables Ontology

dbpedia-owl:BasketBallTeam

dbpedia:Michael_Jordan

dbpedia-owl:playsFor

RDF Linked Data Representation

tab:cell_01 a tab:ColumnHeader; tab:cellLabel "Name"^^xsd:String; tab:columnIndex "1"^^xsd:Integer; tab:valueType dbpedia-owl:BasketballPlayer.

tab:cell_11 a tab:DataCell; tab:cellLabel "Michael Jordan"^^xsd:String;

tab:columnIndex "1"^^xsd:Integer; tab:rowIndex "1"^^xsd:Integer; tab:entity dbpedia:Michael_Jordan.

tab:HeaderRelation_12 a tab:TableRelation; tab:relFromColumn tab:cell_01; tab:relToColumn tab:cell_02; tab:relLabel dbpedia-owl:team.

Human in the loopAAD DECODE

Generate RDF Linked

Verify (optional)

Store / Publish

Query and Rank

Joint Inference

AAD DECODE

Joint Inference

Generate RDF Linked

Verify (optional)

Store / Publish

During

Before

Human in the loop – Before

No. Name Team Position Height

1 Michael Jordan

2 Allen Iverson Philadelphia

Point Guard 1.83

3 Yao Ming Houston Center 2.29

4 Tim Duncan San Antonio

Power Forward

Human in the loop – Before

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

Michael Jordan

Michael_I_JordanMichael_JordanMichael_JacksonMichael_Wodruff….….….

Name, Team

livesInteam….….….….….

Assignments treated as “true values”

Human in the loop – During

𝝍𝟑 𝝍𝟒

Team [0.2] Name, Team [0.1]

WomenArtistBasketBallTeamCitySportsTeam….….

Human in the Loop – Impact on Joint Inference

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

BasketBallPlayer

𝝍𝟑 Name [BasketballPlayer]

[Class: BasketBallPlayer, 1.0] [Fixed][Relation: playsFor, 0.6]

Michael Jordan

Name,Team [playsFor]

𝝍𝟒

[Class: BasketBallPlayer, 0.8][Relation: playsFor, 1.0] [Fixed]

Michael_I_Jordan

Allen_Iverson

Yao_Ming

Change

No - Change

playsFor

Human in the Loop – Impact on Joint Inference

R11 Chicago [Chicago_Bulls]

WomenArtistBasketBallTeamCityPopulatedPlaceSportsTeam….….

livesInteam….….….….….

Candidate classes

Candidate relations

Why How Evaluation

Application

Wrap up

Datasets

Dataset # of tables used in Col. And Rel Annotations

# of tables used in Data Cell Annotations

Average number of columns and rows

Web_Manual 150 371 2, 36

Web_Relation 28 – 4, 67

Wiki_Manual 25 39 4, 35

Wiki_Links – 80 3, 16

Subset of the IIT-B datasetLimaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and

searching web tables using entities, types and relationships. In: Proc. 36th VLDB (2010)

Ground TruthHuman annotators marked each class, relation as ‘vital’, ‘okay’, ‘incorrect’

To compute precision, assign scores to class & relation predicted by the system 1 – If the class was vital 0.5 – If the class was okay, but could have been better (e.g.

Place v/s City) 0 – if it was incorrect

To compute recall assign score of 1 if vital or okay, 0 for incorrect

Ground truth for data cell value annotations from the IIT – B dataset

Column Header Annotations

530.00

Web_Manual

Web_Relation

Wiki_Manual

% of Relevant labels at Rank 1

Column Header Annotations% of Relevant labels at different ranks

1 2 3 4 5 6 7 8 9 100

<----- Rank ----->

Web_Manual

Web_Relation

Wiki_Manual

Web_Manual Web_Relation Wiki_Manual0.00

0.530.57

Precision, Recall and F-score at rank 1

PrecisionRecall

F-score

1 2 3 4 5 6 7 8 9 100

Precision RecallWeb_ManualWeb_RelationWiki_Manual

<----- Rank (k) ----->

Precision v/s Recall at ranks 1-10

0.570.6

0.65 0.67

GOOG SMP

GOOGWeb_Manual Wiki_Manual

Semantic Message Passing v/s the restF-scores

Example Column Header Predictions

Column: ConstituencyPredicted: N.A.

DBpedia classes [Ranks 2-10]:OfficeHolderPrimeMinisterPoliticianElectionEventAdministrativeRegionPopulatedPlaceUniversityEducationalInstitution

Column: Name of Elected M.P.Predicted: OfficeHolder

DBpedia classes [Ranks 2-10]: ElectionEventPrimeMinisterPoliticianCountryPopulatedPlaceSettlementUniversityEducationalInstitution

Relation Annotations

_Relat

34.43 33.33

50.00 50.00 53.85 66.67

% of relevant relations at rank 1

1 2 3 4 5 6 7 8 9 100

Web_Manual_dbp Web_Manual_yago

Web_Relation_dbp Web_Relation_yago

Wiki_Manual_dbp Wiki_Manual_yago

Web_Manual

Web_RelationWiki_Manual

DBpedia

<----- Rank ----->

% of relevant relations at rank 1-10

1.000.89 0.86

0.630.68

IIT-BWeb_Manual Wiki_ManualWeb_Relatio

Semantic Message Passing v/s the restF-scores

Example Relation PredictionsColumn: President – Birth statePredicted: N.A.

DBpedia rels [Ranks 2-10]:locationdeathPlacelocatedInAreabirthPlaceisPartOflargestCityalmaMaterregionstate

Column Pair: Name of Elected M.P. -- Party Affiliation

Predicted: party

DBpedia rels [Ranks 2-8]: affiliationotherPartyprimeMinisterdeathPlacebirthPlaceregionNA

Data Cell Value Annotations

Wiki_Link Web Manual Wiki_Manual0.00

80.00 75.89

63.0767.42

% of correctly linked entities

How long did it run ?

0 2 4 6 8 10 120

Iteration Number

Line represents a table

Number of variables that received a “change” message at the end of a iteration

Literals – Experimental Setup

Subset of 16 tables [17 literal value columns] from the Wiki_Link Dataset

Generate property candidate set by querying against NumKB

Manually annotated each literal column with an appropriate DBpedia property

Header Cell Annotations for Literals

1 2 3 4 5 6 7 8 9 100.00

5.88 5.88

0.00 0.00

<----- Rank ----->

Percentage of correct properties at ranks 1-10

1 2 3 4 5 6 7 8 9 100.00

70.0064.71

0.00 0.00

Human in the loop – Experimental Setup

Subset of 11 tables from the Wiki_Link dataset

User feedback: Correct column header class [1 column in 9 tables and 2 for the remaining 2 tables]

Rest of the experimental setup same.

Data Cell Annotations

1 2 3 4 5 6 7 8 9 10 110.00

100.00

No HILHIL

Human in the Loop (HIL) v/s No Human in the Loop<

<----- Table number----->

Correct Entities

Total %

HIL 286 402 71.14

No – HIL 245 402 60.95

Why How Evaluation

Application

Wrap up

Interpreting Medical Tables as Linked Datafor Generating Meta–Analysis Reports

TABEL – TABle Extracted as Linked Data

AAD DECODE

Pre-processing modules

Query and Rank

Generate RDF Linked

Verify (optional)

Store / Publish

Joint Inference

Your module here!Normalize

Varish Mulwad, Tim Finin and Anupam Joshi, "Interpreting Medical Tables as Linked Data to Generate Meta–Analysis Reports", In 15th IEEE Int. Conf. on Information Reuse and Integration (IRI 2014), San Francisco, USA, Aug. 2014.

Preprocessing – Normalize

Patients with Secondary Thrombosis

N = 146

no. --> 49; % -->33.6

no. (%)

Smoker

Split header cells into Query String and Metadata

Normalize data cells; identify types or units

Query – Candidate Classes* [DBpedia]

Hypertension

(1) Idiopathic intracranial hypertension(2) Pulmonary hypertension(3) Hypertension

Re-rank – Classifier(String Similarity, Popularity)

(1) Hypertension(2) Pulmonary hypertension(3) Idiopathic intracranial hypertension

Also evaluated against SNOMED CT & UMLS

Query – Candidate Classes [Hybrid]

Hypertension

No results?

SNOMED CT

Modeling Medical Tables as RDF

PatientGroup

xsd:integer owl:Thing

numberOfIndividuals

hasGroupAttribute

umls:Secondary_Thrombosis

xsd:String

hasType

xsd:double

hasRawValue

% 33.6

Interactive tool to generate Meta – Analysis reports

User interface to define meta-analysis parameters and select studies

Tool automatically generates relevant SPARQL queries

Evaluation

Header Cell Annotations

60.66 59.02

10.66 20.49

2.46 4.92

3.28 2.46

4.1 9.02

Distribution of header cell concepts at different ranks

SNOMED CT UMLS

HYBRID

DBPEDIA

NF: Correct concept not found in the candidate set

1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

NF 1 2-5 6-1011-25

26-101

Dataset: 7 tables (122 header cells)

Retrieval (Find) Evaluation Experimental Setup

• Generated Linked Data from four tables

• Executed Retrieval SPARQL queries to find tables that included correlation between venous thrombosis for four different cardio vascular risk factors

• Average Precision: 0.79; Average Recall: 0.75

Why How Evaluation

Application

Wrap up

Conclusions

It is possible to generate high quality linked data from tables by jointly inferring the semantics of column headers, values (string and literal) in table cells, and relations between columns augmented with background knowledge from open data sources such as the Linked Open Data cloud.

I claimed:

’’

Conclusions It is possible to generate high quality linked data from tables by jointly inferring the semantics

TABEL jointly inferred the semantics; thorough evaluation showed promising results

… the semantics of column headers, values (string and literal) in table cells, and relations between columns

A novel technique to generate candidate properties from literal values

Conclusions It is possible to generate high quality linked data from tables

Tables ontology to represent the inferred semantics

Demonstrated domain independence and extensibility and support for tables with different structures

Explored different models for Human in the loop

Future Work

Schema + Data driven approach

Build on the work on inferring literals; NumKB

Further develop Human in the loop

Tool to generate meta-analysis reports

Acknowledgements

Dr. Tim Finin

Dr. Anupam Joshi

Dr. Tim Oates

Dr. Yun Peng

Dr. L V Subramaniam

Dr. Indrajit Bhattacharya

Lab mates & Friends!

Thank You! Our papers on this research topic have garnered 93

citations!

Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD...

Documents