+ All Categories
Home > Documents > Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data...

Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data...

Date post: 25-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
28
Wei Hu ([email protected]) Department of Computer Science and Technology National Key Laboratory for Novel Software Technology Nanjing University, China Entity Linkage in the Linked Data: approaches and analysis Université d'Artois, 2016.6.7
Transcript
Page 1: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

WeiHu ([email protected])DepartmentofComputerScienceandTechnology

National Key Laboratory for Novel Software TechnologyNanjingUniversity,China

Entity Linkage in the Linked Data:approaches and analysis

Université d'Artois,2016.6.7

Page 2: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Websoft (http://ws.nju.edu.cn)

n Researchtopics

¡ SemanticWeb

¡ Webscience

¡ Bigdata

n Academicrecords

¡ Papersn WWW, IJCAI, AAAI, ISWC …

n Bestpaperaward&nominee

¡ Grants: 863, NSFC …

n Collaborations¡ Stanford, VUA, KIT, Aberdeen …

¡ IBM,Samsung,ZTE …

2YuzhongQu WeiHu GongCheng

Page 3: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Outline

n Introduction to Semantic Web and entity linkage

n A bootstrapping approach to entity linkage

n Link analysis of biomedical linked data

n (Two applications)

3

Page 4: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Semantic Web

n SemanticWebwasa thoughtfrom TimBerners-Lee

n GiveformalmeaningstoWebinformation– semantics¡ Web1.0(page)àWeb2.0(social)àWeb3.0(awebofdata)

n SemanticWebisabout1. commonformatsfor

n integrationandcombination ofdatadrawnfromdiversesources

2. languages forn recordinghowthedatarelatesto

real-worldobjects

4

Page 5: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

RDF (Resource Description Framework)

5

http://en.wikipedia.org/wiki/Alan_Kay

http://www.smalltalk.org/

RDFtriple:<subject,predicate,object>

Page 6: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Linked data

n As a realization of Semantic Web¡ LinkedData refers to a collectionofinterrelateddatasets

n Usedfor large-scaleintegrationof, reasoning on,dataontheWeb

n Linked data principles1. UseURIs tonamethings

2. UseHTTPURIs(canbe"dereferenced")

3. ProvideusefulinformationusingtheopenWebstandards (e.g.RDF)

4. Includelinks tootherrelatedthings

6

Page 7: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Linking open data (LOD) cloud

7

publication

life science

social network

geography

governmentdata

media language

misc.

UGC

1,014 datasets

Page 8: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Knowledge graph

n KnowledgeGraphisaknowledgebase usedbyGoogletoenhanceitssearchengine’ssearchresultswithsemantic searchinformationgatheredfromawidevarietyofsources¡ Nodes: entities or concepts

¡ Edges: attributes or relations

8

Page 9: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Entity linkage

n SemanticWeb data reach a scale in billionsof entities

n Many different entities refer to the same real-world thing¡ Typically denoted by URIs, from distributed data sources

n e.g. Wei Hu¡ http://data.semanticweb.org/person/wei-hu

¡ http://ws.nju.edu.cn/people/whu

¡ http://ontoworld.org/wiki/Special:URIResolver/Wei_Hu

¡ …

n Entity linkage: link different entities that refer to the same object¡ a.k.a. coreference resolution, entity matching …

¡ Outof31BRDFstatements,lessthan500Marelinksacrosssources

9

Page 10: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Outline

n Introduction to Semantic Web and entity linkage

n A bootstrapping approach to entity linkage

n Link analysis of biomedical linked data

n (Two applications)

10

Page 11: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Background

n In LOD, millions of entities have alreadybeen linked¡ However, potential candidates are still numerous

n Current solutions1. Equivalence reasoning

ü owl:sameAs, inverse functional properties …

ü Atpresent,probably missmany potential candidates

2. Similarity computation(alsointhedatabasearea)ü Compare properties and values of entities

u Inaccurate (heterogeneity),less scalable (pairwise comparison)

3. To improve, machine learningu Time-consuming, labor-intensive tobuild a large-scale training set

11

Page 12: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Definition

How to combine? Our solution: bootstrapping

n Query-driven entity linkage

n Use scenarios1. Search /browsing – a system knows “what to link” only at query time

2. Analyze small portions of a very large dataset toansweron-demandqueries

12

Definition 1. Let U be the set of entities in a set D of data sources. Given anentity , the entity linkage for u is to query a subset of entitiesfor which a relation ε holds:

where ε links all the entities in U that refer to the same object as u does, i.e.are coreferentwith u.

Page 13: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Our contribution

13

Learn discriminativeproperty-value pairsLearn discriminativeproperty-value pairs

Bootstrapping

Build a kernel(Initialize training set)

Build a kernel(Initialize training set)

External knowledge

Input:an entity

Frequentproperty

combinations

Frequentproperty

combinations

1

23

Output: a setof coreferententities

Labeled entities

Unresolved entities

Input

Automatically infer semanticallycoreferententitiesbasedonOWL/SKOSsemantics

IterativeprocessAssumptions:(1)coreferent entitiessharesomesimilar property-valuepairs;(2)a few property-valuepairsaremoreimportant for linking entities

Somepropertiesaremorenaturaltousetogether

Output

WeiHu,Cunxin Jia.ABootstrappingApproachtoEntityLinkageontheSemanticWeb.J. Web Semantics, 2015

Page 14: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Running example

14

dbpedia:Nanjing(DBpedia)

rdfs:labelowl:sameAs

“Nanjing”geo:1799962

geo:1799962(GeoNames)

geo:latgeo:longgeo:alternateName

“32 N ”“118 E”“Nanjing”“Nan-ching”

fb:m.05gqy(Freebase)

rdfs:labelgeo:latgeo:long

“Nanjing”“32 N ”“118 E”

ex:NationalCity geo:longgeo:lat

“117 W”“32 N ”

WeiHu,Cunxin Jia.ABootstrappingApproachtoEntityLinkageontheSemanticWeb.J. Web Semantics, 2015

Page 15: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Experiment

n Dataset¡ Billion Triples Challenge (BTC) 2011

n Testingentities¡ Top-50 in 364 thousand query logs

n Evaluation procedure andmetrics¡ 30graduates,2judges+1arbitrator/link,Fleiss’sκ =0.8 (sufficient agree)

¡ Precision&relativerecall(RR)n RR=correctlinksinonesystem/totalcorrectuniquelinksinallsystems

15

Entities > 100 million

RDF stat. > 2 billlion

Same-as stat. 3,446,029

IFP stat. 1,799,976

FP stat. 2,279,474

Exact-match stat. 22,398

Cardinality stat. 148

Has-key stat. 2

Different-from stat. 691

All-different stat. 89

People 15 Places 10

Tech terms 8 Music / movies 5

Universities 4 Companies 3

Publications 2 others 3

Page 16: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Experiment

n Bootstrappingcurve¡ Maximum iteration = 4

¡ Discriminability threshold = 0.05

n Linkageaccuracy

n Running timeon 5,000 samples:avg. 11.3 links in 12.6s

16

Page 17: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Outline

n Introduction to Semantic Web and entity linkage

n A bootstrapping approach to entity linkage

n Link analysis of biomedical linked data

n (Two applications)

17

Page 18: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Background

n Networkanalysishaslongbeenusedtostudylinkstructures¡ Network medicine: cellular networks and implications

¡ The “bow tie” structure of the Web

n Linked data for the life sciences¡ e.g. Bio2RDF, Chem2Bio2RDF,

Neurocommons, W3C LODD

n Millions of links over hundredsof datasets in overlap

¡ Network analysis can help

n understand structures to express data

n facilitate large-scale data integration

n improve overall quality of biomedical data

18

No such analysis yet!

Page 19: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Preliminaries

n Graph: nodes and edges¡ (outgoing / incoming) degree

n Sink, source, isolated node

n Power law distribution¡

¡ Scale-free

n Weakly connected component¡ Size: number of nodes

n Average distance

n Clusteringcoefficientà Small-world phenomenon

19

Page 20: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Our contribution

n We conduct anempiricallinkanalysisofBio2RDF¡ Bio2RDFisanopensourceprojectthatusesSemantic Webtechnologies to

buildand provide thelargestnetworkof lifescienceLinkedDatan Ensurethesignificance ofourempiricalstudy

1. Dataset link analysis (using RDF data model)

2. Entity link analysis (using a special kind of cross-references)

3. Term link analysis (using ontology matching)

n For each perspective, weinvestigatethegraphfeaturesofBio2RDFvis-à-viswhathasbeenpreviouslyreported

n Symmetry and transitivity of entity links

n Benchmark to evaluate entity matching approaches

20WeiHu,Honglei Qiu,MichelDumontier.LinkAnalysis ofLifeScienceLinkedData.ISWC, 2015

Page 21: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Dataset overview

What is the status of Bio2RDF?n 35 datasets, 11B RDF triples

n 1B entities

n 2K classes

n 4K properties

21

Observations1. Well linked

2. Average distance = 2.77Clustering coefficient = 0.22à small-world phenomenon

3. Hubs and authorities

4. Goodresilience

Page 22: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Entity link analysis

Howwell do entities link to each other?

n 76% entity links fromaspecialkindofRDFtriples¡ e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>

¡ x-relations have under-specified semanticsn Refer to a related resource, e.g. article

n Truly identical

n Degree distribution¡ Threetypesofentities in

OMIM,NCBI,KEGG

¡ Do not follow power lawn Exponent is too large (close to 5), p-values is too small (close to 0)

22

BTC2010

Page 23: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Symmetry and transitivity

n Symmetry¡ Borrow

andreverse

¡ Differentclassesn Phenotype

vs.Disorder

n Transitivity¡ Weakintermediate

¡ Modeling divergence

n Evenhardtohuman

23

Page 24: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Discussion of findings

n Entity link graph does not share the same characteristics with the Hypertext /Semantic Web

¡ Degreedistributiondoesnotfollowpowerlaw

n Adominated partofentitieshavebeenlinkedusing x-relations, buttheirintendedsemanticsdiffers¡ Classesare identicalorequivalentà entity links represent logicalequivalence

n Symmetricandtransitiveentitylinksexist,buttheireffectiveness isweakenedduetothesmallnumber¡ Meaningsofentity linksmayshiftduringtransitive

¡ KEGG,DrugBank andOMIMarethemostprominentknowledgebases

24

Page 25: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Applications: BioSearch

n Keywordsearchisthemostpopularparadigmfor information retrieval¡ Keywordscan be ambiguous andhavemultiplemeanings

n Semanticsearchaims toimprovesearchaccuracybyunderstandinguserintentandsearchcontext

¡ Heterogeneity between local schemas

n Our solution1. Semantic query + faceted filtering

n Notonlyplainkeywordsbutalsosemantictags

2. Onto-based query answering

n Rewrite queries from SIO to local schemas

3. Entity browsing

¡ Result:effectiveness +22.4%,usability +28.8%

http://ws.nju.edu.cn/biosearch/ 25

server side

browser side

external resources

images

Ontologymatching

Ontology-basedquery answering

ontologymappings

Semantic query input

Faceted filtering Entity browsing

SPARQL queries

semantic query

SemanticscienceIntegrated Ontology

Gene

Page 26: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Applications: Clinga

n Chinesegeographicaldataissmall scale, e.g. 4.6% in GeoNames

n Chinese linked geographical dataset (Clinga)1. Extract data from thelargest

Chinesewikiencyclopedia

2. Designageo-ontology toclassifygeographicalentity types

3. Automaticdiscoveryoflinks toexisting knowledge bases

¡ Result: 624K entities, 230K links

n Use scenario¡ MajorknowledgebaseforansweringChinesegeographicalquestions inour

NationalHigherEducationEntrance Examination(calledGaoKao)

26http://ws.nju.edu.cn/clinga/

Page 27: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Conclusion

n Entitylinkage is tolinkdifferententitiesthatrefertothesamerealworld object

n Large scale and heterogeneity are challengingexistingentity linkagesolutions

n Entity linkage approaches often involve knowledge representation,datamining, network analysis, crowdsourcingandmanyothertechniques

27

Page 28: Entity Linkage in the Linked Data - univ-artois.fr · 2019-11-04 · ¡ Linked Data referstoacollection of interrelated datasets n Used forlarge-scale integration of, reasoning on,

Comments?

Contact: Wei Hu ([email protected])

Thank you for your invitation and time!

Université d'Artois, 2016.6.7


Recommended