KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
EfficientGraph-basedDocumentSimilarity
ChristianPaul,AchimRettinger,AdityaMogadala,CraigA.Knoblock,PedroSzekely
Institute of Applied Informatics and Formal Description Methods (AIFB)
2 06/01/2016
Commontask:Related-documentSearch
Applebreakslaptopsalesrecord
Hedrinksapplejuiceduringhalf-timebreak
All-timehighinMacBooks sold
U2recordpre-installedoniPhones
.
.
.
Querydocument
.
.
.
DocumentCollection
Institute of Applied Informatics and Formal Description Methods (AIFB)
3 06/01/2016
Hedrinksapplejuiceduringhalf-timebreak
Matchingwordsdonotalwaysindicatesimilarity
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
DocumentCollection
Querydocument
Institute of Applied Informatics and Formal Description Methods (AIFB)
4 06/01/2016
Wordco-occurrencecanbemisleading,too
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
DocumentCollection
Querydocument
Hedrinksapplejuiceduringhalf-timebreak
Institute of Applied Informatics and Formal Description Methods (AIFB)
5 06/01/2016
SemanticTechnologies:resolveambiguity&exploitrelationalknowledge
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
MacBook
AppleInc.
developer
Laptop
type
iPhone
developer
DocumentCollection
Querydocument
AppleJuice
Hedrinksapplejuiceduringhalf-timebreak
Institute of Applied Informatics and Formal Description Methods (AIFB)
6 06/01/2016
SemanticTechnologies:resolveambiguity&exploitrelationalknowledge
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
MacBook
AppleInc.
developer
Laptop
type
iPhone
developer
DocumentCollection
Querydocument
AppleJuice
Hedrinksapplejuiceduringhalf-timebreak
Expensive graph traversal
Institute of Applied Informatics and Formal Description Methods (AIFB)
7 06/01/2016
RelatedWork
Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations
ExplicitSemanticAnalysis(ESA)[GM07]
TF-IDF,VectorSpaceModel
SalientSemanticAnalysis(SSA) [HM11]
Knowledge-based:+richsemanticknowledge
- expensivegraphtraversal
PathSim [SHY+11]
HeteSim [SKH+14]
AnnSim:1-1matching,hierarchicalsimilarity
[PVH+13]
Schuhmacher, Ponzetto:GraphEditDistance[SP14]
Nunes etal.:Transversaldoc.similarity
[NKF+13]
Institute of Applied Informatics and Formal Description Methods (AIFB)
8 06/01/2016
Bridgingthegap
Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations
ExplicitSemanticAnalysis(ESA)[GM07]
TF-IDF,VectorSpaceModel
SalientSemanticAnalysis(SSA) [HM11]
Knowledge-based:+richsemanticknowledge
- expensivegraphtraversal
PathSim [SHY+11]
HeteSim [SKH+14]
AnnSim:1-1matching,hierarchicalsimilarity
[PVH+13]
Efficient Graph-based
DocumentSimilarity Schuhmacher, Ponzetto:
GraphEditDistance[SP14]
Nunes etal.:Transversaldoc.similarity
[NKF+13]
Institute of Applied Informatics and Formal Description Methods (AIFB)
9 06/01/2016
CoreContributions
Ø Scalablerelated-documentsearchprocess
Ø Graphtraversalduring pre-processing
Ø Light-weighttasksatsearchtime
Weachievesimilarcomputationalefficiencyasstatisticalapproaches
Institute of Applied Informatics and Formal Description Methods (AIFB)
10 06/01/2016
CoreContributions
Ø Scalablerelated-documentsearchprocess
Ø Graphtraversalduring pre-processing
Ø Light-weighttasksatsearchtime
Weachievesimilarcomputationalefficiencyasstatisticalapproaches
Ø Bag-of-entitiesdocumentmodel&similarity
Ø Documentsimilarityascombination ofpairwiseentitysimilarities
Ø Exploitshierarchical&transversal knowledgegraphrelations
Inourexperiments,weachievehighercorrelationwithhumannotion ofdocumentsimilaritythanthecompetition
Institute of Applied Informatics and Formal Description Methods (AIFB)
11 06/01/2016
Related-documentSearchusingGraph-basedSimilarity1) SemanticDocumentExpansion
• Enrichquerydocumentwithrelationalknowledge
2) Inclusionincorpus
• Store&indexexpanded document
3) Pre-search
• Useinvertedindextogeneratecandidateset
4) Fullsearch
• Entity-level,path-basedsimilarities
Institute of Applied Informatics and Formal Description Methods (AIFB)
12 06/01/2016
SemanticDocumentExpansion
l Enrichdocumentannotations
l Hierarchically
- Categories&theirancestors+hierarchicaldepths
l Transversally
- Weightneighboring entitiesbasedon
l numberofpaths
l lengthofpaths
w(e)=∑l= 1
L
βl∗∣pathsa , e(l) ∣
DocA
1.5
0.75
0.25
0.5
1
1
0.5
0.5
0.5
0.5
Institute of Applied Informatics and Formal Description Methods (AIFB)
13 06/01/2016
Pre-Search:GenerateCandidateSet
l Invertedindexfromentitiestodocuments
- Retrievecandidatesefficiently
l Assumption:Entityoverlapà contextualsimilarity
- Coarse,document-levelassessment
Institute of Applied Informatics and Formal Description Methods (AIFB)
14 06/01/2016
FullSearch:Graph-basedDocumentSimilarity
l Foreachcandidatedocument,reconstructquery-candidateannotationsubgraph-hierarchical &transversal
Ø Computeallpairwiseentitysimilarityscores
Ø Combine intodocumentscore
DocA DocB
Institute of Applied Informatics and Formal Description Methods (AIFB)
15 06/01/2016
l Usingstoredancestors&depthstocompute
l Example:
Hierarchicalentitysimilarity
hierSimdps(ent1 , ent2)=1
1+ 2+ 2= 0.2
hierSimdps (x , y )=d (root , lca( x , y ))
d (root , lca(x , y ))+ d (lca( x , y) , x )+ d ( lca(x , y ,) , y )
Institute of Applied Informatics and Formal Description Methods (AIFB)
16 06/01/2016
Transversalentitysimilarity
l Usestoredneighbors&weightstocompute:
l Example:transSim(ent1 , ent2)= 0.52+ 2∗0.252+ 0.5∗0.25= 0.5
transSim(a ,b)=∑l= 1
L∗ 2
βl∗∣pathsa ,b(l ) ∣
Institute of Applied Informatics and Formal Description Methods (AIFB)
17 06/01/2016
Documentsimilarity:bipartitegraphofentitysimilarities1. Annotationpairsimilarity:Combinetransversal&hierarchicalscores
2. DeterminemaxGraph: foreachannotation,choosemax.scoreedge(bold)
3. Computedocumentscorebasedonmax.edges foreachannotationa1i ofDocA:
DocA DocB
docSim(docA , docB)=∑a1i∈A1
(entSiment(a1i ,matched (a1i)))
∣A1∣+∣A2∣
(a1i ,matched (a1i))
Institute of Applied Informatics and Formal Description Methods (AIFB)
18 06/01/2016
Documentsimilarity:DBpediaexample
l Exampledocumentsscore:
docSim(docA , docB)= 0.53+ 0.92+ 0.43+ 0.53+ 0.58+ 0.813+ 3
≈ 0.63
Institute of Applied Informatics and Formal Description Methods (AIFB)
19 06/01/2016
Evaluation
• Task:Measurecorrelationwithhumannotionofsimilarity
• Datasets
• Documentsimilarity:Lee50[1]
• Sentencesimilarity:2012-MSRvid-Test[2],2015-Images[3]
• ...using andX-LiSA[ZR14] entityextractor
[1]https://webfiles.uci.edu/mdlee/LeePincombeWelsh.zip[2]http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/[3]http://ixa2.si.ehu.es/stswiki/index.php/
Institute of Applied Informatics and Formal Description Methods (AIFB)
20 06/01/2016
DocumentSimilarity: Lee50corpus
• 50shortnewsarticles(51to126words)
• Goldstandardsetoffullpairwisedocumentsimilarityscores
Ø Outperformingbaselines&competition:
• Statistical(LSA,ESA,SSA)
• Knowledge-based(GED)
Institute of Applied Informatics and Formal Description Methods (AIFB)
21 06/01/2016
SentenceSimilarity
• Comparedtorelatedunsupervisedapproaches(ontextswithoneormoreextractedentities)
• 2012-MSRvid-Test:Videodescriptions fromMSRVideoParaphraseCorpus
• 2015-Images:Flickrimagedescriptions
Ø Outperformingbaselines&competition
• Statistical(Polyglot)
• Knowledge-based(Tiantianzhu7,IRIT,WSL)
Institute of Applied Informatics and Formal Description Methods (AIFB)
22 06/01/2016
Related-documentSearch:Pre-Search,FullSearch&Efficiency
Ø Rankingscore(nDCG)improvesfromPre-Search toFullSearch
Ø Computationtimegrowslinearlywithcandidatesetsize
Ø Here:candidatesetofsize~15achieveshighperformance
Institute of Applied Informatics and Formal Description Methods (AIFB)
23 06/01/2016
Conclusion&Outlook
l EfficientGraph-basedDocumentSimilarity
• …combineshierarchical&transversalrelationalknowledge
• …outperforms relateddistributional &knowledge-basedapproaches,onbotharticlesandsentences
• …iscomputationallyefficient:related-documentsearch
l Lessonslearned
Ø ValueofDBpediaforsemanticsimilarity
Ø Themoreentities(atleastone)perdocument, thebetter:
Ø Fewentities:disambiguationhelps
Ø Manyentities:maxGraph entitypairingemphasizesmeaningful relations
l Resources(code,data,documents):http://people.aifb.kit.edu/amo/eswc2016/
Institute of Applied Informatics and Formal Description Methods (AIFB)
24 06/01/2016
ReferencesI
l [TMS08] Thiagarajan,Manjunath,Stumptner. Computing semanticsimilarityusingontologies.InISWC08,theInternational SemanticWebConference(ISWC),2008.
l [LD08] Lemaire,Denhière.Effectsofhigh-order co-occurrences onwordsemanticsimilarities.
l [GM07]Gabrilovich, Markovitch.Computing semanticrelatednessusingwikipedia-basedexplicit semanticanalysis.InIJCAI,volume7,pages1606–1611, 2007.
l [HM11] Hassan,Mihalcea.Semanticrelatednessusingsalientsemanticanalysis.InAAAI,2011.
l [SP14] Schuhmacher, Ponzetto. Knowledge-basedgraphdocument modeling.InProceedingsofthe7thACMInternational ConferenceonWebSearchandDataMining,WSDM’14.
l [NKF+13] Nunes, Kawase,Fetahu,Dietze, Casanova,Maynard.Interlinkingdocumentsbasedonsemanticgraphs.ProcediaComputerScience,22:231–240,2013.
l [PSA08] Potthast, Stein, Anderka.Awikipedia-basedmultilingual retrieval model.InAdvancesinInformationRetrieval, pages522–530.Springer, 2008.
l [SHY+11] Sun,Han,Yan,Yu,Wu.Pathsim:Metapath-basedtop-ksimilaritysearchinheterogeneous information networks.VLDB’11, 2011.
l [SKH+14] Chuan,Xiangnan,Yue,Yu,Bin.Hetesim:Ageneralframeworkforrelevancemeasureinheterogeneous networks.IEEETransactionsonKnowledge&DataEngineering.
l [PVH+13] Palma,Vidal,Haag,Raschid,Thor.Measuringrelatedness betweenscientific entities inannotation datasets.InProceedingsoftheInternational ConferenceonBioinformatics,Computational Biology andBiomedical Informatics,BCB’13.
l [ZR14] Zhang,Rettinger.X-lisa:Cross-lingual semanticannotation.ProceedingsoftheVLDBEndowment(PVLDB), the40thInternational ConferenceonVeryLargeDataBases(VLDB).
l [KJC+15] PavanKapanipathi, PrateekJain,Chitra Venkataramani,AmitSheth.Hierarchical interestgraph, 21January2015.wiki.knoesis.org/index.php/Hierarchical_Interest_Graph, lastaccessed07/15/2015
Institute of Applied Informatics and Formal Description Methods (AIFB)
25 06/01/2016
ReferencesII
l [LIJ+15] Lehmann,J.,Isele,R.,Jakob,M.,Jentzsch, A.,Kontokostas, D.,Mendes,P.N.,Hellmann, S.,Morsey,M.,vanKleef,P.,Auer, S.,etal.:Dbpedia-alarge-scale,multilingual knowledgebaseextracted fromwikipedia.SemanticWeb6(2),167-195(2015)