similarity measure

Post on 10-May-2015

17,950 views 4 download

Tags:

description

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

transcript

Chapter 3

Similarity Measures

Data Mining Technology

Chapter 3

Similarity MeasuresWritten by Kevin E. Heinrich

Presented by Zhao Xinyou

youzx@ai.is.uec.ac.jp

2007.6.7

Some materials (Examples) are taken from Website.

Searching Process

Input Text

Process

IndexQuery

Sorting

Show Text Result

Input Text

IndexQuery

Sorting

Show Text Result

Input Key Words

Results

Search

1. XXXXXXX2. YYYYYYY3. ZZZZZZZZ.......................

Example

Similarity Measures A similarity measure can represent the similar

ity between two documents, two queries, or one document and one query

It is possible to rank the retrieved documents in the order of presumed importance

A similarity measure is a function which computes the degree of similarity between a pair of text objects

There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)

PP27-28

Classic Similarity Measures

All similarity measures should map to the range [-1,1] or [0,1],

0 or -1 shows minimum similarity. (incompatible similarity)

1 shows maximum similarity. (absolute similarity)

PP28

Conversion

For example 1 shows incompatible similarity, 10 shows

absolute similarity.

[1, 10]

[0, 1]

s’ = (s – 1 ) / 9

Generally, we may use:s’ = ( s – min_s ) / ( max_s – min_s )

LinearNon-linear

PP28

Vector-Space Model-VSM

1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).

PP28-29

Example D is a set, which contains m Web documents;

D={d1, d2,…di…dm} i=1,2…m

There are n words among m Web documents. di={wi1,wi2,…wij,…win} i=1,2…m , j=1,2,….n

Q= {q1,q2,…qi,….qn} i=1,2,….n

PP28-29

If similarity(q,di) > similarity(q, dj) We may get the result di is more relevant than dj

Simple Measure Technology

Documents Set

PP29

Retrieved A

Relevant B

Retrieved and Relevant A∩B

Precision = Returned Relevant Documents / Total Returned Documents

Recall = Returned Relevant Documents / Total Relevant Documents

P(A,B) = |A∩B| / |A|

R(A,B) = |A∩B| / |B|

Example--Simple Measure Technology

PP29

Documents Set

A,C,E,G,H, I, JRelevant

B,D,FRetrieved & Relevant

W,YRetrieved

|B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10

|A| = {retrieved} = {B, D, F,W,Y} = 5

|A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3

P = precision = 3/5 = 60%

R = recall = 3/10 = 30%

Precision-Recall Graph-Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall

PP29-30

One QueryTwo Queries

Difficult to determine which of these twohypothetical results is better

Similarity measures based on VSM Dice coefficient Overlap Coefficient Jaccard Cosine Asymmetric Dissimilarity Other measures

PP30

Dice Coefficient-Cont’ Definition of Harmonic Mean: To X1,X2, …, Xn ,their harmonic mean E equals n divided

by(1/x1+1/x2+…+1/Xn), that is

RP

E11

2

To Harmonic Mean (E) of Precision (P) and Recall (R)

BA

BA

B

BA

A

BA

22

PP30

n

iin x

n

xxx

nE

121

1111

Dice Coefficient-Cont’

Denotation of Dice Coefficient:

)1,0()1(

)1(),(),(

1

2

1

2

1

n

k kj

n

k kq

n

k kjkq

j

ww

ww

BA

BABADdqsim

PP30

EBA

BA

BA

BABAthenDif

2

)1(),(

2

1

α>0.5 : precision is more important

α<0.5 : recall is more important

Usually α=0.5

Overlap CoefficientPP30-31

Documents Set

A queries B Documents

n

k

n

k kjkq

n

k kjkq

j

ww

ww

BA

BABAOdqsim

1 1

22

1

),min(

),min(),(),(

Jaccard Coefficient-Cont’

Documents Set

A queries B Documents

n

k

n

k kjkq

n

k kjkq

n

k kjkq

j

wwww

ww

BA

BABAJdqsim

1 11

22

1

),(),(

PP31

Example- Jaccard Coefficient

D1 = 2T1 + 3T2 + 5T3, (2,3,5)

D2 = 3T1 + 7T2 + T3 , (3,7,1)

Q = 0T1 + 0T2 + 2T3, (0,0,2)

J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31

J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04

PP31

n

k

n

k kjkq

n

k kjkq

n

k kjkq

j

wwww

ww

dqsim

1 11

22

1

),(

Cosine Coefficient-Cont’

n

k kj

n

k kq

n

k kjkq

j

j

j

ww

ww

dq

dq

BA

BAPRBACdqsim

1

2

1

2

1

),(),(

PP31-32

(d21,d22,…d2n)

(d11,d12,…d1n)(q1,q2,…qn)

Example-Cosine Coefficient Q = 0T1 + 0T2 + 2T3 D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3

C (D1 , Q) =

=10/ √ (38*4) = 0.81

C (D2 , Q) = 2 / √ (59*4) = 0.13

)200()532(

523020222222

PP31-32

(3,7,1)

(2,3,5)(0,0,2)

AsymmetricPP31

n

k kq

n

k kjkqjj

w

wwdqAdqsim

1

1),min(

),(),(

dj

diWki-->wkj

Euclidean distance

n

kkjkqjEj wwdqddqdis

1

2)(),(),(

PP32

Manhattan block distance

n

kkjkqjMj wwdqddqdis

1

),(),(

PP32

Other Measures

We may use priori/context knowledge

For example: Sim(q,dj)= [content identifier similarity]+

[objective term similarity]+

[citation similarity]

PP32

ComparisonPP34

Comparison

),min(

*2

2

1

2

1

BA

BAO

BA

BAC

BA

BAJ

BA

BAD

BA

Simple matching

Dice’s Coefficient

Cosine Coefficient

Overlap Coefficient

Jaccard’s Coefficient

|A|+|B|-|A∩B| ≥(|A|+|B|)/2

|A| ≥ |A∩B||B| ≥ |A∩B|

(|A|+|B|)/2 ≥√(|A|*|B|)

√(|A|*|B|)≥ min (|A|, |B|)

O≥C≥D≥J

PP34

Example-Documents-Term-Query-Cont’D1:A search Engine for 3D Models

D2:Design and Implementation of a string database query language

D3:Ranking of documents by measures considering conceptual dependence between terms

D4 Exploiting hierarchical domain structure to compute similarity

D5:an approach for measuring semantic similarity between words using multiple information sources

D6:determinging semantic similarity among entity classes from different ontologies

D7:strong similarity measures for ordered sets of documents in information retrieval

T1:search(ing) T2:Engine(s) T3:ModelsT4:database T5:query T6:languageT7:documents T8:measur(es,ing) T9:conceptual T10:dependence T11: domain T12:structure T13:similarity T14:semanticT15: ontologiesT16:informationT17: retrieval

Query: Semantic similarity measures used by search engines and other

information searching mechanisms

PP33

Example-Term-Document Matrix-Cont’

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 13 T14 T15 T16 T17

D1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1

D5 1 1 1 1

D6 1 1 1

D7 1 1 1 1 1

Q 2 1 1 1 1 1

Matrix[q][A]

PP34

Dice coefficient

)0*0...1*11*1()0*0...1*12*2(

)0*01*0...0*01*01*11*2(*2

n

k k

n

k kq

n

k kkq

n

k k

n

k kq

n

k kkq

ww

ww

ww

wwqDD

1

211

2

1 1

1

211

2

1 1 2)2

1(

)1(),1(

5.012

6

39

)12(*2

PP30, PP34

Final ResultsPP34

O≥C≥D≥J

Current Applications

Multi-Dimensional Modeling Hierarchical Clustering Bioinformatics

PP35-38

Discussion