+ All Categories
Home > Documents > Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND...

Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND...

Date post: 19-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
88
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Measuring similarities Devert Alexandre School of Software Engineering of USTC December 7, 2012 — Slide 1/62
Transcript
Page 1: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

DistancesMeasuring similarities

Devert AlexandreSchool of Software Engineering of USTC

December 7, 2012 — Slide 1/62

Page 2: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

1 Introduction

2 Strings

3 Data semantic

4 Perceptive models

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 2/62

Page 3: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

DistancesMany data-mining algorithms, like k-means, rely on adistance measure

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 3/62

Page 4: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Distances

The distance measure express how 2 data of the datasetrelates to each other

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 4/62

Page 5: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Distances

So far, we considered point in Rn and Euclidean distance.

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 5/62

Page 6: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Distances

But what looks like a distance measure for

• Text documents ?

• Sounds ?

• Shapes ?

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 6/62

Page 7: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Distances

Is the Euclidean good enough for all cases ?

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 7/62

Page 8: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

1 Introduction

2 Strings

3 Data semantic

4 Perceptive models

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 8/62

Page 9: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Binary strings

Let’s consider binary strings of fixed length

001010011101010100111100011111

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 9/62

Page 10: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Hamming distance

A convenient distance measure for strings is theHamming distance

s1 = 001010011101010100111100011111

s2 = 001010011101010000111100011111

d(s1, s2) = 1

Distance is the number of different digits

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 10/62

Page 11: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Hamming distance

A convenient distance measure for strings is theHamming distance

s1 = 001010011101010100111100011111

s2 = 001010001101000100110100011111

d(s1, s2) = 3

Distance is the number of different digits

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 10/62

Page 12: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Hamming distance

A way to compute it quickly, if you use integers

• Exclusive Or of the 2 chains

• Count the ones in the resulting chain

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 11/62

Page 13: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Fixed size strings

Hamming distance for other alphabets than {0, 1}

s1 = GATEAU

s2 = BATEAU

d(s1, s2) = 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 12/62

Page 14: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Fixed size strings

Hamming distance for other alphabets than {0, 1}

s1 = BIGLOTRON

s2 = BAFFOTRON

d(s1, s2) = 3

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 12/62

Page 15: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Variable length strings

But for many practical applications, we need to comparestrings with different lengths

CYCLOTRON

SYNCHROTRON

SYNCHROPHASOTRON

BIGLOTRON

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 13/62

Page 16: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Levenshtein distance

The Levenshtein distance is the minimum number ofedits needed to transform one string into the other

1 insertion of a character

2 deletion of a character

3 substitution of a character

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 14/62

Page 17: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Levenshtein distance

Insertion of a character

FAT ⇒ FAST

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 15/62

Page 18: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Levenshtein distance

Deletion of a character

FART ⇒ FAT

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 16/62

Page 19: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Levenshtein distance

Substitution of a character

FAT ⇒ CAT

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 17/62

Page 20: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Levenshtein distance

CYCLOTRON

SYCLOTRON substitution S ⇒ C

SYNCLOTRON insertion YC ⇒ YNC

SYNCHOTRON substitution L ⇒ H

SYNCHROTRON insertion HO ⇒ HRO

distance is 4, because of the 4 steps to turnCYCLOTRON in SYNCHROTRON

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 18/62

Page 21: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

Initial step for s1 = CAAT and s2 = CAT

C A A T0 1 2 3 4

C 1A 2T 3

We will fill step by step a matrix, with the distance ofeach prefix of s1 and s2

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 22: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

Let’s fill the matrix

C A A T0 1 2 3 4

C 1 0A 2T 3

Prefixes C and C are the same, distance is 0

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 23: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

Let’s fill the matrix

C A A T0 1 2 3 4

C 1 0 1 2 3A 2T 3

Prefixes C and CA, CAA, CAAT have distance 1, 2, 3

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 24: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C A A T0 1 2 3 4

C 1 0 1 2 3A 2 1T 3 2

Prefixes C and CA, CAT have distance 1, 2

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 25: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C A A T0 1 2 3 4

C 1 0 1 2 3A 2 1 0 1 2T 3 2 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 26: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C A A T0 1 2 3 4

C 1 0 1 2 3A 2 1 0 1 2T 3 2 1 1 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 27: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C A A T0 1 2 3 4

C 1 0 1 2 3A 2 1 0 1 2T 3 2 1 1 1

Distance is 1 !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 19/62

Page 28: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

CUT and CAT

C U T0 1 2 3

C 1A 2T 3

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62

Page 29: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

CUT and CAT

C U T0 1 2 3

C 1 0 1 2A 2 1T 3 2

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62

Page 30: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

CUT and CAT

C U T0 1 2 3

C 1 0 1 2A 2 1 1 2T 3 2 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62

Page 31: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

CUT and CAT

C U T0 1 2 3

C 1 0 1 2A 2 1 1 2T 3 2 1 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62

Page 32: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

CUT and CAT

C U T0 1 2 3

C 1 0 1 2A 2 1 1 2T 3 2 1 1

Distance is 1 !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 20/62

Page 33: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

The general idea

x1 x2 . . . xn0 1 2 . . . n

y1 1y2 2...

... di ,jym m

di ,j is the distance between {x1, x2, . . . , xi} and{y1, y2, . . . , yj}

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62

Page 34: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

The general idea

x1 x2 . . . xn0 1 2 . . . n

y1 1y2 2 di−1,j−1...

... di ,jym m

If xi = yjdi ,j = di−1,j−1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62

Page 35: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

The general idea

x1 x2 . . . xn0 1 2 . . . n

y1 1y2 2 di−1,j−1 di ,j−1...

... di−1,j di ,jym m

If xi 6= yj

di ,j = 1 + min(di ,j−1, di−1,j , di−1,j−1)

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 21/62

Page 36: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1Y 2N 3C 4H 5R 6O 7T 8R 9O 10N 11

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 37: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2N 3 3C 4 3H 5 4R 6 5O 7 6T 8 7R 9 8O 10 9N 11 10

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 38: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2C 4 3 3H 5 4 4R 6 5 5O 7 6 6T 8 7 7R 9 8 8O 10 9 9N 11 10 10

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 39: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2H 5 4 4 3R 6 5 5 4O 7 6 6 5T 8 7 7 6R 9 8 8 7O 10 9 9 8N 11 10 10 9

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 40: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3R 6 5 5 4 4O 7 6 6 5 5T 8 7 7 6 6R 9 8 8 7 7O 10 9 9 8 8N 11 10 10 9 9

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 41: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4O 7 6 6 5 5 4T 8 7 7 6 6 5R 9 8 8 7 7 6O 10 9 9 8 8 7N 11 10 10 9 9 8

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 42: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5T 8 7 7 6 6 5 4R 9 8 8 7 7 6 5O 10 9 9 8 8 7 6N 11 10 10 9 9 8 7

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 43: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5R 9 8 8 7 7 6 5 4O 10 9 9 8 8 7 6 5N 11 10 10 9 9 8 7 6

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 44: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5 6 6R 9 8 8 7 7 6 5 4 5O 10 9 9 8 8 7 6 5 4N 11 10 10 9 9 8 7 6 5

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 45: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

C Y C L O T R O N0 1 2 3 4 5 6 7 8 9

S 1 1 2 3 4 5 6 7 8 9Y 2 2 1 2 3 4 5 6 7 8N 3 3 2 2 3 4 5 6 7 7C 4 3 3 2 3 4 5 6 7 8H 5 4 4 3 3 4 5 6 7 8R 6 5 5 4 4 4 5 5 6 7O 7 6 6 5 5 4 5 6 5 6T 8 7 7 6 6 5 4 5 6 6R 9 8 8 7 7 6 5 4 5 6O 10 9 9 8 8 7 6 5 4 5N 11 10 10 9 9 8 7 6 5 4

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 22/62

Page 46: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

Recursive & lazy style

de f l e v ( a , b ) :i f not a :

r e t u r n l e n ( b )

i f not b :r e t u r n l e n ( a )

r e t u r n min ( l e v ( a [ 1 : ] , b [ 1 : ] ) + ( a [ 0 ] != b [ 0 ] ) , # s u b s t i t u t i o nl e v ( a [ 1 : ] , b ) + 1 , # d e l e t i o nl e v ( a , b [ 1 : ] ) + 1) # i n s e r t i o n

Most programming languages do not deal well with suchcode

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 23/62

Page 47: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Levenshtein distance

Imperative & eager style

de f l e v ( s , t ) :s , t , d = ’ ’ + s , ’ ’ + t , { }

f o r i i n x range ( l e n ( s ) ) :d [ i , 0 ] = i

f o r j i n x range ( l e n ( t ) ) :d [ 0 , j ] = j

f o r j i n x range (1 , l e n ( t ) ) :f o r i i n range (1 , l e n ( s ) ) :

i f s [ i ] == t [ j ] :d [ i , j ] = d [ i −1, j −1]

e l s e :d [ i , j ] = min ( d [ i −1, j ] , d [ i , j −1] , d [ i −1, j −1]) + 1

r e t u r n d [ l e n ( s )−1, l e n ( t )−1]

Can be better, no need to store the complete matrix

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 24/62

Page 48: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Damerau–Levenshtein distance

The Damerau–Levenshtein distance is like theLevenshtein distance, with one more edit operation

1 insertion of a character

2 deletion of a character

3 substitution of a character

4 transposition of 2 adjacent characters

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 25/62

Page 49: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Damerau–Levenshtein

distance

It works almost like the Levenshtein distance

x1 x2 . . . xn0 0 1 2 . . . n

0 0 1 2 . . . n

y1 1 1y2 2 2...

...... di ,j

ym m m

di ,j is the distance between {x1, x2, . . . , xi} and{y1, y2, . . . , yj}

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62

Page 50: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Damerau–Levenshtein

distance

It works almost like the Levenshtein distance

x1 x2 . . . xn0 0 1 2 . . . n

0 0 1 2 . . . n

y1 1 1y2 2 2...

...... di ,j

ym m m

If xi = yjdi ,j = min(di−2,j−2, di−1,j−1)

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62

Page 51: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Computing Damerau–Levenshtein

distanceIt works almost like the Levenshtein distance

x1 x2 . . . xn0 0 1 2 . . . n

0 0 1 2 . . . n

y1 1 1y2 2 2...

...... di ,j

ym m m

If xi 6= yj

di ,j = 1 + min(di−2,j−2, di ,j−1, di−1,j , di−1,j−1)

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 26/62

Page 52: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Robustness

Damerau–Levenshtein distance will differentiates thefollowing strings

THOMAS ⇒ TOHMAS

THOMAS ⇒ THOMASS

THOMAS ⇒ TOMAS

They have a distance = 1 with the string THOMAS

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 27/62

Page 53: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Robustness

Those differences are likely due to some typing error !

THOMAS

TOHMAS

THOMASS

TOMAS

The Damerau–Levenshtein distance might make ouralgorithms sensible to noise, if used for typed things

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 28/62

Page 54: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

A distance specialized for names records : Jaro distance

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 29/62

Page 55: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

d(s1, s2) =1

3

(

m

|s1|+

m

|s2|+

m − t

m

)

• d(s1, s2) = 1 means exact match

• d(s1, s2) = 0 means no similarity

• m is the number of matching characters

• t is half the number of transpositions

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 30/62

Page 56: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

2 characters from s1 and s2 are matching if they areequal and not farther than

max(|s1|, |s2|)

2

− 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 31/62

Page 57: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

THOMAS

TOHMAS

6 matching characters, m = 6

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 32/62

Page 58: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

2 characters from s1 and s2 are transposed if they are not

equal and not farther than

max(|s1|, |s2|)

2

− 1

t is equal to half the number of transposed characters

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 33/62

Page 59: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

THOMAS

TOHMAS

Mismatched characters H/O and O/H , t = 22= 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 34/62

Page 60: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

Jaro distance for THOMAS and TOHMAS is13

(

66+ 6

6+ 6−1

6

)

= 0.944

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 35/62

Page 61: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

Computes which character are matching and notmatching

D I X O ND 1 0 0 0 0I 0 1 0 0 0C 0 0 0 0 0K 0 0 0 0 0S 0 0 0 0 0O 0 0 0 1 0N 0 0 0 0 1X 0 0 0 0 0

• |s1| = 5, |s2| = 8

• match window is 3

• m = 4, t = 0

• distance is13

(

m|s1|

+ m|s2|

+ m−tm

)

= 0.822

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 36/62

Page 62: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Jaro distance

Jaro distance is an example of how to deal with noise indata ⇒ distance that considers identical 2 strings likelyto be the same thing but with some typing errors

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 37/62

Page 63: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Distance between texts

A simple and popular way to measure distance betweenlarge texts for data-mining ⇒ bag of words

• Find a large list of common words L likely to be intext A and B .

• Build vectors X (A) and X (B), where X (A)i is thenumber of occurence of words Li .

• Distance is X (A).X (B)

Used for spam-filtering for instance

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 38/62

Page 64: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

1 Introduction

2 Strings

3 Data semantic

4 Perceptive models

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 39/62

Page 65: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

Let’s consider geographical data : UFO sightings

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 40/62

Page 66: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

The raw data looks like this

description ufolocation Austin, Texas, USAsight date 20020804shape circle

InfoChimp UFO dataset ⇒ 60000 entries !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 41/62

Page 67: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

Using string distances distance for the date string wouldbe silly

d(20020804, 20050804) = 1

d(20020804, 20020704) = 1

d(20020804, 20020814) = 1

One year, one month or ten days are not the same thing !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 42/62

Page 68: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

Using number difference for the date string would be silly

d(20020804, 20050321) = 18020483

Dates are not a single base 10 number !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 43/62

Page 69: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

We need to convert dates into single numbers

description ufolocation Austin, Texas, USAsight date 1028419200shape circle

We can convert them to UTC time (beware, many UFOsightings before 1970)

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 44/62

Page 70: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

Using string distance for location names would be silly

d(SUZHOU , FUZHOU) = 1

d(HEFEI ,HEBEI ) = 1

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 45/62

Page 71: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

We need to convert locations into positions

description ufolocation latitude 30.25location longitude 97.75sight date 1028419200shape circle

Geo–location services can convert this to geo–coordinates

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 46/62

Page 72: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

UFO sightings data

We will consider just the location and the time

location latitude 30.25location longitude 97.75sight date 1028419200

Can we use Euclidean distance now, to cluster those data?

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 47/62

Page 73: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Spherical coordinatesThe locations are spherical coordinates

λ

φr

(r,φ,λ)

A

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 48/62

Page 74: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Spherical coordinates

spherical coordinates are angles, does not work like theusual cartesian coordinates

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 49/62

Page 75: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Spherical coordinates

Distance between 2 sphere coordinates (φa, λa), (φb, λb)is not the usual Euclidean distance

rarcos(sinφa sinφb + cosφa cosφb cos(λb − λa))

Use Vincenty formula to actually compute this

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 50/62

Page 76: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Table of Contents

1 Introduction

2 Strings

3 Data semantic

4 Perceptive models

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 51/62

Page 77: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Picture segmentation

Using kmeans algorithm, one can segment a picture intosimilar areas

Colors are [r , g , b] triplets, so we can use euclideandistance, right ?

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 52/62

Page 78: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color perception

Electronic sensors record color informations as 3 signals

red R green G blue B

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 53/62

Page 79: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color perceptionOthers color spaces, like YUV, separate luminance andchrominance

luminance Y’ chrominance U chrominance V

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 54/62

Page 80: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color interpolation

Let’s interpolate 2 colors Ca and Cb, using different colorspaces

Cα = Ca + α(Cb − Ca), α ∈ [0, 1]

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 55/62

Page 81: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color interpolation

Interpolating colors in different color spaces givesdifferent results

Some color spaces introduce new colors wheninterpolating from one to an other !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 56/62

Page 82: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color interpolation

Interpolating colors in different color spaces givesdifferent results

Some color spaces introduce new colors wheninterpolating from one to an other !

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 56/62

Page 83: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Color perception

When you choose to represent colors in a given colorspace, you choose

• which colors are alike and which colors are verydifferent

• a color ”neighbourhood”

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 57/62

Page 84: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

YUV color space

YUV color space have interesting properties

• color interpolation in YUV looks perceptually morecorrect

• human eye is much more sensitive to luminance Y’

than chrominance UV

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 58/62

Page 85: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

RGB and YUV color spaces

The transformation from RGB to YUV is linear

Y ′

U

V

=

0.299 0.587 0.114−0.14713 −0.28886 0.4360.615 −0.51499 −0.10001

R

G

B

R , G and B are in the [0, 1] range

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 59/62

Page 86: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

RGB and YUV color spaces

RGB and YUV color spaces represents the same thing :colors. But

• RGB is the signal as it comes out of the sensors

• YUV takes in account human perception of color

Some color spaces like LAB are even better models forhuman color perception

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 60/62

Page 87: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Human perception

There are perceptive models for

• colors

• shape

• depth

• sounds

• speech

• motion

• . . .

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 61/62

Page 88: Distancesmarmakoide.org/download/teaching/dm/dm-distance.pdf · UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC Distances Many data-mining algorithms,

UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC

Human perception

A correct data-mining approach can return completelymeaningless results, without a perceptive model

In speech recognition, it’s essential, just to obtain apractically usable system

Devert Alexandre (School of Software Engineering of USTC) — Distances — Slide 62/62


Recommended