Distance Functions on Hierarchies
Eftychia Baikousi
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy
Definition of metric
A distance function on a given set M is a function d:MxM , that satisfies the following conditions: d(x,y)≥0 and d(x,y)=0 iff x=y
Distance is positive between two different points and is zero precisely from a point to itself
It is symmetric: d(x,y)=d(y,x) The distance between x and y is the same in either
direction It satisfies the triangle inequality: d(x,z) ≤ d(x,y)+ d(y,z)
The distance between two points is the shortest distance along any path
Is a metric
Definition of similarity metric
Let s(x,y) be the similarity between two points x and y, then the following properties hold:
s(x,y) =1 only if x=y (0≤ s ≤1)
s(x,y) =s(y,x) x and y (symmetry)
The triangle inequality does not hold
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy
Minkowski Family
norm-1, City-Block, Manhattan
L1(x,y)= Σi |xi-yi|
norm-2, Euclidian L2(x,y)=(Σi |xi-yi|2 )1/2
norm-p, Minkowski Lp(x,y)=(Σi |xi-yi|p )1/p
infinity norm L=limp (Σi |xi-yi|p )1/p
=maxi (|xi-yi|)
Set Based
Simple matching coefficient
Jaccard Coefficient
Extended Jaccard, Tanimoto (Vector based)
Cosine (Vector based)
Dice’s coefficient
|BA||BA|
1)B,A(J
yxyx
)y,xcos(
attributes_#values_attribute_matching_#
SMC
BABA
BA)B,A(T22
|Y||X||YX|2
s
Edit Distance- Levenshtein distance Edit distance between two strings
x=x1 ….xn, y=y1…ym
is defined as the minimum number of atomic edit operations needed
Insert : ins(x,i,c)=x1x2…xicxi+1…xn
Delete : del(x,i)=x1x2…xi-1xi+1…xn
Replace : rep(x,i,c)=x1x2…xi-1cxi+1…xn
Assign cost for every edit operation c(o)=1
Edit distances Needleman-Wunch distance or Sellers Algorithm
Insert a character ins(x,i,c)=x1x2…xicxi+1…xn
with cost(o)=1 a gap ins_g(x,i,g)=x1x2…xigxi+1…xn
with cost(o)=g Delete
a character del(x,i)=x1x2…xi-1xi+1…xn with cost(o)=1
a gap del_g(x,i)=x1x2…xi-1xi+1…xn with cost(o)=g
Replace a character rep(x,i,c)=x1x2…xi-1cxi+1…xn
with cost(o)=1
Edit distances
Jaro distance Let two strings s and t and
s’= characters in s that are common with t t’ = characters in t that are common with s Ts,t =number of transportations of characters in s’ relative to
t’ )
|'s|2
T|s|
|t||'t|
|s||s|
(31)t,s(Jaro 't,'s
Edit distances
Jaro distance Example Let s =MARTHA and t =MARHTA
|s’|=6 |t’|=6 Ts,t = 2/2 since mismatched characters are T/H and H/T
8055.0)1216
66
66(
31
)|'s|2
T|s|
|t||'t|
|s||s|
(31)t,s(Jaro 't,'s
Edit distances
Jaro Winkler JWS(s,t)= Jaro(s,t) + ((prefixLength *
PREFIXSCALE * (1.0-Jaro(s,t))) Where:
prefixLength : the length of common prefix at the start of the string
PREFIXSCALE: a constant scaling factor which gives more favourable ratings to strings that match from the beginning for a set prefix length
Edit distances
Jaro Winkler Example Let s =MARTHA and t =MARHTA and
PREFIXSCALE = 0.1 Jaro(s,t)=0.8055 prefixLength=3
JWS(s,t)= Jaro(s,t) + ((prefixLength * PREFIXSCALE * (1.0-Jaro(s,t)))
= 0.8055 + (3*0.1*(1-0.8055)) = 0.86385
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy
Βασικές Έννοιες OLAP Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών
(μέτρων) πωλήσεις, απόθεμα, κέρδος,...
Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων ημερομηνία, προϊόν, τοποθεσία, πωλητής, …
Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα
μέτρα να είναι σημεία του χώρου αυτού
Κύβοι για OLAP
REGION
NS
WPRODUCT
Juice
Cola
Soap
MONTHJan
10
13
Κύβοι για OLAP
Βασικές Έννοιες OLAP
Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα).
Ο κύβος είναι μια ομάδα από κελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions) του κύβου.
Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.
Ιεραρχίες επιπέδων για OLAP Μια διάσταση μοντελοποιεί όλους τους τρόπους με
τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, …
Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδων συνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος, …
Ιεραρχίες Επιπέδων
Ιεραρχίες Επιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας
Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά
Αδρομέρεια: το αντίθετο της λεπτομέρειας
-- ο σωστός όρος είναι αδρομέρεια...
Year
Month Week
Day
Κύβοι & ιεραρχίες διαστάσεων για OLAP
Διαστάσεις: Product, Region, Date
Ιεραρχίες διαστάσεων:
Month
Regio
n
Pro
duct
Sales volume
Industry
Category
Product
Country
Region
City
Store
Year
Quarter
Month Week
Day
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy
Lattice A lattice is a partially ordered set (poset) in which
every pair of elements has a unique supremum and an inifimum
The hierarchy of levels is formally defined as a lattice (L,<) such that L= (L1, ..., Ln, ALL) is a finite set of levels and < is a partial order defined among the levels of L such that L1<Li<ALL 1≤i≤n.
the upper bound is always the level ALL, so that we can group all values into the single value ‘all’.
The lower bound of the lattice is the most detailed level of the dimension.
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice
Distance in same level of hierarchy Distance in different level of hierarchy
Distances in the same level of Hierarchy Let a dimension D, its levels of hierarchies L1<Li<ALL and
two specific values x and y s.t. x, y Li All
L2
L1
Distances in the same level of Hierarchy Explicit Minkowski Set Based Highway With respect to the detailed level Attribute Based
Distances in the same level of Hierarchy Explicit assignment
n2 distances for the n values of the dom(Li)
Minkowski family reduce to the Manhattan distance: |x-y|
Set based family reduced to {0, 1}, where
yifx,1
yifx,0)y,x(dist
Distances in the same level of Hierarchy Highway distance
Let the values of level Li form a set of k clusters, where each cluster has a representative rk
dist(x, y)= dist(x, rx)+ dist(rx, ry)+ dist(y, ry) Specify
k2 distances: dist (rx, ry) and
k distances: dist(x, rx)
Distances in the same level of Hierarchy With respect to the detailed level
f is a function that picks one of the descendants Attribute based
level L attributes: v [v1 … vn] dom(L) Distance can be defined with respect to the attributes
Ln
L2
L1 a,...,a,a
))y(desc),x(desc(f)y,x(dist y
11
LL
LxL
Outline
Definition of metric & similarity Various Distance Functions
Minkowski Set based Edit distance
Basic concept of OLAP Lattice Distance in same level of hierarchy
Distance in different level of hierarchy
Distances in different levels of Hierarchy Explicit dist1+ dist2
dist3+dist4
With respect to the detailed level With respect to their least common ancestor Highway Attribute Based
Distances in different levels of Hierarchy
Let a dimension D, its levels of hierarchies L1<Li<ALL two specific values x and y s. t.
x Lx y Ly
Lx<Ly
ancestor of x in level Ly
a descendant of y in level Lx
yx
xy
Ly
x
y
dist1dist3
dist2
dist4
Lx)x(ancx y
x
LLy
)y(descy y
x
LLx
Explicit assignment define distLx,Ly(x, y) x Lx, y Ly
dist1 +dist2
Where is a distance of two
values from the same level of hierarchy
special case: y is an ancestor of x then dist2=0
Distances in different levels of Hierarchy
)y),x(anc(dist))x(anc,x(distdistdist y
x
y
x
LL
LL21
)y),x(anc(dist y
x
LL
yx
xy
Ly
x
y
dist1 dist3
dist2
dist4
Lx
Distances in differentlevels of Hierarchy
dist3 +dist4
Where a distance of two values from the same
level of hierarchy
special case: y is an ancestor of x then dist4=0
)x),y(desc(dist))y(desc,y(distdistdist y
x
y
x
LL
LL43
)x),y(desc(dist y
x
LL
yx
xy
Ly
x
y
dist1 dist3
dist2
dist4
Lx
Distances in different levels of Hierarchy With respect to the detailed level
Let and
Where dist(x1, y1) a distance of two values
from the same level of hierarchy
))x(desc(fx x
1
LL1 ))y(desc(fy y
1
LL1
)y,y(dist)y,x(dist)x,x(dist)y,x(dist 1111
Distances in different levels of Hierarchy With respect to their common ancestor
Let Lz the level of hierarchy where x and y have their first common ancestor
number of “hops” needed to reach the first common ancestor
normalizing according to the height of the level
)y),y(anc(dist))x(anc,x(dist)y,x(dist z
y
z
x
LL
LL
Distances in different levels of Hierarchy Highway distance
Let every Li is clustered into ki clusters and every cluster has its own representative rki
Attribute Based
level L attributes: v [v1 … vn] dom(L) Distance can be defined with respect to the
attributes
)y,r(dist)r,r(dist)r,x(dist)y,x(dist yyxx
Types of Levels
Nominal = values hold the distinctness property values can be explicitly distinguished
Ordinal < > values hold the distinctness property & the order property values abide by an order
Interval + - values hold the distinctness, order & the addition property a unit of measurement exists there is meaning of the difference between two values