Post on 16-Feb-2016
description
transcript
OCCT: A One-Class Clustering Tree
for Implementing One-to-Many Data Linkage
Ben-Gurion University of The NegevFaculty of Engineering Sciences
Department of Information Systems Engineering
Ma'ayan Gafny, Asaf Shabtai ,Lior Rokach, Yuval Elovici
Definitions
ππ΄ β a given table A ππ΅ β a given table B (our goal is to link records from table ππ΄ with one or more records from ππ΅) ΘοΏ½ππ΄ΘοΏ½ β number of records in ππ΄ ΘοΏ½ππ΅ΘοΏ½ β number of records in ππ΅
A β the set of attributes of table ππ΄ where ai is the i-th attribute
|A| β denotes the number of attributes in ππ΄
B β the set of attributes of table ππ΅ where bi is the i-th attribute
|B| β denotes the number of attributes in ππ΅ π(π) βππ΄ β a record from table ππ΄ π(π) βππ΅ β a record from table ππ΅ ππ΄Γ ππ΅ β a table that is generated by applying Cartesian product of ππ΄ and ππ΅
r=(r(a),r(b))βTAΓTB β a record of ππ΄Γ ππ΅ ππ΄π΅βππ΄Γ ππ΅ β denoting the set of matching records ππ΄π΅ΰ΄€ΰ΄€ΰ΄€ΰ΄€βππ΄Γ ππ΅ β denoting the set of non-matching records d β a node in the OCCT model AdβA β the subset of attributes of TA that were already selected as splitting attributes in the path
from the root of the tree to node d. ππ΄π΅(π)βππ΄π΅ β the subset of matching instances at node d of the OCCT tree πππππ‘παππ΄π΅(π)α= ππ΄π΅(π)(π) β the splitting of ππ΄π΅(π) into n subsets according to attribute a such that
βπ = 1..π ππ΄π΅(ππ)(π) = {πβππ΄π΅(π)|π = π£π} ππ(ππ΄π΅(π)) β selection operator that is used to select records in ππ΄π΅(π) that satisfy the given predicate
p (in this case p is a=vi) ππ΄(ππ΄π΅αΊπα») β projection operator that is used to select a subset of attributes in ππ΄π΅(π) that appear in
the attribute collection A
Definitions
Definitions
an β¦ a4 a3 a2 a1
TA: TB:
bm β¦ b4 b3 b2 b1
A = {a1,a2,a3,β¦,an}|A| = n
|TA| = num of records in TA
r(a) = a record from TA
B={b1,b2,b3,β¦,bm}|B|=m
|TB| = num of records in TB
r(b) = a record from TB
r(a) r(b)
Definitionsan β¦ a4 a3 a2 a1
TA: TB:
bm β¦ b4 b3 b2 b1
bm β¦ b4 b3 b2 b1 an β¦ a4 a3 a2 a1
TA x TB :
r=(r(a) , r(b))
Definitions
Target bm β¦ b4 b3 b2 b1 an β¦ a4 a3 a2 a1
match
match
match
match
no-match
no-match
no-match
no-match
TA x TB :
TAB
TAB
Definitions
Target bm β¦ b4 b3 b2 b1 an β¦ a4 a3 a2 a1
match
match
match
match
no-match
no-match
no-match
no-match
TA x TB :
TAB
TAB
Definitions
d
d1
d2
bm β¦ b1 an β¦ a2 a1
v1
v1
v1
bm β¦ b1 an β¦ a2 a1
v2
v2
v2
Definitions
d1
d2
d4
d5
d3
Ad4 = {a1,a2}
Ad2 = {a1}
AdβA β the subset of attributes of TA that were already selected as splitting attributes in the path from the root of the tree to node d.
Running Examples
The data set Customer Type Customer City Request Location Request Day Of
WeekRequest Part Of
Day Request ID
private Berlin Berlin Friday Afternoon 1
private Hamburg Hamburg Wednesday Afternoon 2
business Berlin Berlin Wednesday Morning 3
private Berlin Berlin Wednseday Morning 4
private Berlin Berlin Saturday Afternoon 5
private Berlin Berlin Thursday Morning 6
private Berlin Berlin Friday Afternoon 7
business Berlin Berlin Saturday Afternoon 8
private Berlin Berlin Saturday Afternoon 9
business Hamburg Hamburg Friday Afternoon 10
business Hamburg Hamburg Monday Afternoon 11
private Hamburg Hamburg Saturday Afternoon 12
private Berlin Berlin Monday Afternoon 13
private Bonn Berlin Monday Afternoon 14
private Berlin Berlin Monday Afternoon 15
private Bonn Bonn Saturday Morning 16
private Hamburg Hamburg Saturday Morning 17
private Hamburg Hamburg Saturday Morning 18
private Hamburg Hamburg Friday Afternoon 19
The data set β cont .Customer Type Customer City Request Location Request Day Of
WeekRequest Part Of
Day Request ID
private Bonn Hamburg Friday Afternoon 20
private Berlin Hamburg Friday Morning 21
business Berlin Berlin Friday Morning 22
private Berlin Berlin Friday Morning 23
private Berlin Berlin Wednseday Afternoon 24
private Berlin Berlin Thursday Afternoon 25
business Berlin Berlin Thursday Afternoon 26
business Bonn Bonn Monday Afternoon 27
private Hamburg Bonn Monday Afternoon 28
business Berlin Bonn Monday Afternoon 29
business Bonn Bonn Wednseday Afternoon 30
private Bonn Bonn Friday Afternoon 31
Coarse Grained Jaccard
Coarse Grained Jaccard β Splitting the root of the tree
Three candidates for split:β’ Request locationβ’ Request day of weekβ’ Request part of day
CGJβ Splitting the root of the tree
dreqLocation
!= Berlin
reqLocation = Berlin
W1 = 16/31
W3 = 6/31
W2 = 9/31
Score1=1/23
Score3=1/23
Score2=2/23
*
*
*
+
+
Score(SplitreqLocation) =0.0561d
reqLocation !=Hamburg
reqLocation = Hamburg
dreqLocation
!= Bonn
reqLocation = Bonn
CGJβ Splitting the root of the tree
ddayOfWeek!=
Monday
dayOfWeek= Monday
W1 = 7/31
W3 = 3/31
W2 = 5/31
Score1=3/15
Score3=3/15
Score2=5/15
*
*
*+
+Score(SplitdayOfWeek) =
0.260
d dayOfWeek!= Wednesday
dayOfWeek= Wednesday
d dayOfWeek!= Thursday
dayOfWeek = Thursday
W4 = 9/31Score4=5/15 *d dayOfWeek != Friday
dayOfWeek = Friday
W5= 7/31Score5=3/15 *d dayOfWeek != Friday
dayOfWeek = Friday
+
+
CGJβ Splitting the root of the tree
dpartOfDay= Afternoon
partOfDay= Morning
Score1=4/23
Score(SplitpartOfDay) = 0.173
Coarse Grained Jaccard β Splitting the root of the tree
Three candidates for split:β’ Request location 0.0561β’ Request day of week 0.260β’ Request part of day 0.173
The split in the root
Fine Grained Jaccard
Fine Grained Jaccard β Splitting the root of the tree
Req. Location != Berlin
Req. Loca
tion = Berlin
d
Least Probable Intersections
LPI β Splitting the root of the tree
Req. Location != Berlin
Req. Loca
tion = Berlin
d
Customer TypeCustomer CityRequest LocationRequest Day Of Week
Request Part Of DayRequest ID
privateBerlinBerlinFridayAfternoon
privateHamburgHamburgWednsedayAfternoon
businessBerlinBerlinWednsedayMorning
privateBerlinBerlinWednsedayMorning
privateBerlinBerlinSaturdayAfternoon
privateBerlinBerlinThursdayMorning
privateBerlinBerlinFridayAfternoon
businessBerlinBerlinSaturdayAfternoon
privateBerlinBerlinSaturdayAfternoon
businessHamburgHamburgFridayAfternoon
businessHamburgHamburgMondayAfternoon
privateHamburgHamburgSaturdayAfternoon
privateBerlinBerlinMondayAfternoon
privateBonnBerlinMondayAfternoon
privateBerlinBerlinMondayAfternoon
privateBonnBonnSaturdayMorning
privateHamburgHamburgSaturdayMorning
privateHamburgHamburgSaturdayMorning
privateHamburgHamburgFridayAfternoon
privateBonnHamburgFridayAfternoon
privateBerlinHamburgFridayMorning
businessBerlinBerlinFridayMorning
privateBerlinBerlinFridayMorning
privateBerlinBerlinWednsedayAfternoon
privateBerlinBerlinThursdayAfternoon
businessBerlinBerlinThursdayAfternoon
businessBonnBonnMondayAfternoon
privateHamburgBonnMondayAfternoon
businessBerlinBonnMondayAfternoon
businessBonnBonnWednsedayAfternoon
privateBonnBonnFridayAfternoon
Req. Location != Berlin
Req. Loca
tion = Berlin
LPI β Splitting the root of the tree
Req. Location != Berlin
Req. Loca
tion = Berlin
d
Maximum Likelihood Estimation
RequestLocation
Cust.City
Cust. Type
Cust.City
Cust. Type
Cust.City
Cust. Type
MLE β Splitting the root of the tree
p(Cust. City|Cust. Type) p(Cust. Type|Cust. City)