An Experimental Comparison of Some Triclustering Algorithms
Dmitry V. Gnatyshak, Dmitry I. Ignatov*, Sergei O. Kuznetsov
School of Applied Mathematics and Information Science & Intelligence Systems and Structural Analysis Lab
NRU Higher School of Economics, Moscow, Russia
LORIA Orpailleur meeting, Nancy, France, 2013
Outline
1. Motivation and problem setting2. FCA basic definitions3. Triclustering methods4. Experiments5. Conclusion
2
MotivationA large amount of structured and unstructured data
generates triadic data.Example: folksonomy is a set of triples (user, object, tag)
Examples:Bibsonomy.org (user, bookmark, tag)Social networking sites(user, group, interest)Delicious(user, link, tag)
3
Main goals
1. Comparison of some triclustering methods
2. Development of a toolbox for triclustering experiments
3. New possibly better methods
4. Possible applications
4
FCA: basic definitions
Biology Mathematics Computer Science
Chemistry
Kate x x
Mike x x x
Alex x x
Pete x x x
5
Example: – students, – courses, means “to take a course in”.
(R. Wille, 1982; B. Ganter, R. Wille, 1999)
Def. A formal context is a triple , where , , and is an incidence relation.
FCA: basic definitionsDef. Galois operators (concept-forming operators)
is a set of attributes common to all
is a set of objects which possessed all attributes from B
6
Примеры:
Biology Mathematics Computer Science
Chemistry
Kate x x
Mike x x x
Alex x x
Pete x x x
FCA: basic definitionsDef. A pair is called formal concept of context
7
Examples:• is a formal concept is not a formal concept
Biology Mathematics Computer Science
Chemistry
Kate x x
Mike x x x
Alex x x
Pete x x x
Triadic FCA: basic definitions
Def. A triple is called triadic formal concept (triconcept) of triadic context
where
is an extent, is an intent, and is a modus.
Def. A quadruple is a triadic formal context (tricontext). are sets of objects, attributes and conditions respectively.
Let , , , then concept-forming operators in triadic case are:
8
(F. Lehmann, R. Wille, 1995)
OAC-triclusters(based on box operators)
Def. L be a triadic context, then OAC-tricluster based on box operators built on a triple is a triple of sets .
is a density of .
Box operators
9
(D. Ignatov et al., 2011) … …
…
…
OAC-triclusters(based on prime operators)
Def. L be triadic context, then OAC-tricluster based on prime operators for a triple is a triple .
is a density of .
Prime-operators of singletons
10
OAC-triclusteringThe algorithm’s idea:INPUT: is a tricontext
is a density thresholdOUTPUT: is a set of triclusters
For each triple of the context it builds a tricluster by the definitions
Notes:1. It makes sense to use hash-function to avoid duplicates2. Triples enumeration is easy to parallelize
11
TriBoxModel: be a tricontext, then its set of triclusters follows the model below:
is a parameter, is a constant, is a residual, is a Boolean variable, which shows that is in tricluster (similarly for and ).
The method’s idea: to minimize residuals in case of , using the greedy approach: It enumerates all triples adding an object (attribute or condition) at each step
Notes:1. It makes sense to use hash-function to avoid duplicates2. Triples enumeration is easy to parallelize
12
(A. Kramarenko & B. Mirkin, 2011)
Spectral Triclustering: SpecTricThe algorithm’s idea: It sequentially splits tricontext into subtricontexts according to normalized
mincut criterion while they have sufficient size, then it returns them as a set of triclusters.
Notes:1. W a, where is a set of vertices, is a set of edges composed by the rule: .
Then we split graph by approximation of the second smallest eigenvector of the Laplacian matrix of the input graph.
2. To find the partition vector we use a generalized task of finding eigenvalues: , where is a diagonal matrix of sums of the vertex degrees.
13
(D. Ignatov & Z. Sekinaeva, 2011; Ignatov et al. 2013)
Spectral Triclustering: SpecTric
14
(D. Ignatov & Z. Sekinaeva, 2011; Ignatov et al. 2013)
TRIAS
It finds all formal concepts with a predefined minimal support for each of the sets , и . In the task a formal triconcept is a tricluster with density 1.
The algorithm’s idea:It sequentially uses algorithm NextClosure (B. Ganter, 1987)
by first finding a formal concept in dyadic formal context , second it works on a dyadic context built on the extent of each previously found concepts. Then resulting sets are combined into a final triadic concept.
15
(R. Jäschke, 2006)
ExperimentsMain goals:
Fault-tolerance testComparison by criteria: time, quantity, mean density,
coverage and diversity
For TriBox and OAC-triclustering we implemented their parallel versionsThey were included to the comparison
16
DataFault-tolerance test
contexts with three cuboids of ones on the main diagonal with a different noise probability
Time, tricluster number, average density, coverage, and diversityRandom contexts Top-250 IMDBBibSonomy
17
Comparison Criteria Fault-tolerance is an ability of a triclustering method to find
out triclusters maximally similar to the initial cuboids :
where is a number of cuboids, is the obtained triclustering set. Coverage is defined as a fraction of the triples of the input
context among the set of all triclusters. Diversity is defined via Boolean function on two triclusters:
Then the diversity of a tricluster is :
18
Results (fault-tolerance)
19
20
OAC-prime triclustering exampleIMDB
Results (time, quantity, average density, coverage, diversity)
Method T, ms # , %
Uniform random context ()
OAC (box) 407 73 9,88 100,00 0,00 0,00 0,00 0,00
OAC (prime) 312 2659 32,23 100,00 92,51 60,07 59,80 59,45
SepcTric 277 5 8,74 8,84 100,00 100,00 100,00 100,00
TriBox 6218 1011 74,00 96,02 97,42 66,25 79,53 84,80
TRIAS 29367 38356 100,00 100,00 99,99 99,93 4,07 3,51
IMDB
OAC (box) 2314 1500 1,84 100,00 15,65 9,67 0,70 7,87
OAC (prime) 547 1274 53,85 100,00 96,55 94,56 92,14 28,52
Spectric 98799 21 17,07 20,88 100,00 100,00 100,00 100,00
TriBox 197136 328 91,65 98,90 98,89 98,46 95,21 30,94
TRIAS 102554 1956 100,00 100,0 99,89 99,69 52,52 26,18
BibSonomy
OAC (box) 19297 398 4,16 100,00 79,59 67,28 42,83 79,54
OAC (prime) 13556 1289 94,66 100,00 99,74 88,58 99,51 99,53
SpecTric 5906563 2 50,00 100,00 100,00 100,00 100,00 100,00
TriBox Time > 24 hours
TRIAS 110554 1305 100,00 100,00 99,98 91,70 99,78 99,92
21
Method Time Quantity Average density
Coverage Diversity Efficiency of parallel version
OAC (box)average large low high ~ very low very low~ average high
OAC (prime)small large average high ~ average average ~ high low
SpecTricSmall for small contexts small low average ~ high 1 –
TriBox high average high high high high
TRIAS Strongly depends on , and the triconcepts structure
very large 1 high ~ low high ~ low –
22
Results (time, quantity, average density, coverage, diversity)
ConclusionThere is no a winner according to the comparison criteria
Method TriBox shows best results but it takes huge computational time
OAC-triclustering based on prime operators gives the second best results and it is sufficiently fast
23
24
ConclusionThere is no a winner according to the comparison criteria
Details by methods:
TRIASHigh elapsed timeToo large number of small well-interpreted triclusters
(triconcepts)
25
ConclusionOAC (box operators)
Large triclusters of low densityHigh density, small diversityAn efficient parallelization
OAC (prime-operators)High speed of computationsLarge number of dense well-interpreted triclustersLow efficiency of parallelization
26
ConclusionSpectral Triclustering
High computational speed on small contextsWell-interpreted triclusters but of the low density Diversity is always equals to 1, but it causes too low coverage
TriBoxA moderate number of well-interpreted triclustersHigh elapsed timeEfficient parallelizationReasonably high coverage and diversity
Merci beaucoup!
Questions?
27