Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | gillian-wilkins |
View: | 214 times |
Download: | 0 times |
GRAPH-BASED HIERARCHICAL CONCEPTUAL GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERINGCLUSTERING
by
Istvan Jonyer,
Lawrence B. Holder and
Diane J. Cook
The University of Texas at Arlington
OutlineOutline
What is hierarchical conceptual clustering?Overview of SubdueConceptual clustering in SubdueEvaluation of hierarchical clusteringsExperiments and resultsConclusions
What is clustering?What is clustering?
What is What is hierarchical hierarchical conceptual conceptual clustering?clustering?
Unsupervised concept learningGenerating hierarchies to explain dataApplications
– Hypothesis generation and testing– Prediction based on groups– Finding taxonomies
Example hierarchical Example hierarchical conceptualconceptual clusteringclustering
Animals
BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal
Fertilization: externalName: mammalBodyCover: hair
Name: birdBodyCover: feathers
Name: reptileBodyCover: cornified-skin
HeartChamber: imperfect-fourFertilization: internal
Name: fishBodyCover: scales
HeartChamber: two
Name: amphibianBodyCover: moist-skinHeartChamber: three
The ProblemThe Problem
Hierarchical conceptual clustering in discrete-valued structural databases
Existing systems:– Continuous-valued– Discrete but unstructured– We can do better! (Field under explored)
Related WorkRelated Work
CobwebLabyrinthAutoClassSnobIn Euclidian space: Chameleon, Cure
Unsupervised learning algorithms
The SolutionThe Solution
Take Subdue and extend it!
Overview of SubdueOverview of Subdue
Data mining in graph representations of structural databases
A
C
B D
A
C
BD
F
E
f c
b
ad
e
a
bc
g
Overview of SubdueOverview of Subdue
Iteratively searching for best substructure by MDL heuristic
A
C
BD
c
b
a
Overview of SubdueOverview of Subdue
Compress using best substructure
S S
F
E
f
d
eg
Overview of SubdueOverview of Subdue
Fuzzy match– Inexact matching of subgraphs– Applications:
Defining fuzzy concepts Evaluation of clusterings
Conceptual Clustering with Conceptual Clustering with SubdueSubdue
Use Subdue to identify clusters– The best subgraph in an iteration defines a
cluster When to stop within an iteration?
1) Use –limit option2) Use –size option3) Use first minimum heuristic (new)
The First Minimum HeuristicThe First Minimum Heuristic
Use subgraph at first local minimum– Detect it using –prune2 option
0.75
0.8
0.85
0.9
0.95
1
1.05
The First Minimum HeuristicThe First Minimum Heuristic
Not a greedy heuristic!– Although first local minimum is usually the
global minimum– First local minimum is caused by a smaller,
more frequently occurring subgraph– Subsequent minima are caused by bigger, less
frequently occurring subgraphs
=> First subgraph is more general
The First Minimum HeuristicThe First Minimum Heuristic
A multi-minimum search space:
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
Lattice vs. TreeLattice vs. Tree
Previous work defined classification trees– Inadequate in structured domains
Better hierarchical description: classification lattice– A cluster can have more than one parent– A parent can be at any level (not only one level
above)
Hierarchical Clustering in Hierarchical Clustering in SubdueSubdue
Subdue can compress by a subgraph after each iteration
Subsequent clusters may be defined in terms of previously defined clusters
This results in a hierarchy
Hierarchical Conceptual Hierarchical Conceptual Clustering of an Artificial Clustering of an Artificial
DomainDomain
Hierarchical Conceptual Clustering Hierarchical Conceptual Clustering of an Artificial Domainof an Artificial Domain
Root
Evaluation of ClusteringsEvaluation of Clusterings
Traditional evaluation:
– Not applicable to hierarchical domains
No known evaluation for hierarchical clusterings– Most hierarchical evaluations are anecdotal
erDistanceIntraClust
erDistanceInterClustQualityClustering
New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings
Properties of a good clustering:– Small number of clusters
Large coverage good generality
– Big cluster descriptions More features more inferential power
– Minimal or no overlap between clusters More distinct clusters better defined concepts
New Evaluation Heuristic for New Evaluation Heuristic for Hierarchical ClusteringsHierarchical Clusterings
Big clusters: bigger distance between disjoint clusters
Overlap: less overlap bigger distance
Few clusters: averaging comparisons
c
iHc
i
c
ijji
c
i
c
ij
H
k
H
l ljkisize
ljki
C i
i j
CQHH
HH
HHdistance
CQ1
1
1 1
1
1 1 1 1 ,,
,,
)(
),(max
),(
Experiments and ResultsExperiments and Results
Validation in an artificial domainValidation in unstructured domainsComparison to existing systemsReal world applications
The Animal DomainThe Animal Domain
Name Body Cover Heart Chamber Body Temp. Fertilization
mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal
amphibian moist-skin three unregulated external
fish scales two unregulated external
animal
hair
mammal
BodyCover
Fertilization
HeartChamber
BodyTempinternalregulated
Namefour
Hierarchical Clustering of the Hierarchical Clustering of the Animal DomainAnimal Domain
Animals
BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal
Fertilization: externalName: mammalBodyCover: hair
Name: birdBodyCover: feathers
Name: reptileBodyCover: cornified-skin
HeartChamber: imperfect-fourFertilization: internal
Name: fishBodyCover: scales
HeartChamber: two
Name: amphibianBodyCover: moist-skinHeartChamber: three
Hierarchical Clustering of the Hierarchical Clustering of the Animal Domain by CobwebAnimal Domain by Cobweb
animals
amphibian/fishmammal/bird reptile
mammal bird fish amphibian
Comparison of Subdue and Comparison of Subdue and CobwebCobweb
Quality of Subdue’s lattice (tree): 2.60Quality of Cobweb’s tree: 1.74Therefore Subdue is betterReasons for a higher score:
– Better generalization resulting in less clusters– Eliminating overlap between (reptile) and
(amphibian/fish)
Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence
Chemical Application: Chemical Application: Clustering of a DNA sequenceClustering of a DNA sequence
Coverage– 61%
– 68%
– 71%
DNA
O |O == P — OH
C — N C — C
C — C \ O
O |O == P — OH | O | CH2
C \ N — C \ C
O \ C / \ C — C N — C / \O C
ConclusionsConclusions
Goal of hierarchical conceptual clustering of structured databases was achieved
Synthesized classification latticeDeveloped new evaluation heuristic for
hierarchical clusteringsGood performance in comparison to other
systems, even in unstructured domains
Future WorkFuture Work
More experiments on real-world domainsComparison to other systemsIncorporation of evaluation tool into
Subdue