Date post: | 14-Apr-2017 |
Category: |
Education |
Upload: | mahmoud-alfarra |
View: | 805 times |
Download: | 0 times |
Prepared by: Mahmoud Rafeek Alfarra
Seminar ProgramDocument
Clustering and Classification
Out Line Classification and its techniques Clustering its techniques Document clustering !! Comparison
Classification: Definition Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification: Definition
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal to 1.
Artificial Neural Networks (ANN)
X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Outputnode
Inputnodes
otherwise0 trueis if1
)( where
)04.03.03.03.0( 321
zzI
XXXIY
Artificial Neural Networks (ANN)
Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
X1
X2
X3
Y
Black box
w1
t
Outputnode
Inputnodes
w2
w3
)( tXwIYi
ii Perceptron Model
)( tXwsignYi
ii
or
Clustering Definition
Clustering is a division of data into groups of similar objects.
Each group is called cluster and consists of objects that are similar between themselves and dissimilar to objects of other groups .
Clustering Definition
C3
C2 C1
Document clustering
Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters.
The challenge
The problem of Document clustering is how to
organize a large set of documents of various topics
and reach satisfy organization. It can display as follow:
Given: A huge set of documents of various topics
(shared, related, totally different).
Required: Group the documents into a number of clusters
such that the intra-cluster similarity is maximized, and the
inter-cluster similarity is minimized.
The challengeDocument cluster Document cluster
Document cluster
Inter-ClusterSim.
Intra-ClusterSim.
Inter-Cluster Sim. < Intra-Cluster Sim.
Clustering’s Process
Knowledge
Document Data Model Representation
•Document Cleaning•Feature Selection or Extraction.
Documents samples
Clustering Algorithm
• Similarity Measure • Criterion of Clustering
Cluster Validation
• External Indices• Internal Indices
Results Interpretation
Clusters
1 2
34
Clustering Techniques
Clustering methods in general can be viewed
from different perspectives, the most widely
applied to text domain are:
Hierarchical Clustering
Partitioning Clustering
Neural Network based Clustering
Clustering Techniques
Suffix Tree Clustering algorithm
05/03/2023 16
D1: cat ate cheeseD2: mouse ate cheese tooD3: cat ate mouse too
5.0
m
nm
BBB
5.0
n
nm
BBBand
Then
Clustering Techniques
Document Index Graph for clustering (DIG)
Clustering Techniques
Graph based growing hierarchal SOM
Comparison
Thanks