Clustering
By: Avshalom Katz
We will be talking about…
• What is Clustering?• Different Kinds of Clustering• What is DBSCAN?• Pseudocode• Example of Clustering• Definitions of parameters• Complexity
What is Clustering?
• clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.
Different types of Clustering
• Biology• Information retrieval • Climate• Business • Clustering for utility• Summarization
Example
DIFFERENT KINDS OF CLUSTERS
Well Separated
Prototype based
Graph based
Density based
Share property (conceptual clusters)
DBSCAN-IntroductionDensity-Based Spatial Clustering of Applications with Noise
• Since society has started using databases, the amount of information that we are using is increasing exponentially. Due to that, automatic algorithms are entered to every subject.
Database Example
Density-Based Spatial Clustering of Applications with Noise
• 1. Minimum point in the density (MINEPS)
• 2. The distance of the point to check the density (EPS).
There are four main steps in the algorithm, and the algorithm gets two parameters:
Definition 1
• To find all adjacent points. The so called “adjacent” points are called so only of the distance between them is smaller than EPS from what we refer to as P- “point”. All the adjacent points are later entered into Neps (P).
Definition 2• Is to define the
core group by checking if the point p is in the core with point q by checking if p includes in Neps (q) and the size of the group Neps (p) is grater then MINPTS.
Definition 3
• Density-reachable the point p is density reachable from point q if there is a sequence of points that the first is p and the last is q, then every couple in the sequence is a directly density reachable
Definition 4
• Density connected point refers to a single point that can reach two different points, also in different direction. For example in the diagram below we can see that P and Q are density-reachable from O. Therefore, P and Q are are density connected.
Definition 5
• Cluster C, wrt.erps and MINPTS are non-empty subset of the database, together these two terms below are created:
1. If P is a member of class C and q is density reachable from P and NEPS(P)> MINTPS then q is also a member of C.
2. If p and q are both members of C, then both p and q are density connected to eachother.
Definition 6
• There are groups of clusters, each point that does not belong to any group is called “noise”.
= noise
EB
FA
N
P
Q T
S
R
V
U
JC
H
G
I
DOL
KMε
DBSCAN ( Eps = ε , MinPts = 3 )number of adjacent : 5stack : B,C,D,E,Fcurrent ClusterId : green
number of adjacent : 8stack : C,D,E,F,G,H,I,current ClusterId : green
number of adjacent : 8stack : D,E,F,G,H,I,current ClusterId : green
number of adjacent : 9stack : F,G,H,I,Jcurrent ClusterId : green
number of adjacent : 7stack : E,F,G,H,Icurrent ClusterId : green
number of adjacent : 9stack : G,H,I,Jcurrent ClusterId : green
number of adjacent : 6stack : H,I,Jcurrent ClusterId : green
number of adjacent : 7stack : I,Jcurrent ClusterId : green
number of adjacent : 7stack : Jcurrent ClusterId : green
number of adjacent : 5stack : current ClusterId : green
number of adjacent : stack : current ClusterId : purple
number of adjacent : 0stack : current ClusterId : purple
X
number of adjacent : 3 stack : O,P,Qcurrent ClusterId : purple
number of adjacent : 2stack : P,Qcurrent ClusterId : purple
number of adjacent : 5stack : Q,R,S,Tcurrent ClusterId : purple
number of adjacent : 1stack : current ClusterId : purple
Pseudocode of the algorithm DBSCAN (Eps, MinPts) // SetOfPoints is UNCLASSIFIEDClusterId := nextId(NOISE);FOR i FROM 1 TO SetOfPoints.size DOPoint := SetOfPoints.get(i);IF Point.ClId = UNCLASSIFIED THENIF ExpandCluster(SetOfPoints, Point,ClusterId, Eps, MinPts) THEN ClusterId := nextId(ClusterId)END IFEND IFEND FOREND; // DBSCAN
ExpandCluster(SetOfPoints, Point, ClId, Eps,MinPts) : Boolean;seeds:=SetOfPoints.regionQuery(Point,Eps);IF seeds.size<MinPts THEN // no core pointSetOfPoint.changeClId(Point,NOISE);RETURN False;ELSE // all points in seeds are density- // reachable from PointSetOfPoints.changeClIds(seeds,ClId);seeds.delete(Point);WHILE seeds <> Empty DOcurrentP := seeds.first();result := SetOfPoints.regionQuery(currentP,Eps);IF result.size >= MinPts THENFOR i FROM 1 TO result.size DOresultP := result.get(i);IF resultP.ClId IN {UNCLASSIFIED, NOISE} THENIF resultP.ClId = UNCLASSIFIED THENseeds.append(resultP);
• END IF;• SetOfPoints.changeClId(resultP,ClId);• END IF; // UNCLASSIFIED or NOISE• END FOR;• END IF; // result.size >= MinPts• seeds.delete(currentP);• END WHILE; // seeds <> Empty• RETURN True;• END IF• END; // ExpandCluster
Example
Define the value of parameter EPS bay MINPTS:
The complexityThe complexity of ExpandCluster() is o(logN) in the worst case on a data base in size N and there is n iterations of this function ,so it is on * log (n) )
Bibliography • Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). Optics:
ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49-60
• Clustering. (2010, April 19). In Wikipedia, The Free Encyclopedia. Retrieved 14:14, April 19, 2010
from http://en.wikipedia.org/w/index.php?title=Clustering&oldid=357078594
• Ester, M., Kriegel, H.-p., Jörg, S., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.
• Ester, M ., Kriegel, H,. Jörg, S., and Xu, X (1995).A DatabaseIn terface forClustering in Large Spatial Databases, Proc. 1st Int. Conf. onKnowledge Discovery and Data Mining, Montreal, Canada, 1995, AAAI Press, 1995.
• Schikuta E., Erhart M.: “The bang-clustering system:Grid-based data
analysis”. Proc. Sec. Int. Symp. IDA-97,Vol. 1280 LNCS, London, UK, Springer-Verlag, 1997.