Clustering

Clustering

By: Avshalom Katz

We will be talking about…

• What is Clustering?• Different Kinds of Clustering• What is DBSCAN?• Pseudocode• Example of Clustering• Definitions of parameters• Complexity

What is Clustering?

• clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.

Different types of Clustering

• Biology• Information retrieval • Climate• Business • Clustering for utility• Summarization

Example

DIFFERENT KINDS OF CLUSTERS

Well Separated

Prototype based

Graph based

Density based

Share property (conceptual clusters)

DBSCAN-IntroductionDensity-Based Spatial Clustering of Applications with Noise

• Since society has started using databases, the amount of information that we are using is increasing exponentially. Due to that, automatic algorithms are entered to every subject.

Database Example

Density-Based Spatial Clustering of Applications with Noise

• 1. Minimum point in the density (MINEPS)

• 2. The distance of the point to check the density (EPS).

There are four main steps in the algorithm, and the algorithm gets two parameters:

Definition 1

• To find all adjacent points. The so called “adjacent” points are called so only of the distance between them is smaller than EPS from what we refer to as P- “point”. All the adjacent points are later entered into Neps (P).

Definition 2• Is to define the

core group by checking if the point p is in the core with point q by checking if p includes in Neps (q) and the size of the group Neps (p) is grater then MINPTS.

Definition 3

• Density-reachable the point p is density reachable from point q if there is a sequence of points that the first is p and the last is q, then every couple in the sequence is a directly density reachable

Definition 4

• Density connected point refers to a single point that can reach two different points, also in different direction. For example in the diagram below we can see that P and Q are density-reachable from O. Therefore, P and Q are are density connected.

Definition 5

• Cluster C, wrt.erps and MINPTS are non-empty subset of the database, together these two terms below are created:

1. If P is a member of class C and q is density reachable from P and NEPS(P)> MINTPS then q is also a member of C.

2. If p and q are both members of C, then both p and q are density connected to eachother.

Definition 6

• There are groups of clusters, each point that does not belong to any group is called “noise”.

= noise

EB

FA

N

P

Q T

S

R

V

U

JC

H

G

I

DOL

KMε

DBSCAN ( Eps = ε , MinPts = 3 )number of adjacent : 5stack : B,C,D,E,Fcurrent ClusterId : green

number of adjacent : 8stack : C,D,E,F,G,H,I,current ClusterId : green

number of adjacent : 8stack : D,E,F,G,H,I,current ClusterId : green

number of adjacent : 9stack : F,G,H,I,Jcurrent ClusterId : green

number of adjacent : 7stack : E,F,G,H,Icurrent ClusterId : green

number of adjacent : 9stack : G,H,I,Jcurrent ClusterId : green

number of adjacent : 6stack : H,I,Jcurrent ClusterId : green

number of adjacent : 7stack : I,Jcurrent ClusterId : green

number of adjacent : 7stack : Jcurrent ClusterId : green

number of adjacent : 5stack : current ClusterId : green

number of adjacent : stack : current ClusterId : purple

number of adjacent : 0stack : current ClusterId : purple

X

number of adjacent : 3 stack : O,P,Qcurrent ClusterId : purple

number of adjacent : 2stack : P,Qcurrent ClusterId : purple

number of adjacent : 5stack : Q,R,S,Tcurrent ClusterId : purple

number of adjacent : 1stack : current ClusterId : purple

Pseudocode of the algorithm DBSCAN (Eps, MinPts) // SetOfPoints is UNCLASSIFIEDClusterId := nextId(NOISE);FOR i FROM 1 TO SetOfPoints.size DOPoint := SetOfPoints.get(i);IF Point.ClId = UNCLASSIFIED THENIF ExpandCluster(SetOfPoints, Point,ClusterId, Eps, MinPts) THEN ClusterId := nextId(ClusterId)END IFEND IFEND FOREND; // DBSCAN

ExpandCluster(SetOfPoints, Point, ClId, Eps,MinPts) : Boolean;seeds:=SetOfPoints.regionQuery(Point,Eps);IF seeds.size<MinPts THEN // no core pointSetOfPoint.changeClId(Point,NOISE);RETURN False;ELSE // all points in seeds are density- // reachable from PointSetOfPoints.changeClIds(seeds,ClId);seeds.delete(Point);WHILE seeds <> Empty DOcurrentP := seeds.first();result := SetOfPoints.regionQuery(currentP,Eps);IF result.size >= MinPts THENFOR i FROM 1 TO result.size DOresultP := result.get(i);IF resultP.ClId IN {UNCLASSIFIED, NOISE} THENIF resultP.ClId = UNCLASSIFIED THENseeds.append(resultP);

• END IF;• SetOfPoints.changeClId(resultP,ClId);• END IF; // UNCLASSIFIED or NOISE• END FOR;• END IF; // result.size >= MinPts• seeds.delete(currentP);• END WHILE; // seeds <> Empty• RETURN True;• END IF• END; // ExpandCluster

Example

Define the value of parameter EPS bay MINPTS:

The complexityThe complexity of ExpandCluster() is o(logN) in the worst case on a data base in size N and there is n iterations of this function ,so it is on * log (n) )

Bibliography • Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). Optics:

ordering points to identify the clustering structure. SIGMOD Rec., 28(2):49-60

• Clustering. (2010, April 19). In Wikipedia, The Free Encyclopedia. Retrieved 14:14, April 19, 2010

from http://en.wikipedia.org/w/index.php?title=Clustering&oldid=357078594

• Ester, M., Kriegel, H.-p., Jörg, S., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.

• Ester, M ., Kriegel, H,. Jörg, S., and Xu, X (1995).A DatabaseIn terface forClustering in Large Spatial Databases, Proc. 1st Int. Conf. onKnowledge Discovery and Data Mining, Montreal, Canada, 1995, AAAI Press, 1995.

• Schikuta E., Erhart M.: “The bang-clustering system:Grid-based data

analysis”. Proc. Sec. Int. Symp. IDA-97,Vol. 1280 LNCS, London, UK, Springer-Verlag, 1997.

http://en.wikipedia.org/w/index.php?title=Clustering&oldid=357078594

Date post:	14-Jan-2016
Category:	Documents
Upload:	ayla
View:	79 times
Download:	0 times

Clustering

Documents