DATA MINING AND MACHINE LEARNING -...

DATA MINING AND MACHINE LEARNINGLecture 8: Unsupervised learning

Lecturer: Simone Scardapane

Academic Year 2016/2017

Table of contents

Dimensionality reductionPrincipal component analysisAutoencoder networksDictionary learningt-Stochastic Neighbor Embedding

ClusteringK-means clusteringGaussian mixture modelDBSCANHierarchical clustering

Dimensionality reduction

Dimensionality reduction is the problem of reducing the num-ber of features for a model. Feature selection does this bykeeping the original features, while feature extraction providesnew features.

Benefits are similar as for feature selection, i.e.:

I We can remove redundant / noisy features in our dataset,thus generalizing better.

I Training is faster.

I Inference is faster, which is a benefit in many embedded /mobile devices.

I For reduction to 2D / 3D spaces, we obtain a tool forvisualizing the data.

Assumptions for PCA

The most famous dimensionality reduction technique is princi-pal component analysis (PCA), also know as Karhunen-Loevetransformation.

PCA is a linear technique, seeking the ‘optimal’ projection ma-trix P such that:

Z = XP . (1)

Each row in Z is a projected point, while the columns of P arecalled the principal components (PCs). By varying the dimen-sionality of P, we can choose to keep more or less informationon the original data matrix X.

The assumptions of PCA

Apart from linearity, we need to make some assumptions in orderto define what is an optimal projection:

I To maximize the information contained in each transfor-med feature, we impose that P is orthogonal.

I The importance of a feature can be measured by its vari-ance in the dataset.

A simple pseudocode for PCA is thus:

1. Find transformation zi1 = pT1 xi maximizing the variance of

the points zi1 in RN .

2. Find transformation p2 orthogonal to p1, such that thepoints zi2 have maximum variance.

3. Iterate step 2 until desired.

Visualization of PCA

Figure 1 :https://sebastianraschka.com/faq/docs/lda-vs-pca.html

https://sebastianraschka.com/faq/docs/lda-vs-pca.html

Extracting the first principal component

Denoting x as the empirical mean of X, the sample variance ofthe dataset after projection is given by:

1

N − 1

N∑i=1

(pT1 xi − pT

1 x)2

= pT1 Sp1 , (2)

where:

S =1

N − 1

N∑i=1

(xi − x) (xi − x)T . (3)

This is trivially maximized with p1 = ∞. To avoid this trivialsolution, we further impose that p1 is a unit vector:

pT1 p1 = 1 . (4)

Solving for the first PC

Using a Lagrange multiplier λ1, we want to maximize:

p∗1 = max{

pT1 Sp1 − λ1

(pT1 p1 − 1

)}. (5)

Setting the gradient with respect to p1 equal to zero we obtain:

Sp1 = λ1p1 , (6)

meaning that the optimal projection must be an eigenvector ofthe covariance matrix S. If we left-multiply by pT

1 and exploitthe normality condition:

pT1 Sp1 = λ1 . (7)

Solving for the second PC

The first PC is then the eigenvector of S corresponding to thelargest eigenvalue λ1 of the covariance matrix S.

To find the second PC p2, we impose the further condition thanpT1 p2 = 0 using a second Lagrange multiplier φ:

p∗2 = max{

pT2 Sp2 − λ2

(pT2 p2 − 1

)− φpT

2 p1

}. (8)

By the optimality condition for p2 we obtain:

Sp2 − λ2p2 − φp1 = 0 . (9)

The final PCA algorithm

Left-multiplying by pT1 :

φ1 = 0→ φ = 0 ,

so that:Sp2 = λ2p2 , (10)

meaning that p2 is the eigenvector of S corresponding to thesecond-largest eigenvalue.

This generalizes to further PCs, which are all the eigenvectorsof S, ordered in importance by their corresponding eigenvalues.The value λi∑d

i=1 λiis called the ‘explained variance’ of the ith

PC.

PCA as minimizing the projection error

Let us back-project the points Z in the original space:

X = ZPT . (11)

PCA can be derived in an alternative fashion, as the projectionmatrix P minimizing the mean-squared reconstruction errorgiven by:

J(P) =∥∥∥X− X

∥∥∥22, (12)

under the orthonormality condition on P.

This has motivated a wide range of extensions of the stan-dard PCA algorithm, which attempt to modify the previous costfunction or to impose additional regularization terms on P.

Table of contents



PCA as a compression algorithm

Consider the full expression of the reconstructed dataset:

X = XPPT . (13)

The matrix P can be seen as an ‘encoding’ matrix, compressingeach point xi into a lower-dimensional space zi . The matrix PT

works as a decoder, undoing the compression step.

We can generalize this using a nonlinear encoding function g ,and a nonlinear decoding function h:

xi = h(g (xi)

). (14)

Autoencoder networks

Practically, g and h can be implemented as two layers in a gene-ric neural network, whose weights can be found by minimizingthe reconstruction error:

g(xi) = zi = tanh (W1xi) , (15)

h(zi) = xi = tanh (W2zi) . (16)

The resulting scheme is called an autoencoder neural network.By removing the nonlinearities, an autoencoder finds the samesubspace as PCA, but the projection matrix is no longer ortho-normal. With the nonlinearities, the autoencoder can find moreexpressive transformations.

Optimization problem for an autoencoder

Practically, we can minimize:

J(·) =∥∥∥X− X

∥∥∥22+C

∥∥vect{

W1

}∥∥22+C

∥∥vect{

W2

}∥∥22. (17)

In many implementations, it is common to further impose W2 =WT

1 , in analogy to what happens in PCA. This is called anautoencoder with tied weights, and it can simplify the trainingphase.

An overcomplete version of the autoencoder (the denoising au-toencoder) was essential in the first wave of deep learning ar-chitectures.

Architecture of an autoencoder network

Figure 2 : Generic structure of an autoencoder network[Wikipedia].

Table of contents



Signal decomposition

PCA can be seen as extracting from the ‘signal’ xi its underlyingcomponents (the PCs), such that xi can be represented as a setof coefficients with respect to this new basis.

In practice, it is immediate to think of a sparse representation,where each xi can be made by combining only a few componentsfrom a large set, and not all components are used for all thepatterns.

Under a linearity assumption, the resulting model is called dicti-onary learning. If the components are already known, it iscalled the sparse coding problem. It significantly generalizesPCA, and it has become a fundamental component of manyrecent machine learning applications.

Decomposition with a dictionary

Consider the data matrix X ∈ RN×d . We want to representeach xi as the linear combination of some elements taken fromthe rows of a dictionary matrix D ∈ Rk×d , with k � d beingthe number of atoms in the dictionary.

Given xi , we represent it with a sparse coefficient vector zi ∈Rk , such that:

X ≈ ZD , (18)

where each row in Z ∈ RN×k is a coefficient vector.

Dictionary learning problem

Extending our previous discussion, we would like to solve thefollowing dictionary learning problem:

minD,Z

{‖X− ZD‖22 + C ‖Z‖1

}. (19)

In general, we can also constrain each row of D to be less thana constant c in norm, in order to avoid degenerate solutions.The previous problem is convex with respect to D, convex withrespect to D, but it is not convex jointly with respect to thetwo variables.

Common methods to solve the previous problem consider alter-nating minimization between the two set of variables.

Solving with respect to the coefficients

For a fixed dictionary D, we need to solve N independent pro-blems of the form:

minzi

{∥∥X− zTi D∥∥22

+ ‖zi‖1}, (20)

which is basically a LASSO formulation.

Note that once the optimal dictionary is found, the previousproblem must be solved for any point we wish to represent inour dictionary. This is in stark contrast with PCA, where asingle matrix multiplication is needed.

Solving with respect to the dictionary

For the dictionary problem, we obtain a standard least-squares,where we use k Lagrange multipliers to ensure the constrains:

minD

{Tr{

(X− ZD)T (X− ZD)}

+k∑

i=1

αi(‖zi‖22 − c)

}(21)

This can be optimized efficiently using any second-order met-hod over the dual variables, and solving the previous problemanalytically.

This is only a sketch of a basic algorithm to solve the dictionaryproblem, but in practice more advanced algorithm can be im-plemented, based on stochastic updated on both variables (seethe links at the end of the lecture).

Example of dictionary learning

Figure 3 : Example of learning image patches from raccoonimages [https://lijiancheng0614.github.io/scikit-learn/auto examples/decomposition/plot image denoising.html].

https://lijiancheng0614.github.io/scikit-learn/auto_examples/decomposition/plot_image_denoising.html

https://lijiancheng0614.github.io/scikit-learn/auto_examples/decomposition/plot_image_denoising.html

Table of contents



The problem of data visualization

The previous techniques are effective in reducing dimensionalityfor, e.g., providing the new features as input to a classifier. Fordata visualization, we have the additional constraint that thefeatures must lie in a 2D or 3D space.

In this case, alternative methods are generally preferred. Insteadof looking for a representation having small reconstruction error,the idea underlying most of these methods is to find a low-dimensional representation such that most pairwise distancesamong points (thus, the topological structure of the data) arepreserved.

In the remainder of the section, we describe a state-of-the-artmethod called t-Distributed Stochastic Neighbor Embedding (t-SNE).

Modeling pairwise distance

Denote by pj |i the probability that xi has xj as neighbor if neig-hbors are extracted according to a probability distribution cen-tered on xi :

pj |i =exp{−‖xi − xj‖22 /2σ2

i

}∑k 6=i exp

{−‖xi − xk‖22 /2σ2

k

} , (22)

where the scales σ1, . . . , σN for now are free parameters. Bydefinition we set pi |i = 0. The joint distribution is obtained bysymmetrizing with respect to the conditional distributions:

pji =pj |i + pi |j

2N. (23)

Setting the scales parameters

The scales σ1, . . . , σN model the locality of information whichis considered. Denote by Pi the the distribution with respect toa single point i . The perplexity is defined as:

Perp(Pi) = 2H(Pi ) , (24)

where H(Pi) is the entropy of Pi :

H(Pi) = −N∑j=1

pj |i log2 pj |i . (25)

Note that the entropy is maximal by choosing neighbors uni-formly at random. It gets lower by considering more and morepoints very close to xi .

Setting perplexity

In a sense, the perplexity can be understood as the ‘smooth’number of neighbors we consider when computing distances.

The algorithm works by setting the scales such that the per-plexity equal a number predefined by the user. Common values(deriving from the original t-SNE paper) are around 5 and 50.

The reasons for using perplexity instead of directly the numberof neighbors are several, including its smoothness and a goodlevel of robustness.

Distribution of the projected points

The basic idea of t-SNE is to model the distribution Q of pai-rwise distance between transformed points zi , which are thenfound by making Q and P as similar as possible.

For high-dimensional spaces, using Gaussian distributions forQ gives rise to the so-called crowding problem: moderatedistances in the original space require points very far apart inthe low-dimensional space.

t-SNE solves this by using a distribution with heavier tails, na-mely, a t-Student distribution with one degree of freedom:

qij =

(1 + ‖zi − zj‖22

)−1∑k 6=i

(1 + ‖zi − zk‖22

)−1 (26)

Visualizing the t-Student distribution

4 2 0 2 4Distance between points

0.0

0.1

0.2

0.3

0.4

Prob

abilit

y qi

j

Normalt-Student

Figure 4 : Comparison of a normal distribution and a t-Studentdistribution with one degree of freedom.

The optimization problem of t-SNE

The Kullback-Leibler divergence is an information-theoretic me-asure, which computes the similarity between two probabilitydistribution:

J(z1, . . . , zN) = KL(P || Q) =N∑i=1

N∑j=1

pij logpijqij. (27)

Optimizing J(·) with respect to z1, . . . , zN gives us the desiredprojections. A single gradient term is given by:

∂J

∂zi= 4

N∑j=1

(pij − qij) (zi − zj)(

1 + ‖zi − zj‖22)−1

. (28)

Visualizing t-SNE

For an interactive visualization of t-SNE, refer to the follo-wing blog post: https://colah.github.io/posts/2014-10-Visualizing-MNIST/

https://colah.github.io/posts/2014-10-Visualizing-MNIST/

https://colah.github.io/posts/2014-10-Visualizing-MNIST/

Table of contents



Clustering problem

Clustering is the problem of partitioning a set of data {xi}Ni=1

into K clusters (or segments), such that points in the samecluster are more similar among them then points in differentclusters.

Clustering is a problem which is less well defined than supervisedlearning or dimensionality reduction; for this reason, choosing acorrect approach and evaluating its results are more difficult.

It is possible to think of clustering as an extreme form of di-mensionality reduction, where we want to describe each pointusing a single value, i.e., the index of its corresponding cluster.

Objective function of K-means

Each of the K clusters is represented by a centroid µk . A pointbelongs to the cluster corresponding to the closest centroid.

We use new variables rik , such that:

rik =

{1 if xi belong to cluster k ,

0 otherwise .(29)

Given this, the objective function of K-means is to find a clusterconfiguration such that the distance between points and cen-troids is minimized:

J(·) =N∑i=1

K∑k=1

rik ‖xi − µk‖22 . (30)

Iterative solution of K-means

If we fix the centroids, the points can be trivially assigned tothe nearest centroid as:

rik =

{1 if k = arg minj

∥∥xi − µj

∥∥22,

0 otherwise .(31)

If we fix instead the assignments, J(·) can be analytically mini-mized as:

µk =

∑Ni=1 rikxi∑Ni=1 rik

. (32)

K-means iterates between the previous two updates until con-vergence, which is guaranteed to a local minimum (each stepdecreases the objective function).

An example of K-means clustering

2 1 0 1 2 3

2

1

0

1

Figure 5 : A simple set of points to cluster. Note that, inpractice, the colors are not available even during training.

An example of K-means clustering (2)

2 1 0 1 2 3

2

1

0

1

(a) Initialization

2 1 0 1 2 3

2

1

0

1

(b) 1 iteration

2 1 0 1 2 3

2

1

0

1

(c) 5 iterations

Figure 6 : With a correct choice of K, the algorithm converges to a ‘good’assignment very rapidly (except for specific choices of the initial clusters).

An example of K-means clustering (3)

2 1 0 1 2 3

2

1

0

1

(a) Initialization

2 1 0 1 2 3

2

1

0

1

(b) 1 iteration

2 1 0 1 2 3

2

1

0

1

(c) 5 iterations

Figure 7 : The algorithm works just as well with a different choice of K.In this case, the clusters are not as easy to interpret visually.

Evaluation and model selection in clustering

The K of K-means is a hyper-parameter similar to the k of k-NN or the regularization factors in supervised models. However,model selection (e.g., grid-search) and evaluation are not asstraightforward for clustering as for classification/regression.

There are two broad range of evaluation methods:

I Internal evaluation: these are indexes to evaluate theinternal/external cohesion of the clusters.

I External evaluation: in this case, a human judge providesa set of correct cluster assignments (gold standard) to use.

One can also use data visualization techniques in order to obtainsome visual, empirical confirmation.

Internal indexes

As an example of internal evaluation index, the Davies-Bouldinindex evaluates a clustering partition as:

DB =1

K

K∑k=1

maxj 6=k

(σk + σj∥∥µk − µj

∥∥22

), (33)

where σk is the average distance of points in cluster k withrespect to µk .

Note how the index tries to balance between inter-cluster dis-tance, and distance between centroids.

Example of DB index

4 6 8 10 12Number of clusters

0.60

0.65

0.70

0.75

0.80

0.85DB

inde

x

Figure 8 : Note how internal indexes can give spurious results, likein this case with K=11.

Confusion matrix for clustering

Given instead a gold standard, one can compute a confusionmatrix as follows:

I True positive (TP): points which are in the same partitionin the clustering and the gold standard.

I True negative (TN): points which are in different subsetsin both partitions.

I False negative (FN): points which are in different clusters,but in the same partition in the gold standard.

I False positive (FP): points which are in the same cluster,but in different partitions in the gold standard.

Indexes for external evaluation

The equivalent of accuracy in clustering is the Rand index,defined as:

R =TP + TN

TP + TN + FP + FN. (34)

Alternatively, one can use precision, recall, F-measures, and soon (see the lecture on classifier evaluation).

In practice, one has to combine several indexes (both inter-nal and external), particularly when working in high-dimensionalspaces, for correctly assessing the clustering procedure.

Table of contents



Generalizing K-means

Despite its wide popularity, K-means is severely limited in whatit can represent. As an example, all clusters are assumed to beperfectly spherical around their centroid, which is hardly true inpractice.

One way to relax this assumption is to consider a generic Gaus-sian shape with mean µk and covariance Σk for each cluster.

If we also interchange the hard assignments rik with probabilis-tic assignments, we obtain the very popular Gaussian mixturemodel (GMM) algorithm.

GMM formulation

Formally, we suppose that our data is generated according to asuperposition of K Gaussian distributions as follows:

p(x) =K∑

k=1

πkN (x | µk ,Σk) , (35)

where each Gaussian is called a component of the distribution.The weights π1, . . . , πK are called mixing weights, and can beseen as the prior probability of assigning a point to a givencluster. For the rules of probability, they must respect:

K∑k=1

πk = 1, 0 ≤ πk ≤ 1 . (36)

Maximum likelihood for GMM

In order to fit the parameters of the GMM, we consider maxi-mum likelihood estimation:

log{p(X | πk ,Σk , πk)

}=

N∑i=1

log{ K∑k=1

πkN (xi | µk ,Σk)}. (37)

In principle, we can solve the previous problem via gradientdescent or similar optimization tools. In practice, a differentmethod is used, which closely resembles the optimization forK-means, called expectation-maximization (EM) algorithm.

EM intuition

EM has a wide applicability for probabilistic models. Here, weonly see its informal application to GMMs.

Similarly to K-means, EM works by alternating expectation steps(E steps), where we consider the most probable cluster assign-ments for the points, with maximization steps (M steps), wherewe consider the most probable configuration for the parameters.

A quantity that will appear repeatedly is the responsibility ofcomponent k defined as:

γk(x) =πkN (x | µk ,Σk)∑Kj=1 πjN (x | µj ,Σj)

. (38)

Responsibilities and M step

The responsibility can be seen as the posterior probability ofassigning point x to cluster k , given the prior probability πk andthe value of the kth component.

In a sense, responsibilities generalize the hard assignment rik ofK-means. The E step of EM in this case simply recomputesthe responsibilities for each possible point and cluster using theprevious equation.

The M step is more complex than K-means, because we needto maximize the log probability with respect to all three sets ofparameters.

Equations for the M step

Maximizing the log probability with respect to the mean µk weobtain:

µk =1

Nk

N∑i=1

γk(xi)xi , (39)

where:

Nk =N∑i=1

γk(xi) . (40)

Note the similarity with respect to K-means, where the meanis replaced with a weighted average of the points. For the kthcovariance Σk , we similarly obtain:

Σk =1

Nk

N∑i=1

γk(xi)(xi − µk)(xi − µk)T . (41)

Equations for the M step (2)

The optimal value for the mixing coefficients can be obtainedby solving:

max log{p(X | πk ,Σk , πk)

}+ λ

(K∑

k=1

πk − 1

). (42)

Setting the gradient to zero, and after some rearrangements,we obtain an intuitive expression for the mixing coefficients:

πk =Nk

N. (43)

The EM algorithm works by alternating the three maximizationssteps, with the computation of the new responsibilities.

Example of application of GMM and EM

Figure 9 : Taken from (Bishop, 2006).

Singularities for GMM

Differently from K-means, GMM can have some singular so-lutions whenever a mean µk coincides with a data point xi .Considering for simplicity a covariance Σk = σ2

kI, we have:

p(xi | µk ,Σk) ∝ 1

σ2k

. (44)

A maximum likelihood solution will then set σ2k → ∞ in order

to maximize the log probability.

Any real application of EM has to contain some check for thesesingular solutions, together with a reinitialization of the points.

Table of contents



Clustering algorithms families

K-means is a prototype-based clustering algorithm, which de-fines each cluster in terms of a single centroid point. GMMis instead a distribution-based algorithm, which fits the mostlikely parameters of a given probability distribution.

Density-based models try to avoid the need to specify K, byconstructing a non parametric density model from the data.In this case, close points are considered to be a cluster, whileisolated points are considered outliers.

The most prominent density-based clustering algorithm is DBS-CAN, described shortly here.

DBSCAN

In DBSCAN, we iteratively inspect all points in our databaseand mark them as one of three possibilities:

1. The point is a core point if it has at least P points withina distance ε. Both P and ε are specified by the user.

2. Points which are in the neighborhood of a core point, butare not core points, are said to be reachable points.

3. All other points are outliers.

The union of all core and reachable points which are connectedform a cluster.

Visualizing DBSCAN

Figure 10 : Red: core points. Yellow: reachable points. Blue:outlier. [Taken from Wikipedia]

Advantages and disadvantages of DBSCAN

DBSCAN works best when P and ε can be selected with somedomain knowledge. A strong advantage of DBSCAN is that itis able to detect some level of noise in the data using the notionof outliers.

Sometimes setting the parameters is extremely difficult. This isparticularly true for datasets with high dimensionality (due tothe curse of dimensionality), and for datasets whose dimensionshave different scalings.

Despite this, DBSCAN can find clusters with extremely irregularshapes, standing apart from K-means and GMM.

Table of contents



Hierarchical clustering

Some clustering strategies seeks to build a hierarchy of clusters,as opposed to a flat number of them. They can be roughlyclassified into two categories:

I Agglomerative clustering: we start with ≈ N clusters,and iteratively merge them until a proper termination con-dition is met.

I Divisive clustering: we start with a single cluster, anditeratively partition one or more of them to get smallerclusters.

Hierarchical clustering can be particularly useful in some con-texts, but it tends to be less scalable than alternative flat clus-tering strategies.

Examples of hierarchical clustering

A baseline for agglomerative clustering is single-linkage clus-tering, where at every iteration we merge the two clusters con-taining the closest pair of points.

A common way to do divisive clustering, instead, is bisectingK-means, where at each iteration we partition the bigger clus-ter with a K-means with K = 2.

The resulting hierarchical data structure is called a dendro-gram.

Visualizing DBSCAN

Figure 11 : An example of dendrogram on genetic data(https://dienekes.blogspot.it/2010/12/human-genetic-variation-first.html).

https://dienekes.blogspot.it/2010/12/human-genetic-variation-first.html

https://dienekes.blogspot.it/2010/12/human-genetic-variation-first.html

Further readings

Some material for this lecture is taken from Chapter 14 of thebook.

A fantastic introduction to low-rank models for dimensionalityreduction:

[1] Udell, M., Horn, C., Zadeh, R. and Boyd, S., 2016. Ge-

neralized low rank models. Foundations and Trends® in

Machine Learning, 9(1), pp. 1-118.

On dictionary learning and t-SNE:

[2] Mairal, J., Bach, F., Ponce, J. and Sapiro, G., 2009, June.Online dictionary learning for sparse coding. In Proc. 26thICML (pp. 689-696).

[3] Maaten, L.V.D. and Hinton, G., 2008. Visualizing data

using t-SNE. Journal of Machine Learning Research, 9(Nov),

pp. 2579-2605.

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

DATA MINING AND MACHINE LEARNING -...

Documents