Kernel Methods for Unsupervised Learning...Abstract Kernel Methods are algorithms that projects...

Dipartimento di Informatica eScienze dell’Informazione

••••• ••

Kernel Methods for Unsupervised Learning

by

Francesco Camastra

Theses Series DISI-TH-2004-XX

DISI, Universita di Genova

v. Dodecaneso 35, 16146 Genova, Italy http://www.disi.unige.it/

Universita degli Studi di Genova

Dipartimento di Informatica e

Scienze dell’Informazione

Dottorato di Ricerca in Informatica

Ph.D. Thesis in Computer Science

Kernel Methods for Unsupervised Learning

by

Francesco Camastra

March, 2004

Dottorato di Ricerca in InformaticaDipartimento di Informatica e Scienze dell’Informazione

Universita degli Studi di Genova

DISI, Universita di Genovavia Dodecaneso 35

I-16146 Genova, Italyhttp://www.disi.unige.it/

Ph.D. Thesis in Computer Science

Submitted by Francesco CamastraDISI, Universita di [email protected]

Date of submission: March 31th 2004

Title: Kernel Methods for Unsupervised Learning

Advisor and Supervisor: Prof. Alessandro VerriDISI, Universita di [email protected]

Ext. Reviewers: Prof. Marcello PelilloDipartimento di Informatica, Universita Ca Foscari di Venezia

[email protected]

Dr. Massimiliano PontilDepartment of Computer Science, University College London

[email protected]

Abstract

Kernel Methods are algorithms that projects input data by a nonlinear mapping in a newspace (Feature Space). In this thesis we have investigated Kernel Methods for Unsuper-vised learning, namely Kernel Methods that do not require targeted data. Two classicalunsupervised learning problems using Kernel Methods have been tackled. The former isthe Data Dimensionality Estimation, the latter is the Clustering.The dimensionality of a data set, called Intrinsic Dimension (ID), is the minimum num-ber of free variables needed to represent the data without information loss. The KernelPCA, has been investigated and compared, as ID estimator, with a classical dimensional-ity estimation method, namely the Principal Component Analysis (PCA). The study hasbeen carried out both on a synthethic data set of known dimensionality and on real databenchmarks, i.e. MIT-CBCL Face database and Indoor-Outdoor Image database. The in-vestigations have enlighted that Kernel PCA performs better than PCA, as dimensionalityestimator, only when the kernel adopted is very close to the function that has generatedthe data set. Otherwise Kernel PCA can perform even worst than PCA.With regard to the Clustering problem, we have proposed a novel Kernel Method, theKernel Grower [Cam04]. The Kernel Grower convergence is guaranteed, unlike severalclassical clustering algorithm, since it is an Expectation-Maximization algorithm.The main Kernel Grower quality consists, unlike all clustering algorithms published in theliterature, in producing nonlinear separation surfaces among data.Kernel Grower compares better with popular clustering algorithms, namely K-MEANS,Neural Gas and Self Organizing Maps, on a synthethic dataset and two UCI real databenchmarks, i.e. IRIS data and Wisconsin breast cancer database.Kernel Grower is the main original result of the thesis.

To my family.

In meinen freien Stunden, deren ich viele habe, bin ich meinen Fall durchge-gangen und habe daruber nachgedacht, wie die Welt der Wissenschaft, zu derich mich selber nicht mehr zahle, ihn zu beurteilen haben wird. Selbst einWollhandler muβ, auβer billig einkaufen und teuer verkaufen, auch noch darumbesorgt sein, daβ der Handel mit Wolle unbehindert vor sich gehen kann. DerVerfolg der Wissenschaft scheint mir diesbezuglich besondere Tapferkeit zu er-heischen. Sie handelt mit Wissen, gewonnen durch Zweifel. Wissen verschaf-fend uber alles fur alle, trachtet sie, Zweifler zu machen aus allen.1 (B.Brecht, Leben des Galilei)

1Nel tempo che ho libero- e ne ho, di tempo libero - mi e avvenuto di rimeditare il mio caso e didomandarmi come sara giudicato da quel mondo della scienza al quale non credo piu di appartenere.Anche un venditore di lana, per quanto abile sia ad acquistarla a buon prezzo per poi rivenderla cara,deve preoccuparsi che il commercio della lana possa svolgersi liberamente. Quanto a questo mi pare chela pratica della scienza richieda particolare coraggio. Essa tratta il sapere, che e un prodotto del dubbio;e col procacciare sapere a tutti su ogni cosa, tende a destare il dubbio in tutti. (Italian translation by E.Castellani.)

Acknowledgements

Firstly I wish to thank Professor Alessandro Verri for his valuable advices and encourage-ments. I wish to thank all DISI people for the friendship shown during the Phd thesis.

Table of Contents

Chapter 1 Introduction 6

1.1 Taxonomy of Machine Learning Research . . . . . . . . . . . . . . . . . . . 6

1.1.1 Rote Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.2 Learning from instruction . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.3 Learning by analogy . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Learning from examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . 12

1.5 The dissertation aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.1 Multidimensionalscaling . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6 How to read the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Chapter 2 Mathematical Foundations of Kernel Methods 17

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Scalar Products, Norms and Metrics . . . . . . . . . . . . . . . . . . . . . 18

2.3 Positive Definite Kernels and Matrices . . . . . . . . . . . . . . . . . . . . 19

1

2.3.1 How to make a Mercer kernel . . . . . . . . . . . . . . . . . . . . . 23

2.4 Conditionate Positive Definite Kernels and Matrices . . . . . . . . . . . . . 25

2.5 Negative Definite Kernels and Matrices . . . . . . . . . . . . . . . . . . . . 27

2.6 Relations between Positive and Negative Definite Kernels . . . . . . . . . . 29

2.6.1 Infinitely Divisible Kernels . . . . . . . . . . . . . . . . . . . . . . . 31

2.7 Metric computation by Mercer kernels . . . . . . . . . . . . . . . . . . . . 32

2.8 Hilbert Space Representation of Positive Definite Kernels . . . . . . . . . . 35

2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 3 Kernel Methods for Supervised Learning 38

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Lagrange Method and Kuhn-Tucker Theorem . . . . . . . . . . . . . . . . 39

3.2.1 Lagrange Multipliers Method . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Kuhn-Tucker Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Support Vector Machines for Classification . . . . . . . . . . . . . . . . . . 43

3.3.1 Optimal hyperplane algorithm . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Support Vector Machine Construction . . . . . . . . . . . . . . . . 46

3.3.3 Algorithmic Approaches . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Support Vector Machines for Regression . . . . . . . . . . . . . . . . . . . 50

3.4.1 Regression with Quadratic ε-insensitive loss . . . . . . . . . . . . . 51

3.4.2 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.3 Regression with Linear ε-insensitive loss . . . . . . . . . . . . . . . 54

3.4.4 Other Approaches to Support Vector Regression . . . . . . . . . . . 55

3.5 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5.1 Regression with Gaussian Processes . . . . . . . . . . . . . . . . . . 57

3.6 Kernel Fisher Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.6.1 Linear Fisher Discriminant . . . . . . . . . . . . . . . . . . . . . . . 58

3.6.2 Fisher Discriminant in Feature Space . . . . . . . . . . . . . . . . . 59

2

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Chapter 4 Data Dimensionality Estimation Methods 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Local methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2.1 Fukunaga-Olsen’s algorithm . . . . . . . . . . . . . . . . . . . . . . 64

4.2.2 The Near Neighbor Algorithm . . . . . . . . . . . . . . . . . . . . . 65

4.2.3 TRN-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Global Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.1 Projection techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.2 Multidimensional Scaling Methods . . . . . . . . . . . . . . . . . . 69

4.4 Fractal-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.1 Box-Counting Dimension . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4.2 Correlation Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.3 Methods of Estimation of Fractal Dimension . . . . . . . . . . . . . 74

4.4.4 Limitations of Fractal Methods . . . . . . . . . . . . . . . . . . . . 75

4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Chapter 5 Data Dimensionality Estimation with Kernels 79

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Kernel PCA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2.1 Centering in Feature Space . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.3.1 Trivial Kernel Perturbations . . . . . . . . . . . . . . . . . . . . . . 85

5.4 Experimentations on Inner-Outdoor Image Database . . . . . . . . . . . . 88

5.5 Experimentations on MIT-CBCL Face Database . . . . . . . . . . . . . . . 89

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3

Chapter 6 Clustering Data Methods 93

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Expectation and Maximization algorithm . . . . . . . . . . . . . . . . . . . 94

6.2.1 Basic EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3 Basic Definitions and Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3.1 Codebooks and Codevectors . . . . . . . . . . . . . . . . . . . . . . 96

6.3.2 Quantization Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.3 Entropy Maximization . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3.4 Winner-Takes-All Learning . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 K-MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4.1 Batch K-MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4.2 Online K-MEANS . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.5 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.5.1 SOM drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.6 Neural Gas and Topology Representing Network . . . . . . . . . . . . . . . 106

6.6.1 Neural Gas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.6.2 Topology Representing Network . . . . . . . . . . . . . . . . . . . . 107

6.6.3 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.7 General Topographic Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.7.1 Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.7.2 Optimization by EM Algorithm . . . . . . . . . . . . . . . . . . . . 110

6.7.3 GTM versus SOM . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.8 Fuzzy Clustering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.8.1 FCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 7 Clustering with Kernels 115

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4

7.2 Algorithms with Kernelised Metric . . . . . . . . . . . . . . . . . . . . . . 115

7.2.1 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3 Support Vector Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.3.2 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.4 Kernel Grower algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.5 Experimentations on Delta Set . . . . . . . . . . . . . . . . . . . . . . . . . 123

7.6 Experimentations on Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.7 Experiments on Wisconsin Breast Cancer Database . . . . . . . . . . . . . 140

7.8 KG Critical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.8.1 KG Qualities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.8.2 KG Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Chapter 8 Conclusion 149

8.1 Data Dimensionality Estimation . . . . . . . . . . . . . . . . . . . . . . . . 149

8.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Bibliography 151

5

Chapter 1

Introduction

The ability to learn is one of the distintive attributes of intelligent behavior. FollowingCarbonell et al., ”Learning process include the acquisition of new declarative knowledge, thedevelopment of motor and cognitive skills through instruction or practice, the organizationof new knowledge into general, effective representations, and the discovery of new facts andtheories through observation and experimentation [Car83].The study and computer modeling of learning process in their multiple manifestationsconstitutes the topic of Machine Learning. Machine Learning has been developed aroundthe primary research lines:

• Task-Oriented Studies, that is the development of learning systems to improve per-formance in a predetermined set of tasks.

• Cognitive Simulation, namely the investigation and computer simulation of humanlearning processes.

• Theorethical Analysis, i.e. the theorethical investigation of possible learning methodsand algorithms independently of applicative domain.

1.1 Taxonomy of Machine Learning Research

This section presents a taxonomy of Machine Learning presenting useful criteria for classi-fying and comparing most Machine Learning investigations. Although Machine Learningsystems can be classified according to different view points [Car83], a common choice is toclassify Machine Learning systems on the basis of the underlying learning strategies used.In Machine Learning two entities, the teacher and the learner, play a crucial role. Theteacher is the entity that has the required knowledge to perform a given task. The learner

6

is the entity that has to learn the knowledge to perform the task.We can distinguish learning strategies by the amount of inference the learner performson the information provided by the teacher. We consider the two extreme cases, namelyperforming no inference and performing a remarkable amount of inference. If a computersystem (the learner) is programmed directly, its knowledge increases but it performs noinference since all cognitive efforts are developed by the programmer (the teacher). On theother hand, if a system independently discovers new theories or invents new concepts, itmust perform a very substantial amount of inference; it is deriving organized knowledgefrom experiments and observations. An intermediate case it could be a student determin-ing how to solve a math problem by analogy to problem solutions contained in a textbook.This process requires inference but much less than discovering a new theorem in Mathe-mathics.Increasing the amount of inference that the learner is capable of performing, the burdenon the teacher decreases. The taxonomy of Machine Learning below tries to capture thenotion of trade-off in the amount of effort required of the learner and of the teacher [Car83].Hence we can identify four different learning types: Rote Learning, Learning from instruc-tion, Learning by analogy and Learning from examples. The first three learning types aredescribed below, while the next section is devoted to the last type.

1.1.1 Rote Learning

Rote Learning consists in the direct implanting of new knowledge in the learner. Noinference or other transformation of the knowledge is required on the part of the learner.Variants of this method include:

• Learning by being programmed or modified by an external identity. It requires noeffort on the part of the learner. For instance, the usual style of computer program-ming.

• Learning by memorization of given facts and data with no inferences drawn from theincoming information. For instance, the primitive database systems.

1.1.2 Learning from instruction

Learning from instruction (or Learning by being told) consists in acquiring knowledge froma teacher or other organized source, such as a textbook, requiring that the learner transformthe knowledge from the input language to an internal representation. the new informationis integrated with prior knowledge for effective use. The learner is required to performsome inference, but a large fraction of the cognitive burden remains with the teacher, who

7

must present and organize knowledge in a way that incrementally increases the learner’sactual knowledge. Learning from instruction mimics education methods. Therefore, theMachine Learning task is to build a system that can accept instruction and can store andapply this learned knowledge effectively. Systems that use learning from instructions aredescribed in [Haa83][Mos83][Ryc83].

1.1.3 Learning by analogy

Learning by analogy consists in acquiring new facts or skills by transforming and increasingexisting knowledge that bears strong similarity to the desired new concept or skill into aform effectively useful in the new situation. A Learning-by-analogy system might be appliedto convert an existing computer program into one that performs a closely-related functionfor which it was not originally designed. Learning by analogy requires more inference onthe part of the learner that does Rote Learning or Learning from instruction. A fact orskill analogous in relevant parameters must be retrieved from memory; then the retrievedknowledge must be transformed, applied to the new situation, and stored for future use.Systems that use learning by analogy are described in [And83] [Car83b].

1.2 Learning from examples

Given a set of examples of a concept, the learner induces a general concept description thatdescribe the examples. The amount of inference performed by the learner is much greaterthan in Learning from instruction and in Learning by analogy. Learning from exampleshas become so popular in the last years that it is often called simply learning. In a similarway, the examples are referred as data. In the rest of the dissertation these conventionswill be adopted.The learning problem can be described as finding a general rule that explains data givenonly a sample of limited size. The difficulty of this task is similar to the problem of childrenlearning to speak from the sounds emitted by the grown-up people.The learning problem can be stated as follows. Given an example sample of limited size, tofind a concise data description. Learning techniques can be grouped in three big families:Supervised Learning, Reinforcement Learning and Unsupervised Learning.

1.2.1 Supervised Learning

In Supervised Learning (or Learning with a teacher), the data is a sample of input-outputpatterns. In this case, a concise description of the data is the function that can yield the

8

output, given the input. This problem is called supervised learning because the objectsunder considerations are already associated with target values, e.g. classes and real values.Examples of this learning task are the recognition of handwritten letters and digits, theprediction of stock market indexes.In the problem of supervised learning, given a sample of input-output pairs, called thetraining sample (or training set), the task is to find a deterministic function that mapsany input to an output that can predict future input-output observations, minimizing theerrors as much as possible. Whenever asked for the target value of an object present inthe training sample, it can return the value that appeared the highest number of timestogether with this object in the training sample. According to the type of the outputs,supervised learning can be distinguished in Classification and Regression Learning.

1.2.1.1 Classification Learning

If the output space has no structure except whether two elements of the output are equalor not, this is called the problem of Classification Learning (or simply classification).Each element of the output space is called a class. The learning algorithm that solves theclassification problem is called the classifier. In classification problems the task is to assignnew inputs to one of a number of discrete classes or categories. This problem characterizesmost pattern recognition tasks. A typical classification problem is to assign to a characterbitmap the correct letter of the alphabet.

1.2.1.2 Regression

If the outputs space is formed by the outputs represent the values of continuous variables,for instance the prediction of a stock exchange index at some future time, then the learningtask is known as the problem of Regression or Function Learning [Her03]. Typical examplesof Regression are to predict the value of shares in the stock exchange market and to estimatethe value of a physical measure (e.g. pression, temperature) in a section of a thermoelectricplant.

1.2.2 Reinforcement Learning

Reinforcement learning has its roots in control theory. It considers the scenario of adynamic environment that results in state-action-reward triples as the data. The differencebetween reinforcement and supervised learning is that in reinforcement learning no optimalaction exists in a given state, but the learning algorithm must identify an action in orderto maximize the expected reward over time. The concise description of data is the strategythat maximizes the reward.

9

The problem of reinforcement learning is to learn what to do, i.e. how to map situations toactions, in order to maximize a given reward. Unlike supervised learning task, the learningalgorithm is not told which actions to take in a given situation. Instead, the learneris assumed to gain information about the actions taken by some reward not necessarilyarriving immediately after the action is taken. An example of such a problem is learningto play chess. Each board configuration, namely the position of chess pieces on the chessboard, is a given state; the actions are the possible moves in a given configuration. Thereward for a given action (the move of a piece), is winning the game. On the contrary,the punishment is losing the game. This reward, or this punishment, is delayed which isvery typical for reinforcement learning. Since a given state has no optimal action, one ofthe biggest challanges of a reinforcement learning algorithm is to find a trade-off betweenexploration and exploitation. In order to maximize reward (or minimize the punishment)a learning algorithm must choose actions which have been tried out in the past and foundto be effective in producing reward, that is it must exploit its current knowledge. On theother hand, to discover those actions the learning algorithm has to choose actions not triedin the past and thus explore the state space. There is no general solution to this dilemma,but that neither of the two options can lead exclusively to an optimal strategy is clear.A comprehensive survey on reinforcement learning can be found in [Sut98].

1.3 Unsupervised Learning

If the data is only a sample of objects without associated target values, the problem isknown as Unsupervised Learning. In Unsupervised Learning there is no teacher. Hence aconcise description of the data could be a set of clusters or a probability density statinghow likely it is to observe a certain object in the future. Typical examples of UnsupervisedLearning tasks include the problem of image and text segmentation and the task of noveltydetection in process control. Unsupervised Learning is one of the topic of the dissertation.In Unsupervised Learning we are given a training sample of objects (e.g. images) with theaim of extracting some structure from them. For instance, identifying indoor or outdoorimages or extracting face pixels in an image. If some structure exists in training data, itcan take advantage of the redundancy and find a short description of data. A general wayto represent data is to specify a similarity between any pairs of objects. If two objects sharemuch structure, it should be possible to reproduce the data from the same prototype. Thisidea underlies clustering algorithms that form a rich subclass of unsupervised algorithms.Clustering algorithms are based on the following idea. Given a fixed number of clusters, weaim to find a grouping of the objects such that similar objects belong to the same cluster.If it is possible to find a clustering such that the similarities of the objects in one cluster aremuch greater than the similarities among objects from different clusters, we have extractedstructure from the training sample so that the whole cluster can be represented by one

10

representative data point.From a statistic point of view, the idea of concise representation of data is closely related tothe problem of Probability Density Estimation of input data. Probability Density Estima-tion consists in estimating the unknown Probability Density Function (PDF ) of the inputknowing only some input samples. There some methods that try to estimate explicitlythe function such as mixture models. Clustering algorithms try to reproduce PDF withoutestimating it explicitly.In addition to clustering algorithms, in unsupervised learning techniques there are algo-rithms whose aim is to represent high-dimensionality data in low dimension manifolds,trying to preserve the original information of data. These techniques, called Multidimen-sionalscaling (MDS ) are particular important for the following reasons. The use of moredimensions than strictly necessary leads to several problems. The first one is the spaceneeded to store the data. As the amount of available information increases, the compres-sion for storage purposes becomes even more important. The speed of algorithms using thedata depends on the dimension of the vectors, so a reduction of the dimension can resultin reduced computation time. Then it can be hard to make reliable classifiers when thedimensionality of input data is high (curse of dimensionality [Bel61]). Therefore the useof vectors with smaller dimension often leads to improved classification performance.In this dissertation we will investigate Clustering and MDS problems by means new emerg-ing Machine Learning techniques, i.e. Kernel Methods.

1.4 Kernel Methods

First of all, we present a brief historical survey of the Learning by examples techniques.The success of these techniques coincide essentially with the affirmation of Neural Net-works (or Neural Nets). Neural Networks have been developed with the aim of modelingthe behaviour of the human brain. In 1943 Mc Culloch and Pitts [Mcu43] presented firstworks on Neural Networks. In 1957 Rosenblatt [Ros62] proposed a biologically motivatedneural network, the perceptron that can solve classification problems where the classes arelinearly separated. In 1969 Minsky and Papert [Min69] presented a textbook in whichthey investigate whether or not it was realistic to expect machines to learn complex tasks.Minsky and Papert showed that Rosenblatt’s perceptron could not solve elementary non-linear problems such as XOR. This negative result stopped practically active research inthe field of neural networks for the next ten years. In 1986 Rumelhart et al. [Rum86a][Rum86b] revived the interest in neural networks and in the problem of Machine Learn-ing. They proposed a generalization of the perceptron, the Multi-Layer-Perceptron (MLP)[Bis95][Her91] that was capable to solve an adequately wide class of problems. Learn-ing in MLP was performed by means of Back Propagation algorithm, based on maximumlikelihood principle [Bis95]. MLP success has led to a big development of the research in

11

Neural Networks. Many neural network models have been developed in the following years.Neural Networks have been used to solve effectively many machine learning problems, e.gclassification, regression and clustering problems.

1.4.1 Statistical Learning Theory

For the sake of semplicity, we consider only the classification problem. In order to work,classifiers have to be submitted to a learning process. During the learning phase, is nec-essary to provide the neural net the training set, that is used by the classifiers to learnthe function that describes the input-output data. After the learning, the performance ismeasured on a a new dataset (test set) disjoint from the training set. The performances areexpressed in terms of an error, expressed by a loss function (e.g Mean Square Error), thatmeasures how much the classifier is far from the desired behaviour. Vladimir Vapnik, in hisstatistic theory of learning [Vap79], put in connection the error (or Expected Risk) of theclassifier on the test set with the error (or Empirical Risk) on the training set. Expected(EX) and Empirical (EM) Risk are connected by the following relation:

EX ≤ EM + ES (1.1)

where ES is the Estimation Error.Vapnik proved that

ES '√

h

llog

(1 + 2

l

h

)

where l is the sample number of the Training Set and h is the Vapnik-Chervonenkis (VC)dimension [Vap71] of the classifier. Roughly speaking, VC dimension is the maximumnumber of samples that the classifier can classify correctly, and is ≤ l. VC dimension isoften called the classifier capacity. If the rate h

lis not negligible, the Expected Risk differs

notably from the Empirical Risk. Therefore it is necessary during the classifier training notonly to minimize of Empirical Risk but also to control the capacity classifier, i.e. to avoidthat the rate h

lis too high. Neural Networks, trained with maximum likelihood principle,

try to minimize the Empirical Risk without taking care of VC dimension. Hence it happensthat though the trained neural classifier has a small Empirical Risk on the training set itperforms poorly on the test set.In 1995 Corinna Cortes and Vladimir Vapnik [Cor95] proposed a new learning algorithm,Support Vector Machines (SVM ) [Vap95] [Vap98], motivated by the Statistical Learn-ing Theory results, above exposed. Unlike Neural Networks, SVM tries to minimize theExpected Risk, instead of the Empirical Risk. This is possible since VC dimension iscontrolled during the SVM training. A distinctive feature of Support Vector Machinesis the use of Mercer Kernels [Ber84] to perform the inner product (kernel trick). SVMsare more and widely applied in the Machine Learning community, since they often show

12

higher performances than other learning algorithms. The great SVM success had led toa development of a new branch in Machine Learning, Kernel Methods i.e. the algorithmsthat use the kernel trick.

1.5 The dissertation aim

Kernel Methods have been historically developed for supervised learning. A large numberof papers have been published about supervised kernel Methods and their successful ap-plications.On the contrary the problem of developing effective Kernel Methods for UnsupervisedLearning has been only recently tackled. The aim of this dissertation is to investigate andto develop some Kernel Methods for Unsupervised Learning. Two classical unsupervisedlearning problems using Kernel Methods have been tackled. The former is the Multidi-mensionalscaling, the latter is the Clustering.

1.5.1 Multidimensionalscaling

We have selected a specific topic in the Multidimensionalscaling, the Data DimensionalityEstimation. Hence the former goal of the dissertation is to investigate the problem ofestimating the data dimensionality using Kernel Methods available in the literature. Thedimensionality (Intrinsic Dimensionality) of a data set is the minimum number of freevariables needed to represent the data without information loss. The Dimensionality Esti-mation is important for many reasons in Machine Learning. The use of more dimensionsthan strictly necessary leads to several problems. The first one is the space needed to storethe data. As the amount of available information increases, the compression for storagepurposes becomes even more important. The speed of algorithms using the data dependson the dimension of the vectors, so a reduction of the dimension can result in reducedcomputation time. Then it can be hard to make reliable classifiers when the dimension ofinput data is high (curse of dimensionality [Bel61]).

1.5.2 Clustering

The second goal of the dissertation is to tackle the clustering problem using Kernel Meth-ods. Clustering is a classical problem in Machine Learning. In spite of the clusteringimportance, the literature does not offer Kernel Methods that can perform clustering effec-tively. The aim of the dissertation is to propone an original Kernel Method for Clustering.

13

The formulation of a Kernel Method for Clustering is the main original result of the dis-sertation.

1.6 How to read the dissertation

The dissertation consists of of three big parts. Each part is formed by two chapters. Thefirst part is propaedeutic to the other two parts. On the contrary there is no propaedeuticties between the second and the third part of the dissertation, they can be read in anysequence.The first part, formed by the second and the third chapter, has the aim to provide the basicfacts and concepts of the Kernel Methods. The second chapter is devoted to present themathemathical foundations of the Kernels Methods. The third chapter presents a surveyof supervised Kernel Methods. Although the topic of the dissertation is the UnsupervisedKernel Methods, a review of the supervised Kernel Methods is necessary since many tech-niques used in unsupervised Kernel Methods have been borrowed by the supervised ones.The second part, formed by the fourth and the fifth chapter, is devoted to investigate theproblem of the estimation of the Intrinsic Dimensionality (ID) of data. The fourth chapterpresents the state of the art of the not-kernel-based methods of ID estimation. This surveyhas been published in similar form on Pattern Recognition [Cam03]. The fifth chapter isdevoted to the problem of ID estimation by means of Kernel Methods. Firstly the KernelMethod used to estimate ID, the Kernel PCA, is presented. Then the experimental investi-gations, using Kernel PCA, on real and synthetic data are described. These investigationsrepresents the original contribute of the dissertation to the problem of ID estimation.The third part, formed by the sixth and the seventh chapter, is devoted to investigate theproblem of data clustering. It has the same organization of the second part. The sixthchapter presents the state of the art of the not-kernel-based methods of the clusteringalgorithm. The seventh chapter is devoted to the problem of clustering data by means ofKernel Methods. In this chapter a novel kernel-based clustering algorithm, Kernel Grower,is presented. Kernel Grower represents the main original result of the dissertation. KernelGrower compares better, on synthethic and real data benchmark, with popular clusteringalgorithms such as K-MEANS, Self Organizing Maps and Neural Gas.The content of each chapter of the dissertation is as follows.The second chapter, Mathemathical Foundations of the Kernel Methods presents a tutorialon the Kernel theory, focusing on the theorethical aspects whose interest is relevant forthe dissertation topics, such as the computation of the inner products and the metrics bymeans of Mercer kernels. Firstly the definitions of scalar product, norm and metric arerecalled. Then Positive and Negative Definite Kernels are described. In particular theconnections between Positive and Negative Definitive Kernels are discussed. Then howto compute a metric by means of a Positive Definite Kernel is presented. Finally how a

14

Positive Definite Kernel can be represented by means of a Hilbert space is described.The third chapter, Kernel Methods for Supervised Learning, provides an overview of the su-pervised Kernel Methods. The chapter presents the main tools of the optimization theoryused in the Kernel Methods. Then Support Vector Machines for Classification and Regres-sion are presented. Gaussian Processes are also described underlining their connectionswith Support Vector Machines for Regression. Finally the Kernel Fisher Discriminant isdiscussed.The fourth chapter, Data Dimensionality Estimation Methods presents the state-of-theart of not-kernel-based methods for the data dimensionality estimation methods. Theapproaches for estimating the data dimensionality can be divided in local and global. Inthe first one the dimensionality is estimated using the information contained in sampleneighborhoods, avoiding the projection of the data onto a lower dimensional space. Un-like local approaches that use only the information contained in the neighborhood of eachdata sample, at the same time global approaches make use of the whole data set. Themost common local and global approaches are reviewed in the chapter. Special attentionis payed to describe fractal-based approaches.The fifth chapter, Data Dimensionality Estimation with Kernels, is devoted to the problemof estimating the dimensionality of a data set by means of Kernel Methods. Firstly thechapter presents an overview of the Kernel PCA (KPCA). Then the use of KPCA such anData Dimensionality Estimator is investigated. KPCA has been tested on a synthethicaldataset, whose dimensionality is known and on real data. It is shown that KPCA per-formances are strongly influenced by the kernel choice. KPCA is a good dimensionalityestimator only when the kernel is very close to the function that has generated the dataset. Otherwise KPCA seems to behave poorly.The sixth chapter, Clustering Data Methods, presents the state-of-the-art of not-kernel-based methods for clustering data. This chapter presents a survey of the clustering, payingspecial attention to neural-based algorithms. Firstly the chapter reviews the Expectation-Maximization algorithm that it is used in the most common clustering algorithms. Thenthe the basic concepts, such as the Quantization Error and the Voronoi Tessellation, of theclustering algorithms are presented. In the chapter the most common popular clusteringalgorithm, K-MEANS is discussed. Besides, the main soft competitive learning algorithms,such as Self Organizing Maps (SOM ), Neural Gas and Topology Representing Networksare presented. General Topographic Mapping is also discussed, underlining its connectionswith SOM. Finally the chapter provides some sketches on the fuzzy clustering algorithms.The seventh chapter, Clustering with Kernels, investigates the problem of clustering datausing Kernels. In this chapter a novel kernel-based clustering algorithm, Kernel Grower,is presented. Kernel Grower represents the main original result of the dissertation. Firstlythe Support Vector Clustering, the basic step of the Kernel Grower, is introduced. Thenthe Kernel Grower algorithm is presented. Finally Kernel Grower is tested on synthethicaland real datasets. On these sets, Kernel Grower compares notably better with K-MEANS,SOM and Neural Gas.

15

The eighth chapter, Conclusions and Future Work, resumes the main results obtained inthe thesis and indicates further themes that could be investigated in a next future.

16

Chapter 2

Mathematical Foundations of KernelMethods

2.1 Introduction

The theory of the Kernel functions has been developed during the first four decades ofthe twentieth century by some among most brilliant mathemathicians of the century. Theconcept of positive definite kernel has been introduced for the first time by David Hilbert[Hil04]. Moreover, remarkable contributions to the kernel theory have been provided byMercer [Mer09], Schur [Sch11], Schoenberg [Sch38a] [Sch38b] [Sch42], Kolmogorov [Kol41]and Aronszajn [Aro44] [Aro50].In Machine Learning, the use of the kernels functions to make computations, has been pro-posed, for the first time, by Aizerman et al. [Aiz64] in 1964. In 1995 Cortes and Vapnik[Cor95] proposed a learning algorithm, Support Vector Machines (SVM ), that use kernelsin order to make inner products between data vectors. SVM is becoming more and more awidespread algorithm in the Machine Learning community, since it has shown, in severalcases, better performances in comparison with the other learning algorithms. The successof SVM has brought to extend the kernel use to other learning algorithm. At present thereis quite a number of algorithms (e.g. SVM, Kernel PCA [Sch98a]), called Kernel Methods,that use kernels to make inner products between vectors. Kernel Methods performancesare strongly influenced by the choice of the kernel used to make the product [Bar02]. Be-sides, it is often necessary, in many applications, to design new kernels, that incorporatethe apriori knowledge on the application. Therefore for Machine Learning people to havea good knowledge of the kernel theory is becoming more and more necessary.The aim of this chapter is to present a tutorial on the Kernel theory, focusing on the theoryaspects whose interest is relevant for the dissertation topics, such as the computation ofthe inner products and the metrics by means of Mercer kernels.

17

The chapter is organized as follows: in Section 2.2 the definitions of scalar product, normand metric are recalled; in Section 2.3 Positive Definite Functions and Matrices are pre-sented; Section 2.4 is devoted to Conditionate Positive Definite Kernels and Matrices; Neg-ative Definite Functions and Matrices are described in Section 2.5; Section 2.6 is devotedto present the connections between negative and definite kernels; Section 2.7 presents howit can compute a metric by a positive definite kernel; Section 2.8 describes how a positivedefinite kernel can be represented by means of a Hilbert space.

2.2 Scalar Products, Norms and Metrics

The aim of this section is to recall the concepts of the inner product, norm and metric.These concept are widely exposed in many books of functional analisys (e.g. [Rud64]). Forsimplicity sake, in all chapter we consider vectors in Rn instead of Cn, as it is usually donein the books of analisys .

Definition 1. Let X be a set. A scalar product ( or inner product) is an application· : X ×X → R satisfying the following conditions:

a) y · x = x · y x, y ∈ X

b) (x + y) · z = (x · z) + (y · z) x, y, z ∈ X

c) (αx) · y = α(x · y) x, y ∈ X α ∈ R

d) x · x ≥ 0 x ∈ X

e) x · x = 0 ⇐⇒ x = 0

f) x · (y + z) = (x · y) + (x · z) x, y, z ∈ X

Axioms (a) and (b) imply (f).Using the axiom (d) it is possible to associate, to the inner product, a quadratic form,called norm, ‖ · ‖, such that:

‖x‖2 = x · xMore generally, the norm can be defined in the following way.

Definition 2. The norm ‖ · ‖ : X → R is a function that has the following properties:

‖x‖ ≥ 0 ∀x ∈ X

‖x‖ = 0 ⇐⇒ x = 0

‖αx‖ = |α|‖x‖ ∀α ∈ R ∀x ∈ X

‖x + y‖ ≤ ‖x‖+ ‖y‖

18

Norms and inner products are connected by the Cauchy-Schwarz’s inequality :

|x · y| ≤ ‖x‖‖y‖Definition 3. Let X a set. A function ρ : X ×X → R is called a distance on X if:

a) ρ(x, y) ≥ 0 ∀x, y ∈ X

b) ρ(x, y) = ρ(y, x) ∀x, y ∈ X

c) ρ(x, x) = 0 ∀x ∈ X

The (X, ρ) is called a distance space.If ρ satisfies, in addition, the triangle inequality

d) ρ(x, y) ≤ ρ(x, z) + ρ(y, z) ∀x, y, z ∈ X

then ρ is called a semimetric on X.Besides, if

e) ρ(x, y) = 0 ⇒ x = y

In this case (X, ρ) is called a metric space.

It is easy to show that the function ρ(x, y) = ‖x− y‖ is a metric.We conclude the section introducing the concept of Lp spaces.

Definition 4. Consider countable sequences of real numbers and let 1 ≤ p < ∞. Thespace Lp is the set of sequences z = z1, . . . , zn, . . . such that

‖z‖p =

( ∞∑i=1

|zi|p) 1

p

< ∞

2.3 Positive Definite Kernels and Matrices

We introduce the concept of positive definite matrices.

Definition 5. A n × n matrix A = (aij) aij ∈ R is called a positive definite matrixiff 1

n∑j=1

n∑

k=1

cjckajk ≥ 0 (2.1)

for all n ∈ N, c1, . . ., cn ⊆ R.

1iff stands for if and only if

19

The basic properties of positive definite matrices are underlined by the following result.

Theorem 1. A matrix

• is positive definite iff is symmetric.

• is positive definite has all eigenvalues ≥ 0.

A matrix is called strictly positive definite if all eigenvalues are positive. The followingresult (Sylvester’s criterion), whose proof is omitted, is a useful tool to set up if a matrixis strictly definite positive.

Theorem 2. Let A = (ajk) a symmetric n× n matrix. Then A is strictly positive definiteiff

(det(ajk)j,k≤p) > 0 p = 1, . . . , n

i.e all its minors have positive determinants.

Now we introduce the concept of the positive definite kernels.

Definition 6. Let X be a nonempty set. A function ϕ : X ×X → R is called a positivedefinite kernel (or Mercer kernel) iff

n∑j=1

n∑

k=1

cjckϕ(xj, xk) ≥ 0

for all n ∈ N, x1, . . ., xn ⊆ X and c1, . . ., cn ⊆ R .

The following result, that we do not prove, underlines the basic properties of positivedefinite matrices.

Theorem 3. A kernel ϕ on X ×X

• is positive definite iff is symmetric.

• is positive definite iff for every finite subset X0 ⊆ X the restriction of ϕ to X0 ×X0

is positive definite.

Besides, if ϕ is positive definite, then ϕ(x, x) ≥ 0 ∀x ∈ X.

An example of Mercer kernel is the inner product.

Corollary 1. The inner product is a positive definite kernel.

20

Proof. Applying the properties of the inner product, we have:n∑

j=1

n∑

k=1

cjckxj · xk =n∑

j=1

cjxj ·n∑

j=1

cjxj = ‖n∑

j=1

cjxj‖2 ≥ 0

For Mercer kernels an inequality analogous to Cauchy-Schwarz’s one holds.

Theorem 4. For any positive definite kernel ϕ the following inequality holds

|ϕ(x, y)|2 ≤ ϕ(x, x)ϕ(y, y) (2.2)

Proof. Without losing generality, we consider the matrix

A =

(a bb d

)

where a, b, d ∈ R. Then, for w, z ∈ R we have:

(w z)

(a bb d

)(wz

)= aw2 + 2bzw + dz2 = a

[w +

b

az

]2

+z2

a

[ad− b2

](∀a 6= 0)

The matrix A is positive definite iff a ≥ 0, d ≥ 0 and

det

(a bb d

)= ad− b2 ≥ 0

Therefore for any positive definite kernel ϕ we have

|ϕ(x, y)|2 ≤ ϕ(x, x)ϕ(y, y)

Since both sides of the inequality are positive, we get:

|ϕ(x, y)| ≤√

ϕ(x, x)√

ϕ(y, y) (2.3)

If we define ‖x‖ϕ4=

√ϕ(x, x) a pseudonorm, the inequality 2.3 becomes

|ϕ(x, y)| ≤ ‖x‖ϕ‖y‖ϕ

that recalls the Cauchy-Schwarz’s inequality of inner product.The following remark underlines that ‖x‖ϕ is a pseudonorm.

Remark 1. ‖x‖ϕ is not a norm, since x = 0 does not imply ‖x‖ϕ = 0.

21

Proof. We consider the kernel ϕ(x, y) = cos(x− y) x, y ∈ R.The kernel ϕ is a Mercer kernel, since we have:

n∑i=1

n∑j=1

cicj cos(xi − xj) =n∑

i=1

n∑j=1

cicj [cos(xi) cos(xj) + sin(xi) sin(xj)]

=

[n∑

i=1

ci cos(xi)

]2

+

[n∑

i=1

ci sin(xj)

]2

≥ 0

But ‖x‖ϕ = 1 ∀x.

Now we introduce the result, due to Mercer, that allows to use Mercer kernels to makeinner products.

Theorem 5. Let K be a symmetric function such that for all x, y ∈ X, X ⊆ RK(x, y) = Φ(x) · Φ(y) (2.4)

where Φ : X → F , F is called the inner product Feature space.K can be represented in terms of 2.4 iff K = (K(xi, xj))

ni,j=1 is semidefinite positive, i.e.

K is a Mercer kernel.Besides, K defines an explicit mapping if Φ is known, otherwise the mapping is implicit.

Proof. We prove the proposition in the case of finite dimension space. Consider a spaceX = [x1, . . . , xn] and suppose that K(x, y) is a symmetric function on X. Consider thematrix K = (K(xi, xj))

ni,j=1. Since K is symmetric, exist an orthogonal matrix V =

[v1, . . . , vn] such that K = V ΛV T , where Λ is a diagonal matrix that has, as elements, theeigenvalues λi of K, while vi are the eigenvectors of K.Now we consider the following mapping Φ : X → Rn

Φ(xi) = (√

λtvti)nt=1

Therefore we have:

Φ(xi) · Φ(xj) =n∑

i=1

λtvtivtj = (V ΛV T )ij = Kij = K(xi, xj)

The requirements that K has all eigenvalues non-negative descends from the definition ofΦ since the argument of the square root must be non-negative.

For sake of completeness, we quote the Mercer’s Theorem that is the generalization of theproposition 2.4 for the infinite dimension spaces.

22

Theorem 6. Let X(X ⊆ Rn) a compact set. If K is a continuous symmetric functionsuch that the operator TK:

(TKf)(·) =

∫

X

K(·, x)f(x)dx (2.5)

is positive definite, i.e.

∫

X×X

K(x, y)f(x)f(y)dxdy ≥ 0 ∀f ∈ L2(X) (2.6)

then we can expand K(x, y) in a uniformly convergent series in terms of eigenfunctionsΦj ∈ L2(X) and positive eigenvalues λj > 0,

K(x, y) =∞∑

j=1

λjΦj(x)Φj(y) (2.7)

It is necessary to point out the following remark.

Remark 2. The condition 2.6 corresponds to the condition 2.4 of the definition of theMercer Kernels in the finite case.

Now we provide examples of Mercer kernels that defines implicit and explicit mapping.The kernel K(x, y) = cos(x− y) x, y ∈ R defines an explicit mapping. In fact, we have

K(x, y) = cos(x− y) = cos(x) cos(y) + sin(x) sin(y)

that is the inner product in a feature space F defined by the mapping Φ : R→ R2

Φ(x) =

(cos(x)sin(x)

)

On the contrary, the Gaussian 2 G = exp(−‖x− y‖2) is a case of a Mercer kernel with animplicit mapping, since Φ is unknown. The possibility to use Mercer kernels in order toperform inner product makes quite important their study for the Computer Science. Inthe rest of the section we will present methods to make Mercer kernels.

2.3.1 How to make a Mercer kernel

The following result shows that Mercer kernels satisfy quite a number of closure properties.

2For the proof of the positive definiteness of the Gaussian see the Corollary 7

23

Theorem 7. Let ϕ1 and ϕ2 be Mercer kernels repectively over X×X and X ⊆ Rn, a ∈ R+,· and

⊗respectively the inner and the tensor product.

Then the following functions are Mercer kernels:

1. ϕ(x, z) = ϕ1(x, z) + ϕ2(x, z)

2. ϕ(x, z) = aϕ1(x, z)

3. ϕ(x, z) = ϕ1(x, z) · ϕ2(x, z)

4. ϕ(x, z) = ϕ1(x, z)⊗

ϕ2(x, z)

Proof. The proof of the first and the second item are immediate.

n∑j=1

n∑i=1

cicj[ϕ1(xi, xj) + ϕ2(xi, xj)] =n∑

i=1

n∑j=1

cicjϕ1(xi, xj) +n∑

i=1

n∑j=1

cicjϕ2(xi, xj) ≥ 0

n∑i=1

n∑j=1

cicj[aϕ(xi, xj)] = a

n∑i=1

n∑j=1

cicjϕ(xi, xj) ≥ 0

Since the product of positive definite matrices is still definite positive, the third item isimmediately proved.We pass to prove the last item. The tensor product of two positive definite matrices ispositive semidefinite since the eigenvalues of the product are all pairs of products of theeigenvalues of the two components.

The following corollaries provide useful methods in order to make Mercer kernels.

Corollary 2. Let ϕ(x, y) : X ×X → R be positive definite. The following kernels are alsodefinite positive:

1. K(x, y) =n∑

i=0

ai[ϕ(x, y)]n ai ∈ R+

2. K(x, y) = exp(ϕ(x, y))

Proof. The first item is an immediate consequence of the Theorem 7.Regarding the second item, the exponential can be represented as:

exp(ϕ(x, y)) = 1 +∞∑i=1

[ϕ(x, y)]i

i!

and is a limit of linear combinations of Mercer kernels. Since Mercer kernels are closedunder taking pointwise limit, the item is proved.

24

Corollary 3. Let f(·) : X → X then ϕ(x, y) = f(x)f(y) is positive definite.

Proof.

n∑i=1

n∑j=1

cicjf(xj)f(xk) = (n∑

i=1

cif(xi))2 ≥ 0

Abovedescribed propositions are very useful to make new Mercer kernels by means ofexisting Mercer kernels. Nevertheless to prove that a kernel is positive definite is generallynot a trivial task. The following propositions, that we do not prove, are useful criteria thatallows to establish if a kernel is positive definite.

Theorem 8 (Bochner). If K(x− y) is a continuous positive definite function, then thereexists a bounded nondecreasing function V (u) such that K(x − y) is a Fourier-Stjeltiestransform of V (u), that is :

K(x− y) =

∫ ∞

−∞ei(x−y)udV (u)

If the function K(x− y) satisfies this condition, then it is positive definite.

Theorem 9 (Schoenberg). Let us call a function F (u) completely monotonic on (0,∞),provided that it is in C∞(0,∞) and satisfies the condition:

(−1)nF (n)(u) ≥ 0 u ∈ (0,∞), n = 0, 1, . . .

Then the function F (‖x− y‖) is positive definite iff F (√‖x− y‖) is continuous and com-

pletely monotonic.

Theorem 10 (Polya). Any real, even, continuous function F (u) which is convex on(0,∞), i.e. satisfies F (αu1+(1−α)u2 ≤ αF (u1)+(1−α)f(u2) for all u1, u2 and α ∈ (0, 1),is positive definite.

On the basis of these theorems, one can construct different Mercer kernels of the typeK(x− y).

2.4 Conditionate Positive Definite Kernels and Ma-

trices

Although the class of Mercer kernels is populated, it can be useful to individuate kernelfunction that, although non Mercer kernel, can be used, in similar way, to compute Mercer

25

kernels. To this purpose we define the conditionate positive definite matrices and kernels[Mic86].

Definition 7. An n × n matrix A = (aij) aij ∈ R is called a conditionate positivedefinite matrix of order r if it has n− r non-negative eigenvalues.

Definition 8. We call the kernel ϕ a a conditionate positive definite kernel of orderr iff is symmetric (i.e ϕ(x, y) = ϕ(y, x) ∀x, y ∈ X) and

n∑j=1

n∑

k=1

cjckϕ(xj, xk) ≥ 0

for all n ≥ 2, x1, . . ., xn ⊆ X and c1, . . ., cn ⊆ R withn∑

j=1

cjP (x) = 0 where P (x) is a

polynomial of order r − 1.

Examples of conditionate positive kernels are 3:

k(x, y) = −√‖x− y‖2 + c2 Hardy Multiquadric (m = 1)

k(x.y) = ‖x− y‖2 ln ‖x− y‖ Thin P late Spline (m = 2)

As pointed out by Girosi et al. [Gir93] conditionally positive definite kernels are admissiblefor methods that use kernel to make inner products. This is underlined by the followingresult.

Theorem 11. Define∏n

m the space of polynomials of degree lower than m on Rn generatesan admissible kernel for SV expansions on the space of functions f orthogonal to

∏nm by

setting k(x, y) := h(‖x− y‖2).

ProofIn [Dyn91] [Mad90] it was shown that conditionate positive definite kernels h generatesemi-norms ‖.‖h by :

‖f‖2h =

∫h(‖x− y‖2)f(x)f(y)dxdy (2.8)

Provided that the projection of f onto the space of polynomials of degree lower than mis zero. Since ‖f‖2

h is a seminorm, the second side of (2.8), that is the Mercer’s conditionfor h(‖x − y‖2), obviously ≥ 0. Therefore Mercer’s condition is fulfilled and h(‖x − y‖2)defines a scalar product in some feature space. Hence they can be used to perform an innerproduct.This result enlarges remarkably the class of kernels, that can be used in order to performinner products.

3The conditionate positive definiteness of Hardy Multiquadrics is shown in Corollary 5.

26

2.5 Negative Definite Kernels and Matrices

We introduce the concept of negative definite matrices.

Definition 9. An n× n matrix A = (aij) aij ∈ R is called a negative definite matrixiff

n∑j=1

n∑

k=1

cjckajk ≤ 0 (2.9)

for all n > 2, c1, . . ., cn ⊆ R.

Since the previous definition involves integers n > 2, it is necessary to point out that any1 × 1 matrix A = (a11) with a11 ∈ R is called negative definite. The basic properties ofnegative definite matrices are underlined by the following result.

Theorem 12. A matrix

• is negative definite iff is symmetric.

• is negative definite has all eigenvalues ≥ 0.

A matrix is called strict negative definite if all eigenvalues are negative.Now we introduce the concept of the negative definite kernels.

Definition 10. We call the kernel ϕ a negative definite kernel iff is symmetric (i.eϕ(x, y) = ϕ(y, x) ∀x, y ∈ X) and

n∑j=1

n∑

k=1

cjckϕ(xj, xk) ≤ 0

for all n ≥ 2, x1, . . ., xn ⊆ X and c1, . . ., cn ⊆ R withn∑

j=1

cj = 0

In analogy with the positive definite kernel, the following result holds:

Theorem 13. A kernel ϕ is negative definite iff for every finite subset X0 ⊆ X therestriction of ϕ to X0 ×X0 is negative definite.

An example of negative definite kernel is the square of the Euclidean distance.

Corollary 4. The kernel ϕ(x, y) = ‖x− y‖2 is negative definite.

27

Proof.n∑

i=1

n∑j=1

cicj‖xi − xj‖2 =n∑

i=1

n∑j=1

cicj[‖xi‖2 − 2(xi · xj) + ‖xj‖2]

=n∑

j=1

cj

n∑i=1

ci‖xi‖2 +n∑

i=1

ci

n∑j=1

cj‖xj‖2 − 2n∑

i=1

n∑j=1

cicj(xi · xk)

= −2n∑

i=1

n∑j=1

cicj(xi · xk) (since

n∑j=1

cj = 0)

≤ 0

since the inner product is definite positive.

Important properties of negative definite kernels are enstablished by the following lemma,whose prove [Ber84] is omitted for brevity sake.

Lemma 1. If ψ : X ×X → R negative definite and satisfies ψ(x, x) ≥ 0 ∀x ∈ X then thefollowing kernels are negative definite

• ψα for 0 < α < 1.

• log(1 + ψ)

Consequence of the lemma is that Hardy Multiquadrics is a conditionate positive definitekernel.

Corollary 5. The Hardy Multiquadrics√

α2 + ‖x− y‖2 is a conditionate positive definitekernel of order 1, for α ∈ R.

Proof. The kernel ψ(x, y) = α2 + ‖x− y‖2 is negative definite,

n∑i=1

n∑j=1

cicj[α2 + ‖xi − xj‖2] = α2

(n∑

i=1

ci

)2

+n∑

i=1

n∑j=1

cicj‖xi − xj‖2

=n∑

i=1

n∑j=1

cicj‖xi − xj‖2 (since

n∑i=1

ci = 0)

≤ 0

since the Corollary 4.

Therefore, for the previous lemma, ϕ(x, y) = ψ(x, y)12 is still negative definite. Hence, the

opposite of ϕ, i.e. the Hardy Multiquadrics, is a conditionate positive definite kernel oforder 1.

28

But consequence of the Lemma 1 is this fundamental result that characterizes the negativedefinite kernels.

Corollary 6. The Euclidean distance is negative definite. More generally, the kernelψ(x, y) = ‖x− y‖α is negative definite for 0 < α ≤ 2.

Proof. The result is immediate consequence of the Corollary 4 and the Lemma 1.

2.6 Relations between Positive and Negative Definite

Kernels

Positive and negative definite kernels are strictly connected. If K is positive definitethen −K is negative definite. On the contrary, if K is negative definite, then −K is aconditionate positive definite kernel of order 1. Besides, positive and negative definitefunctions are related by the following lemma.

Lemma 2. Let X be a nonempty set, x0 ∈ X, and let ψ : X×X → R a symmetric kernel.Put ϕ(x, y) := ψ(x, x0) + ψ(y, x0)− ψ(x, y)− ψ(x0, y0) . Then ϕ is positive definite if andonly if ψ is negative definite.If ψ(x0, x0) ≥ 0 and ϕ0(x, y) := ψ(x, x0) + ψ(y, x0)−ψ(x, y), then ϕ0 is positive definite ifand only if ψ is negative definite.

Proof. For c1, . . . , cn ∈ R,n∑

j=1

cj = 0 and x1, . . . , xn ∈ X we have

n∑j=1

n∑i=1

cicjϕ(xi, xj) =n∑

j=1

n∑i=1

cicjϕ0(xi, xj)

= −n∑

j=1

n∑i=1

cicjψ(xi, xj)

Therefore positive definiteness of ϕ implies the negative definiteness of ψ.On the other hand, suppose that ψ is negative definite. Let c1, . . . , cn ∈ R and x1, . . . , xn ∈X. We put c0 = −

n∑j=1

cj = 0. Then

0 ≥n∑

j=0

n∑i=0

cicjψ(xi, xj)

29

0 ≥n∑

j=1

n∑i=1

cicjψ(xi, xj) +n∑

j=1

cjc0ψ(xj, x0) +n∑

i=1

cic0ψ(xi, x0) + ‖c0‖2ψ(x0, x0)

0 ≥n∑

j=1

n∑i=1

cicj[ψ(xi, xj)− ψ(xj, x0)− ψ(xi, x0) + ψ(x0, x0)]

0 ≥ −n∑

j=1

n∑i=1

cicjϕ(xi, xj)

Hence ϕ is positive definite. Finally, if ψ(x0, x0) ≥ 0 then

n∑j=1

n∑i=1

cicjϕ0(xi, xj) =n∑

j=1

n∑i=1

cicj[ϕ(xi, xj)] + ψ(x0, x0)(n∑

j=1

cj)2 ≥ 0

The following theorem, due to Schoenberg, is very important in the kernel applications.

Theorem 14 (Schoenberg). Let X be a nonempty set and let ψ : X ×X → R. Then ψis negative definite iff exp(−tψ) is positive definite ∀t > 0.

Proof. If exp(−tψ) is positive definite then 1− exp(−tψ) is negative definite

n∑i=1

n∑j=1

cicj[1− exp(−tψ)] = (n∑

i=1

ci)2 −

n∑i=1

n∑j=1

cicj exp(−tψ)

= −n∑

i=1

n∑j=1

cicj exp(−tψ) (since

n∑i=1

ci = 0)

≤ 0

since exp(−tψ) is definite positive.Negative definite is also the limit

limt→0+

1

t(1− exp(−tψ)) = ψ

On the other side, suppose that ψ is negative definite. We show that for t = 1, the kernelexp(−tψ) is positive definite. We choose x0 ∈ X and, for the Lemma 2, we have:

−ψ(x, y) = ϕ(x, y)− ψ(x, x0)− ψ(y, x0) + ψ(x0, x0)

where ϕ is positive definite. Therefore

exp(−ψ(x, y)) = exp(ϕ(x, y)) exp(−ψ(x, x0)) exp(−ψ(y, x0)) exp(ψ(x0, x0))

we conclude that exp(−ψ) is positive definite. The generic case ∀t > 0, can be derived forinduction.

30

Immediate consequence of the previous Theorem is the following result.

Corollary 7. The Gaussian exp(−‖x−y‖2σ2 ) is positive definite, for x, y ∈ Rn and σ ∈ R.

More generally, ψ(x, y) = exp(−a‖x− y‖α), with a > 0 and 0 < α ≥ 2, is definite positive.

Proof. The kernel ‖x− y‖α with 0 < α ≥ 2 is negative definite, as shown in Corollary 6.Therefore for the Theorem 14 the Gaussian is positive definite.

We conclude this section reporting, without proving them, the following results.

Lemma 3. A kernel ψ : X ×X → R is negative definite iff (t + ψ)−1 is positive definite∀t > 0.

Theorem 15. The kernel ψ : X × X → R is negative definite iff its Laplace transformL(tψ)

L(tψ) =

∫ +∞

0

exp(−tsψ)dµ(s)

is positive definite ∀t > 0.

Consequence of the lemma 3, is the following result.

Corollary 8. Inverse Hardy Multiquadrics ψ(x, y) = (α2 + ‖x− y‖2)−12 α ∈ R

is positive definite.

Proof.

n∑i=1

n∑j=1

cicj(α2 + ‖x− y‖2)−

12 ≥

n∑i=1

n∑j=1

cicj(α2 + ‖x− y‖2)−1 ≥ 0

for the lemma 3.

2.6.1 Infinitely Divisible Kernels

We conclude the section introducing the concept of infinitely divisible kernel [Hor69].

Definition 11. A positive definite kernel ϕ is called infinitely divisible, if for each n ∈ Nthere exists a positive definite kernel ϕn such that ϕ = (ϕn)n

If ψ is negative definite then ϕ = exp(−ψ) is infinitely divisible since ϕn = exp(− 1nψ) is

positive definite and (ϕn)n = ϕ.It is necessary to make the following remark

31

Remark 3. Let ϕ be infinitely divisible. Then

‖ϕ‖ = ‖ϕn‖n = (‖ϕ2n‖2)n = [(‖ϕ2nk‖2)n]n k, n ≥ 1

so that the nonnegative kernel ϕ is infinitely divisible inside the family of all non-negativepositive definite kernel, each nth root ‖ϕn‖ again being an infinitely divisible kernel.

Finally, we quote a result that underlines the connections between infinitely divisible andnegative kernels.

Theorem 16. Let N(X) the closure of all real-valued negative definite kernels on X ×X.For a positive definite kernel ϕ ≥ 0 the following assumptions are equivalent:

• ϕ is infinitely divisible

• − log ϕ ∈ N(X)

• ϕt is positive definite ∀t > 0

2.7 Metric computation by Mercer kernels

In this section we show how to compute a metric by means of a Mercer kernel. If a Mercerkernel K is independent ([Hau99]), namely it can be represented as K(x, y) = Φ(x) · Φ(y)where Φ : X → F to compute the metric is trivial. It is easy to verify that the kernelρ(x, y)

ρ(x, y)4=

√K(x, x)− 2K(x, y) + K(y, y)

is a metric.Using the definition of independent kernel and the bilinearity axioms of inner product, wehave:

ρ(x, y) =√

(Φ(x) · Φ(x))− 2(Φ(x) · Φ(y)) + (Φ(y) · Φ(y))

=√

(Φ(x)− Φ(y)) · (Φ(x)− Φ(y))

= ‖Φ(x)− Φ(y)‖ (2.10)

Therefore the kernel ρ(x, y) is a metric.

If the kernel K is not independent, it is not possible to express the kernel in terms of aninner product. Therefore the bilinearity axioms are not valid and the equation 2.10 cannothold. Nevertheless, grace to a fundamental result due to Schoenberg [Sch38a][Sch38b], it is

32

possible to associate to the kernel a metric. In order to show that we consider, associatedto a Mercer kernel K, the kernel d(x, y):

d(x, y)4= K(x, x)− 2K(x, y) + K(y, y)

The kernel d(x, y) is negative definite.

Corollary 9. If K(x, y) is positive definite then d(x,y) is negative definite.Besides,

√d(x, y) is negative definite.

Proof.

n∑j=1

n∑i=1

cjcid(xj, xi) =n∑

j=1

n∑i=1

cjci[K(xi, xi)− 2K(xi, xj) + K(xj, xj)]

=n∑

j=1

cj

n∑j=i

ciK(xi, xi)− 2n∑

i=1

n∑j=1

cjciK(xi, xj) +n∑

i=1

ci

∑j=1

cjK(xj, xj)]

= −2n∑

i=1

n∑j=1

cjciK(xi, xj) (since

n∑j=1

cj = 0)

≤ 0

since K is definite positive.Now we show that d(x, y) ≥ 0

d(x, y) = K(x, x)− 2K(x, x)K(y, y) + K(y, y)

≥ K(x, x)− 2√

K(x, x)K(y, y) + K(y, y)

≥[√

K(x, x)−√

K(y, y)]2

≥ 0

Hence, for the Lemma 1,√

d(x, y) is negative definite.

Now we introduce Schoenberg’s result [Sch38a][Sch38b].

Theorem 17. Let X be a nonempty set and ψ : X × X → R be negative definite. Thenthere is a space H ⊆ RX and a mapping x 7→ ϕx from X to H such that

ψ(x, y) = ‖ϕx − ϕy‖2 + f(x) + f(y)

where f : X → R. The function f is nonnegative whenever ψ is.If ψ(x, x) = 0 ∀x ∈ X then f = 0 and

√ψ is a metric on X.

33

Proof. We fix some x0 ∈ X and define

ϕ(x, y) =1

2[ψ(x, x0) + ψ(y, x0)− ψ(x, y)− ψ(x0, y0)]

which is positive definite for Lemma 2. Let H be the associated space for ϕ and putϕx(y) = ϕ(x, y). Then

‖ϕx − ϕy‖2 = ϕ(x, x) + ϕ(y, y)− 2ϕ(x, y)

= ψ(x, y)− 1

2[ψ(x, x) + ψ(y, y)]

Setting f(x) := 12ψ(x, x) we have:

ψ(x, y) = ‖ϕx − ϕy‖2 + f(x) + f(y)

The other statements can be derived immediately.

As pointed out by Deza and Laurent[Dez93], the negative definiteness of the metric is aproperty of the L2 spaces. Schoenberg’s theorem can be reformulated in the following way.

Theorem 18. Let X be a L2 space.Then the kernel ψ : X ×X → R is negative definite iff

√ψ is a metric.

Immediate consequence of the Schoenberg’s theorem is the following result.

Corollary 10. Let K(x, y) be a positive definite kernel. Then the kernel

ρK(x, y)4=

√K(x, x)− 2K(x, y) + K(y, y)

is a metric.

Proof. For the Corollary 9 the kernel d(x, y) = K(x, x) − 2K(x, y) + K(y, y) is negative

definite. Since d(x, x) = 0 ∀x ∈ X, applying Theorem 1 we get that ρK(x, y)4=

√d(x, y)

is a distance.

Hence, it is always possible to compute a metric by means of a Mercer kernel, even ifan implicit mapping is associated to the Mercer kernel. When an implicit mapping iaassociated to the kernel, it can not compute the positions Φ(x) e Φ(y) in the Feature Spaceof two points x and y; nevertheless it can compute their distance ρK(x, y) in the FeatureSpace. Finally, We conclude this section providing metric examples that can be derivedby Mercer Kernels.

34

Corollary 11. The following kernels ρ : X ×X → R+

• ρ(x, y) =√

(2− 2 exp(−‖x− y‖α) with 0 < α < 2

• ρ(x, y) =√

(‖x‖2 + 1)n + (‖y‖2 + 1)n − 2(x · y + 1)n with n ∈ N

are metrics.

Proof. Since (x · y + 1)n and exp(−‖x − y‖α) with 0 < α < 2 are Mercer kernels, thestatement, by means of the Corollary 10, is immediate.

2.8 Hilbert Space Representation of Positive Definite

Kernels

First, we recall some basic definitions in order to introduce the concept of Hilbert space.

Definition 12. A set is a linear space (or vector space) if the addition and the multi-plication by a scalar are defined on X such that, ∀x, y ∈ X and α ∈ R

x + y ∈ X

αx ∈ X

1x = x

0x = 0

α(x + y) = αx + αy

Definition 13. A sequence xn in a normed linear space 4 is said to be a Cauchy sequenceif ‖xn − xm‖ → 0 for n,m →∞.A space is said to be complete when every Cauchy sequence converges to an element ofthe space.A complete normed linear space is called a Banach space.A Banach space where it can define an inner product is called a Hilbert space.

Now we pass to represent positive definite kernels in terms of a reproducing kernel Hilbertspace (RKHS ).Let X be a nonempty set and ϕ : X ×X → R be positive definite. Let H0 be the space

4A normed linear space is a linear space in which it is defined a norm function ‖ · ‖ : X → R that mapseach element x ∈ X into ‖x‖.

35

the subspace of RX generated by the functions ϕx|x ∈ X where ϕx(y) = ϕ(x, y).

If f =∑

j

cjϕxjand g =

∑i

diϕyi, with f, g ∈ H0, then

∑i

dif(yi) =∑i,j

cjdiϕ(xj, yi) =∑

j

cjg(xj) (2.11)

The above formula desn not depend on the chosen representations of g, g and is denoted

〈f, g〉. Then the inner product 〈f, g〉 =∑i,j

cicjϕ(xi, xj) ≥ 0 since ϕ is definite positive.

Besides, the form 〈·, ·〉 is linear in both arguments.A consequence of 2.11 is the reproducing property

〈f, ϕx〉 =∑

j

cjϕ(xj, x) = f(x) ∀f ∈ H0 ∀x ∈ X

〈ϕx, ϕy〉 = ϕ(x, y) ∀x, y ∈ X

Besides, using Cauchy-Schwarz inequality, we have:

‖〈f, ϕx〉‖2 ≤ 〈ϕx, ϕx〉〈f, f〉|f(x)|2 ≤ 〈f, f〉ϕ(x, x) (2.12)

Therefore 〈f, f〉 = 0 ⇐⇒ f(x) = 0 ∀x ∈ X.Hence, the form 〈·, ·〉 is an inner product and H0 is a PreHilbertian space5. H, the comple-tion of H0, is a Hilbert space, in which H0 is a dense subspace. The Hilbert function spaceH is usually called the reproducing kernel Hilbert space (RKHS ) associated to the Mercerkernel ϕ. Hence, the following result has been proved.

Theorem 19. Let ϕ : X × x → R be a Mercer kernel.Then there is a Hilbert space H ⊆ RX and a mapping x 7→ ϕx, from X to H such that

〈ϕx, ϕy〉 = ϕ(x, y) ∀x, y ∈ X

i.e ϕ for H is the reproducing kernel.

2.9 Conclusions

In this chapter the foundations of the Kernel Methods have been presented. The chapterhas been focused on the aspects of the Kernel theory that are relevant for the dissertationtopics, such as the computation of the inner products and the metrics by means of Mercer

5A PreHilbertian space is a normed space, not complete where it can define an inner product

36

kernels. Positive and Negative Definite Kernels have been described, underlying the con-nections between them. Then how to compute a metric by means of a Positive DefiniteKernel has been presented. Finally how a Positive Definite Kernel can be represented bymeans of a Hilbert space has been described. Special attention has been paid to the theoryconcerning inner product and metric computation by Mercer kernels.

37

Chapter 3

Kernel Methods for SupervisedLearning

3.1 Introduction

As pointed out in the Introduction, the Learning problem can stated as to find a con-cise data description, given a data set of limited size. When the data set (training set)is formed by input-output pairs, the learning is called supervised. In supervised learningthe data description is a deterministic function that maps any input to an output suchthat the disagreement with future input-output observation is minimized. It is possible togroup the supervised learning problem in two big families, the classification problem andthe regression problem.If the output space has no structure except whether two elements of the output space areequal or not, this is called the problem of classification. Each element of the output spaceis called a class. This problem emerges in any pattern recognition task. For instance,the classification of handwritten images of digits is a standard benchmark problem forclassification learning algorithms. Among classification problems, the problem of binaryclassification has great importance. In the binary classification, the output space containsonly two elements, the first one is understood as the positive class, the latter one as thenegative class.If the output space is not formed by discrete values, but it assumes a continous spectrumof real values we have a regression problem.The aim of this chapter is to provide an overview of the supervised Kernel Methods. KernelMethods have been historically developed for supervised learning. Although the topic ofthe thesis is the unsupervised Kernel Methods, a review on the supervised Kernel Methodsis appropriate, for the sake of completeness.The chapter is organized as follows: Section 3.2 presents the main tools of the optimization

38

theory used in the Kernel Methods; in Section 3.3 the Support Vector Machines for Classi-fication are reviewed; in Section 3.4 Support Vector Machines for Regression are presented;Gaussian Processes are described in Section 3.5; Section 3.6 is devoted to present KernelFisher Discriminant; finally some conclusions are drawn in Section 3.7.

3.2 Lagrange Method and Kuhn-Tucker Theorem

In this section we describe the basic tools of the optimization theory used in the costructionof the Kernel Methods. The first general analytical method for solving optimization meth-ods was discovered by Fermat in 1629. He described a method for finding the minimum orthe maximum of functions defined in the entire space, without constraints. Fermat statedthe following theorem, whose proof is omitted for the sake of brevity.

Theorem 20 (Fermat). Let f be a function of n variables differentiable at the point x?.If x? is a point of local extremum of the function f(x), then the differential of the functionin the point in the point x? Df(x?) is

Df(x?) = 0 (3.1)

that implies∂f(x?)

∂x1

=∂f(x?)

∂x2

= · · · = ∂f(x?)

∂xn

= 0 (3.2)

Fermat’s theorem shows a way to find the stationary points of functions, that satisfy thenecessary conditions to be a maximum point. In order to find these points it is necessaryto solve the system (3.2) of n equations with n unknown values x? = (x?

1, x?2, . . . , x

?n).

3.2.1 Lagrange Multipliers Method

The next important step in the optimization theory was done by Lagrange in 1788. La-grange suggested his rule for solving the optimization problem with constraints (condi-tional optimization problem). The problem is to minimize (or to maximize) the functionf , f : Rn → R under m constraints

g1(x) = g2(x) = . . . = gm(x) = 0 (3.3)

We consider only functions gr, r = 1, . . . , m that possess some differentiability properties.We assume that in the subset X of the space Rn all functions gr and their partial derivativesare continous. We have the following definition

39

Definition 14. Let X ⊆ Rn be and f : Rn → R. We say that x? ∈ X is a point of localminimum in the problem of minimizing f under constraints (3.3) if there exists ε > 0 suchthat ∀x that satisfy (3.3) and

‖x− x?‖ < ε

the inequalityf(x) ≥ f(x?)

holds.

The definition of maximum is analogous.Now we consider the function L (Lagrangian),

L(x, λ, λ0) = λ0f(x) +m∑

k=1

λkgk(x) (3.4)

the values λ0, λ1, . . . , λm are called Lagrange multipliers. Lagrange proved the followingtheorem, whose proof is omitted for the sake of the brevity.

Theorem 21 (Lagrange). Let the functions fk(x), k = 0, 1, . . . , m be continuous anddifferentiable in a vicinity of x?. If x? is the point of a local extremum, then one can findLagrange multipliers λ? = (λ?

1, λ?2, . . . , λ

?n) and λ?

0 which are not equal to zero simultaneouslysuch that the differential of the Lagrangian DL(x?, λ?, λ?

0) is null (stationary condition),i.e.

DL(x?, λ?, λ?0) = 0 (3.5)

That implies∂L(x?, λ?, λ?

0)

∂xi

= 0 i = 1, 2, . . . , n (3.6)

To guarantee that λ0 6= 0 it is sufficient that the m vectors g1(x?), g2(x

?), . . . , gm(x?) arelinearly independent.

Therefore to find the stationary point x? the system formed by the following n+m equations

∂

∂xi

(λ0f(x) +

m∑

k=1

λkgk(x)

)(3.7)

g1(x) = g2(x) = . . . = gm(x) (3.8)

must be solved.The system has n + m equations with n + m + 1 unknown values. Therefore the systemshould be indeterminate. However Lagrange multipliers are defined with accuracy up to acommon multiplier.

40

If λ0 6= 0 (this is the most important case since λ0 = 0 means that the goal function is notconnected with constraint), then one can multiply all Lagrange multipliers by a constant toobtain λ0 = 1. Hence the number of equations becomes equal to the number of unknowns.The system assumes the final form:

∂

∂xi

(f(x) +

m∑

k=1

λkgk(x)

)(3.9)

g1(x) = g2(x) = . . . = gm(x) (3.10)

3.2.2 Kuhn-Tucker Theorem

In 1951 Kuhn and Tucker [Kuh51] suggested an extension of the Lagrange method in orderto cope with constraints of inequality type. Kuhn and Tucker suggested a solution to theconvex optimization problem, i.e. to minimize a convex objective function under certainconvex constraints of inequality type.We recall the concept of convexness.

Definition 15. The set A is called convex if ∀x, y it contains the interval

[x, y] = z : z = αx + (1− α)y, 0 ≤ α ≤ 1that connects these points.

Definition 16. The function f is called convex if ∀x, y the inequality (Jensen inequal-ity)

f(αx + (1− α)y) ≤ αf(x) + (1− α)f(y), 0 ≤ α ≤ 1

holds true.

We consider the following convex optimization problem:

Problem 1. Let X be a linear space, let A be a convex subset of this space, and let f(x)and gk(x), k = 1, . . . , m be convex functions.Minimize the function f(x) subject to the constraints

x ∈ A (3.11)

gk(x) ≤ 0 k = 1, . . . , m (3.12)

To solve this problem we consider the Lagrangian function

L(x, λ, λ0) = λ0f(x) +m∑

k=1

λkgk(x)

where λ = (λ1, . . . , λn).We have the following theorem.

41

Theorem 22 (Kuhn-Tucker). If x? minimizes the function f(x) under constraints (3.11)and (3.12), then exist Lagrange multipliers λ?

0 and λ? = (λ?1, . . . , λ

?m) that are simultane-

ously not equal to zero and such that the following three conditions hold true:

1. The minimum principleminx∈A

L(x, λ?0, λ

?) (3.13)

2. The nonnegativeness conditions

λ?k ≥ 0 k = 0, 1, . . . , m (3.14)

3. The Kuhn-Tucker conditions

λ?k gk(x

?) = 0, k = 1, . . . , m (3.15)

If λ0 6= 0 then conditions (1), (2) and (3) are sufficient conditions for x? to be the solutionof the optimization problem.To get λ0 6= 0, it is sufficient that exists x such that the following conditions (Slaterconditions)

gi(x) < 0, i = 1, . . . , m

holds.

The following corollary is consequence of the Kuhn-Tucker theorem.

Corollary 12. If the Slater conditions is satisfied, then one can choose λ0 = 1 and rewritethe Lagrangian in the form

L(x, 1, λ) = f(x) +m∑

k=1

λkgk(x) (3.16)

Now the Lagrangian defined on n+m variables and conditions of the Kuhn-Tucker theoremare equivalent to the existence of a saddle point (x?, λ?) of the Lagrangian, i.e.

minx∈A

L(x, 1, λ?) = L(x, 1, λ?) = maxλ>0

L(x, 1, λ?) (3.17)

Proof. The left equality of (3.17) follows from conditions (1) of the Kuhn-Tucker Theoremand the right equality follows from conditions (3) and (2) of the same theorem.

Lagrange Methods and Kuhn-Tucker are the basic optimization tools of the Kernel Methodsfurtherly described in the dissertation.

42

3.3 Support Vector Machines for Classification

In this section we describe the most popular kernel method i.e the Support Vector Machines(SVM ) for classification. For the sake of the semplicity, we consider a problem of the binaryclassification, that is the training set has only two classes.Let D be a training set formed by l patterns pi. Each pattern pi is a couple of values(xi, yi) where the first term xi (xi ∈ Rn) is called input and the second term (output)yi can assume only two possible discrete values, that we fix conventionally at +1 and−1. The patterns with output +1 are called positive patterns, while the others are callednegative patterns. Finally, we assume that each pattern pi has been generated accordingto a unknown probability distribution P (x, y).The problem of learning how to classify correctly patterns consists in estimating a functionf : Rn → ±1 using training set patterns

(x1, y1), . . . , (xl, yl) ∈ Rn ×±1 (3.18)

such that f will correctly classify unseen examples (x, y), i.e. f(x) = y for examples (x, y)generated from the same probability distribution P (x, y) of the training set. Usually itmakes the assumption that the patterns (xi, yi) are i.i.d i.e identically independent dis-tributed.The underlying idea of SVM is the optimal hyperplane algorithm.

x1

x2

f[ x] = - 1

f[ x] = +1

Figure 3.1: A binary classification problem: to separate circles from disks. The optimalhyperplane is orthogonal to the shortest line connecting the convex hulls of the two classesand intersects it half-way between the two classes.

43

3.3.1 Optimal hyperplane algorithm

Vapnik and Lerner [Vap63] and Vapnik and Chervonenkis [Vap64] have considered theclass of hyperplanes

w · x + b = 0 w, x ∈ Rn, b ∈ R (3.19)

corresponding to decision functions 1

f(x) = sgn(w · x + b) (3.20)

and proposed a learning algorithm, the Generalized Portrait for separable problems, thatcomputed f from empirical data.Vapnik observed that among all hyperplanes separating the data, there exists a unique oneyielding the maximum margin of separation between the classes

maxw,b

min(‖x− xi‖ : x ∈ Rn, w · x + b = 0, i = 1, . . . l) (3.21)

In order to construct the Optimal Hyperplane the following optimization problem has beensolved:

minimize τ(w) =1

2‖w‖2 (3.22)

subject to yi((w · xi) + b) ≥ 1 i = 1 . . . l (3.23)

This conditional optimization problem can be solved by introducing Lagrange multipliersαi ≥ 0 and a Lagrangian function (see Section 3.2) L

L(w, b, α) =1

2‖w‖2 −

l∑i=1

αi((xi · w) + b)− 1 (3.24)

The Lagrangian L has to be minimized with respect to the primal variables w and b andmaximized with respect to the dual variables αi, i.e a saddle point has to be found. Theoptimization problem can be solved by means of the Kuhn-Tucker theorem (see Section3.2). Kuhn-Tucker theorem implies that the condition at the saddle point, the derivativesof L with respect to the primal variables must vanish,

∂L(w, b, α)

∂b= 0,

∂L(w, b, α)

∂w= 0 (3.25)

leads tol∑

i=1

αiyi = 0 (3.26)

1The function signum sgn(x) is defined as follows: sgn(x) = 1 if x > 0; sgn(x) = −1 if x < 0;sgn(x) = 0 if x = 0.

44

and

w =l∑

i=1

αiyixi (3.27)

Hence the solution vector w is an expansion in terms of a subset of the training set pat-terns, namely those patterns whose αi are 6= 0. These patterns are called Support Vectors(SV).Kuhn-Tucker theorem implies that αi must satisfy the Karush-Kuhn-Tucker (KKT ) con-ditions

αi · [yi(xi · wi) + b)− 1] = 0 i = 1 . . . l (3.28)

This condition implies that the Support vectors lie on the margin. All remaining samplesof the training set are irrelevant for the optimization since their αi is null. This impliesthat the hyperplane is completely determined by the patterns closest to it, the solutionshould not depend on other patterns of the training set. Therefore (3.27) can be writtenas

w =l∑

αi∈SV

αiyixi (3.29)

Plugging (3.26) and (3.27) into L, one eliminates the primal variables and the optimizationproblem becomes:

maximize W (α) =l∑

i=1

αi − 1

2

l∑i,j=1

αiαjyiyj(xi · xj) (3.30)

subject to αi ≥ 0 i = 1 . . . l (3.31)l∑

i=1

αiyi = 0 (3.32)

Therefore the hyperplane decision function can be written as

f(x) = sgn

(l∑

i=1

αiyi(xi · xj) + b

)(3.33)

The optimal hyperplane algorithm can be solve only linear problems. It cannot solve simplenonlinear problems as XOR, how underlined by Minsky and Papert [Min69]. In order tobuild a classifier that can solve nonlinear problems it has to find a method to perform theoptimal hyperplane algorithm in a Feature Space nonlinearly related to the input space[Aiz64].The Theorem 5 in the Section 2.3 states that Mercer kernels permit performing scalar

45

products in Feature Spaces that are nonlinearly related to the input space. In particular,each Mercer Kernel K(x, y), K : X ×X → R can be written as

K(x, y) = (Φ(x) · Φ(y)) (3.34)

where Φ : X → F , F is called Feature Space.Hence it is adequate to substitute in the formula (3.33) the inner product xi · xj with theKernel K(xi, xj) to perform the optimal hyperplane algorithm in the Feature Space F .This method is called kernel trick [Sch00a].

3.3.2 Support Vector Machine Construction

In order to construct SVMs, an optimal hyperplane in some Feature Space has to becomputed. In order to do that, it is sufficient to substitute each training example xi withits corresponding image in the Feature Space Φ(xi). The weight vector (3.27) becomes anexpansion of vectors in the Feature Space

w =l∑

i=1

αiyiΦ(xi). (3.35)

Hence the weight vector is not directly computable when the mapping Φ is unknown. SinceΦ(xi) occurr only in scalar products, scalar products can be substituted by an appropriateMercer kernel K, leading to a generalization of the decision function (3.33)

f(x) = sgn

(l∑

i=1

αiyi(Φ(xi) · Φ(xj)) + b

)

= sgn

(l∑

i=1

αiyiK(xi, xj) + b

)(3.36)

and the following quadratic problem to optimize:


i=1

αi − 1

2

l∑i,j=1

αiαjyiyjK(xi, xj) (3.37)

subject to αi ≥ 0 i = 1 . . . l (3.38)l∑

i=1

αiyi = 0 (3.39)

In most practical problems a separating hyperlane may not exist. Due to presence of noise,it happens that it exists an overlapping between classes. Therefore it is necessary to allow

46

the possibility that some examples can violate (3.23).In order to get that, it is necessary to use slack variables [Cor95] [Vap95]

ξi ≥ 0 i = 1 . . . l (3.40)

relaxing the constraints

yi · ((w · xi) + b) ≥ 1− ξi i = 1 . . . l (3.41)

Therefore the constructed classifier, Support Vector Machine, control at the same time themargin (‖w‖) and the number of training errors, minimizing the objective function:

τ(w, ξ) =1

2‖w‖2 + C

l∑i=1

ξi (3.42)

subject to the constraints (3.40) (3.41), for some value of the constant C ≥ 0 determiningthe trade-off.Plugging the constraints in (3.42) and rewriting in terms of Lagrange multipliers, we obtainthe following problem to maximize


i=1

αi − 1

2

l∑i,j=1

αiαjyiyjK(xi, xj) (3.43)

subject to 0 ≤ αi ≤ C i = 1 . . . l (3.44)l∑

i=1

αiyi = 0 (3.45)

The only difference from the separable case is the upper bound C on the Lagrange multi-pliers αi. As in the separable case, the decision assumes the form (3.36) The threshold bcan be computed by exploiting the fact that for all SVs xi with αi < C, the slack variableξi is zero, therefore

l∑i

yjαj K(xi, xj) + b = yi (3.46)

The solution of the system formed by the equations (3.43), (3.44), (3.45) requires QuadraticProgramming (QP) techniques, that are not always are efficient. However, it is possible touse In SVMs different approaches that do not require QP techniques.

3.3.2.1 A Linear programming approach to classification

Instead of using quadratic programming it is also possible to derive a kernel classifier inwhich the learning task involves Linear Programming (LP) instead. Training the classifier

47

involves the minimization

min

[l∑

i=1

αi + C

l∑i=1

ξi

](3.47)

yi

[l∑

j=1

αiK(xi, xj) + b

]≥ 1− ξi (3.48)

where αi ≥ 0 and ξi ≥ 0.

By minimizingl∑

i=1

αi we could obtain a solution which is sparse i.e. relatively datapoints

have αi non null and hence have to be used in the computations. Moreover, the efficientsimplex method [Kor68] exists for solving linear programming problems so this is a practicalalternative to conventional SVMs based on QP approaches. This linear programmingapproach [Man65] evolved independently of the QP approach to SVMs. It is also possibleto handle multi-class problems using linear programming techniques [Wes99].

3.3.3 Algorithmic Approaches

The methods we have considerated have involved linear or quadratic programming. Linearprogramming can be implemented using simplex method. LP packages are included inthe most popular mathemathical software packages such as Matlab and Mathematica.For quadratic programming there are also many appliable techniques including conjugategradient and primal-dual interior point methods [Lue84]. Certain QP packages are readilyappliable such as MINOS and LOQO. These methods can be used to train an SVM rapidlybut they have the disadvantage that the l×l matrix K(xi, xj) (Gram matrix ) is stored in thememory. For small datasets this is possible, but for large datasets alternatives techniqueshave to be used. These techniques can be grouped into three categories: techniques in whichkernel components are evaluated and discarded during learning, working set methods inwhich an evolving subset of data is used, and new algorithms that explicitly exploit thestructure of the problem.For the first category the most obvious approach is to sequentially update the αi and thisis the approach used by the kernel adatron algorithm (KA) [Fri98].For binary classification, with no soft margin or bias, this is a simple gradient ascentprocedure on (3.43) in which αi ≥ 0 initially and the αi are subsequently sequentiallyupdated using

αi ← βiθ(βi) (3.49)

where

βi = αi + η

[1− yi

m∑j=1

αjK(xi, xj)

](3.50)

48

and θ(β) is the Heaviside step function.

The optimal learning rate η is1

K(xi, xi). A sufficient condition for the convergence is

0 < ηK(xi, xi) < 2.This method is very easy to implement and can give a quick impression of the performanceof SVMs on classification tasks. However, it is not fast as most QP routines, especially onsmall datasets.

3.3.3.1 Chunking and Decomposition

Rather than sequentially updating the αi the alternative is to update the αi in parallelbut using only a subset or chunk of data at each stage. Thus a QP routine is used tooptimize the Lagrangian on an initial arbitrary subset of data. The support vectors foundare retained and all other datapoints (with αi = 0) discarded. A new working set of data isthen derived from these support vectors and additional datapoints which maximally violatethe storage constraints. This chunking [Osu97] process is then iterated until the margin ismaximized. This procedure may still fail because the dataset is too large or the hypothesismodelling the data is not sparse, i.e. most αi are non-null. In this case decomposition[Osu99] methods provide a better approach: these algorithms only use a fixed size subsetof data with the αi for the remainder kept fixed.

3.3.3.2 Sequential minimal optimization

The limiting case of decomposition is the Sequential Minimal Optimization (SMO) algo-rithm of Platt [Pla99] in which only two αi are optimized at each iteration. The smallestset of parameters which can be optimized with each iteration has cardinality two if the

constraintm∑

i=1

αiyi = 0 holds. If only two parameters are optimized and the rest kept

fixed then it is possible to derive an analytical solution which can be executed using fewnumerical operations.The method consists of a heuristic step for finding the best pair of multipliers to optimizeand use of an analytical expression to ensure the Lagrangian increases monotonically.For the hard margin case the latter is easy to the derive from the maximiztion of δW withrespect to the additive corrections, a, b in αi → αi + a and αj → αj + b, (i 6= j).For the soft margin it is necessary to avoid the violation of the constraints (3.23) leadingto bound on these corrections. the SMO algorithm has been refined to improve speed[Kee99]. It is worth mentioning that SVM packages such as SVMTorch [Col01] and SVM-Light [Joa99] use working set methods.

49

3.3.3.3 Further optimization algorithms

The third optimization approach consists in creating new algorithms.Keerthi et al. [Kee00] have proposed a very effective binary classification algorithm basedon the dual geometry of finding the two closest points in the convex hulls. These ap-proaches have been particularly effective for linear SVM problems.The Lagrangian SVM (LSVM ) method of Mangasarian and Musicant [Man00] reformu-lates the classification problem as an unconstrained optimization task and then solves theproblem using an algorithm which only requires the solution of systems of linear equali-ties. LSVM uses a method based on the Sherman-Morrison-Woodbury formula which onlyrequires solution of systems of linear equalities.Finally it is worth mentioning the interior-point [Fer00a] and semi-smooth support vector[Fer00b] methods of Ferris and Munson that seem quite effective in solving linear classifi-cation problems with huge training sets.

3.4 Support Vector Machines for Regression

The Support Vector method can also be extended to the case of the regression, mantainingall the main features that characterize the maximal margin algorithm, that it a non-linearfunction is learned by a linear learning machine in a Feature Space while the capacity ofthe system is controlled by a parameter that does not depend on the dimensionality of thespace. As in the classification case the learning algorithm minimises a convex functionaland its solution is sparse. In the regression task the underlying idea is to define a lossfunction that ignored errors that were within a certain distance of the true value. Thistype of function is referred to as an ε-insensitive loss function.

Definition 17. Let D = (xi, yi) i = 1, . . . , l be with xi ∈ Rn, yi ∈ R and a functionf : X ⊆ Rn → R. The linear ε-insensitive loss function Lε(x, y, f) is defined by

Lε(x, y, f) = |y − f(x)|ε (3.51)

where x ∈ X and y ∈ R.In an analogous way, the quadratic ε-insensitive loss function is defined by

Lε(x, y, f) = |y − f(x)|2ε (3.52)

We can visualize this as a tube of size 2ε around the function f(x) and any points outsidethis tube can be viewed as training errors. This approach is called ε-SV regression [Vap95]and is the most common approach to SV regression, though not the only one [Vap98].

50

3.4.1 Regression with Quadratic ε-insensitive loss

We can optimise the generalization of our regressor by minimising the sum of the quadraticε-insensitive losses. As in the classification case we introduce a parameter C to measurethe trade-off between complexity and losses.The primal problem can therefore be defined as follows:

min

[‖w‖2 + C

l∑i=1

(ξ2i + ξ2

i )

](3.53)

subject to yi − (w · xi + b) ≤ ε + ξi i = 1, . . . , l (3.54)

(w · xi + b)− yi ≤ ε + ξi i = 1, . . . , l (3.55)

ξi ≥ 0 i = 1, . . . , l (3.56)

ξi ≥ 0 i = 1, . . . , l (3.57)

ξiξi = 0 i = 1, . . . , l (3.58)

where we have introduced two slack variables, the first one for exceeding the target valueby more than ε, the latter one for being more than ε below the target.The conditional optimization problem can be solved by the usual techniques, i.e. the La-grange Multipliers and the Kuhn-Tucker Theorem, taking into account that (3.58) induces,for the correponding lagrange multipliers αi and αi the relation

αiαi = 0 i = 1, . . . , l.

Hence we get the following objective function to maximize

W (α, α) =l∑

i=1

yi(αi− αi)− ε

l∑i=1

(αi + αi)− 1

2

l∑i,j=1

(αi− αi)(αj − αj)(xi ·xj +1

Cδij) (3.59)

subject to

l∑i=1

αi =l∑

i=1

αi (3.60)

αi ≥ 0 i = 1, . . . , l (3.61)

αi ≥ 0 i = 1, . . . , l (3.62)

where δij is the Kronecker symbol.The corresponding KKT complementarity conditions are

αi((w · xi + b)− yi − ε− ξi) = 0 i = 1, . . . , l (3.63)

αi(yi − (w · xi + b)− ε− ξi) = 0 i = 1, . . . , l (3.64)

ξiξi = 0 i = 1, . . . , l (3.65)

αiαi = 0 i = 1, . . . , l (3.66)

51

If we define β = α− α, the problem (3.59) assumes a form similar to the classification case

maxl∑

i=1

yiβi − ε

l∑i=1

|βi| − 1

2

l∑i=1

l∑j=1

βiβj((xi · xj) +1

Cδij) (3.67)

subject to

l∑i=1

βi = 0 i = 1 . . . l (3.68)

Finally, using the kernel trick, i.e substituting in the (3.67) the dot products xi · xj withK(xi, xj) where K(·) is an appropriate Mercer Kernel, we get

maxl∑

i=1

yiβi − ε

l∑i=1

|βi| − 1

2

l∑i=1

l∑j=1

βiβj(K(xi, xj) +1

Cδij) (3.69)

subject to

l∑i=1

βi = 0 i = 1 . . . l (3.70)

Then the function modelling the data is

f(z) =l∑

i=1

βiK(xi, z) + b (3.71)

For yi ∈ −1, +1, the similarity becomes even more apparent when ε = 0 and if we use

the variables βi = yiβi, the only difference is that βi are not constrained to be positiveunlike the classification case.

3.4.2 Kernel Ridge Regression

We consider again the final formulation of the Regression with Quadratic ε-insensitive loss,i.e.

maxl∑

i=1

yiβi − ε

l∑i=1

|βi| − 1

2

l∑i=1

l∑j=1

βiβj(K(xi, xj) +1

Cδij) (3.72)

subject to

l∑i=1

βi = 0 i = 1 . . . l (3.73)

It is necessary to make some remarks. When ε 6= 0, we introduce an extra weight factorinvolving the dual parameters. On the other hand, when ε is null the problem correspondsto considering standard least squares linear regression with a weight decay factor controlledby the regularization parameter C. This approach to regression is also known as ridge

52

regression, and it is equivalent to techniques derived from Gaussian Processes, that we willexamine in Section 3.5.We will give here an independent derivation. First of all, we ignore the bias term b.Therefore we consider the problem that can be stated as follows:

min λ‖w‖2 +l∑

i=1

ξ2i (3.74)

subject to yi − (w · xi) = ξi i = 1, . . . , l (3.75)

Hence we derive the Lagrangian:

L(w, ξ, α) = λ‖w‖2 +l∑

i=1

ξ2i +

l∑i=1

αi(yi − (w · xi)− ξi) (3.76)

If we impone∂L

∂w= 0,

∂L

∂ξi

= 0

we get

w =1

2λ

l∑i=1

αixi (3.77)

ξi =αi

2(3.78)

Plugging in (3.76) we have:

maxα

W (α) =l∑

i=1

yiαi − 1

4λ

l∑i=1

l∑j=1

αiαj(xi · xj)− 1

4α2

i (3.79)

and using the kernel trick, i.e. substituting xi, xj with the kernel k(xi, xj) where k is anappropriate Mercer kernel, we get the final form:

maxα

W (α) =l∑

i=1

yiαi − 1

4λ

l∑i=1

l∑j=1

αiαjk(xi, xj)− 1

4α2

i (3.80)

The equation (3.80) can be rewritten in matricial form

W (α) = yT α− 1

4λαT Kα− 1

4αT α (3.81)

where y and x are the vectors formed respectively by yi and xi and K is the Gram Matrixwhose generic element Kij = k(xi, xj).If we impose

∂W

∂α= 0

53

we get

− 1

2λKα− 1

2α + y = 0 (3.82)

Henceα = 2λ(K + λI)−1y (3.83)

where I is the identity matrix.The corresponding regression function is:

f(x) = yT (K + λI)−1K (3.84)

where K is the vector whose generic element is Ki = k(xi, x)

3.4.3 Regression with Linear ε-insensitive loss

We can optimise the generalization of our regressor by minimising the sum of the linearε-insensitive losses. For a linear ε-insensitive loss function, the task is to minimize

min

[1

2‖w‖2 + C

l∑i=1

(ξi + ξi)

](3.85)

subject to yi − (w · xi + b) ≤ ε + ξi i = 1, . . . , l (3.86)

(w · xi + b)− yi ≤ ε + ξi i = 1, . . . , l (3.87)

ξi ≥ 0 i = 1, . . . , l (3.88)

ξi ≥ 0 i = 1, . . . , l (3.89)

Plugging the conditions in (3.85) we get the following objective function to maximize

W (α, α) =l∑

i=1

yi(αi − αi)− ε

l∑i=1

(αi + αi)− 1

2

l∑i,j=1

(αi − αi)(αj − αj)(xi · xj) (3.90)

subject to

l∑i=1

αi =l∑

i=1

αi (3.91)

0 ≤ αi ≤ C i = 1, . . . , l (3.92)

0 ≤ αi ≤ C i = 1, . . . , l (3.93)

Using the kernel trick we get finally:

W (α, α) =l∑

i=1

yi(αi − αi)− ε

l∑i=1

(αi + αi)− 1

2

l∑i,j=1

(αi − αi)(αj − αj)K(xi, xj) (3.94)

54

subject to

l∑i=1

αi =l∑

i=1

αi (3.95)

0 ≤ αi ≤ C i = 1, . . . , l (3.96)

0 ≤ αi ≤ C i = 1, . . . , l (3.97)

where K(·) is an opportune Mercer kernel.Finally we have to compute the bias b. In order to do that, we consider KKT conditionsfor regression. Before using the kernel trick, KKT conditions are

αi(ε + ξi − yi + w · xi + b) = 0 (3.98)

αi(ε + ξi + yi − w · xi − b) = 0 (3.99)

wherel∑

j=1

yj(αj − αj)xj = w (3.100)

(C − αi)ξi = 0 (3.101)

(C − αi)ξi = 0 (3.102)

From the latter conditions we see that only when αi = C or αi = C are the slack variablesnon-null. These samples of the training set correspond to points outside the ε-insensitivetube. Hence from (3.100) we can find the bias from a non-bound example with 0 < αi < Cusing b = yi−w ·xi−ε and similarly for 0 < αi < C we can obtain it from b = yi−w ·xi +ε.Though the bias b can be obtained using only one sample of the training set, it would preferto compute it using an average over all points on the margin.

3.4.4 Other Approaches to Support Vector Regression

Apart from the formulations given here it is possible to define other loss functions givingrise to different dual objective functions. In addition, rather than specifying ε a priori itis possible to specify an upper bound ν (0 ≤ ν ≤ 1) on the fraction of the points lyingoutside the band and then find ε by optimizing over the primal objective function

1

2‖w‖2 + C

(νlε +

l∑i=1

|yi − f(xi)|)

(3.103)

with ε acting as an additional parameter to minimize over [Sch98b].As for classification it is possible to formulate a linear programming approach to regressionwith [Wes98]

min

[l∑

i=1

αi +l∑

i=1

αi +l∑

i=1

ξi +l∑

i=1

ξi

](3.104)

55

subject to

yi − ε− ξi ≤[

l∑j=1

(αj − αj)K(xi, xj)

]+ b ≤ yi + ε + ξi (3.105)

Minimizing the sum of the αi approximatively minimizes the number of support vectorswhich favours sparse hypotheses with smooth functional approximations of the data. Inthis approach the kernel does not need to satisfy Mercer’s condition [Wes98].

3.5 Gaussian Processes

Gaussian Processes are an emerging branch of kernel Methods. Unlike SVMs, that aredesigned to solve mainly classification problems, Gaussian Processes are designed to solveessentially Regression Problems. Although there are some attempts [Wil98] of using Gaus-sian Processes for classification, the problem to solve a classification task with GaussianProcesses, remains still opened.Gaussian Processes are not a novelty. Within the geostatistics field, Matheron [Mat63]proposed a framework for regression using optimal linear estimators, that he called krig-ing, in honour of D.G. Krige, a South African mining engineer. This framework is identicalto Gaussian processes, currently used in machine learning. Kriging [Cre93] has been de-veloped considerably in the last thirty years in geostatistics, even the been model has beendeveloped mainly on the solution of low-dimensional problems, at most problems in R3.Machine Learning community ignored completely Gaussian Processes until Neal [Nea96]found them out again. Neal argued, that is no reason to believe that, for real problems,neural networks should be limited to nets containing only a small number of hidden nodes.A neural network model with a huge number of nodes, cannot be trained with a back-propagation algorithm, based on maximum likelihood algorithm [Dud73] [Fuk90], since thetrained neural net overfits the data.Neal investigated the net behavior when the number of hidden nodes goes to infinity, andshowed that it can get good performances using the Bayesian learning [Mac92], instead ofmaximum likelihood strategy.In the Bayesian approach to neural networks a prior distribution over the weights inducesa prior distribution over functions. This prior is combined with a noise model, which spec-ifies the probability of observing the targets ti given function values yi, to yield a posteriorover functions which can then be used for predictions.Neal proved that the most popular neural model, the Multi-Layer-Perceptron [Bis95], willconverge to a Gaussian Process Prior over functions in the limit of an infinite number ofhidden units. Although infinite networks are a method of creating Gaussian process, it isalso possible to specify them directly using parametric forms for the mean and covariancefunctions. The advantage of the Gaussian Process formulation, in comparison with infinite

56

networks, is that the integrations, which have to be approximated for neural nets, can becarried out exactly, using matrix computations. in the following section it is described howcan make regression by means of Gaussian Processes.

3.5.1 Regression with Gaussian Processes

A stochastic process is a collection of random variables Y (x)|x ∈ X indexed by a set X.The stochastic process is specified by giving the probability distribution for every finitesubsets of variables Y (x1), . . . , Y (xk) in a consistent manner.A Gaussian process is a stochastic process which can be fully specified by its mean functionµ(x) = E[Y (x)] and its covariance function C(x, x′) = E[(Y (x)− µ(x))(Y (x′)− µ(x′))]; itwill have a joint multivariate gaussian distribution.In this section we consider Gaussian Processes which have µ(x) ≡ 0. This is the case formany neural networks priors [Nea96]. Otherwise it assumes that any known offset or trendin the data has been removed. A non-zero µ(x) can be incorporated into the framework,but leads to extra notational complexity.Given a prior covariance function Cp(x, x′), a noise process Cn(x, x′) (with Cn(x, x′) = 0for x 6= x′) and a data set D = ((x1, y1), (x2, y2), . . . , (xn, yn)), the prediction for thedistribution of Y corresponding to a test point x is obtained simply by marginalizing the(n + 1)-dimensional joint distribution to obtain the mean and variance

y(x) = yT (KP + KN)−1kP (x) (3.106)

σ2y(x) = CP (x, x′) + CN(x, x′)− kT

P (x)(KP + KN)−1kP (x) (3.107)

where :[KP ]ij = CP (x, x′); [KN ]ij = CN(x, x′);kP (x) = (CP (x, x1), . . . , CP (x, xn))T ; y = (y1, . . . , yn)σ2

y(x) gives a measure of the error that the prediction yields.

If we assume that the variance of the noise process σ2 does not depend by the samplex, we have KN = σ2I.Substituting in the previous equations we have :

y(x) = y(KP + σ2I)−1kP (x) (3.108)

σ2y(x) = CP (x, x′) + CN(x, x′)− kT

P (x)(KP + σ2I)−1kP (x) (3.109)

The prediction value in (3.108) is the same that it is possible to obtain with a Kernel RidgeRegression (see equation 3.84) where the optimization is performed on L2 norm. The bigdifference between Gaussian Processes (GP) and SVM for Regression is that GP permitcomputing, unlike SVM, the variance of the prediction value σ2

y(x) providing an estimateon the prediction reliability. This peculiarity makes GP very appealing for applications

57

that require, not only a prediction, but also a measure of reliability of the prediction itself.Examples of these applications are portfolio management and geostatistics.

3.6 Kernel Fisher Discriminant

We conclude the chapter with the description of the Kernel Fisher Discriminant, namelythe generalization, in the Feature Space, of the one of the most popular algorithm of thePattern Recognition, i.e. the Fisher discriminant [Fis36].A classical issue of the Pattern Recognition is the curse of dimensionality, i.e. the design ofa good classifier becomes more difficult when the dimension of the input space increases. Away to cope with the curse of dimensionality is to preprocess input data in order to reduceits dimensionality before applying a classification algorithm. The Fisher discriminant aimsto achieve an optimal linear dimensionality reduction. Therefore it is not strictly a classifierbut it is used in the preprocessing stage of the classifier construction. We pass to describethe algorithm.

3.6.1 Linear Fisher Discriminant

Let X1 = (x11, x

12, . . . , x

1l1) and X2 = (x2

1, x22, . . . , x

2l2) be samples from two different classes

and X = X1 ∪ X2 = (x1, . . . , xl) their union. We define the mean of the two classes m1

and m2:

m1 =1

l1

l1∑j=1

x1j , m2 =

1

l2

l2∑j=1

x1j

Fisher’s linear discriminant is given by the vector w which maximizes

J(w) =wT SBw

wT SW w(3.110)

where

SB = (m1 −m2)(m1 −m2)T (3.111)

SW =∑x∈X1

(x−m1)(x−m1)T +

∑x∈X2

(x−m2)(x−m2)T (3.112)

SB and SW are called the between and within class scatter matrices respectively.The intuition behind maximizing J(w) is to find a direction which maximizes the projectedclass means (the numerator) while minimizing the class variance in this direction (the

58

denominator).If we pone

∂J

∂w= 0

we have:(wT SBw)Sww = (wT SW w)SBw (3.113)

From (3.111) we see that Sbw is always in the direction of (m2 − m1). We do not careabout the magnitude of w, only its direction. Thus we can drop any scalar factors in 3.113,we have:

Sww ∝ (m2 −m1) (3.114)

Multiplying both sides of (3.114) by S−1w we then obtain

w ∝ S−1w (m2 −m1) (3.115)

This is known as Fisher’s linear discriminant, although strictly it is not a discriminantbut rather a specific choice of direction for projection of the data onto one dimension. Theprojected data y(x) = w ·x can subsequently used to construct a discriminant, by choosinga threshold y0 so that we classify a new point as belonging to X1 if y(x) ≥ y0 and classifyit as belonging to X2 otherwise.It can prove that the vector w maximizing (3.110) has the same direction as the discriminantin the corresponding Bayes optimal classifier. Fisher’s linear discriminant has proven verypowerful. One reason is certainly that a linear model is rather robust against noise andmost likely will not overfit. Crucial is the estimation of the scatter matrices, which mightbe highly biased. Using simple plug-in estimates as in (3.111) when the number of samplesis small compared to the dimensionality will result in a high variability.

3.6.2 Fisher Discriminant in Feature Space

Clearly, for most real-world data Fisher discriminant is not complex enough. To enpowerthe discriminant we could look for nonlinear directions. As pointed out before, assuminggeneral distributions will cause trouble. Here we restrict ourselves to finding nonlineardirections by first mapping the data nonlinearly into some Feature space F and computingFisher’s linear discriminant there, thus implicitly yielding a nonlinear discriminant in inputspace.Let Φ be a nonlinear mapping from the input space to some Feature space F . To find thelinear discriminant in F we need to maximize

J(w) =wT SΦ

Bw

wT SΦW w

(3.116)

59

where w ∈ F , SΦB and SΦ

W are the corresponding matrices in F :

SΦB = (mΦ

1 −mΦ2 ) · (mΦ

1 −mΦ2 )T (3.117)

SΦW =

∑x∈X1

(Φ(x)−mΦ1 ) · (Φ(x)−mΦ

1 )T +∑x∈X2

(Φ(x)−mΦ2 ) · (Φ(x)−mΦ

2 )T (3.118)

with

mΦ1 =

1

l1

l1∑j=1

Φ(x1j), mΦ

2 =1

l2

l2∑j=1

Φ(x2j).

Since the mapping Φ can be unknown, it is impossible to solve directly the problem. Inorder to overcome this difficulty we use the kernel trick, that has been successfully used inthe other Kernel Methods. Instead of mapping the data explicitly we seek a formulation ofthe algorithm which uses only scalar products (Φ(x) ·Φ(y)) of the training patterns. Sincewe can compute the scalar products by means of Mercer kernels, we can solve the problemwithout mapping explicitly data in Feature space F .In order to find Fisher’s discriminant in the Feature Space F , we first need a formulationof (3.116) of only scalar products of input patterns which we then replace by some Mercerkernel K(·). From the theory of reproducing kernels we know that any solution w ∈ Fmust lie in the span of all training samples in F . Therefore we can find an expansion forw of the form

w =l∑

i=1

αiΦ(xi) (3.119)

Using the expansion (3.119) and the definition of mΦ1 and mΦ

2 we write

wT mΦi =

1

li

l∑j=1

li∑

k=1

αjK(xj, xik) i = 1, 2

= αT Mi i = 1, 2 (3.120)

where we defined

(Mi)j =1

li

li∑

k=1

K(xj, xik) i = 1, 2 (3.121)

and replaced the scalar product by means of the Mercer kernel K(·).Now we consider the numerator of (3.116). Using (3.117) and (3.120) the numerator canbe rewritten as

wT SΦBw = αT Mα (3.122)

whereM = (M1 −M2)(M1 −M2)

T

60

We pass to consider the denominator. Using (3.119), the definition of mΦi and a similar

transformation as in (3.122), we find:

wT SΦW w = αT Nα (3.123)

where we set

N =2∑

j=1

Pj(I− 1lj)PTj

Pj is a l × lj matrix with (Pj)nm = K(xn, xjm), I is the identity matrix and 1lj is a matrix

with all elements 1lj.

Finally combining (3.122) and (3.123), we can find Fisher’s linear discriminant in theFeature Space F by maximizing

J(α) =αT Mα

αT Nα(3.124)

This problem can be solved by finding the leading eigenvector of N−1M . This approach iscalled Kernel Fisher Discriminant (KFD) [Mik99a].The projection of a new pattern x onto w is given by

(w · Φ(x)) =l∑

i=1

αiK(xi, x). (3.125)

Obviously, the proposed setting is ill-posed. We are estimating l dimensional covariancestructures from l samples. Besides, numerical problems which cause the matrix N not tobe positive, we need a way of capacity control in F .In order to get that, we simply add a multiple of the identity matrix to N , i.e. replace Nby Nµ where

Nµ = N + µI (3.126)

therefore the problem becomes to find the leading eigenvalue of (Nµ)−1M .The use of Nµ brings some advantages: the problem becomes numerically more stable,since for µ large enough Nµ become positive definite; Nµ it can be seen in analogy to[Fri89], decreasing the bias in sample based estimation of eigenvalues; a regularization on‖α‖2 is imposed, favoring solutions with small expansion coefficients.

3.7 Conclusions

In this chapter we have provide an overview of supervised Kernel Methods. First of all,we have recalled the basic tools of the optimization theory, the Lagrange Multipliers andthe Kuhn-Tucker theorem, used in the Kernel Methods. Then Support Vector Machines

61

for Classification and Regression have been presented. Gaussian Processes are described,underlining their connection with Kernel Ridge Regression. Besides, it has been shown thata Gaussian Process can be viewed as a Multi-Layer-Perceptron with an infinite number ofneurons. Finally Fisher Kernel Discriminant has been reviewed.

62

Chapter 4

Data Dimensionality EstimationMethods

4.1 Introduction

Machine Learning problems deal with data represented as vectors of dimension d. The datais then embedded in Rd, but this does not necessarily imply that its actual dimension is d.The dimension of a data set is the minimum number of free variables needed to representthe data without information loss.In more general terms, following Fukunaga [Fuk82], a data set Ω ⊂ Rd is said to have In-trinsic Dimensionality (ID) equal to M if its elements lie entirely within an M -dimensionalsubspace of Rd (where M < d).ID estimation is important for many reasons. The use of more dimensions than strictlynecessary leads to several problems.The first one is the space needed to store the data. As the amount of available informationincreases, the compression for storage purposes becomes even more important. The speedof algorithms using the data depends on the dimension of the vectors, so a reduction ofthe dimension can result in reduced computation time.Then it can be hard to make reliable classifiers when the dimension of input data is high(curse of dimensionality [Bel61]). Therefore the use of vectors with smaller dimensionoften leads to improved classification performance.Besides, when using an autoassociative neural network [Kir01] to perform a nonlinearfeature extraction (e.g. nonlinear principal component analysis), the ID can suggest areasonable value for the number of hidden neurons.Finally, ID estimation methods are used to fix the model order in a time series, that iscrucial in order to make reliable time series predictions.Following the classification proposed in [Jai88], the approaches for estimating ID can be

63

grouped in two big families. The approaches of the former family, called local, estimate IDusing the information contained in sample neighborhoods, avoiding the projection of thedata onto a lower-dimensional space.In the approaches that belong to the latter family, called global, the data set is unfolded inthe d-dimensional space. Unlike local approaches that use only the information containedin the neighborhood of each data sample, at the same time global approaches make use ofthe whole data set.The aim of this chapter is to present the state-of-art of not-kernel-based methods for theID estimation. Kernel methods for the ID estimation will be investigate in the fifth chapterof the dissertation.The chapter is organized as follows: in Section 4.2 local approaches are reviewed; Section 4.3present global approaches to estimate ID; Section 4.4 is devoted to describe specific globalapproaches i.e. fractal-based techniques; in Section 4.5 some applications are described; inSection 4.6 a few conclusions are drawn.

4.2 Local methods

Local (or topological) methods try to estimate the topological dimension of the data man-ifold. The definition of topological dimension was given by Brouwer [Hey75] in 1913.Topological dimension is the basis dimension of the local linear approximation of the hy-persurface on which the data resides, i.e. the tangent space.For example, if the data set lies on an m-dimensional submanifold, then it has an m-dimensional tangent space at every point in the set. Therefore a sphere has a two-dimensional tangent space at every point and may be viewed as a two-dimensional manifold.Since the ID of the sphere is three, the topological dimension represents a lower bound ofID. If the data does not lie on a manifold, the definition of topological dimension does notdirectly apply. Sometimes the topological dimension is also referred to simply as the localdimension. This is the reason why the methods that estimate the topological dimension arecalled local. The basic algorithm to estimate the topological dimension was proposed byFukunaga and Olsen [Fuk71]. Alternative approaches to the Fukunaga-Olsen’s algorithmhave been proposed to estimate locally ID. Among them the Near Neighbor Algorithm[Pet79] and the methods based on Topological Representing Networks (TRN) [Mar94] arethe most popular.

4.2.1 Fukunaga-Olsen’s algorithm

Fukunaga-Olsen’s algorithm is based on the observation that for vectors embedded ina linear subspace, the dimension is equal to the number of non-zero eigenvalues of the

64

covariance matrix. Besides, Fukunaga and Olsen assume that the intrinsic dimensionalityof a data set can be computed by dividing the data set in small regions (Voronoi tesselationof data space). Voronoi tesselation can be performed by means of a clustering algorithm,e.g. K-MEANS (see Section 6.4 in the sixth chapter). In each region (Voronoi set) thesurface in which the vectors lie is approximately linear and the eigenvalues of the localcovariance matrix are computed. Eigenvalues are normalized by dividing them by thelargest eigenvalue. The intrinsic dimensionality is defined as the number of normalizedeigenvalues that are larger than a threshold T .Although Fukunaga and Olsen proposed for T , on the basis of heuristic motivations, valuessuch as 0.05 and 0.01, it is not possible to fix a threshold value T good for every problem.

4.2.2 The Near Neighbor Algorithm

The first attempt to use near neighbor techniques in order to estimate ID is due to Trunk[Tru76]. Trunk’s method works as follows.An initial value of an integer parameter k is chosen and the k nearest neighbors to eachpattern in the given data set are identified. The subspace spanning the vectors from theith pattern to its k nearest neighbors is constructed for all patterns. The angle betweenthe (k + 1)th near neighbor of pattern i and the subspace constructed for pattern i is thencomputed for all i. If the average of these angles is below a threshold, ID is k. Otherwise,k is incremented by 1 and the process is repeated.The weakness of Trunk’s method is that it is not clear how to fix a suitable value for thethreshold.An improvement (Near Neighbor Algorithm) of Trunk’s method was proposed by Pettiset al. [Pet79]. Assuming that the data are locally uniformly distributed, they derive thefollowing expression for ID:

ID =〈rk〉

(〈rk+1〉 − 〈rk〉)k (4.1)

where 〈rk〉 is the mean of the distances from each pattern to its k nearest neighbors.The algorithm presents some problems. It is necessary to fix a suitable value for k and itis performed on a heuristic basis. Besides, Pettis et al. derived ID, using the equation (1),for the special case of three uniformly distributed one-dimensional vectors. They foundID = 0.9. Therefore it seems that Pettis’ estimator is biased even for this simple case.Besides, Pettis described an iterative algorithm, based on an arbitrary number of neighbors,for the ID estimation.Then Verveer and Duin [Ver95] found that Pettis’ iterated algorithm yielded an uncorrectvalue for ID. Therefore Verveer and Duin proposed an iterative algorithm (near neighborestimator) that provides a non iterative solution for ID estimation.If 〈rk〉 is observed for k = km to k = kM a least square regression line can be fit to 〈rk〉 asa function of (〈rk+1〉 − 〈rk〉)k.

65

Verveer and Duin obtained the following estimation for ID:

ID =

[kM−1∑

k=km

(〈rk+1〉 − 〈rk〉)〈rk〉k

][kM−1∑

k=km

(〈rk+1〉 − 〈rk〉)2

]−1

(4.2)

Since the estimate yielded by Verveer-Duin’s algorithm is generally not an integer it hasto be rounded to the nearest integer. Since the vectors are usually locally uniformly dis-tributed, Verveer and Duin advise that the values k = km and k = kM should be small aspossible. When the data is very noisy Verveer and Duin suggest to ignore the first nearestneighbor i.e it should be km > 1.As a general comment it is necessary to remark that both Pettis’ and Verver-Duin’s algo-rithms are sensitive to outliers. The presence of outliers tends to affect significatively IDestimate.Another problem is the influence of the edge effect. Data close to the cluster edge are notuniformly distributed. Therefore if the percentage of this data, on the whole data set, isnot negligible, ID estimate is distorted. This happens when the dimensionality of the dataset is high and the data density is low.

4.2.3 TRN-based methods

Topology Representing Network (TRN ) (see Section 6.6 in the sixth chapter) is a unsu-pervised neural network proposed by Martinetz and Schulten [Mar94]. They proved thatTRN are optimal topology preserving maps i.e TRN preserves in the map the topologyoriginally present in the data.Bruske and Sommer [Bru98] proposed to improve Fukunaga-Olsen’s algorithm using TRNin order to perform the Voronoi tesselation of the data space. In detail, the algorithmproposed by Bruske and Sommer is the following. An optimal topology preserving mapG, by means of a TRN, is computed. Then, for each neuron i ∈ G, a PCA is performedon the set Qi consisting of the differences between the neuron i and all of its mi closestneurons in G.Bruske-Sommer’s algorithm shares with Fukunaga-Olsen’s one the same limitations: sincenone of the eigenvalues of the covariance matrix will be null due to noise, it is necessaryto use heuristic thresholds in order to decide whether an eigenvalue is significant or not.Finally Frisone et al. [Fri95] use Topology Representing Networks to get directly an IDestimate. If the data manifold Ω is approximated by a TRN, the number n of cross-correlations learnt by each neuron of the TRN is an indicator of the local dimension ofthe data set Ω. Frisone et al. conjectured that the number n is close to the number kof spheres that touch a given sphere, in the Sphere Packing Problem (SPP) [Con88]. Forspace dimensions from 1 to 8, k is: 2, 6, 12, 24, 40, 72, 126, 240. Besides, SPP has onlybeen solved for a 24-dimensional space and k is 196560. Hence it is adequate to measure

66

k to infer ID.Frisone’s approach presents some drawbacks: the conjecture has not been proved yet, thenumber k is known exactly only for few dimension values and tends to grow exponentiallywith the space dimension. This last peculiarity strongly limits the use of the conjecture inpractical applications where data can have high dimensionality.

4.3 Global Methods

Global methods try to estimate the ID of a data set, unfolding the whole data set in thed-dimensional space. Unlike local methods that use only the information contained in theneighborhood of each data sample, at the same time global methods make use of the wholedata set.Global methods can be grouped in three big families: Projection techniques, Multidimen-sional Scaling Methods and Fractal-Based Methods.

4.3.1 Projection techniques

Projection techniques search for the best subspace to project the data by minimizing theprojection error. These methods can be divided into two families: linear and nonlinear.

y

x

u

v

Figure 4.1: Ω Data Set. The data set is formed by points lying on the upper semicircon-ference of equation x2 + y2 = 1. The ID of Ω is 1. Neverthless PCA yields two non-nulleigenvalues. The principal components are indicated by u and v.

67

Principal Component Analysis (PCA) [Jol86] [Kir01] is a widely used linear method. PCAprojects the data along the directions of maximal variance. The method consists of comput-ing eigenvalues and eigenvectors of the covariance matrix of data. Each of the eigenvectorsis called a principal component. ID is given by the number of the non-null eigenvalues.The method presents some problems. PCA is an inadequate estimator, since it tends tooverestimate the ID [Bis95]. As shown in Figure 4.1, a data set formed by points lying ona circumference for PCA has dimension 2 rather than 1.In order to cope with these problems, nonlinear algorithms have been proposed to getnonlinear PCA. There are two different possible approaches to get a nonlinear PCA: anautoassociative approach (Nonlinear PCA) [Kar94] [Kir01]) and the one based on the useof Mercer kernels (Kernel PCA) [Sch98a].Nonlinear PCA is performed by means of a five-layers neural network. The neural nethas a typical bottleneck structure, shown in Figure 4.2. The first (input) and the last

Input Layer

Output Layer

Figure 4.2: A Neural Net for Nonlinear PCA

68

(output) layer have the same number of neurons, while the remaining hidden layers haveless neuron than the first and the last ones. The second, the third and the fourth layerare called respectively mapping, bottleneck and demapping layer. Mapping and demappinglayers have usually the same number of neurons. The number of the neurons of the bottle-neck layer provides an ID estimate. The targets used to train Nonlinear PCA are simplythe input vector themselves. The network is trained with the backpropagation algorithm,minimizing the square error. As optimization algorithm, the conjugate-gradient algorithm[Pre89] is generally used.Though nonlinear PCA performs better than linear PCA in some contexts [Fot97], itpresents drawbacks when estimating ID. As underlined by Malthouse [Mal98], the projec-tions onto curves and surfaces are suboptimal. Besides, NLPCA cannot model curves orsurfaces that intersect themselves.ID estimation with Kernel PCA will be investigated in the fifth chapter.Among projection techniques it is worth mentioning the Whitney reduction network re-cently proposed by Broomhead and Kirby [Bro00] [Kir01]. This method is based on Whit-ney’s concept of good projection [Whi36], namely a projection obtained by means of aninjective mapping. An injective mapping between two sets U and V is a mapping thatassociate a unique element of V to each element of U . As pointed out in [Kir01], findingprojections, by means of injective mappings, can be difficult and can sometimes involveempirical considerations.

4.3.2 Multidimensional Scaling Methods

Multidimensional Scaling (MDS) [Rom72a] [Rom72b] methods are projection techniquesthat tend to preserve, as much as possible, the distances among data. Therefore data thatare close in the original data set should be projected in such a way that their projections,in the new space (output space), are still close.Among multidimensional scaling algorithms, the best known example is MDSCAL, byKruskal [Kru64] and Shepard [She62]. The criterion for the goodness of the projectionused by MDSCAL is the stress. This depends only on the distances between data. Whenthe rank order of the distances in the output space is the same as the rank order of thedistances in the original data space, stress is zero.Kruskal’s stress SK is:

SK =

∑i<j

[rank(d(xi, xj))− rank(D(xi, xj))]2

∑i<j

rank(d(xi, xj))2

12

(4.3)

where d(xi, xj) is the distance between the data xi and xj and the D(xi, xj) is the distanceof the projections of the same data in the output space.

69

When the stress is zero a perfect projection exists. Stress is minimized by iteratively movingthe data in the output space from their initially randomly chosen positions according toa gradient-descent algorithm. The intrinsic dimensionality is determined in the followingway. The minimum stress for projections of different dimensionalities is computed. Thena plot of the minimum stress versus dimensionality of the output space is performed. ID isthe dimensionality value for which there is a knee or a flattening of the curve. Kruskal andShepard’s algorithm presents a main drawback. The knee or the flattening of the curvecould not exists.MDS approaches close to Kruskal and Shepard’s one are the Bennett’s algorithm [Ben68]and the Sammon’s mapping [Sam69].

4.3.2.1 Bennett’s algorithm

Bennett’s algorithm is the based on the assumption that data are uniformly distributedinside a sphere of radius r in an L-dimensional space.Let X1 and X2 be random variables representing points in the sphere and RL be thenormalized Euclidean distance (the interpoint distance) between them. If

RL =|X1 −X2|

2r(4.4)

then the variance of RL is a decreasing function of L, which may be expressed as:

L var(RL) ≈ constant (4.5)

where var(RL) is the variance of RL.Therefore increasing the variance of the interpoint distances has the effect of decreasingthe dimensionality of the representation, i.e. it ”flattens” the data set.Bennett’s algorithm involves two stages. The first stage moves the patterns, in the origi-nal input space, in order to increase the variance of the interpoint distances. The secondstage adjusts the position of the patterns in order to make the rank orders of interpointdistances in local regions are the same. These steps are repeated until the variance of theinterpoint distances levels off. Finally the covariance matrix of the whole data set, yieldedby the previous stages, is computed. The ID is determined by the number of significanteigenvalues of the covariance matrix.Bennett’s algorithm presents some drawbacks. First of all, as in Fukunaga-Olsen’s algo-rithm, in order to decide if an eigenvalue is significant, it is necessary to fix heuristicallya threshold value. Besides, as underlined previously in the PCA description, this methodtends to overestimate the dimensionality of a data set.Chen and Andrews [Che74] proposed to improve Bennett’s algorithm by introducing a costfunction to make Bennett’s rank-order criterion more sensitive to local data regions. Thebasic idea is still to mantain rank order of local distances in the two spaces.

70

4.3.2.2 Sammon’s mapping

Sammon proposed to minimize a stress measure similar to Kruskal’s one. The stress SS

proposed by Sammon has the following expression:

SS =

[∑i<j

(d(xi, xj)−D(xi, xj))2

d(xi, xj)

][∑i<j

d(xi, xj)

]−1

(4.6)

Where d(xi, xj) is the distance between patterns xi and xj in the original data space andD(xi, xj) is the distance in the two- or three- dimensional output space.The stress is minimized by the gradient-descent algorithm.Kruskal [Kru71] demonstrated how a data projection very similar to Sammon’s mappingcould be generated from MDSCAL.An improvement to Kruskal’s and Sammon’s methods has been proposed by Chang andLee [Cha73]. Unlike Sammon and Kruskal who move all points simultaneously in the out-put space to minimize the stress, Chang and Lee have suggested to minimize the stressby moving the points two at a time. In this way, it tries to preserve local structure whileminimizing the stress. The method requires heavy computational resources even when thecardinality of the data set is moderate. Besides, the results are influenced by the order inwhich the points are coupled.Several other approaches to ID estimation have been proposed. It is worth mention-ing Shepard and Carroll’s index of continuity [She64], Kruskal’s indices of condensation[Kru72] and Kruskal and Carroll’s parametric mapping [Kru69].Surveys of the classical Multidimensional Scaling methods can be found in [Rom72a][Rom72b] [She74].Recently local versions of MDS methods, i.e. ISOMAP algorithm [Ten00] and Local LinearEmbedding [Row00] have been proposed. We do not describe these methods for the sakeof the brevity.Finally it is worth mentioning the Curvilinear Component Analysis (CCA) proposed byDemartines and Herault [Dem97]. The principle of CCA is a self-organizing neural networkperforming two tasks: vector quantization of the data set, whose dimensionality is n, anda nonlinear projection of these quantizing vectors onto a space of dimensionality p (p < n).The first task is performed by means of SOM [Koh97]. The second task is performed bymeans of a technique very similar to MDS methods previously described. Since a MDSthat preserve all distances is not possible, a cost function E measures the goodness of theprojection. The cost function E is the following:

E =1

2

∑i

∑

j 6=i

(d(xi, xj)−D(xi, xj))2F (D(xi, xj), λ) (4.7)

where d(xj, xj) are Euclidean distances between the points xi and xj of data space andD(xi, xj) are Euclidean distances between the projections of the points in the output space;

71

λ is a set of parameters to set up and F (·) is a function (e.g. a decreasing exponential ora sigmoid) to be chosen in an opportune way.CCS seems to have very close performances to Shepard’s MDS based on index of continuity[Dem97].

4.4 Fractal-Based Methods

Fractal-based techniques are global methods that have been successfully applied to esti-mate the attractor dimension of the underlying dynamic system generating time series[Kap95]. Unless other global methods, they can provide as ID estimation a non-integervalue. Since fractals are generally1 characterized by a non-integer dimensionality, for in-stance the dimension of Cantor’s set and Koch’s curve [Man77] is respectively ln 2

ln 3and ln 4

ln 3,

these methods are called fractal.In nonlinear dynamics many definitions of fractal dimensions [Eck85] have been proposed.The Box-Counting and the Correlation dimension are the most popular.

4.4.1 Box-Counting Dimension

The first definition of dimension (Hausdorff dimension) [Eck85] [Ott93] is due to Hausdorff[Hau18].The Hausdorff dimension DH of a set Ω is defined by introducing the quantity

ΓdH(r) = inf

si

∑i

(ri)d (4.8)

where the set Ω is covered by cells si with variable diameter ri, and all diameters satisfyri < r.We look for that collection of covering sets si with diamaters less than or equal to r whichminimizes the sum in (8) and we denote that minimized sum Γd

H(r).The d-dimensional Hausdorff measure is then defined as

ΓdH = lim

r→0Γd

H(r) (4.9)

The d-dimensional Hausdorff measure generalizes the usual notion of the total length, areaand volume of simple sets. Haussdorf proved that Γd

H , for every set Ω, is +∞ if d is lessthan some critical value DH and is 0 if d is greater than DH . The critical value DH iscalled the Hausdorff dimension of the set.

1Fractals have not always non-integer dimensionality. For instance, the dimension of Peano’s curve is2.

72

Since the Hausdorff dimension is not easy to evaluate, in practical application it is replacedby an upper bound that differs only in some constructed examples: the Box-Countingdimension (or Kolmogorov capacity) [Ott93].The Box-Counting dimension DB of a set Ω is defined as follows:if ν(r) is the number of the boxes of size r needed to cover Ω, then DB is

DB = limr→0

ln(ν(r))

ln(1r)

(4.10)

It can show that if in the definition of Hausdorff dimension the cells have the same diameterr, Hausdorff dimension reduces to Box-Counting dimension.Although efficient algorithms [Gra90] [Keg03] [Tol03] have been proposed, the Box-Countingdimension can be computed only for low-dimensional sets because the algorithmic com-plexity grows exponentially with the set dimension.

4.4.2 Correlation Dimension

A good substitute for the Box-Counting dimension can be the Correlation dimension[Gra83]. Due to its computational simplicity, the Correlation dimension is successfullyused to estimate the dimension of attractors of dynamical systems.The Correlation dimension is defined as follows:let Ω = x1,x2, . . . ,xN be a set of points in Rn of cardinality N . If the correlation integralCm(r) is defined as:

Cm(r) = limN→∞

2

N(N − 1)

N∑i=1

N∑j=i+1

I(‖xj − xi‖ ≤ r) (4.11)

where I is an indicator function2, then the Correlation dimension D of Ω is:

D = limr→0

ln(Cm(r))

ln(r)(4.12)

Correlation and Box-Counting dimension are strictly related. It can be shown that bothdimension are special cases of the generalized Renyi dimension.If the generalized correlation integral Cp is:

Cp(r) =1

N(N − 1)p−1

N∑i=1

(N∑

j 6=i

I(‖xj − xi‖ ≤ r)

)p−1

(4.13)

2I(λ) is 1 iff condition λ holds, 0 otherwise.

73

The generalized Renyi dimension Dp is defined in the following way:

Dp = limr→0

1

p− 1

ln(Cp(r))

ln(r)(4.14)

It can be shown [Gra83] that for p = 0 and p = 2 Dp reduces respectively to the Box-Counting and the Correlation dimension. Besides, it can be proved that Correlation Di-mension is a lower bound of the Box-Counting Dimension.Nevertheless, due to noise, the difference between the two dimensions is negligible in ap-plications with real data.

4.4.3 Methods of Estimation of Fractal Dimension

The most popular method to estimate Box-Counting and Correlation dimension is thelog-log plot. This method consists in plotting ln(Cm(r)) versus ln(r). The Correlationdimension is the slope of the linear part of the curve (Figure 4.3). The method to estimate

−2.5 −2 −1.5 −1 −0.5 0−12

−10

−8

−6

−4

−2

0

ln(r)

ln(C

m(r

))

plot of ln(Cm

(r)) vs ln(r)

Figure 4.3: Plot of ln(Cm(r)) vs ln(r).

74

Box-Counting is analogous, but ln(ν(r)) replaces ln(Cm(r)). The methods to estimateCorrelation and Box-Counting dimension present some drawbacks. Though Correlationand Box-Counting dimension are asymptotic results and hold only for r → 0; r cannot betoo small since too few observations cannot allow to get reliable dimension estimates. Infact the noise has most influence at small distance. Therefore there is a trade-off betweentaking r small enough to avoid non-linear effects and taking r sufficiently large to reducestatistical errors due to lack of data. The use of least-square method makes the dimensionestimate not adequately robust towards the outliers. Moreover, log-log plot method doesnot allow to compute the error in dimension estimation.Some methods [Bro89] [Smi92] [Tak84] have been studied to obtain an optimal estimatefor the correlation dimension.Takens [Tak84] has proposed a method, based on Fisher’s method of Maximum Likelihood[Dud73] [Fuk90], that allows to estimate the correlation dimension with a standard error.

4.4.3.1 Takens’ method

Let Q be the following set Q = qk | qk < r where rk is the the Euclidean distance betweena generic couple of points of Ω and r (cut-off radius) is a real positive number.Using the Maximum Likelihood principle it can prove that the expectation value of theCorrelation Dimension 〈Dc〉 is:

〈Dc〉 = − 1

|Q||Q|∑

k=1

qk

−1

(4.15)

where |Q| stands for the cardinality of Q.Takens’ method presents some drawbacks. It requires some heuristics to set the radius[The90]. Besides, the method is optimal only if the correlation integral Cm(r) assumesthe form Cm(r) = arD[1 + br2 + o(r2)] where a and b are constants. Otherwise Takens’estimator can perform poorly [The88].

4.4.4 Limitations of Fractal Methods

In addition to the drawbacks previously exposed, estimation methods based on fractaltechniques have a fundamental limitation.It has been proved [Eck92] [Smi88] that in order to get an accurate estimate of the dimen-sion D, the set cardinality N has to satisfy the following inequality:

D < 2 log10 N (4.16)

75

Inequality (4.16) shows that the number N of data points needed to accurately estimate

the dimension of a D-dimensional set is at least 10D2 . Even for low dimensional sets this

leads to huge values of N .In order to cope with this problem and to improve the reliability of the measure for lowvalues of N , the method of surrogate data [The92] has been proposed.The method of surrogate data is an application of a well-know statistic technique calledbootstrap [Efr93]. Given a data set Ω, the method of surrogate data consists of creating anew synthetic data set Ω′, with greater cardinality, that has the same statistical propertiesof Ω, namely the same mean, variance and Fourier Spectrum.Although the cardinality of Ω′ can be chosen arbitrarily, the method of surrogate datacannot be used when the dimensionality of the data set is high. As pointed out previously,a data set whose dimension is 18 require at least, on the base of (4.16), a data set with 109

points. Therefore the method of surrogate data becomes computationally burdensome.Finally heuristic methods [Cam01a] [Cam02a]) have been proposed in order to estimate howfractal techniques underestimate the dimension of a data set when the data set cardinalityis unadequate. These heuristic methods permit inferring the actual dimensionality of thedata set. Since the methods are not theoretically well-grounded they have to be used withprudence.

4.5 Applications

As pointed out in the introduction of this chapter, estimation methods of the dimensionof data sets are useful in pattern recognition to develop powerful feature extractors. Forinstance, when using an autoassociative neural network to perform a nonlinear featureextraction (e.g. nonlinear principal component analysis), the ID can suggest a reasonablevalue for the number of hidden neurons. Besides, ID has been used as feature for the char-acterization of human faces [Kir90] and, in general, some authors [Hua94] [Mei97] havemeasured the fractal dimension of an image with the aim to establish if the dimensionalitywas a distinctive feature of the image itself.The estimate of the dimensionality of a data set is crucial in the analysis of signals andtime series. For instance, ID estimation is fundamental in the study of chaotic systems(e.g. Henon map, Rossler oscillator) [Ott93], in the analysis of ecological time series (e.g.Canadian lynx population) [Ish93], in biomedical signal analysis [Chi90], [Tir00], in radarclutter identification [Hay95], in speech analysis [Som03], and in the prediction of financialtime series [Sch89].Finally ID estimation methods are used to fix the model order in time series. This is fun-damental to make reliable time series predictions. In order to understand the importanceof the knowledge of ID, we consider a time series x(t), with (t = 1, 2, . . . , N). It can be

76

described by the equation:

x(t) = f(x(t− 1), x(t− 2), . . . , x(t− (d− 2)), x(t− (d− 1))) + εt (4.17)

The term εt represents an indeterminable part originated either from unmodelled dynamicsof the process or from real noise.The function f(·) is the skeleton of the time series [Kan97] [Ton90].If f(·) is linear, we have an autoregressive model of order (AR(d-1)), otherwise a nonlinearautoregressive model of order (NAR(d-1)).The key problem in the autoregressive models is to fix the model order (d-1). Fractal-basedtechniques can be used to fix the order in a time series. In particular, these techniques canbe used for the model reconstruction (or reconstruction of dynamics) of the time series.This is performed by the method of delays [Pac80].The time series in the equation (17) is represented as a series of a set of points X(t) :X(t) = [x(t), x(t− 1), . . . , x(t− d + 1)] in a d -dimensional space.If d is adequately large, between the manifold M , generated by the points X(t) and theattractor U of the dynamic system that generated the time series, there is a diffeomor-phism3.Takens-Mane embedding theorem [Tak81] [Man81] states that, in order to obtain a faithfulreconstruction of the system dynamics, it must be:

2S + 1 ≤ d (4.18)

where S is the dimension of the system attractor and d is called the Embedding dimensionof the system.Therefore, it is adequate to measure S to infer the Embedding dimension d and the orderof the time series d − 1. The estimation of S can be performed by means of fractal tech-niques (e.g Box-Counting and Correlation dimension estimation) previously described inSection 4.4.There are many applications of fractal techniques to fix the model order to natural timeseries: in the economic field [Sch89], in engineering [Cam99], in the analysis of electroen-cephalogram data [Dvo91], in metereology [Lor63] [Hou91], in the analysis of astronomicaldata [Sca90] and many others. A good review about these application can be found in[Ish93].

4.6 Conclusions

The estimation of the intrinsic dimension (ID) of a data set is a classical problem ofMachine Learning. Recently the use of fractal-based techniques and neuroassociators seems

3M is diffeomorphic to U iff there is a differentiable map m : M 7→ U whose inverse m−1 exists and isalso differentiable.

77

to get new force to the research on reliable estimation methods of data set dimensionality.In this chapter we have provide a survey of the not-kernel-based methods for data setdimensionality estimation, focusing on the methods based on fractal techniques and neuralautoassociators. Fractal techniques have been quite effective, in comparison with the othermethods, in the estimation of the dimensionality of a data set. Nevertheless they seem tofail dramatically when, at the same time, the cardinality of the data set is low and thedimensionality is high. To get reliable estimators, when the dimensionality is high and theset cardinality is low, still remains an open problem.In the next chapter we will investigate the problem of ID estimation with kernels.

78

Chapter 5

Data Dimensionality Estimation withKernels

5.1 Introduction

In this chapter, the problem of estimating the intrinsic dimension of a data set by meansof the Kernel Methods is investigated.As pointed out in the fourth chapter, the Intrinsic Dimension (ID) of a data set is theminimum number of free variables needed to represent the data without information loss.Principal Component Analisys (PCA) (see section 4.2 in the fourth chapter) is the mostcommon ID estimator. We recall briefly some basic facts on PCA.PCA projects the data along maximal variance directions. The method consists of comput-ing eigenvalues and eigenvectors of the covariance matrix of data. Each of the eigenvectorsis called a principal component. Eigenvalues are normalized by dividing them by the largesteigenvalue. ID is the number of normalized eigenvalues (significative eigenvalues) that arelarger than a threshold Θ. Fukunaga and Olsen [Fuk71] proposed for Θ, on the basis ofheuristic motivations, values such as 0.05, 0.03 and 0.01 (Fukunaga-Olsen thresholds).As underlined in the fourth chapter, PCA is not a good ID estimator, since it tends tooverestimate the ID. For instance, a set formed by points lying on a curve for PCA, asshown in the Figure 4.1, has dimension 2 rather than 1.Recently Scholkopf, Smola and Muller [Sch98a] have proposed a Kernel Method, the KernelPrincipal Component Analysis (KPCA), to extract nonlinear principal component featuresfrom data. KPCA has been successfully applied for denoising images and extracting fea-tures [Gir02][Rom99][Ros01a][Wil02]. As far as we know there are no results on the useof KPCA such an ID estimator. Therefore the investigations on KPCA as ID estimatorrepresent the original contribution of the dissertation to ID estimation problem.In this chapter we investigate the PCA reliability such as ID Estimator. In particular we

79

show that KPCA performances are strongly influenced by the kernel choice.The chapter is organized as follows: Section 5.2 presents an overview of Kernel PCA;Section 7.6 describe a case study, whose ID is known, where KPCA has been tested; Sec-tions 5.4 and 5.5 presents KPCA experimentations on real data; finally some conclusionsare drawn in Section 5.6.

5.2 Kernel PCA Overview

Firstly we recall the definition of Mercer kernel (see Section 2.3 in the second chapter).

Definition 18. Let X be a nonempty set. A function K : X ×X → R is called a positivedefinite kernel (or Mercer kernel) if and only if

n∑

j,k=1

cjckK(xj, xk) ≥ 0

for all n ∈ N, c1, . . . , cn ⊆ R, x1, . . . , xn ⊆ R.

Mercer kernels have the following property.

Theorem 23. Each Mercer kernel K can be represented as:

K(x, y) = 〈Φ(x), Φ(y)〉

where 〈·, ·〉 is the inner product and Φ : X → F , F is called Feature Space.

Examples of Mercer kernels are the Gaussian G(x, y) = exp(−‖x−y‖2σ2 ) and the Polynomial

kernel p(x, y) = (〈η(x), η(y)〉 + 1)n, where σ ∈ R and n ∈ N. For more details see thesecond chapter.Now we introduce the Kernel Principal Component Analysis (KPCA) algorithm.Let Ω = x1, x2, . . . , xN be a data set of points in Rn, KPCA algorithm consists of thefollowing steps:

1. The Gram matrix G is created. G is a square matrix of rank N , whose genericelement is Gij = K(xi, xj) where xi, xj are samples of Ω and K is a Mercer kernel.

2. The matrix G = (I − 1N)G(I − 1N) is computed. Where I is the identity matrix ofrank N and 1N is a square matrix of rank N whose elements are equal to 1

N.

3. Eigenvalues and eigenvectors of matrix G are computed.

80

4. ID of Ω is given by the number of significative normalized eigenvalues of G.

The meaning of each step of KPCA is the following.First step of KPCA maps the data into a Feature Space F by means of a nonlinear mapping;second step is performed in order to assure that the data projections have zero mean; laststep projects the data along the directions of maximal variance in the Feature Space F .

5.2.1 Centering in Feature Space

In this subsection we show that the computation of G assures that the data projections inFeature Space have zero mean, i.e.

N∑n=1

Φ(xn) = 0 (5.1)

In order to show that, we note that for any mapping Φ and for any data set Ω = x1, . . . , xN ,the points, in the Feature space,

Φ(xi) = Φ(xi)− 1

N

N∑i=1

Φ(xi) (5.2)

will be centered.Hence we go on defining covariance matrix and dot product matrix K = Φ(xi)

T Φ(xj) inthe Feature Space F .We arrive at the eigenvalue problem

λα = Kα (5.3)

with α that is the expansion coefficients of an eigenvector in the Feature Space F , in termsof the points Φ(xi), i.e.

V =N∑

i=1

αiΦ(xi) (5.4)

Since Φ can be unknown, we cannot compute K directly; however, we can express it interms of its non-centered counterpart K.We consider Gij = K(xi, xj) = Φ(xi)

T Φ(xj) and we make use of the notation 1ij = 1 forall i, j. We have:

Kij = Φ(xi)T Φ(xj)

= (Φ(xi)− 1

N

N∑m=1

Φ(xm))T (Φ(xi)− 1

N

N∑m=1

Φ(xm))

81

= Φ(xi)T Φ(xj)− 1

N

N∑m=1

Φ(xm)T Φ(xj)− 1

N

N∑n=1

Φ(xi)T Φ(xn) +

1

N2

N∑m,n=1

Φ(xm)T Φ(xn)

= Gij − 1

N

N∑n=1

1imGmj − 1

N

N∑n=1

1njGmj +1

N2

N∑n,m=1

1imGmn1nj (5.5)

If we define the matrix (1N)ij = 1N

and I the Identity matrix, we have:

Kij = G− 1NG−G1N + 1NG1N

= IG− 1NG + (1NG−G)1N

= (I− 1N)G + (1NG− IG)1N

= (I− 1N)GI− (I− 1N)G1N

= (I− 1N)G(I− 1N)

= G (5.6)

An immediate result, since the projections of data are zero mean, is the following:

-1 -0.5 0 0.5 1x

0.2

0.4

0.6

0.8

1

y

Figure 5.1: Ω Data Set.The data set is formed by points lying on the upper semicirconfer-ence of equation x2 + y2 = 1 with y ≥ 0.

82

Theorem 24. The matrix G is singular.

Proof. The elements of the matrix C = I−1N are equal to 1− 1N

if they are on the diagonal,Otherwise they are equal to − 1

N. If we sum the rows of C we get the null row. Therefore

the determinant of C is null since its rows are linearly dependent. The determinant ofG is also null, for Binet [Kor68] theorem. Hence G is singular and has at least one nulleigenvalue.

The remark implies that at least the last eigenvector, i.e the eigenvector associated to thesmallest eigenvalue, must be discarded. Besides, the remark provides a requirement, thatis the smallest eigenvalue of G is null, that the eigenvalue spectra, yielded by means ofnumerical methods, should satisfy. The computation of eigenvalues and eigenvector of Grequires the matrix diagonalization, that can be an heavy computational operation whenthe rank of G is high.Recently Rosipal and Girolami [Ros01b] have proposed a computationally efficient method,based on the EM algorithm [Dem77], to extract eigenvalues and eigenvectors that seemsto overcome the bottleneck. Finally, if KPCA is performed with the Gaussian kernel(GKPCA), a theorethical result has been established. Twining and Taylor [Twi03] haveproved that GKPCA, in the case of an infinite number of data points, approaches to PCA,for large values of the variance σ.

5.3 A Case Study

In order to test KPCA such an ID estimator, a data set Ω, whose ID is known, has beenconsidered. The Ω data set, shown in figure 5.1, is formed by 100 points lying on the uppersemicirconference of equation x2 + y2 = 1 with y ≥ 0.Although the ID of Ω is 1, PCA yields two non-null eigenvalues. In fact if we normalize theeigenvalue spectrum, i.e each eigenvalue is divided by the biggest, the second normalizedeigenvalue is 0.167. Hence ID for PCA is 2 instead of 1. Therefore Ω is a good benchmarkto test KPCA as ID estimator.Afterwards Kernel PCA has been tried. The goal of the experimentation has been to showthat KPCA, using an appropriate Mercer kernels, has only one non-null eigenvalue i.e. IDis 1.In order to show this, we consider the so-called Trivial kernel since it is strictly related tothe function that has generated Ω, T (x,y):

T (x,y) = 〈η(x), η(y)〉where 〈·, ·〉 is the inner product and η : R2 7→ R2 η(x1, x2) = (x1

2, x22).

T (x,y) is a Mercer kernel since it is an inner product (see Corollary 1 in the second

83

chapter).Besides, if we compute T (x,y) with x = (x1, x2) y = (y1, y2) we have:

T (x,y) = x12y1

2 + x22y2

2

that corresponds to make an inner product in a bidimensional Feature Space F whose axesare (x1

2, x22).

In the space F the data set Ω lies on a line of equation X + Y = 1, where X = x21 and

X = x22. As shown in the table 5.1, KPCA with the Trivial kernel has only one significative

eigenvalue.Then we have investigated KPCA on the Ω data set, using Mercer kernels that are suc-cessfully used in Support Vector Machines (SVM ) [Cri00][Vap98] such as the Polynomialthe and Square kernel. The use of the Square kernel S(x,y) = (〈x,y〉)2 corresponds toproject the data in a three-dimensional Feature space F .If we compute S(x,y) with x = (x1, x2) y = (y1, y2) we have:

S(x,y) = x12y1

2 + x22y2

2 + 2x1x2y1y2

that corresponds to make an inner product in a three-dimensional Feature Space F whoseaxes are (x1

2, x22,√

2x1x2).The polynomial kernel of degree two P (x,y) = (〈x,y〉 + 1)2 projects the data in a six-dimensional Feature Space F whose axes are (x1

2, x22,√

2x1x2,√

2x1,√

2x2, 1).Therefore the use of square and polynomial kernels implies that the data set Ω is projectedin a Feature space whose dimension is higher than the one in which Ω was originally em-bedded. Therefore it is reasonably to assume that the number of significative eigenvaluesof KPCA cannot be lower than the one of PCA. As shown in the table 5.1, KPCA withsquare kernel has 2 significative eigenvalues. Hence ID, estimated with KPCA using theSquare kernel, is 2.KPCA with polynomial kernel can have more than 2 significative eigenvalues, if, as thresh-old Θ, one of the two smallest Fukunaga-Olsen values, i.e 0.03 and 0.01, is adopted.Therefore we can assume that ID, estimated with KPCA using the Square kernel, is more

Trivial Kernel Square Kernel Polynomial Kernel1st eigenvalue 1 1 12nd eigenvalue 3.12·10−7 0.746 0.3463rd eigenvalue 3.09·10−7 2.32·10−7 0.03114th eigenvalue 2.90·10−7 2.24·10−7 2.91·10−3

5th eigenvalue 2.82·10−7 2.16·10−7 8.14·10−8

Table 5.1: Five top eigenvalues of KPCA, computed on Ω data set with Trivial, Squareand Polynomial kernel.

84

than 2.Finally we have investigated the problem of how much a Mercer kernel can be differentfrom the Trivial one, in order to yield a KPCA with only one significative eigenvalue. Inorder to answer to this question, we perturbate the Trivial kernel.

5.3.1 Trivial Kernel Perturbations

Two different kinds of perturbations of the Trivial kernel have been investigated. The firstkind of perturbation has been designed in order to project still the data in a 2-dimensionalFeature Space.It has been considered the following kernel Kc(x,y):

Kc(x,y) = 〈ηc(x), ηc(y)〉

where 〈·, ·〉 is the inner product and ηc : R2 7→ R2 ηc(x1, x2) = ((x1 − c)2, (x2 − c)2).Kc(x,y) is a Mercer kernel since it is an inner product (see Corollary 1 in the first chapter).Besides, it is immediate to show that Kc(x,y) projects the data in a 2-dimensional FeatureSpace whose axes are ((x1 − c)2, (x2 − c)2).The behaviour of the eigenvalue spectrum of the KPCA, using Kc(x,y), for different value

of c has been investigated. The figure 5.2 shows the rate between the second and the firsteigenvalue changing the parameter c. Only for small values (c < 0.1) of the parameter itcan consider negligible the second eigenvalue and assume ID equal to 1.A second kind of perturbation has been studied. This perturbation has been designed inorder to project the data in a Feature Space whose dimension is higher than the originalone, that is 2. The following kernel Ks(x,y):

Ks(x,y) = T (x,y) + sH(x,y)

has been considered. Where s ∈ R+ and T (x,y) is the trivial kernel and H(x,y) a MercerKernel (e.g. square kernel or polynomial kernel ) that projects the data in a Feature spacewhose dimension is higher than 2. Since Ks(x,y) is sum of Mercer kernels, Ks is still aMercer kernel (see Theorem 7 in the second chapter).If H(x,y) is the square kernel it is easy to show that Ks projects the data in a three-dimensional space whose axes are ((1 + s)x1

2, (1 + s)x22, s√

2x1x2). If the parameter s isvery small the third axe vanishes and the Feature Space becomes bidimensional.In a similar way, when H(x,y) is the polynomial kernel of degree 2, Ks projects the data ina six-dimensional space whose axes are ((1 + s)x1

2, (1 + s)x22, s√

2x1x2, s√

2x1, s√

2x2, s).If the parameter s is very small last four axes vanish and the Feature Space becomesbidimensional. The behaviour of the eigenvalue spectrum of the KPCA, using Ks(x,y),for different value of s was investigated. The figures 5.3 and 5.4 show the eigenvaluespectrum when it is perturbated respectively with the square and the polynomial kernel.

85

0 0.2 0.4 0.6 0.8 1c

0

0.01

0.03

0.05

Rate

Figure 5.2: Rate between 2nd and 1st Eigenvalue Spectrum of KPCA with Kc(x, y).

0 0.2 0.4 0.6 0.8 1s

0

0.1

0.2

0.3

0.4

0.5

0.6

Rate

Figure 5.3: Rate between 2nd and 1st Eigenvalue Spectrum of KPCA with Ks(x,y). AsH(x,y) the square kernel is used.

86

0 0.1 0.2 0.3 0.4s

0

0.2

0.4

0.6

0.8

Rate

Figure 5.4: Rate between 2nd and 1st Eigenvalue Spectrum of KPCA with Ks(x,y). AsH(x,y) the Polynomial Kernel of degree 2 is used.

0 20 40 60 80 100Eigenvalue

0

0.2

0.4

0.6

0.8

1

M

Figure 5.5: Normalized Eigenvalue Spectrum of PCA (green) and KPCA (red) yielded onInner-Outdoor Image data set. The top 100 eigenvalues are shown. Magnitude Eigenvaluesare reported on the M axe

87

In both cases, only when the parameter is very small (s ≤ 0.01) it can consider negligiblethe second eigenvalue of KPCA and assume ID equal to 1.Summing up KPCA yields one significative eigenvalue only if the kernel differs negligiblyby the trivial kernel. Otherwise KPCA can perform as PCA or even less than PCA. Inorder to confirm this assumption Kernel PCA has been experimented on a data set formedby real data.

0 20 40 60 80 100Eigenvalue

0

0.2

0.4

0.6

0.8

1

M

Figure 5.6: Normalized Eigenvalue Spectrum of PCA (green) and KPCA (red) yielded onMIT-CBCL Face data set. The top 100 eigenvalues are shown. Magnitude Eigenvalues arereported on the M axe. PCA eigenvalues are in green, KPCA (gaussian kernel) eigenvaluesare in red. The kernel variance σ is 1.

5.4 Experimentations on Inner-Outdoor Image Database

Kernel PCA has been tried on two real data databases, whose ID is unknown, namely theInner-Outdoor Image Database and the MIT-CBCL Face Database1 [Hei00].The experimentations on MIT-CBCL database will be exposed in the next section.

1MIT-CBCL database can be downloaded from the following http address:www.ai.mit.edu/projects/cbcl

88

Inner-Outdoor Image Database is formed by 300 indoor and 300 outdoor colored imagesof various size. Each image was used to construct a color histogram in the HSV (Hue-Saturation-Value) color space formed by 15 x 15 x 15 bins. Kernel PCA has been tested onthe Inner-Outdoor Image database, using a specific Mercer kernel, i.e. the color histogramintersection kernel Kint. Roughly speaking, Kint measures the degree of similarity betweentwo color histograms, for details see [Bar02]. Kint has been chosen since the SVM usingthis kernel performs significatively better than SVMs with general purpose kernels (e.gpolynomial and Gaussian), in classifying correctly indoor and outdoor images [Bar02].KPCA and PCA have been extracted. PCA and Kernel PCA, with Kint, spectra areshown in figure 5.5. The table 5.2 shows ID estimated with Kernel PCA and PCA, usingFukunaga-Olsen’s thresholds.

0 20 40 60 80 100Eigenvalue

0

0.2

0.4

0.6

0.8

1

M


5.5 Experimentations on MIT-CBCL Face Database

MIT-CBCL face database consists of low-resolution grey images of faces and non-faces. Inthe experiment we have considered the 2429 faces of the training set. In this experiment, we

89

0 20 40 60 80 100Eigenvalue

0

0.2

0.4

0.6

0.8

1

M


90

Θ PCA KPCA0.05 44 860.03 58 1740.01 107 542

Table 5.2: ID estimated by means of PCA and KPCA (with Kint kernel), using differentFukunaga-Olsen’s thresholds Θ, on Indoor/Outdoor dataset.

tried several KPCAs with gaussian kernel (GKPCA) with different values of the varianceσ. As a comparison, PCA has been also tried. Figures 5.3, 5.4, 5.4, 5.5 show GKPCAeigenvalue spectrum for different σ values. The table 5.3 shows ID estimated with GKPCAsand PCA, using Fukunaga-Olsen’s thresholds. As shown in the table, ID estimated withGKPCA for large σ tends to the ID value estimated with PCA, confirming the Twining-Taylor’s theorethical result.

0 20 40 60 80 100Eigenvalue

0

0.2

0.4

0.6

0.8

1

M


91

Θ GKPCA GKPCA GKPCA GKPCA GKPCA PCAσ=1 σ=5 σ=10 σ=20 σ=40

0.05 2425 33 23 22 22 210.03 2425 56 34 31 31 300.01 2425 175 76 62 62 60

Table 5.3: ID estimated by means of PCA and GKPCA, using different Fukunaga-Olsen’sthresholds Θ, on CBCL dataset.

5.6 Conclusions

In this chapter, the problem of estimating the intrinsic dimension (ID) of a data set bymeans of the Kernel PCA has been investigated. Kernel PCA has been tested on a syn-thetic data set whose ID is known. It is shown that Kernel PCA is a good ID estimatoronly when the Mercer Kernel, used in the Kernel PCA, is very close to the function thathas generated the data set. Otherwise Kernel PCA, as ID estimator, seems to perform asPCA or even less.Besides it has been shown that general purpose kernels (e.g. polynomial, gaussian), do notseem fit to be used in KPCA. In fact, using these kernels, the intrinsic dimensionality ofdata projected in the Feature Space can often exceed the original data dimensionality inthe Input space.Finally some investigations have been performed on real databases, confirming the Twinning-Taylor theorethical result that claims that the Gaussian Kernel PCA, for large values ofthe variance σ, tends to behave as the PCA.

92

Chapter 6

Clustering Data Methods

6.1 Introduction

In unsupervised learning a training sample of objects, e.g. images, is given and the goal isto extract some structure from them. For instance, identifying indoor or outdoor imagesor differentiating between face and background pixels. This statement of the problem thatshould be represented better as learning a concise data representation. In fact, if somestructure exists in the training objects, it can take advantage of the redundancy and finda short data description.One of the most general ways to represent data is to specify a similarity among objects. Ifsome objects share much structure, it should be possible to reproduce them from the sameprototype. This idea underlies clustering methods. It is not possible to provide a formaldefinition of clustering, only an intuitive definition can be given.Given a fixed number of clusters, clustering methods have the goal to find a grouping ofobjects such that similar objects belong to the same cluster. We view all objects withinone cluster as being similar to each other. If it is possible to find a clustering such thatthe similarities of the objects in one cluster are much greater than the similarities amongobjects from different clusters, we have extracted structure from the training sample sothat each single cluster can be represented by one representative.From a statistical point of view, the idea of finding a concise data representation is closelyrelated to the idea of mixture models, where the overlap of high density regions of the in-dividual mixture components is as small as possible. Since we do not observe the mixturecomponent that generated a particular training object, we have to treat the assignment oftraining examples to the mixture components as hidden variables. This makes estimationof the unknown probability measure quite intricate. Therefore clustering methods, in somecases, represent the unique possibility to make clustering, since it can be quite hard toapply mixture models.

93

Following Jain et al. [Jai99], clustering methods can be categorized into hierarchical clus-tering and partitioning clustering. Hierarchical schemes sequentially build nested clusterswith a graphical representation known as dendrogram. Partitioning methods directly assignall the data points according to some appropriate criteria, such as similarity and density,into different groups (clusters). The dissertation is focused on the prototyped-based cluster-ing (PBC ) algorithms, which is the most popular class of partitioning clustering methods.PBC algorithms are so popular that they are often called simply clustering algorithms. Inthe rest of the dissertation this convention is adopted.This chapter presents a survey of the clustering algorithms, paying special attention toneural-based algorithms. This chapter does not include the kernel-based algorithms, thatare widely discussed and investigated in the next chapter.The chapter is organized as follows: Section 6.2 reviews the EM algorithm, that is a ba-sic tool of several clustering algorithms; Section 6.3 presents the basic concepts and thecommon definitions to all clustering algorithms; Section 6.4 describes the algorithm K-MEANS; Sections 6.5 and 6.6 review some soft competitive learning algorithms, that isSelf-Organizing Maps, Neural Gas and Topology Representing Networks ; General Topo-graphic Mapping is discussed in Section 6.7; Section 6.8 provides some sketches on thefuzzy clustering techniques; finally in Section 6.9 some conclusions are drawn.

6.2 Expectation and Maximization algorithm

First of all, we recall the definition of the maximum-likelihood problem.We have a density function p(x|Θ) that is governed by the set of parameters Θ. We alsohave a data set X = (x1, . . . , xN), that we suppose drawn from the Probability Distributionp. We assume that the data vectors of X are i.i.d. with distribution p. Therefore, theresulting density for the samples is

p(X|Θ) =N∏

i=1

pi(x|Θ) = L(Θ|χ) (6.1)

The function L(Θ|X ) is called the likelihood of the parameters given the data, or simplythe likelihood function.The likelihood is thought of as a function of the parameters Θ where the data set X isfixed.Therefore we can state the Maximum Likelihood Problem

Problem 2. To find the parameter Θ? that maximizes the likelihood L(Θ|X ), that is

Θ? = arg maxΘL(Θ|X ). (6.2)

We often maximize its logarithm, logL(Θ|X ), since it is easier.

94

6.2.1 Basic EM

The Expectation and Maximization (EM ) [Dem77] [Wu83] [Bis95] algorithm is one suchelaborate technique. The EM algorithm is a general method of finding the maximum-likelihood estimate of the parameters of an underlying distribution from a given data setwhen the data is incomplete or has the missing values.There are two main applications of the EM algorithm. The former occurs when the dataindeed has missing values, due to limitations of the observation process. The latter oc-curs when optimizing the likelihood function is analytically intractable and the likelihoodfunction can be simplified by assuming the existence of additional missing (or hidden)parameters. The latter application is commonly used in clustering.We assume that the data set X is generated by some probability distribution p. We callX the incomplete data.We assume that a complete data set exists Z = (X ,Y) and also assume a joint densityfunction:

p(z|Θ) = p(x, y|Θ) = p(y|x, Θ)p(x|Θ)

This joint density often arises from the marginal density function p(x|Θ) and the assump-tion of hidden variables and parameter value guesses. In other cases (e.g. missing datavalues in samples of a distribution), we must assume a joint relationship between the miss-ing and the observed values.With this new density function, we can define a new likelihood function,

L(Θ|Z) = L(Θ|X ,Y) = p(X ,Y|Θ) (6.3)

called the complete data likelihood.This function is a random variable since the missing information Y is unknown, random,and presumably governed by an underlying distribution. Hence we can pone

L(Θ|X ,Y) = hX ,Θ(Y)

where X and Θ are constants, Y is a random variable and h(·) is an unknown function.The original likelihood function L(Θ|X ) is called incomplete data likelihood function.

6.2.1.1 E-step

First of all, the EM algorithm finds the expected value of the complete data log likelihoodlog p(X ,Y|Θ) with respect to the unknown data Y given the observed data X and thecurrent parameter estimates Θ.We define

Q(Θ, Θ(i−1)) = E[log p(X ,Y|Θ)|X , Θ(i−1)

](6.4)

95

Where Θ(i−1) are the current parameter estimates that we used to evaluate the expectationand Θ are the new parameters that we optimize to increase Q.We observe that X and Θ(i−1) are constants, Θ is the variable that we want to adjust andY is a random variable governed by the distribution f(y|X , Θ(i−1)).The right side of (6.4) can be rewritten as:

E[log p(X ,Y|Θ)|X , Θ(i−1)] =

∫

y∈Υ

log p(X , y|Θ)f(y|X , Θ(i−1))dy (6.5)

f(y|X , Θ(i−1)) is the marginal distribution of the unobserved data and is dependent on Xand on the current parameters Θ(i−1); Υ is the space of values y can take on.In the best cases, f(·) is a simple analytical expression of its inputs. In the worst of cases,an analytical expression of f(·) is very hard to obtain.The evaluation of the equation (6.5) is called the E-step of the algorithm.

6.2.1.2 M-step

The second step (the M-step) of the EM algorithm is to maximize the expectation thathas been computed in the first step.We find:

Θ(i) = arg maxΘ

Q(Θ, Θi−1). (6.6)

The two step of the EM algorithm are repeated until the algorithm converges. Eachiteration is guaranteed to increase the loglikelihood and the algorithm is guaranteed toconverge a local maximum of the likelihood function. Although there are many papers(e.g. [Dem77][Red84][Wu83]) on the rate of the convergence of EM Algorithm, in practicethe algorithm converges after few iterations. This is the cause of the popularity of the EMalgorithm in the Machine Learning community.

6.3 Basic Definitions and Facts

This section provides the basic definitions and concepts of the clustering.

6.3.1 Codebooks and Codevectors

Let X be a data set, whose cardinality is m, formed by vectors ζ ∈ Rn. We call codebookthe set W = (w1, w2, . . . , wk−1, wk) where each element wc ∈ Rn and k ¿ m.The Voronoi Region (Rc) of the codevector wc is the set of all vectors in Rn for which wc

96

is the nearest vector (winner)

Rc = ζ ∈ Rn | c = arg minj‖ζ − wj‖

Each Voronoi region Vi is a convex area, i.e.

(∀ζ1, ζ2 ∈ Vi) ⇒ (ζ1 + α(ζ2 − ζ1) ∈ Vi) (0 ≤ α ≤ 1)

The partition of Rn formed by all Voronoi polygons is called Voronoi Tessellation (orDirichlet Tessellation). Efficient algorithms to compute it are only known for two-dimensionaldata sets [Pre90].If one connects all pairs of points for which the respective Voronoi regions share an edge,i.e. an (n − 1)-dimensional hyperface for spaces of dimension n, one gets the DelaunayTriangulation. This triangulation is special among all possible triangulations. It is theonly triangulation where the circumcircle of each triangle contains no other point from theoriginal point set than the vertices of this triangle. Delaunay triangulation has been shownto be optimal for function interpolation [Omo90].As shown in the Figure 6.1, the Voronoi Set (Vc) of the codevector wc is the set of allvectors in X for which wc is the nearest vector

Vc = ζ ∈ X | c = arg minj‖ζ − wj‖

6.3.2 Quantization Error

A goal of the clustering techniques is the minimization of the Expected Quantization Error(or Expected Distortion Error).In the case of a continuous input distribution p(ζ), the Expected Quantization ErrorE(p(ζ)) is:

E(p(ζ)) =m∑

c=1

∫

Rc

‖ζ − wc‖2p(ζ)dζ (6.7)

where Rc is the Voronoi Region of the codevector wc and m is the cardinality of thecodebook W .When we have a finite data set X , the goal can be to minimize the Empirical QuantizationError E(X ), that is:

E(X ) =1

2|X |m∑

c=1

∑

ζ∈Vc

‖ζ − wc‖2 (6.8)

where |X | is the cardinality of the data set X and Vc is the Voronoi Set of the codevectorwc.

97

Figure 6.1: The clusters, formed by the black points, can be represented by their code-vectors (red points). Dashed circles identify the Voronoi sets associated to each centroids.Codevectors induce a tessellation of the input space.

98

A typical application where error minimization is the vector quantization [Lin80] [Gray84].In the vector quantization data is transmitted over limited bandwidth communicationchannels by transmitting for each data vector only the index of the codevector. Thecodebook is assumed to be known both to sender and receiver. Therefore, the receiver canuse the transmitted index to retrieve the corresponding codevector. There is an informationloss which is equal to the distance of the current data vector and the nearest codevector.The expectation value of this error is described by the equations (6.7) and (6.8). If thedata distribution contains subregions of high probability density, dramatic compressionrates can be achieved with small quantization error.

6.3.3 Entropy Maximization

Sometimes the codebook should be distributed such that each codevector has the sameprobability to be the winner s(ξ) for a randomly chosen input ζ:

P (s(ξ) = wc) =1

m∀wc ∈ W (6.9)

If we interpret the choice of an input and the subsequent mapping onto the winner code-vector in W as a random experiment which assigns a value x ∈ X to the random variableX, then (6.9) is equivalent to maximizing the entropy

H(X) = −∑x∈X

P (x) log(P (x)) = E(log(1

P (x))) (6.10)

where E(·) is the expectation operator.If the data can be modeled from a continuous probability distribution p(ζ), then (6.9) isequivalent to ∫

Rc

p(ξ)dξ =1

m(∀wc ∈ W ) (6.11)

where Rc is the Voronoi Region of wc and m is the cardinality of W .When the data set X is finite, (6.9) corresponds to the situation where each Voronoi setVc contains the same number of data points:

|Vc||X | ≈

1

|W | (∀wc ∈ W ) (6.12)

An advantage of choosing codevectors in order to maximize entropy is the inherent robust-ness of the resulting system. The removal of any codevectors affects only a limited fractionof the data.Entropy maximization and error minimization cannot usually be achieved in the same time.If the data distribution is highly non-uniform the goals differ notably.

99

For instance, we consider a data set where 50% of the points are distributed in a very smallregion of the input space, whereas the rest of data are uniformly distributed in the inputspace. To maximize entropy half of the codevectors should be positioned in each region.To minimize the quantization error however, only one single vector should be positionedin the pointwise region and all others should be uniformly distributed in the input space.

6.3.4 Winner-Takes-All Learning

Learning in Clustering Algorithms often use methods (Winner-takes-all Learning) whereeach input only determines the adaptation of the nearest codevector, the winner. Differentspecific methods, that we present in the next sections of the chapter, can be obtained, byperforming either batch or on-line update. In batch methods all inputs are evaluated firstbefore any adaptations are done. This procedure is repeated quite a number of times.On-line methods perform an update after each input. Among the on-line methods variantswith constant adaptation rate can be distinguished from variants with decreasing adapta-tion rates of different kinds.A general problem occurring with winner-takes-all learning is the possible existence of deadcodevectors. These are codevectors which are never winner for any input and keep theirposition for ever. A common way to avoid dead codevectors is to use distinct samplesaccording to the input distribution probability to initialize the codevectors. However if thecodevectors are initialized randomly according to the input distribution probability p(ζ),then their expected initial local density is proportional to p(ζ). This may be unoptimalfor certain goals. For instance, if the goal is error minimization and p(ζ) is highly non-uniform, it is better to undersample the region with high probability density, i.e. to useless codevectors than suggested by p(ζ), and to oversample the other regions.Another problem of winner-takes-all learning is that different random initializations canyield very different results. The local adaptations may not be able to get the system outof the poor local minimum where it was started. One way to cope with this problem tomodify the winner-takes-all learning in a soft competitive learning, in this case not onlythe winner but also some other codevectors are adapted.

6.4 K-MEANS

6.4.1 Batch K-MEANS

Batch K-MEANS [For65] assumes in the literature other names, e.g. in speech recognitionis called LBG [Lin80], in the ancient books of pattern recognition is also called generalizedLloyd algorithm.

100

Batch K-MEANS works by repeatedly moving all codevectors to the arithmetic meanof their Voronoi sets. The theorethical foundation of this procedure is that a necessarycondition for a codebook W to minimize the Quantization Error

E(X ) =1

2|X |m∑

c=1

∑

ζ∈Vc

‖ζ − wc‖2

is that each codevector wc fulfill the centroid condition [Gray92].In the case of finite data set X and the Euclidean distance, the centroid condition reducesto

wc =1

|Vc|∑

ζ∈Vc

ζ (6.13)

where Vc is the following Voronoi set of the codevector wc.The Batch K-MEANS algorithm is formed by the following steps:

1. Initialize the codebook W to contain K codevectors wi (W = w1, w2, . . . , wK) withvectors chosen randomly from the training set X .

2. Compute for each codevector wi ∈ W its Voronoi Set Vi

3. Move each codevector wi to the mean of its Voronoi Set.

wi =1

|Vi|∑

ζ∈Vi

ζ (6.14)

4. Go to step 2 if any codevector, in the step 3, wi has been changed.

5. Return the codebook.

Second and third step form a Lloyd iteration. It is guaranteed that after a Lloyd iterationthe quantization error does not increase. Besides, Batch K-MEANS can be viewed as anEM algorithm (see Section 6.2). Second and third step are respectively the Estimationand the Maximization stage. Since each EM algorithm is convergent, Batch K-MEANS isconvergent, too.The main drawback of K-MEANS is the lacking in robustness with respect to the outliers. Ifwe consider (6.14), we can see that the presence of outliers can affect notably the codevectorcomputation.

101

6.4.2 Online K-MEANS

When the data set X is so huge the batch methods become unfeasible. In this case theon-line update becomes a necessity.The version on-line of K-MEANS (online K-MEANS ) can be described as follows:

1. Initialize the codebook W to contain K codevectors wi (W = w1, w2, . . . , wK) withvectors chosen randomly from the training set X .

2. Choose randomly an input ζ according to the input probability function p(ζ).

3. Fix the nearest codevector, i.e the winner ws = s(ζ)

s(ζ) = arg minwc∈W

‖ζ − wc‖ (6.15)

4. Adapt the winner towards ξ:∆ws = ε(ζ − ws) (6.16)

5. Go to step 2 until the maximum number of steps is reached.

online K-MEANS performs an hard competitive learning, since only the winner is modifiedin each iteration.The learning rate ε determines how much the winner is adapted towards the input.

6.4.2.1 Learning rate

If the learning rate is constant, i.e.

ε = ε0 (0 < ε0 ≤ 1)

then it can ber shown that the value of the codevector, at the time t, wc(t) can be expressedas an exponentially decaying average of those inputs for which the codevector has been thewinner, that is

wc(t) = (1− ε0)twc(0) + ε0

t∑i=1

(1− ε0)t−iζc

i (6.17)

From (6.17) it is obvious that the influence of past inputs decays exponentially fast withthe number of inputs for which the codevector wc is the winner. The most recent inputalways determines a fraction ε of the current value of wc. This has the consequence thatthe algorithm has no convergence. Even after a large number of inputs, the current inputcan change a remarkable change in the winner codevector.

102

In order to cope with this problem, it has been proposed to have a learning rate thatdecreases over the time. In particular MacQueen [Mac67] proposed

ε(t) =1

t(6.18)

where t stands for the number of inputs for which the codevector has been winner in thepast.Some authors when quote K-MEANS refers only to online K-MEANS with a learning ratesuch as (6.18). This is motivated because each codevector is always the exact arithmeticmean of the inputs for which it has been winner in the past.We have:

wc(0) = ζc0

wc(1) = wc(0) + ε(1)(ζc1 − wc(0)) = ζc

1

wc(2) = wc(1) + ε(2)(ζc2 − wc(1)) =

ζc1 + ζc

2

2. . . . . . . . .

. . . . . . . . .

wc(t) = wc(t− 1) + ε(t)(ζct − wc(t− 1)) =

ζc1 + ζc

2 + . . . ζct

t(6.19)

The set of inputs ζc1, ζ

c2, . . . , ζ

ct for which a particular codevector wc has been the winner may

contain elements which lie outside the current Voronoi region of Vc. Therefore, althoughwc(t) represents the arithmetic mean of the inputs it has been winner for, at time t some ofthese imputs may well lie in Voronoi regions belonging to other units. Another importantpoint about this algorithm is, that there is no strict convergence, as is present in BatchK-MEANS, since the sum of the harmonic series has no convergence:

limn→∞

n∑i=1

1

i= ∞

Since the series is divergent, even after a large number of inputs and low values of thelearning rates ε(t) large modification could happen in the winner codevector. However suchlarge modifications are very improbable and many simulations show that the codebookrather quickly assume values that are not changed notably in the further course of thesimulation. It has been shown that online K-MEANS with a learning rate such as (6.18)[Mac67] converges asymptotically to a configuration where each codevector wc is positionedsuch that it coincides with the expectation value

E(ζ|ζ ∈ Rc) =

∫

Rc

ζp(ζ)dζ (6.20)

of its Voronoi region Rc.The equation (6.20) is the generalization, in the continuous case, of the centroid condition

103

(6.13).Finally another possibility for decaying adaptation rule has been proposed by Ritter et al[Rit91]. They propose an exponential decay according to

ε(t) = εi

(εf

εi

) ttmax

(6.21)

where εi and εf are the initial and the final values of the learning rate and tmax is the totalnumber of adaptation steps.The most relevant drawback of online K-MEANS is its sensitivity with respect to theinput sequence ordering. Changing the sequence ordering, the algorithm performances canchange notably. This drawback is shared by the other on-line algorithms that we describelater.

6.5 Self-Organizing Maps

Self-Organizing Map (SOM ) [Koh97] [Koh82] has been proposed by Kohonen [Koh82] whobased the model on earlier work by Willshaw and von der Malsburg [Wil76]. Unlike theother clustering algorithms, SOM has biological plausibility. This plausibility is the causeof the popularity of the SOM algorithm in the Neural Network community. A peculiarcharacteristic of SOM model is that its topology is constrained to be a two-dimensionalgrid (aij). The grid is inspired to the retinotopic map that connects the retina to thevisual cortex in higher vertebrates. The grid does not change during self-organization.The distance on the grid is used to determine how strongly a unit r = akm is adaptedwhen the unit s = aij is the winner. The metric d1(·), on the grid, is the usual L1 norm(Manhattan distance):

d1(r, s) = |i− k|+ |j −m| (6.22)

The complete SOM algorithm is the following:

1. Initialize the codebook W to contain K codevectors wi (W = w1, w2, . . . , wK) withvectors chosen randomly from the training set X .Initialize the connection set C to form a rectangular grid N1 × N2 grid, with K =N1 N2. Each codevector is mapped onto a unit of the grid.Initialize the parameter t:

t = 0

2. Choose randomly an input ζ from the training set X3. Determine the winner s(ζ) = s:

s(ζ) = arg minwc∈W

‖ζ −Wc‖ (6.23)

104

4. Adapt each codevector wr according to:

∆wr = ε(t) h(d1(r, s)) (ζ − wc) (6.24)

where:

h(d1(r, s)) = exp

(−d1(r, s)

2

2σ2

)(6.25)

λ(t) = λi

(λf

λi

) ttmax

(6.26)

ε(t) = εi

(εf

εi

) ttmax

(6.27)

σ(t) = σi

(σf

σi

) ttmax

(6.28)

and d1(r, s) is a function that depends by the Manhattan distance between the unitsr and s that are the images of the codevectors wr and ws on the grid.

5. Increase the time parameter t:t = t + 1 (6.29)

6. if t < tmax go to step 2.

It is necessary to remark that (6.25) can be substituted by the any decreasing function ofthe arguments σ and d1(r, s).SOM performs a soft competitive learning since, unlike online K-MEANS, other codevec-tors, in addition to the winner, can be modified.

6.5.1 SOM drawbacks

SOM shares with other on-line algorithms the sensitivity to initialization, the order ofinput vectors and presence of a large number of outliers. Besides, it can identify furtherproblems. Following Bishop [Bis98], the list below describes some SOM drawbacks.

• The SOM algorithm is not derived by optimizing an objective function. Indeed,Erwin et al. [Erw92] proved that such an objective function cannot exist for theSOM algorithm.

• Neighbourhood-preservation is not guaranteed by the SOM procedure.

• There is no assurance that the codebook vectors will converge using SOM.

105

• SOM does not define a probability density model. Attempts have been made tointerpret the density of the codebook vectors as a model of the data distributionwith limited success.

• It is difficult to know by what criteria to compare different runs of the SOM procedure.

6.6 Neural Gas and Topology Representing Network

In this section we describe models that do not impose a topology of fixed dimensionality tothe network. In the case of Neural Gas there is no topology at all, in the case of TopologyRepresenting Networks the topology of the network depends on the local dimensionality ofthe data and can vary within the input space.

6.6.1 Neural Gas

The Neural Gas algorithm [Mar91] sorts for each input ζ the codevectors according totheir distance to ζ. Using this rank order a certain number of codevectors is adapted. TheNeural Gas algorithm is the following:

1. Initialize the codebook W to contain K codevectors wi (W = w1, w2, . . . , wK) withvectors chosen randomly from the training set X .Initialize the time parameter t:

t = 0

2. Choose randomly an input ζ from the training set X3. Order all elements of W according to their distance to ζ, i.e. to find the sequence

of indices (i0, i1, . . ., iN−1) such that wi0 is the nearest codevector to ζ, wi1 is thesecond-closest to ζ and so on. Therefore wip−1 is the pth-closest to ζ. FollowingMartinetz and Schulten [Mar93] we denote with ki(ζ,X ) the rank number associatedwith the codevector wi.

4. Adapt the codevectors according to:

∆wi = ε(t) hλ(ki(ζ,X )) (ζ − wi) (6.30)

where:

λ(t) = λi

(λf

λi

) ttmax

(6.31)

106

ε(t) = εi

(εf

εi

) ttmax

(6.32)

hλ(ki) = e−ki

λ(t) (6.33)



Neural Gas performs a soft competitive learning since, unlike online K-MEANS, othercodevectors, in addition to the winner, can be modified.Neural Gas is usually not used on its own but it is generally its extension, i.e. TopologyRepresenting Networks (TRN ) [Mar94].

6.6.2 Topology Representing Network

In addition to Neural Gas, TRN model at each adaptation step creates a connection be-tween the winner and the second-nearest codevector. Since the codevectors are adaptedaccording to the Neural Gas method a mechanism is needed to remove connections whichare not valid anymore. This is performed by a local aging connection mechanism.The complete Topology Representing Networks algorithm is the following:

1. Initialize the codebook W to contain N codevectors wi (W = w1, w2, . . . , wK) withvectors chosen randomly from the training set X .Initialize the connection set C, C ⊆ X × X , to the empty set:

C = ®

Initialize the time parameter t:t = 0

2. Choose randomly an input ζ from the training set X .

3. Order all elements of W according to their distance to ζ, i.e. to find the sequenceof indices (i0, i1, . . ., iK−1) such that wi0 is the nearest codevector to ζ, wi1 is thesecond-closest to ζ and so on. Hence wip−1 is the pth-closest to ζ. We denote withki(ζ,X ) the rank number associated with the codevector wi.

4. Adapt the codevectors according to:

∆wi = ε(t) hλ(ki(ζ,X )) (ζ − wi) (6.35)

107

where:

λ(t) = λi

(λf

λi

) ttmax

(6.36)

ε(t) = εi

(εf

εi

) ttmax

(6.37)

hλ(ki) = e−ki

λ(t) (6.38)

5. If it does not exist already, create a connection between i0 and i1:

C = C ∪ i0, i1 (6.39)

Set the age of the connection between i0 and i1 to zero, refresh the connection:

age(i0,i1) = 0

6. Increment the age of all edges emanating from i0:

age(i0,i) = age(i0,i) + 1 (∀i ∈ Ni0) (6.40)

where Ni0 is the set of direct topological neighbors of the codevector wi0 .

7. Remove connections with an age larger than maximal age T (t)

T (t) = Ti

(Tf

Ti

) ttmax

(6.41)



For the time dependent parameters suitable initial values (λi, εi, Ti) and final values ((λf ,εf , Tf ) have to be chosen.Finally we can underline that the cardinality of C can be used to estimate the intrinsicdimensionality of the training set X . See section 4.2.3 for more details.

6.6.3 Drawbacks

Neural Gas and TRN share with other on-line algorithms the sensitivity to initialization,the order of input vectors and presence of a large number of outliers. Besides, the conver-gence of Neural Gas and TRN is not guaranteed.

108

6.7 General Topographic Mapping

A goal in unsupervised learning is to develop a representation of the distribution p(t) fromwhich the data were generated. The General Topographic Mapping (GTM ) [Bis98] tries tomodel p(t) in terms of a number (usually two) of latent or hidden variables. By consideringa particular class of such models a formulation in terms of a constrained Gaussian mixture,which can be trained using the EM algorithm, is obtained. The topographic nature of therepresentation is an intrinsic feature of the model and is not dependent on the details ofthe learning process.We pass to describe in detail the algorithm.

6.7.1 Latent Variables

The goal of a latent variable model is to find a representation for the distribution p(t) ofdata in a D-dimensional space t = (t1, . . . , tD) in terms of a number L of latent variables x =(x1, . . . , xL). This is achieved by first considering a non-linear function y(x; W ), governedby a set of parameters W , which maps points x in the latent space into correspondingpoints y(x; W ) in the data space. Typically we are interested in the situation in which thedimensionality L of the latent space is less than the dimensionality D of the data space,since our premise is that the data itself has an intrinsic dimensionality which is less thanD. The transformation y(x,W ) then maps the latent space into an L-dimensional non-Euclidean manifold embedded within the data space. If we define a probability distributionp(x) on the latent space, this will induce a corresponding distribution p(y|W ) in the dataspace. We shall refer to p(x) as the prior distribution of x. Since L < D, the distributionin t-space would be confined to a manifold of dimension L and hence would be singular.Since in reality the data will only approximately live on a lower-dimensional manifold, it isappropriate to include a noise model for the t vector. We therefore define the distributionof t, for given x and W , to be a spherical Gaussian centred on y(x,W ) having varianceβ−1 so that p(t|x,W, β) ∼ N (t|y(x,W ), β−1I).The distribution in t-space, for a given value of W , is then obtained by integration overthe x-distribution

p(t|W,β) =

∫p(t|x, W, β)p(x)dx (6.43)

For a given dataset D = (t1, . . . , tN) of n data points, we can determine the parametermatrix W , and the inverse variance β, using maximum likelihood principle [Dud73], wherethe log-likelihood function is given by

L(W,β) =N∑

n=1

ln p(tn|W,β) (6.44)

109

In principle we can now seek the maximum likelihood solution for the weight matrix,once we have specified the prior distribution p(x) and the functional form of the mappingy(x; W ), by maximizing L(W,β).The latent variable model can be related to the SOM algorithm by choosing p(x) to be asum of delta functions centred on the nodes of a regular grid in latent space

p(x) =1

K

K∑j=1

δ(x− xj).

where δ(·) is the Kronecker symbol.This form of p(x) allows to compute the integral in (6.43) analytically. Each point xj isthen mapped to a corresponding point y(xj,W ) in data space, which forms the centre ofa Gaussian density function.Hence the distribution function in data space takes the form of a Gaussian mixture model

p(t|W,β) =1

K

K∑j=1

p(t|xl,W, β)

and the log likelihood function (6.44) becomes

L(W,β) =N∑

n=1

ln

[1

K

K∑j=1

p(tn|xl,W, β)

](6.45)

This distribution is a constrained gaussian mixture since the centres of the Gaussians can-not move independently but are related through the function y(x,W ). Since the mappingfunction y(x,W ) is smooth and continuous, the projected points y(xj,W ) will necessarilyhave a topographic ordering.

6.7.2 Optimization by EM Algorithm

If we choose a particular parametrized form for y(x,W ) which is a differentiable function ofW we can use standard techniques for non-linear optimization, such as conjugate gradientsor quasi-Newton methods, to find a weight matrix W ?, and inverse variance β?, whichmaximizes L(W,β). GTM maximizes the (6.45) by means of an EM algorithm. By makinga careful choice of the model y(x,W ) we will see that the M-step can be solved exactly.In particular we shall choose y(x,W ) to be given by a generalized linear network model ofthe form

y(x,W ) = Wφ(x) (6.46)

where the elements of φ(x) consist of m fixed basis functions φi(x) and W is a D × Mmatrix with elements wki.

110

Generalized linear networks possess the same universal approximation capabilities as MLP[Bis95], provides the basis functions φi(x) are chosen appropriately. By setting the deriva-tives of (6.45) with respect to wki to zero, we obtain

ΦT GΦW T = ΦT RT (6.47)

where Φ is a K×M matrix with elements Φij = Φi(xj), T is a N×D matrix with elementstkn and R is a K ×N matrix with elements Rin given by

Rjn(W,β) =p(tn|xj,W, β)

K∑s=1

p(tn|xs,W, β)

(6.48)

which represents the posterior probability, or responsability, of the mixture components jfor the data point n.Finally, G is a K ×K diagonal matrix , with elements Gjj

Gjj =N∑

n=1

Rjn(W,β)

Equation (6.47) can be solved for W using standard matrix inversion techniques. Similarly,optimizing with respect to β we obtain

1

β=

1

ND

K∑j=1

N∑n=1

Rjn(W,β)‖y(xj,W )− tn‖2 (6.49)

The equation (6.48) corresponds to the E-step, while the equations (6.47) and (6.49) cor-responds to the M-step. Typically the EM algorithm gives satisfactory convergence aftera few tens of cycles. An on-line version of this algorithm can be obtained by using theRobbins-Monro procedure to find a zero of the objective function gradient, or by using anon-line version of the EM algorithm.

6.7.3 GTM versus SOM

The list below describes some of the problems with the SOM and how the GTM algorithmsolves them.

• The SOM algorithm is not derived by optimizing an objective function, unlike GTM.

• In GTM the neighborhood-preserving nature of the mapping is an automatic con-sequence of the choice of a smooth, continuous function y(x,W ). Neighbourhood-preservation is not guaranteed by the SOM procedure.

111

• There is no assurance that the codebook vectors will converge using SOM. Conver-gence of the batch GTM algorithm is guaranteed by the EM algorithm, and theRobbins-Monro theorem provides a convergence proof for the on-line version.

• GTM defines an explicit probability density function in data space. In contrast, SOMdoes not define a density model. The advantages of having a density model includethe ability to deal with missing data in a principled way, and the straightforwardpossibility of using a mixture of such models, again trained using EM.

• For SOM the choice of how the neighborhood function should shrink over time dur-ing training is arbitrary and so this must be optimized empirically. There is noneighborhood function to select for GTM.

• It is difficult to know by what criteria to compare different runs of the SOM procedure.For GTM one simply compares the likelihood of the data under the model, andstandard statistical tests can be used for model comparison.

Nevertheless there are very close similarities between SOM and GTM techniques. At anearly stage of the training the responsability for representing a particular data point isspread over a relatively large region of the map. As the EM algorithm proceeds so thisresponsability bubble shrinks automatically. The responsabilities (computed in the E-step)govern the updating of W and β in the M-step and, together with the smoothing effect ofthe basis functions φi(x), play an analogous role to the neighbourhood function in the SOMalgorithm. While the SOM neighbourhood function is arbitrary, however, the shrinkingresponsibility bubble in GTM arises directly from the EM algorithm.

6.8 Fuzzy Clustering algorithms

We have described algorithms that assign each data point to exactly one cluster. In thissection we describe briefly fuzzy clustering techniques. Fuzzy clustering algorithms arebased on the idea to assign each data points to several clusters with varying degrees ofmembership.The idea is based on the observation that, in real data, data clusters usually overlap to someextent. Therefore, some data vectors cannot be certainly assigned to exactly one clusterand it seems more reasonable to assign partially to several clusters. In this section weonly describe the most famous fuzzy clustering algorithm, i.e. Fuzzy C-Means algorithms(FCM ), proposed by Bedzek [Bed81].For comprehensive surveys on fuzzy clustering algorithms, see [Bar99] [Jai99].

112

6.8.1 FCM

As the K-MEANS algorithm, FCM assumes that the number of clusters C is a prioriknown. FCM minimizes the cost function:

JFCM =C∑

i=1

m∑j=1

uSij‖ζj − wi‖2 (6.50)

subject to the m probabilistic constraints:

C∑i=1

uij = 1 j = 1 . . . ,m

Here, uij is the membership values of input vector ζj belonging to the cluster i, S standsfor the degree of fuzziness. Using Lagrangian multipliers method the condition for localminima of JFCM is derived as

uij =

[C∑

k=1

[ ‖ζj − wi‖‖ζj − wk‖

] 2S−1

]−1

∀i, j (6.51)

and

wi =

m∑j=1

uSijζj

m∑j=1

uSij

∀i (6.52)

The final cluster centers can be obtained by the iterative optimization scheme, calledAlternative Optimization (AO) [Pal97] method.The on-line version for the optimization of JFCM with stochastic gradient descent methodis known as fuzzy competitive learning [Chu94].

6.9 Conclusion

In this chapter we have presented the state-of-the-art of the prototyped-based clusteringalgorithms, paying special attention to neural-based algorithms. We have first describedthe EM algorithm, that is the basic tool of several clustering algorithms. Then we havepresented K-MEANS and some competitive learning algorithms, i.e. SOM, Neural Gas andTRN. Besides, General Topographic Mapping has been discussed underlining its connec-tions with SOM. Finally we have provided some sketches on the fuzzy clustering algorithms.

113

Clustering Kernel-based algorithms have not been considered, since they are the topic ofthe next chapter where they will be investigated. We have described only algorithms whosecodevector number has to be a priori fixed. Clustering algorithms whose codevector num-ber has not necessarily to be fixed can be found in [Fri94] [Fri95].We conclude the chapter identifying one of the criteria to choose a clustering algorithm,the robustness.According to Huber [Hub81], a robust clustering algorithm should possess the followingproperties:

1. it should have a reasonably good accuracy at the assumed model;

2. small deviations from the model assumption should impair the performance only bya small amount;

3. larger deviations from the model assumption should not cause a catastrophe.

The algorithms, described in this chapter, often perform well under the first condition.Whereas the other two conditions cannot be sometimes satisfied.

114

Chapter 7

Clustering with Kernels

7.1 Introduction

In this chapter, we discuss clustering algorithms based on Mercer Kernels. In particularwe present a Kernel Method, Kernel Grower that can detect more than one cluster in theinput data. In our algorithm, data are mapped from the Input Space to a high dimensionalFeature Space using a Mercer Kernel. Then we consider K centers and compute, for eachcenter, the smallest ball that enclose the images of input data that are closest to thecenter. Kernel Grower uses a K-MEANS-like strategy, i.e. it moves repeatedly the centers,computing for each center the smallest ball, until any center changes. Kernel Grower is anEM algorithm and its convergence is guaranteed.The chapter is organized as follows: Section 7.2 describe the algorithms with kernelisedmetric; Section 7.3 we describe Support Vector Clustering, the basic step of the KernelGrower; in Section 7.4 we present the Kernel Grower algorithm; in Sections 7.5, 7.6 and7.7 we describe the experimental results obtained on a synthetic data set and two UCIbenchmarks; a critical analysis of the Kernel Grower is presented in Section 7.8; finallysome conclusions are drawn in Section 7.9.

7.2 Algorithms with Kernelised Metric

In this section we describe the state of the art of the clustering with Kernels. It is necessaryto remark that there are very few works on the clustering with Kernels. All these works[Yu02] [Zha03] are based on Kernelising the Metric, i.e. the metric is computed by meansof a Mercer Kernel in a Feature Space.In the second chapter we have seen that it can compute the metric between two data points

115

in the Feature Space, even if we do not know explicitly the mapping Φ. Given two datapoints x, y ∈ Rn the metric dG(x, y) between Φ(x) and Φ(y) in the Feature Space is

dG(x, y) =√

G(x, x)− 2G(x, y) + G(y, y) (7.1)

where G(·) is an appropriate Mercer Kernel.Now we recall that most classical clustering algorithms are based on the minimization ofthe Quantization Error.Given a dataset D = (xi ∈ Rn, i = 1, 2, . . . ,m), the goal is to get a codebook W = (wi ∈Rn, i = 1, 2, . . . , K) that minimizes the quantization error E(D)

E(D) =1

2|D|m∑

c=1

∑xi∈Vc

‖xi − wc‖2 (7.2)

where Vc is the Voronoi set of the codebook wc.Since there are metrics in the espression of (7.3), we can think to compute them in aFeature Space, i.e. we have:

EG(D) =1

2|D|m∑

c=1

∑xi∈Vc

G(xi, xi)− 2G(xi, wc) + G(wc, wc) (7.3)

A naive solution to minimize EG(D), as suggested in [Cam01b], consists in computing

∂EG(D)

∂wc

(7.4)

and using a steepest gradient descent algorithm.As observed in [Cam01b], some classical clustering algorithms can be kernelised. Forinstance, we consider online K-MEANS. Its learning rule is

∆wc = ε(ζ − wc) = ε∂E(D)

∂wc

(7.5)

where ζ is the input vector, wc is the winner codevector for the input ζ.Hence it can be rewritten as:

∆wc = ε∂EG(D)

∂wc

(7.6)

In the case of the Gaussian Kernel G(x, y) = exp(−‖ζ−wc‖2σ2 ) the equation (7.6) becomes

∆wc = ε(ζ − wc)σ−2 exp

(−‖ζ − wc‖2

σ2

)(7.7)

In similar way, Neural Gas can be kernelised. Its rule changes from

∆wc = ε(t) hλ(kc(ζ,X )) (ζ − wc) (7.8)

116

into

∆wc = ε(t) hλ(kc(ζ,X ))∂EG(D)

∂wc

(7.9)

In the case of the Gaussian Kernel G(x, y) = exp(−‖ζ−wc‖2σ2 ) the equation (7.8) becomes

∆wc = ε(t) hλ(kc(ζ,X )) (ζ − wc)σ−2 exp

(−‖ζ − wc‖2

σ2

)(7.10)

7.2.1 Drawbacks

The method presents some drawbacks:

• it can be applied only to particular clustering online algorithms.

• the convergence of the kernelised algorithm is not guaranteed.

• it does not provide a support vector description of data.

We do not try kernelised algorithms in this thesis, since these algorithms provide do notuse a support vector description of data. In the next session we introduce an algorithmbased on a support vector description of data.

7.3 Support Vector Clustering

Support Vector Clustering [Ben01](SVC ), called also one-class classification [Tax99] ornovelty detection [Sch00a] algorithm, is an unsupervised Kernel Method based on supportvector description of a data set. SVC computes the smallest sphere enclosing the images, inthe Feature Space, of data points. Therefore it cannot be considered a clustering algorithm,in spite of its name. In fact SVC yields only one ”codevector” the center of the sphere.Let D = (xi ∈ Rn, i = 1, 2, . . . , m) ⊆ χ, with χ ⊆ Rn. Using a nonlinear transformationΦ from χ to some high-dimensional feature space, it looks for the smallest enclosing sphereof radius R. This is described by the constraints:

‖Φ(xj)− a‖2 ≤ R2 ∀j

where ‖ · ‖ is the Euclidean norm and a is the center of the sphere.The constraints can be relaxed using slack variables ξj:

‖Φ(xj)− a‖2 ≤ R2 + ξj (7.11)

117

with ξj ≥ 0.In order to solve the problem the Lagrangian is introduced:

L = R2 −m∑

j=1

(R2 + ξj − ‖Φ(xj)− a‖2)βj −m∑

j=1

ξjµj + C

m∑j=1

ξj (7.12)

where βj ≥ 0 and µj ≥ 0 are Lagrange multipliers, C is a constant and C

m∑j=1

ξj is a penalty

term.If we put

∂L

∂R= 0;

∂L

∂a= 0;

∂L

∂ξj

= 0

we getm∑

j=1

βj = 1 (7.13)

a =m∑

j=1

βjΦ(xj) (7.14)

βj = C − µj (7.15)

The Karush-Kuhn-Tucker conditions yield

ξjµj = 0 (7.16)

(R2 + ξj − ‖Φ(xj)− a‖2)βj = 0 (7.17)

It follows from (7.17) that the image of a point xj with ξj > 0 and βj > 0 lies outsidethe Feature Space sphere. The equation (7.16) states that such a point has µj = 0, hencewe conclude from the equation (7.15) that βj = C. This will be called a bounded supportvector (BSV ). A point xj with ξj = 0 is mapped to the inside or to the surface of theFeature Space sphere. If its 0 < βj < C then (7.17) implies that its image Φ(xj) lies on thesurface of the Feature Space sphere. Such a point will be referred to as a support vector(SV ). Support Vectors lie on cluster boundaries, BSVs lie outside the boundaries and allother points lie inside them. The constraint (7.13) implies when C ≥ 1 no BSVs exist.Using these relations we may elimate the variables R, a and µj, turning the Lagrangianinto the Wolfe dual form that is a function of the variables βj:

W =m∑

j=1

Φ(xj)2βj −

m∑i=1

m∑j=1

βiβjΦ(xi) · Φ(xj) (7.18)

118

Since the variables µj don’t appear in the Lagrangian they may be replaced with theconstraints:

0 ≤ βj ≤ C j = 1, . . . , N (7.19)

We compute the dot products Φ(xi) · Φ(xj) by an appropriate Mercer kernel G(xi, xj).Therefore the Lagrangian W becomes

W =m∑

j=1

G(xj, xj)βj −m∑

i=1

m∑j=1

βiβjG(xi, xj) (7.20)

At each point x the distance of its image in the Feature Space from the center of the sphere:

R2(x) = ‖Φ(x)− a‖2 (7.21)

Using (7.14) we have:

R2(x) = G(x, x)− 2m∑

j=1

βjG(xj, x) +m∑

i=1

m∑j=1

βiβjG(xi, xj) (7.22)

The radius of the sphere is:R = R(xi) | xi ∈ SV (7.23)

on the basis of (7.23), SVs lie on cluster boundaries, BSVs are outside and all other pointslie inside the clusters.

7.3.1 Optimization

In the previous section we have just formulated the Support Vector Clustering using aproblem of Quadratic Programming. The problem can be solved using QP packages whenthe dimension of the training set is quite limited. In other cases, the best solution is touse a modified version of SMO [Pla99].The strategy of SMO is to break up the constrained minimization of (7.20) into the small-est optimization step possible. Due to the constraint on the sum of the dual variables,it is impossible to modify individual variables separately without possibly violating theconstraint. Therefore the optimization has to be performed over pairs of multipliers. Thealgorithm is based on an elementary optimization step.

7.3.1.1 Elementary Optimization Step

For instance, consider optimizing over α1 and α2 with all other variables fixed. If we defineGij = G(xi, xj), (7.20) becomes:

minα1,α2

1

2

2∑i=1

2∑j=1

αiαjGij +2∑

i=1

αiCi + C (7.24)

119

where

Ci =l∑

j=3

αjGij, C =l∑

i=3

l∑j=3

αiαjGij

subject to

0 ≤ α1 ≤ 1

νl(7.25)

0 ≤ α2 ≤ 1

νl(7.26)

2∑i=1

αi = ∆ = 1−3∑

i=1

αi (7.27)

We discard C, which is independent of α1 and α2, and eliminate α1 to obtain

minα2

W (α2) =1

2(∆− α2)

2 G11 + (∆− α2) α2G12 +1

2α2

2G22 + (∆− α2) C1 + α2C2 (7.28)

Computing the derivative of W and setting it to zero, we have:

−(∆− α2)G11 + (∆− 2α2) G12 + α2G22 − C1 + C2 = 0 (7.29)

Solving the equation for α2, we get:

α2 =∆(G11 −G12) + C1 − C2

G11 + G22 − 2G12

(7.30)

Once α2 is found, α1 can be recovered from α1 = ∆ − α2. If the new point (α1, α2) is

outside of

[0,

1

νl

], the constrained optimum is found by projecting α2 from (7.30) into

the region allowed by the constraints and recomputing α1. The offset is recomputed afterevery such step. Additional insight can be obtained by rewriting the last equation in termsof the outputs of the kernel expansion on the examples x1 and x2 before the optimizationstep.Let α?

1, α?2 denote the values of their Lagrange parameters before the step. Then the

corresponding outputsOi = G1iα

?1 + G2iα

?2 + Ci (7.31)

Using the latter to eliminate the Ci, we end up with an update equation for α2 which doesnot explicitly depend on α?

1,

α2 = α?2 +

O1 −O2

G11 + G22 − 2G12

(7.32)

which shows that the update is essentially the fraction of first and second derivative of theobjective function along the direction of the constraint satisfaction.Clearly, the same elementary optimization step can be applied to any pair of two variables,not just α1 and α2. We next briefly describe how to do the overall optimization.

120

7.3.2 Optimization algorithm

The inizialization of the algorithm is the following. We start by setting a random fraction

ν of all αi to1

νl. If νl is not an integer, then one of the examples is set to a value in

(0,

1

νl

)to ensure that

l∑i=1

αi = 1. Besides, we set the initial ρ to

ρ = maxi∈[l],αi>0

Oi

Then we select a first variable for the elementary optimization step in one of the twofollowing ways. Here, we use the shorthand SVnb for the indices of variables which are notat bound, i.e.

SVnb =

i : i ∈ [l], 0 < αi <

1

νl

These correspond to points that will sit exactly on the hyperplane, that will therefore havea strong influence on its precise position. The couple of the parameters on which applyingthe elementary optimization algorithm is selected by using the following heuristics:

1. We scan over the entire dataset until we find a variable violating a KKT condition,

i.e. a point such that (Oi−ρ)αi > 0 or (ρ−Oi)

(1

νl− αi

)> 0. Once we have found

one, say αi, we pick αj according to:

j = arg maxn∈SVnb

|Oi −On| (7.33)

2. The same as the above item, but the scan is only performed over SVnb.

In practice, one scan of the first type is followed by multiple scans of the second type. Ifthe first type scan finds no KKT violations, the optimization terminates.

7.3.2.1 Remarks

In unusual circumstances, the choice heuristic cannot make positive progress. Therefore,a hierarchy of other choice heuristics is applied to ensure positive progress. These otherheuristics are the same as in the case of classification. SMO usually converges in mostcases. However to ensure convergence, even in rare pathological conditions, the algorithmcan be modified slightly [Kee99].

121

7.4 Kernel Grower algorithm

Since SVC is able to detect only one cluster, the goal of our research is to formulate analgorithm, based on the support vector description of the data set, that can detect moreclusters. We propose an algorithm, Kernel Grower (KG), similar to K-MEANS, that makeclustering in the Feature Space. We assume that the cluster number K is a priori known.Given a data set D, we map our data in some Feature Space F , by means a nonlinearmapping Φ. We consider K centers in Feature Space (ai ∈ F i = 1, . . . , K). We call theset A = a1, . . . , aK Feature Space Codebook since in our representation the centers in theFeature Space play the same part of the codevectors in the Input Space. In analogy withthe codevectors in the Input Space, we define for each center ac its Voronoi Region andVoronoi Set in Feature Space. The Voronoi Region in Feature Space (FRc) of the centerac is the set of all vectors in F for which ac is the closest vector

FRc = ξ ∈ F | c = arg minj‖ξ − aj‖ (7.34)

The Voronoi Set in Feature Space (FVc) of the center ac is the set of all vectors ζ in Dsuch that ac is the closest vector for their images Φ(ζ) in the Feature Space

FVc = ζ ∈ D | c = arg minj‖Φ(ζ)− aj‖ (7.35)

These definitions, as in the case of the codevectors in the Input Space, induces a VoronoiTessellation of the Feature Space.Continuing the analogy, we can also define a measure of the quality of our Clustering inthe Feature Space, i.e. the Quantization Error in Feature Space FE(D), that is:

FE(D) =1

2|D|m∑

c=1

∑

ζ∈FVc

‖Φ(ζ)− ac‖2. (7.36)

where |D| is the cardinality of the data set D and FVc is the Voronoi set in Feature Spaceof the center ac.Now we pass to describe the KG algorithm. KG uses a K-MEANS-like strategy, i.e. movesrepeatedly all centers ac in the Feature Space, computing SVC on their FVc, i.e. Voronoisets in the Feature Space.In order to make more robust KG with respect to the outliers, SVC is computed on FVc(ρ)of the each center ac.FVc(ρ) is defined as

FVc(ρ) = ζ ∈ D | (c = arg minj‖Φ(ζ)− aj‖) ∧ ‖Φ(ζ)− ac‖ < ρ (7.37)

FVc(ρ) is the Voronoi set in the Feature Space of the center ac without outliers, that isthe data points whose distance is larger than ρ. The parameter ρ can be set by means of

122

model selection techniques.KG has the following steps:

1. Project the data set D in a Feature Space F , by means a nonlinear mapping Φ.Initialize the centers ac c = 1, . . . , K ac ∈ F

2. Compute for each center (ac c = 1, . . . , K) FVc(ρ) i.e the data ζ such that ac is theclosest center to their images Φ(ζ) and their distance from the center is lower thanρ.

3. Apply SVC to each FVc(ρ) and assign ac = SV C(FVc(ρ))

4. Go to step 2 until any ac changes

5. Return the Feature Space codebook.

KG is an EM algorithm since its second and third step are respectively the expectationand maximization stage of an EM algorithm. Hence KG convergence is guaranteed, sinceeach EM algorithm is convergent. The inizialization of the Feature Space codebook, in thestep one, can be performed considering a small part (RD) of the data set D, dividing it inK disjoint subsets RDi i = 1, . . . , K and assigning ai = SV C(RDi) i = 1, . . . , K. In theexperimentations described in the sections 7.5, 7.6 and 7.7 RD is formed by a tenth of thepoints, randomly chosen, of the data set D.Finally we underline that the use of moderate values of ρ implies that the Voronoi sets ofthe centers tends to grow after each iteration. This assertion has been confirmed in theexperimentations performed. KG has been tried on three datasets, a synthetic data set(Delta Set), the IRIS Data and Wisconsin breast cancer database.

7.5 Experimentations on Delta Set

The synthetic data set (Delta Set) 1 is a bidimensional set formed by 424 points. Delta Setis shown in figure 7.1. Delta Set is formed by points of two classes nonlinearly separated.Therefore the two classes cannot be separated by clustering algorithms that use only twocodevectors in the input space, since two codevectors permit only linear separation of thedata. In order to confirm that, we tried K-MEANS, using two codevectors, on Delta Set.As shown in the figure 7.2, K-MEANS cannot separate the clusters. This K-MEANSlimitation is shared by other not-kernel-based clustering algorithms, e.g. SOM [Koh82][Koh97] and Neural Gas [Mar93]. Then we tried KG on Delta Set using only two centers.

1Delta dataset can be downloaded from the following ftp address:ftp.disi.unige.it/person/CamastraF/delta.dat

123

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.1: Delta Set

124

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.2: K-Means on Delta Set. The codevectors are indicated by the black points. Thesolid line indicates the separation line determined by the codevectors

125

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.3: KG algorithm on Delta Set after 1st iteration. The blue data points belong tothe cluster of the center a1. The green data points belong to the cluster of the center a2.

126

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.4: KG algorithm on Delta Set after 2nd iteration. The blue data points belong tothe cluster of the center a1. The green data points belong to the cluster of the center a2.

127

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.5: KG algorithm on Delta Set after 3rd iteration. The blue data points belong tothe cluster of the center a1. The green data points belong to the cluster of the center a2.

128

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

Figure 7.6: KG algorithm on Delta Set after 4th iteration. The blue data points belong tothe cluster of the center a1. The green data points belong to the cluster of the center a2.

129

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1


130

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1


131

0 0.2 0.4 0.6 0.8 1

-1

-0.5

0

0.5

1

Figure 7.9: Kernel Grower on Delta Set. The region delimited by the blue line identifiesthe input data whose images in the Feature Space have distance from the centers a1 lessthan 0.75. The region delimited by the green line identifies the input data whose imagesin the Feature Space have distance from the centers a2 less than 0.84. The counterimagesΦ−1(a1) and Φ−1(a2), in the input space, of the centers a1 and a2 do not exist.

132

0 0.2 0.4 0.6 0.8 1

-1

-0.5

0

0.5

1

Figure 7.10: Kernel Grower on Delta Set. The blue lines identify the input data whoseimages in the Feature Space have distance from the centers a1 less than 0,65, 0.70, 0.75.The green lines identify the input data whose images in the Feature Space have distancefrom the centers a2 less than 0.80 and 0.84.

133

As shown in figures 7.9 and 7.10, KG can separate, unlike K-MEANS, the two clusters.Besides, it is important to remark that the counterimages Φ−1(a1) and Φ−1(a2), in theInput Space, of the centers a1 and a2 do not exist.Cluster sizes, measured in terms of data points, during the KG learning are shown in tab7.1. As shown in Figures 7.3, 7.4, 7.5, 7.6, 7.7 and 7.8 cluster sizes grow during the learning,justifying the algorithm name.

iteration First Cluster Second Cluster1 82 1182 112 1423 121 2014 136 2125 175 2126 212 2127 212 212

Table 7.1: KG algorithm on Delta Set

7.6 Experimentations on Iris Data

Iris Data 2 is the most famous real data benchmark in Machine Learning.

Iris Data has been proposed by Fisher [Fis36] in 1936. Iris Data is formed by 150 points,that belong to three different classes. One class is linearly separable from the other two,but the other two are not linearly separable from each other. The eigenvalue spectrum isshown in table 7.3. Since the IRIS Data dimension is 4, Iris Data is usually represented,projecting the data along their principal components. Iris Data projected along the twoand three principal components is shown respectively in figures 7.11 and 7.12. We triedKG, K-MEANS, Neural Gas and SOM on IRIS Data using two centers, one center for eachclass. The results using SOM and Neural Gas are shown respectively in figures 7.15 and7.14. The results using K-MEANS and KG are shown respectively in figures 7.13 and 7.16.Even in this case, it is important to remark that the counterimages, in the Input Space,of the centers do not exist. The table 7.3 show the average NG, K-MEANS, SOM andNG performances on 20 runs, obtained changing algorithm inizializations and parameters(e.g. gaussian kernel variance σ and ρ value). As shown in the table, KG performancesare better than the other clustering algorithms.

2Delta dataset can be downloaded from the following ftp address: ftp.ics.uci.edu/pub/machine-learning-databases/iris

134

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Figure 7.11: Iris Data Set. The data are projected along two major principal components.The three classes are represented by different colours.

Figure 7.12: Iris Data Set. The data are projected along three major principal components.The three classes are represented by different colours.

135

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Figure 7.13: K-MEANS on Iris Data Set. Filled disks indicate the data correctly classifiedby K-MEANS. Circles represent the data missclassified.

136

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Figure 7.14: Neural Gas on Iris Data Set. Filled disks indicate the data correctly classifiedby the Neural Gas. Circles represent the data missclassified.

137

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Figure 7.15: SOM on Iris Data Set. Filled disks indicate the data correctly classified bythe SOM. Circles represent the data missclassified.

138

-9 -8 -7 -6 -5 -4 -3

-6.5

-6

-5.5

-5

-4.5

-4

Figure 7.16: Kernel Grower on Iris Data Set. Filled disks indicate the data correctlyclassified by the Kernel Grower. Circles represent the data missclassified.

139

Magnitude1st eigenvalue 12nd eigenvalue 0.0573rd eigenvalue 0.0194th eigenvalue 0.00056

Table 7.2: Normalized eigenvalue spectrum of Iris Data

iteration Points Correctly Classified Classification RateSOM 121.5 ± 1.5 81.0%

K-MEANS 133.5 ± 0.5 89.0%Neural Gas 137.5 ± 1.5 91.7%

KG 142 ± 1 94.7%

Table 7.3: Average KG and K-MEANS performances on IRIS Data. The results have beenobtained using twenty different runs for each algorithm.

7.7 Experiments on Wisconsin Breast Cancer Database

Wisconsin breast cancer database 3 has been proposed by the physician Wolberg [Wol90][Man90a] [Man90b] [Ben92] in 1990. The breast cancer database collects 699 cases for suchdiagnostic samples. We have removed 16 database samples with missing values, thereforethe database considered in the experiments has 683 patterns. The patterns belong to twodifferent classes, the former has 444 samples, the latter has 239 samples. Wisconsin breastcancer database projected along the two and three components is shown respectively infigures 7.17 and 7.18.

Each pattern (patient) was represented by a nine-dimensional feature vector. We tried KG,K-MEANS, Neural Gas and SOM on Wisconsin Breast Cancer database using two centers,one center for each class. The results using SOM and Neural Gas are shown respectivelyin figures 7.21 and 7.20. The results using K-MEANS and KG are shown respectively infigures 7.19 and 7.22.

The table 7.4 show the average performances of the KG, K-MEANS, Neural Gas and SOMon the breast cancer database on 20 runs, obtained changing algorithm inizializations andparameters. As shown in the table, KG performances are better than other clusteringalgorithms.

3Wisconsin breast cancer database can be downloaded from the following ftp address:ftp.ics.uci.edu/pub/machine-learning-databases/breast-cancer-wisconsin

140

0.5 1 1.5 2 2.5

-0.5

-0.25

0

0.25

0.5

0.75

1

Figure 7.17: Wisconsin Breast Cancer Database. The data are projected along two majorprincipal components. The two classes are represented by different colours.

141

Figure 7.18: Wisconsin Breast Cancer Database. The data are projected along three majorprincipal components. The two classes are represented by different colours.

142

0 0.5 1 1.5 2 2.5-0.4

-0.2

0

0.2

0.4

Figure 7.19: K-MEANS on Wisconsin Breast Cancer Database. The data are projectedalong two major principal components. The data correctly classified are represented by red(first class) and green (second class) points. The data uncorrectly classified are representedby violet (first class) and blue (second class)

143

0 0.5 1 1.5 2 2.5-0.4

-0.2

0

0.2

0.4

Figure 7.20: Neural Gas on Wisconsin Breast Cancer Database. The data are projectedalong two major principal components. The data correctly classified are represented by red(first class) and green (second class) points. The data uncorrectly classified are representedby violet (first class) and blue (second class)

144

0 0.5 1 1.5 2 2.5-0.4

-0.2

0

0.2

0.4

Figure 7.21: SOM on Wisconsin Breast Cancer Database. The data are projected alongtwo major principal components. The data correctly classified are represented by red (firstclass) and green (second class) points. The data uncorrectly classified are represented byviolet (first class) and blue (second class)

145

0 0.5 1 1.5 2 2.5-0.4

-0.2

0

0.2

0.4

Figure 7.22: KG on Wisconsin Breast Cancer Database. The data are projected alongtwo major principal components. The data correctly classified are represented by red (firstclass) and green (second class) points. The data uncorrectly classified are represented byviolet (first class) and blue (second class)

146

model Points Correctly Classified Classification RateSOM 660.5 ± 0.5 96.7%

K-MEANS 656.5 ± 0.5 96.1%Neural Gas 656.5 ± 0.5 96.1%

KG 662.5 ± 0.5 97.0%

Table 7.4: Average KG, SOM, K-MEANS and Neural Gas performances on Wisconsinbreast cancer database. The results have been obtained using twenty different runs foreach algorithm.

7.8 KG Critical Evaluation

In this section we discuss the KG main qualities and drawbacks.

7.8.1 KG Qualities

KG is an EM algorithm, hence its convergence is guaranteed. Under this aspect KGcompares favourebly with on-line clustering algorithms such as SOM and Neural Gas,whose convergence is not guaranteed. KG is a batch clustering algorithm, therefore thepattern ordering in the training set does not affect the algorithm performance, unlike on-line clustering algorithms that are heavily affect by pattern ordering. In comparison withall clustering algorithms presented in the sixth chapter, KG is more robust with respectto the outliers, since it has mechanisms that permit discarding outliers in the training.However the main KG quality consists, unlike all clustering algorithms published in theliterature, in producing nonlinear separation surfaces among data. As shown in the Section7.5, KG can separate classes of data that are non linearly separable by other clusteringalgorithms.

7.8.2 KG Drawbacks

The KG main drawback is the heavy computation time required by the algorithm. Infact, if we use K centers and P is the number of loops required by the algorithm to getthe convergence, the number N of quadratic programming problems to solve is N = KP .If the training set has thousands of samples, as it happens in handwriting recognitionsapplications, the required time to solve the N quadratic programming problem can behuge. Another part of the algorithm that can requires large computation resources is thestep 2. It involves the calculus of the equation 7.22. This operation has computationalcomplexity O(s2e) where s is the support vector number and e is the time required to

147

compute the exponential function.If the training set has m samples and we use K centers, the computational complexity ofthe step is O(Kms2e). Therefore if the training set is very large and the cluster number ishigh the computational load can be cumbersome.

7.9 Conclusion

In this chapter a Kernel Method, the Kernel Grower, that can detect more clusters in thedata set, has been presented. Kernel Grower is an EM algorithm and its convergence isguaranteed. Under this aspect KG compares favourebly with on-line clustering algorithmssuch as Self Organizing Maps and Neural Gas, whose convergence is not guaranteed. KernelGrower is a batch clustering algorithm, therefore the pattern ordering in the training setdoes not affect the algorithm performance, unlike on-line clustering algorithms that areheavily affect by pattern ordering.KG often performs well under the first Huber condition for robust clustering. Whereas theother two conditions cannot be sometimes satisfied.The main KG quality consists, unlike all clustering algorithms published in the literature,in producing nonlinear separation surfaces among data.Kernel Grower has been tried on a synthetic data set and on two UCI benchmarks, theIRIS Data and the Wisconsin breast cancer database. Kernel Grower compares better thanK-MEANS on these benchmarks. These results encourage the use of Kernel Grower forthe solution of real world problems.

148

Chapter 8

Conclusion

Kernel Methods are algorithms that projects input data by a nonlinear mapping in a newspace (Feature Space). In this thesis we have investigated Kernel Methods for Unsupervisedlearning, namely Kernel Methods that do not require targeted data. In this chapter weresume the main results of the thesis. In this thesis we have tackled, using Kernel Methods,two classical unsupervised learning problems. The former is the Data DimensionalityEstimation, the latter is the Clustering.

8.1 Data Dimensionality Estimation

The dimensionality of a data set, called the Intrinsic Dimension (ID), is the minimumnumber of free variables needed to represent the data without information loss. We haveinvestigated, as ID estimator, the Kernel PCA. Kernel PCA has been compared, with aclassical dimensionality estimation method, the Principal Component Analysis (PCA). Thestudy has been carried out both on a synthetic data set of known dimensionality and onreal data benchmarks, i.e. MIT-CBCL Face database and Indoor-Outdoor Image database.The investigations have enlighted that Kernel PCA is reliable as ID estimator, only whenthe kernel, adopted in the Kernel PCA, is very close to the function that has generatedthe data set. Otherwise Kernel PCA can perform even worst than PCA. Besides, we haveshowed that usual Mercer Kernel as the polynomial or the gaussian imply that the IntrinsicDimension of the data projected in the Feature Space is higher than the one of the data inthe input space. The obtained results imply that classical methods, e.g. fractal techniquesand PCA, have to be preferred to the Kernel PCA.

149

8.2 Clustering

With regard to the Clustering problem, we have proposed a novel Kernel Method, theKernel Grower [Cam04]. We report the main Kernel Grower qualities. The Kernel Growerconvergence since it is an Expectation-Maximization algorithm. Under this aspect KGcompares favourebly with on-line clustering algorithms such as Self Organizing Maps andNeural Gas, whose convergence is not guaranteed. Kernel Grower is a batch clusteringalgorithm, therefore the pattern ordering in the training set does not affect the algorithmperformance, unlike on-line clustering algorithms that are heavily affect by pattern order-ing. In comparison with popular clustering algorithms KG is more robust with respectto the outliers, since it has mechanisms that permit discarding outliers in the training.However the main KG quality consists, unlike all clustering algorithms published in theliterature, in producing nonlinear separation surfaces among data. As shown in the Section7.5, KG can separate classes of data that are non linearly separable by other clusteringalgorithms.Kernel Grower compares better with popular clustering algorithms, namely K-MEANS,Neural Gas and Self Organizing Maps, on a synthethic dataset and two UCI real databenchmarks, i.e. IRIS data and Wisconsin breast cancer database.Kernel Grower is the main original result of the thesis.

8.3 Future Work

In this section we briefly present future research activities. As pointed out in the section7.8, the Kernel Grower main drawback is the heavy computation time required by the al-gorithm. The computational load of the Kernel Grower can be burdensome for training setwith many thousands of samples, as handwriting character recognition. Therefore a firstline of research involves the optimization of the Kernel Grower. Kernel Grower requires thesolution of quite a number of quadratic programming problems. The investigation of fastand effective techniques, alternatives to SMO, of the solution of quadratic programmingalgorithms is a research line that has to be developed in the next future. Another futureresearch line is the development of the kernels, instead of the gaussian one, that can beused in the Kernel Grower.Finally the application of the Kernel Grower for the solution of real problem will be per-formed in the next future.

150

Bibliography

[Aiz64] M. Aizerman, E. Braverman and L. Rozonoer. Theoretical foundations of the po-tential function method in pattern recognition learning. Automation and RemoteControl, 25, 821-837, 1964.

[And83] J.R. Anderson. Acquisition of Proof Skills in Geometry. Machine Learning, J.G.Carbonell, R.S. Michalski and T.M. Mitchell (eds.), 191-220, Tioga PublishingCompany, 1983.

[Aro44] N. Aronszajn. La theorie generale de noyaux reproduisants et ses applications.Proc. Cambridge Philos. Soc., 39, 133-153, 1944.

[Aro50] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68,337-404.

[Bar99] A. Baraldi, P. Blonda. A survey of fuzzy clustering algorithms for Pattern Recog-nition. IEEE Trans. on System, Man and Cybernetics-B, 29(6), 778-801, 1999.

[Bar02] A. Barla, E. Franceschi, F. Odone and A. Verri. Image Kernels. Proceedings ofSVM2002, 83-96, 2002.

[Bed81] J.C. Bedzek. Pattern Recognition with Fuzzy Objective Function Algorithms.Plenum Press, NewYork, 1981.

[Bel61] R. Bellmann. Adaptive Control Processes: A Guided Tour. Princeton UniversityPress, New Jersey 1961.

[Ben68] R.S. Bennett. The Intrinsic Dimensionality of Signal Collections. IEEE Trans-actions on Information Theory, 15, 517-525, 1969.

[Ben92] K. P. Bennett and O. L. Mangasarian. Robust linear programming discriminationof two linearly inseparable sets. Optimization Methods and Software, 1, 23-34,1992.

151

[Ben01] A. Ben-Hur, D. Horn, H.T. Siegelmann, V. Vapnik. Support Vector Clustering.Journal of Machine Learning Research, 2, 125-137, MIT Press, 2001.

[Ber84] C. Berg, J.P.R. Christensen and P. Ressel. Harmonic analysis on semigroups.Springer e Verlag, New York, USA: 1984.

[Bil98] J.A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Application toParameter Estimation for Gaussian Mixture and Hidden Markov Models. Tech.Report TR-97-021, ICSI, U.C. Berkeley, April 1998.

[Bis95] C. M. Bishop. Neural Networks for Pattern Recognition. Cambridge UniversityPress, Cambridge, UK: 1995.

[Bis98] C. M. Bishop, M. Svensen and C.K.I. Williams. GTM: the generative topo-graphic mapping. Neural Computation, 10, 215-234, 1998.

[Bro89] D.S. Broomhead and R. Jones. Time series analysis. Proc. R. Soc. Lond., A423,103-121, 1989.

[Bro00] D.S. Broomhead and M. Kirby. A New Approach to Dimensionality Reduction:Theory and Algorithms. SIAM Journal of Applied Mathematics, 60(6), 2114-2142, 2000.

[Bru98] J. Bruske and G. Sommer. Intrinsic Dimensionality Estimation with OptimallyTopology Preserving Maps. IEEE Transaction on Pattern Analysis and MachineIntelligence, 20(5), 572-575, IEEE Computer Society, 1995.

[Buh95] J.M. Buhmann. Data clustering and learning”. The Handbook of Brain Theoryand Neural Networks, M.A. Arbib (ed.), 278-281, MIT Press, 1995

[Bur98] C.J. Burges. A tutorial on support vector machines for pattern recognition. DataMining and Knowledge Discovery, 2(2), 1-47, 1998.

[Cam99] F. Camastra and A.M. Colla. Neural short-term prediction based on dynamicsreconstruction. Neural Processing Letters, 9(1), 45-52, Kluwer Academic Pub-lishers, 1999.

[Cam01a] F. Camastra and A. Vinciarelli. Intrinsic Estimation of data: An approach basedon Grassberger-Procaccia’s Algorithm. Neural Processing Letters, 14(1), 27-34,Kluwer Academic Publishers, 2001.

[Cam01b] F. Camastra. Kernel Methods for Computer Vision: Theory and ApplicationsPhd. Thesis Proposal, University of Genova, 2001.

152

[Cam02a] F. Camastra and A. Vinciarelli. Estimating the Intrinsic Dimension of Data witha Fractal-Based Method. IEEE Transaction on Pattern Analysis and MachineIntelligence, 24(10), 1404-1407, IEEE Computer Society, 2002.

[Cam02b] F. Camastra. Kernel Methods for Unsupervised Learning. Phd. Thesis ProgressReport, University of Genova, Italy, 2002.

[Cam03] F. Camastra. Data Dimensionality Estimation methods: A survey. PatternRecognition, 36(12), 2945-2954, Elsevier, 2003.

[Cam04] F. Camastra, A. Verri. Clustering with a Kernel Grower. in preparation, 2004.

[Cam02c] C. Campbell. Kernel methods: A survey of current techniques. Neurocomputing,48, 63-84, Elsevier, 2002.

[Car83] J.G. Carbonell, R.S. Michalski and T.M. Mitchell. An Overview of MachineLearning. Machine Learning, J.G. Carbonell, R.S. Michalski and T.M. Mitchell(eds.), 3-23, Tioga Publishing Company, 1983.

[Car83b] J.G. Carbonell. Learning by Analogy: Formulating and Generalizing Plans fromPast Experience. Machine Learning, J.G. Carbonell, R.S. Michalski and T.M.Mitchell (eds.), 137-162, Tioga Publishing Company, 1983.

[Cha73] C.L. Chang and R.C.T. Lee. A heuristic relaxation method for nonlinear map-ping in cluster analysis. IEEE Transactions on Computers, C-23, 178-184, IEEEComputer Society, 1974.

[Che74] C.K. Chen and H.C. Andrews. Nonlinear Intrinsic Dimensionality Computations.IEEE Transactions on System Man and Cybernetics, SMC-3, 197-200, IEEEComputer Society, 1973.

[Chi90] D.R. Chialvo, R.F. Gilmour and J. Jalife. Low-dimensional Chaos in CardiacTissue. Nature, 343, 653-658, 1990.

[Chu94] F.L. Chung and T. Lee. Fuzzy competitive learning. Neural Networks, 7(3),539-551, 1994.

[Col01] R. Collobert, S. Bengio. SVMTorch: Support Vector Machines for Large-ScaleRegression Problems. Journal of Machine Learning Research, vol. 1, 143-160,2001.

[Con88] J.H. Conway, N.J.A. Sloane. Sphere Packings, Lattices and Groups. Springer eVerlag, New York, USA: 1988.

[Cor95] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20, 1-25,1995

153

[Cox94] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.

[Cre93] N. Cressie. Statistics for Spatial Data. John Wiley, New York, USA: 1993.

[Cri00] N. Cristianini and J. Shawe-Taylor. Sphere Packings, Lattices and Groups. Cam-bridge University Press, Cambridge, UK: 2000.

[Dem97] P. Demartines and J. Herault. Curvilinear component analysis: A self-organizingneural network for nonlinear mapping in cluster analysis. IEEE Transactions onNeural Networks, 8(1), 148-154, 1997.

[Dem77] A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum Likelihood from Incom-plete Data via the EM algorithm. Journal Royal Statistical Society, 39(1), 1-38,1977.

[Dev96] L. Devroye, L. Gyorfi, G. Lugosi. A Probabilistic Theory of Pattern Recognition.Springer-Verlag, New York, USA: 1996.

[Dez93] M. Deza and M. Laurent. Measure Aspects of Cut Polyhedra: l1-embeddabilityand Probability. Technical Report, Departement de Mathematiques et d’ Infor-matique, Ecole Normale Superieure, LIENS-93-17, 1993.

[Dyn91] N. Dyn. Interpolation and approximation by radial and related functions. Ap-proximation Theory, C.K. Chui, L.L. Schumaker and D.J.Ward (eds.), 211-234,Academic Press, New York, USA: 1991.

[Dud73] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. JohnWiley, New York, 1973.

[Dun63] N. Dumford and T.J. Schwarz. Linear Operators Part II: Spectral Theory, SelfAdjoint Operators in Hilbert Spaces. John Wiley & Sons, No. 7 in Pure andApplied Mathematics, New York, 1963.

[Dvo91] I. Dvorak and A. Holden. Mathematical Approaches to Brain Functioning Di-agnostics. Manchester University Press, Manchester, UK: 1991.

[Eck85] J.P. Eckmann and D. Ruelle. Ergodic Theory of Chaos and Strange Attractors.Review of Modern Physics, 57, 617-659, 1985.

[Eck92] J.P. Eckmann and D. Ruelle. Fundamental Limitations for Estimating Dimen-sions and Lyapounov Exponents in Dynamical Systems. Physica, D56, 185-187,1992.

[Efr93] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman andHall, New York, USA: 1993.

154

[Erw92] E. Erwin, K. Obermayer and K. Schulten. Self-organizing maps:ordering, conver-gence properties and energy functions. Biological Cybernetics, 67, 47-55, 1992.

[Fer00a] M. Ferris, T. Munson. Interior point method for massive support vector ma-chines. Data Mining Institute Technical Report 00-05, Computer Sciences De-partment, University of Wisconsin, Madison, Wisconsin, 2000.

[Fer00b] M. Ferris, T. Munson. Semi-smooth support vector machines. Data MiningInstitute Technical Report 00-09, Computer Sciences Department, University ofWisconsin, Madison, Wisconsin, 2000.

[Fis36] R.A. Fisher. The use of multiple measurements in taxonomic problems. Annalsof Eugenics, 7, 179-188, 1936.

[For65] E. Forgy. Cluster analysis of multivariate data; efficiency vs. interpretability ofclassifications. Biometrics, 21, 768, 1965.

[Fot97] D. Fotheringhame and R.J. Baddeley. Nonlinear principal component analysisof neuronal spike train data. Biological Cybernetics, 77, 282-288, 1997.

[Fri89] J. Friedman. Regularized discriminant analysis. Journal of the American Sta-tistical Association, 84(405), 165-175, 1989.

[Fri98] T.T. Friess, N. Cristianini, C. Campbell. The kernel adatron algorithm: a fastand simple learning procedure for support vector machines. Proceedings of 15th

International Conference on Machine Learning, Morgan Kaufman Publishers,188-196, 1998.

[Fri95] F. Frisone, F. Firenze, P. Morasso and L. Ricciardiello . Application ofTopological-Representing Networks to the Estimation of the Intrinsic Dimen-sionality of Data. Proceedings of International Conference on Artificial NeuralNetworks, ICANN95, 323-329, 1995.

[Fri94] B. Fritzke. Growing cell structures- a self organizing network for unsupervisedand supervised learning. Neural Networks, 7(9), 1441-1460, 1994.

[Fri95] B. Fritzke. A growing neural gas learns topologies. Advances in Neural Infor-mation Processing Systems 7, G. Tesauro, D.S. Touretzky and T.K. Leen (eds.),625-632, Mit Press, Cambridge(USA), 1995.

[Fri97] B. Fritzke. Some Competitive Learning Methods. Tech. Report Ruhr-UniversitatBochum, 1997.

[Fuk71] K. Fukunaga and D.R. Olsen. An Algorithm for Finding Intrinsic Dimensionalityof Data. IEEE Transactions on Computers, 20(2), 165-171, IEEE ComputerSociety, 1976.

155

[Fuk82] K. Fukunaga. Intrinsic Dimensionality Extraction. Classification, Pattern Recog-nition and Reduction of Dimensionality, Vol. 2 of Handbook of Statistics, P.R.Krishnaiah and L.N. Kanal (eds.), 347-360, North Holland, Amsterdam, NL:1982.

[Fuk90] K. Fukunaga. An Introduction to Statistical Pattern Recognition. AcademicPress, New York, USA: 1990.

[Gib98] M.N. Gibbs and D.J.C. MacKay. Variational Gaussian Process Classifiers. Tech.Report. Cavendish Laboratory, University of Cambridge, 1998.

[Gir93] F. Girosi, M. Jones and T. Poggio. Priors, stabilizers and basis functions: Fromregularization to radial,tensor and additive splines. A.I. Memo No. 1430, MIT,1993.

[Gir02] M. Girolami. Orthogonal Series Density Estimation and the Kernel EigenvalueProblem. Neural Computation, 14(3), 669-688, MIT Press, 2002.

[Gra83] P. Grassberger and I.Procaccia Measuring the Strangeness of Strange Attractors.Physica, D9, 189-208, 1983.

[Gra90] P. Grassberger. An optimized box-assisted algorithm for fractal dimension.Physics Letters, A148, 63-68, 1990.

[Gray84] R.M. Gray. Vector quantization. IEEE ASSP Magazine, 1, 4-29, 1984.

[Gray92] R.M. Gray. Vector Quantization and Signal Compression. Kluwer AcademicPress, 1992.

[Haa83] N. Haas and G.G. Hendrix. Learning by Being Told: Acquiring Knowledge forInformation Management. Machine Learning, J.G. Carbonell, R.S. Michalskiand T.M. Mitchell (eds.), 405-428, Tioga Publishing Company, 1983.

[Hau18] F. Hausdorff. Dimension und ausseres Mass. Math. Annalen, 157(79), 1918.

[Hau99] D. Haussler. Convolution Kernels on Discrete Structures. Tech. Report UCSC-CRL-99-10, 1999.

[Hay95] S. Haykin and X. Bo Li. Detection of Signals in Chaos. Proceedings of IEEE,83(1), 95-122, 1995.

[Hei00] B. Heisele, T. Poggio and M. Pontil. Face detection in still gray images. TechnicalReport AI Memo 1687- CBCL Paper 187, MIT, May 2000.

[Her03] R. Herbrich. Learning Kernel Classifiers Mit Press, 2003.

156

[Her91] J. Hertz, A. Krogh and R. Palmer. Introduction to the Theory of Neural Com-putation. Addison-Wesley, 1991.

[Hey75] A. Heyting and H. Freudenthal. Collected Works of L.E.J Brouwer. NorthHolland Elsevier, Amsterdam, NL: 1975.

[Hil04] D. Hilbert. Grundzuge einer allgemeinen Theorie der linearen Integralgleichun-gen. nachr. Gottinger Akad. Wiss. Math. Phys. Klasse, 49-91, 1904.

[Hor69] R.A. Horn. The theory of infinitely divisible matrices and kernels. Trans. Amer.Soc., 136, 269-286, 1969.

[Hou91] J. Houghton. The Bakerian Lecture 1991: the predictability of weather andclimates. Phil. Trans. R. Soc. Lond., A337, 521-572, 1991.

[Hua94] Q. Huang, J.R. Lorch and R.C. Dubes. Can the Fractal Dimension of Images beMeasured?. Pattern Recognition, 27(3), 339-349, Elsevier, 1994.

[Hub81] P.J. Huber. Robust Statistics. Wiley, 1981.

[Ish93] V. Isham. Statistical Aspects of Chaos: a Review. Networks and Chaos-Statistical and Probabilistic Aspects, O.E. Barndorff-Nielsen, J.L. Jensen andW.S. Kendall (eds.), 124-200, Chapman and Hall, 1993.

[Jai88] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

[Jai99] A.K. Jain, M.N. Murty and P.J. Flynn. Data Clustering: A review. ACMComput. Surveys, 31(3), 264-323, 1999.

[Jak98] T.S. Jakkola and D. Haussler. Exploiting generative models in discriminativeclassifiers. Advances in Neural Information Processing Systems, M.S. Kearns, S.Solla and D.A. Cohn (eds.), 11, MIT Press, 1998.

[Joa99] T. Joachim. Making Large-Scale SVM Learning Practical. Advances in KernelMethods, B. Scholkopf, C. Burges, A. Smola (eds.), 169-184, MIT Press, 1999.

[Jol86] I.T. Jolliffe. Principal Component Analisys. Springer-Verlag, New York, USA:1986.

[Kan97] H. Kanz and T. Schreiber. Nonlinear Time Series Analisys. Cambridge Univer-sity Press, Cambridge, UK: 1997.

[Kap95] D. Kaplan and L. Glass. Understanding Nonlinear Dynamics. Springer-Verlag,New York, USA: 1995.

157

[Kar94] J. Karhunen and J. Joutsensalo. Representations and Separation of Signals UsingNonlinear PCA Type Learning. Neural Networks, 7(1), 113-127, Pergamon Press,1994.

[Kee99] S. Keerthi, S. Shevde, C. Bhattacharyya, K. Murthy. Improvements to Platt’sSMO algorithm for SVM classifier design. Technical Report, Deaprtment of CSA,Bangalore, India, 1999.

[Kee00] S. Keerthi, S. Shevde, C. Bhattacharyya, K. Murthy. A fast iterative nearestpoint algorithm for support vector machine design. IEEE Transaction on NeuralNetworks, 11, 124-136, IEEE Computer Society, 2000.

[Keg03] B. Kegl. Intrinsic dimension estimation using packing numbers. in Advances inNeural Information and Processing 15, 7(1), MIT Press, 2003.

[Kir90] M. Kirby and L. Sirovich Application of the Karhunen-Loeve procedure for thecharacterization of human faces. IEEE Transaction on Pattern Analysis andMachine Intelligence, 12(1), 103-108, IEEE Computer Society, 1990.

[Kir01] M. Kirby. Geometric Data Analysis: An Empirical Approach to DimensionalityReduction and the Study of Patterns. John Wiley and Sons, New York, USA:2001.

[Koh82] T. Kohonen. Self-Organized formation of topologically correct feature maps.Biological Cybernetics, 43, 59-69, 1982.

[Koh97] T. Kohonen. Self-Organizing Map. Springer-Verlag, New York, USA: 1997.

[Kol41] A.N. Kolmogorov. Stationary sequences in Hilbert space. Bull. Math. Univ.Moscow, 2, 1-40.

[Kor68] G.A. Korn and T.M. Korn. Mathematical Handbook for Scientists and Engineers.Second edition. McGraw-Hill, New York, USA: 1968.

[Kru64] J.B. Kruskal. Multidimensional Scaling by Optimizing Goodness of Fit to aNonmetric Hypothesis. Psychometrika, 29, 1-27, 1964.

[Kru69] J.B. Kruskal and J.D. Carroll. Geometrical models and badness-of-fit functions.Multidimensional Analisys, Vol. 2, Krishnaiah (ed.), 639-671, Academic Press,1969.

[Kru71] J.B. Kruskal. A nonlinear mapping for data structure analysis. IEEE Transactionon Computers, C-20, 1614-1614, 1971.

158

[Kru72] J.B. Kruskal. Linear transformation of multivariate data to reveal clustering.Multidimensional Scaling, Vol. I, Theory, Romney, Shepard, Nerlove (eds.), 101-115, Academic Press, 1972.

[Kuh51] H.W. Kuhn and A.W. Tucker. Nonlinear programming. Proceedings of 2nd Berke-ley Symposium on Mathematical Statistics and Probabilistics, 481-492, Universityof California Press, Berkeley, 1951.

[Lin80] Y. Linde, A. Buzo and R. Gray. Least Square quantization in PCM. IEEETransaction on Information Theory, 28(2), 129-137, IEEE Computer Society,1982.

[Llo82] S.P. Lloyd. An algorithm for vector quantizer design. IEEE Transaction onCommunications, 28(1), 84-95, IEEE Computer Society, 1980.

[Lor63] E.N. Lorenz. Deterministic non-periodic flow. Journal of Atmospheric Science,20, 130-141, 1963.

[Lue84] D. Lueberger. Linear and Nonlinear Programming. Addison-Wesley, Reading,MA, 1984.

[Mac92] D.J.C. MacKay. A Practical Bayesian Framework for backpropagation Networks.Neural Computation, 4(3), 448-472, 1992.

[Mac67] J. Mac Queen. Some methods for classifications and analysis of multivariateobservations. Proceedings of the Fifth Berkeley Symposium on Mathematicalstatistics and probability, vol. I, 281-297, Berkeley, University of California Press,1967.

[Mad90] W.R. Madych and S.A. Nelson. Multivariate interpolation and conditionallypositive definite functions. Mathematics of Computation, 54(189) 211-230, 1990.

[Mal98] E.C. Malthouse. Limitations of nonlinear PCA as performed with generic neuralnetworks. IEEE Transaction on Neural Networks, 9(1), 165-173, IEEE ComputerSociety, 1998.

[Man77] B. Mandelbrot. Fractals: Form, Chance and Dimension. Freeman, 1977.

[Man81] R. Mane. On the dimension of compact invariant sets of certain nonlinear maps.Dynamical Systems and Turbolence, Proceedings Warwick 1980, D. Rand andL.S. Young (eds.), 230-242, Lecture Notes in Mathematics 898, Springer-Verlag,1981.

[Man65] O.L. Mangasarian. Linear and non-linear separation of patterns by linear pro-gramming. Oper. Res., 13, 444-452, June 1965.

159

[Man90a] O.L. Mangasarian and W.H. Wolberg. Cancer diagnosis via linear programming.SIAM News, 23(5), 1-18, September 1990.

[Man90b] O.L. Mangasarian, R. Setiono and W.H. Wolberg. Pattern recognition via lin-ear programming: Theory and application to medical diagnosis. Large-scalenumerical optimization, Thomas F. Coleman and Yuying Li (eds.), 22-30, SIAMPublications, Philadelphia 1990.

[Man00] O.L. Mangasarian, D. Musicant. Lagrangian Support Vector Regression. Datamining Institute Technical Report 00-06, June 2000.

[Mar91] T.E. Martinetz and K.J. Schulten. A ”neural gas” network learns topologies.T.Kohonen, K. Makisara, O. Simula, J. Kangas (eds.), Artificial Neural Net-works, 397-402, North-Holland, 1991.

[Mar93] T.E. Martinetz and K.J. Schulten. Neural-gas network for vector quantizationand its application to time-series prediction. IEEE Transaction on Neural Net-works, 4(4), 558-569, IEEE Computer Society, 1993.

[Mar94] T.E. Martinetz and K.J. Schulten. Topology Representing Networks. NeuralNetworks, 3, 507-522, Pergamon Press, 1994.

[Mat63] G.Matheron. Principles of geostatistics. Economic Geology, 58, 1246-1266, 1963.

[Mcu43] W.S. McCulloch and W. Pitts. A Logical Calculus of Ideas Immanent in NervousActivity. Bulletin of Mathemathical Biophysics, 5, 115-133, 1943

[Mei97] L.V. Meisel and M.A. Johnson. Convergence of Numerical Box-Counting andCorrelation Integral Multifractal Analysis Techniques. Pattern Recognition,30(9), 1565-1570, Elsevier, 1997.

[Mer09] J. Mercer. Functions of positive and negative type and their connection with thetheory of integral equations. Philos. Trans. Royal Soc., A 209, 415-446, 1909.

[Mic86] C.A. Micchelli. Interpolation of scattered data: distance matrices and condition-ally positive definite. Constructive Approximation, 2, 11-22, 1986.

[Mik99a] S. Mika, G. Ratsch, J. Weston, B. Scholkopf and K.R. Muller. Fisher Discrim-inant Analysis with Kernels. Proceedings of IEEE Neural Networks for SignalProcessing Workshop, 41-48, 1999.

[Mik99b] S. Mika, B. Scholkopf, A. Smola, K.R. Muller, M. Scholz, and G. Ratsch. KernelPCA for feature extraction and denoising in feature space. in M.S. Kearns, S.A.Solla, D.A. Cohn (eds.), Advances in Neural Information Processing Systems,vol. 11, 536-542, Mit Press, Cambridge, 1999.

160

[Min69] M.L. Minsky and S.A. Papert. Perceptrons. MIT Press, 1969.

[Mos83] D.J. Mostow. Machine Transformation of Advice into a Heuristic Search Pro-cedure. Machine Learning, J.G. Carbonell, R.S. Michalski and T.M. Mitchell(eds.), 367-404, Tioga Publishing Company, 1983.

[Nea96] R. Neal. Bayesian Learning in Neural Networks. Springer-Verlag, New York,1996.

[Omo90] S.M. Omohundro. The Delaunay triangulation and function learning. Tech.Report 90-001, International Computer Science Institute, Berkeley, 1990.

[Osu97] E. Osuna, R. Freund and F.Girosi. An improved training algorithm for supportvector machines. Neural Networks for Signal Processing VII, Proceedings of the1997 IEEE Workshop, J. Principe, L. Giles, N. Morgan and E. Wilson (eds.),276-285, New York, IEEE Press, 1997.

[Osu99] E. Osuna, F.Girosi. Reducing the Run-time Complexity in Support Vector Ma-chines. Advances in Kernel Methods, B. Scholkopf, C. Burges, A. Smola (eds.),271-284, MIT Press, 1999.

[Ott93] E. Ott. Chaos in Dynamical Systems. Cambridge University Press, Cam-bridge:UK, 1993.

[Pac80] N. Packard, J. Crutchfield, J. Farmer and R. Shaw. Geometry from a time series.Physical Review Letters, 45(1), 712-716, 1980.

[Pal97] N.R. Pal, K. Pal and J.C. Bedzek. A mixed c-means clustering model. Proc. ofIEEE Int. Conf. on Fuzzy Syst., 11-21, 1997.

[Pet79] K. Pettis, T. Bailey, T. Jain and R. Dubes. An Intrinsic Dimensionality Esti-mator From Near-Neighbor Information. IEEE Transaction on Pattern Analysisand Machine Intelligence, 1(1), 25-37, IEEE Computer Society, 1982.

[Pla99] J.C. Platt. Fast Training of Support Vector Machines using Sequential MinimalOptimization. Advances in Kernel Methods, B. Scholkopf, C. Burges, A. Smola(eds.), 185-208, MIT Press, 1999.

[Pre90] F.P. Preparata and M.I. Shamos. Computational geometry. Springer, new York,1990.

[Pre89] W.H. Press, B.P. Flannery, S.A. Teulkosky and W.T. Vetterling. NumericalRecipes: The Art of Scientific Computing. Cambridge University Press, Cam-bridge, UK: 1989.

161

[Red84] R. Redner and H. Walker. Mixture densities, maximum likelihood and the emalgorithm. SIAM Review, 26(2), 1984.

[Rit91] H.J. Ritter, T.M. Martinetz and K.J. Schulten. Neuronale Netze. Addison-Wesley, Munchen, 1991.

[Rom99] S. Romdhani, S. Gong, A. Psarrou. A multi-view nonlinear active shape modelusing kernel PCA. Proceedings of the 10th British Machine Vision Conference(BMVC99 ), T. Pridmore, D. Elliman (eds.), BMVA Press, 1999, pp. 483-492.

[Rom72a] A.K. Romney, R.N. Shepard and S.B. Nerlove. Multidimensionaling Scaling,vol. I, Theory. Seminar Press, 1972.

[Rom72b] A.K. Romney, R.N. Shepard and S.B. Nerlove. Multidimensionaling Scaling,vol. 2, Applications. Seminar Press, 1972.

[Ros62] F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of theBrain Mechanism. Spartan, 1962

[Ros01a] R. Rosipal, M. Girolami, L. Trejo and A. Cichoki. Kernel PCA for FeatureExtraction and De-Noising in Non-Linear Regression. Neural Computing andApplications, 10(3), 231-243, MIT Press, 2001.

[Ros01b] R. Rosipal and M. Girolami. An Expectation Maximisation Approach to Non-linear Component Analysis. Neural Computation, 13, 505-510, MIT Press, 2001.

[Row00] S. Roweis, L.K. Saul. Nonlinear dimensionality reduction by locally linear em-bedding. Science, 290 (12), 2323-2326, 2000.

[Rud64] W. Rudin. Real and Complex Analysis. McGraw Hill, 1966.

[Rum86a] D.E. Rumelhart, G.E. Hinton and R.J. Williams. Learning representations byBack-propagating Errors. Nature, 323, 533-536, 1986.

[Rum86b] D.E. Rumelhart, G.E. Hinton and R.J. Williams. Learning Internal Represen-tations by Error Propagation. Parallel Distributed Processing, vol. 1, Chapter 8,MIT Press, 1986

[Ryc83] M.D. Rychener. An Overview of Machine Learning. Machine Learning, J.G.Carbonell, R.S. Michalski and T.M. Mitchell (eds.), 429-459, Tioga PublishingCompany, 1983.

[Sam69] J.W. Jr. Sammon. A nonlinear mapping for data structure analysis. IEEETransaction on Computers, C-18, 401-409, IEEE Computer Society, 1969.

162

[Sca90] J. Scargle. Studies in Astronomical Time Series Analysis. IV . Modeling chaoticand random processes with linear filters. Astrophysical Journal, 359, 469-482,1990.

[Sch89] J.A. Scheinkman and B. Le Baron. Nonlinear Dynamics and Stock Returns.Journal of Business, 62, 311-337, 1989.

[Sch38a] I.J. Schoenberg. Metric spaces and completely monotone functions. Ann. ofMath., 39, 811-841, 1938.

[Sch38b] I.J. Schoenberg. Metric spaces and positive definite functions. Trans. Amer.Math. Soc., 44, 522-536, 1938.

[Sch42] I.J. Schoenberg. Positive definite functions on spheres. Duke. Math. J., 9, 96-108,1942.

[Sch98a] B. Scholkopf, P. Bartlett, A.J. Smola, R. Williamson. Support Vector Regressionwith automatic accuracy control. L. Niklasson, M. Boden, T. Ziemke (eds.) Pro-ceedings of the ICANN98, Perspectives in Neural Computing, Springer-Verlag,1998.

[Sch98b] B. Scholkopf, A.J. Smola and K.R. Muller. Nonlinear Component Analysis as akernel eigenvalue problem. Neural Computation, 10, 1299-1319, 1998.

[Sch00a] B. Scholkopf, R.C. Williamson, A.J. Smola, J. Shawe-Taylor and J. Platt. Sup-port vector method for novelty detection. Advances in Neural Information Pro-cessing Systems 12 : Proceedings of the 1999 Conference, Sara A. Solla, Todd K.Leen and Klaus-Robert Muller (eds.), 2000.

[Sch00b] B. Scholkopf. A kernel trick for distances. Microsoft Research Report, 2000.

[Sch11] I. Schur. Bemerkungen zur Theorie der beschrankten Bilininearformen mit un-endlich vielen Veranderlichen. J. Reine Angew. Math., 140, 1-29, 1911.

[She62] R.N. Shepard. The analysis of proximities: Multimensional scaling with anunknown distance function. Psychometrika, 27, 125-140, 1962.

[She64] R.N. Shepard and J.D. Carroll. Parametric representation of nonlinear datastructures. Multivariate Analisys, P.R. Krisnaiah (ed.), 561-592, Academic Press,1969.

[She74] R.N. Shepard. Representation of structure in similarity data problems andprospect. Psychometrika, 39, 373-421, 1974.

[Smi88] L.A. Smith. Intrinsic Limits on Dimension Calculations. Physics Letters, A123,283-288, 1988.

163

[Smi92] R.L. Smith. Optimal Estimation of Fractal Dimension. Nonlinear Modeling andForecasting, SFI Studies in the Sciences of Complexity vol. XII, M. Casdagli andS. Eubank (eds.), 115-135, Addison Wesley, 1992.

[Som03] P. Somervuo. Speech Dimensionality Analysis on Hypercubical Self-OrganizingMaps. Neural Processing Letters, 17(2), 125-136, 2003.

[Sut98] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MitPress, 1998.

[Tak81] F. Takens. Detecting strange attractor in turbolence. Dynamical Systems andTurbolence, Proceedings Warwick 1980, D. Rand and L.S. Young (eds.), 366-381,Lecture Notes in Mathematics 898, Springer-Verlag, 1981.

[Tak84] F. Takens. On the Numerical Determination of the Dimension of an Attrac-tor. Dynamical Systems and Bifurcations, Proceedings Groningen 1984, B.Braaksma, H. Broer and F. Takens (eds.), 115-135, Lecture Notes in Mathe-matics 1125, Springer-Verlag, 1984.

[Tax99] D.M.J. Tax and R.P.W. Duin. Support Vector domain description. PatternRecognition Letters, 20, 1991-1999, Elsevier Science Publishers, 1999.

[Ten00] J.B. Tennenbaum, V. de Silva and J.C. Langford. A global geometric frameworkfor nonlinear dimensionality reduction. Science, 290, 2319-2323, 2000.

[The88] J. Theiler. Lacunarity in a Best Estimator of Fractal Dimension. Physics Letters,A133, 195-200, 1988.

[The90] J. Theiler. Statistical Precision of Dimension Estimators. Physical Review, A41,3038-3051, 1990.

[The92] J. Theiler, S. Eubank, A. Longtin, B. Galdrikian and J.D. Farmer. Testing fornonlinearity in time series: the method for surrogate date. Physica, D58, 77-94,1992.

[Tir00] W.S. Tirsch, M. Keidel, S. Perz, H. Scherb and G. Sommer. Inverse Covariationof Spectral Density and Correlation Dimension in Cyclic EEG Dimension of theHuman Brain. Biological Cybernetics, 82, 1-14, 2000.

[Tol03] C.R. Tolle, T.R. Mc Junkin and D.J. Gorisch. Suboptimal Minimum Cluster Vol-ume Cover-Based Method for Measuring Fractal Dimension. IEEE Transactionon Pattern Analysis and Machine Intelligence, 25(1), 32-41, IEEE ComputerSociety, 2003.

[Ton90] H. Tong. Nonlinear Time Series. Oxford University Press, Oxford, UK: 1990.

164

[Tru76] G.V. Trunk Statistical Estimation of the Intrinsic Dimensionality of a Noisy Sig-nal Collection. IEEE Transaction on Computers, 25, 165-171, IEEE ComputerSociety, 1976.

[Twi03] C.J. Twining, C.J. Taylor. The use of kernel principal component analysis tomodel data distributions. Pattern Recognition, 36(1), 217-227, 2003.

[Vap63] V.N. Vapnik and A. Lerner. Pattern recognition using generalized portraitmethod. Automation and Remote Control, 24, 1963.

[Vap64] V.N. Vapnik and A.Ya. Chervonenkis. A note on one class of perceptron. Au-tomation and Remote Control, 25, 1964.

[Vap71] V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relativefrequencies of events to their probabilities. Theory Probab. Appl., 16, 264-280,1971.

[Vap79] V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Nauka,1979.

[Vap95] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NewYork, 1995.

[Vap98] V.N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York,1998.

[Ver95] P.J. Verveer and R. Duin. An Evaluation of Intrinsic Dimensionality Estimators.IEEE Transaction on Pattern Analisys and Machine Intelligence, 17(1), 81-86,IEEE Computer Society, 1995.

[Wes98] J. Weston, A. Gammerman, M. Stitson, V. Vapnik, V. Vovk, C. Watkins. Sup-port vector density estimation Advances in Kernel Methods, B. Scholkopf, C.Burges, A. Smola (eds.), 293-306, MIT Press, 1999.

[Wes99] J. Weston, C. Watkins. Multi-Class Support Vector Machines. Proceedings ofESANN99, M. Verleysen (ed.), 219-224, D. Facto Press, Brussels, 1999.

[Whi36] H. Whitney. Differentiable manifolds. Annals of Math., 37, 645-680, 1936.

[Wil97] C.K.I. Williams and C.Rassmussen. Gaussian Processes for regression. Advancesin Neural Information Processing Systems 8, D.S. Touretzky, M.C. Mozer andM.E. Hasselmo (eds.), MIT Press, 1996.

[Wil98] C.K.I. Williams and D. Barber. Bayesian Classification with Gaussian Processes.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1342-1351, 1998.

165

[Wil02] C.K.I Williams. On a Connection between KPCA and Metric MultidimensionalScaling. Advances in Neural Information Processing Systems 13, T.K. Leen,T.G. Diettrich, V. Tresp (eds.), MIT Press, 2001.

[Wil76] D.J. Willshaw and C. von der Malsburg. How patterned neural connections canbe set up by self-organization. Proceedings of the Royal Society London, B194,431-445, 1976.

[Wol90] W.H. Wolberg and O. Mangasarian. Multisurface method of pattern separationfor medical diagnosis applied to breast cytology. Proceedings of the NationalAcademy of Sciences, U.S.A., 87, 9193-9196, December 1990.

[Wu83] C.F.J. Wu. On the convergence properties of the em algorithm. The Annals ofStatistics, 11(1), 95-103, 1983.

[Yu02] K. Yu, L. Ji and X. Zhang. Kernel Nearest-Neighbor Algorithm. Neural Pro-cessing Letters, 15(2), 147-156, 2002.

[Zha03] D. Zhang and S. Chen. Clustering Incomplete Data using Kernel-Based FuzzyC-Means Algorithm. Neural Processing Letters, 18(3), 155-162, 2003.

166

Date post:	26-Jul-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Kernel Methods for Unsupervised Learning...Abstract Kernel Methods are algorithms that projects...

Documents