+ All Categories
Home > Documents > DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Date post: 14-Jan-2016
Category:
Upload: kenyon
View: 20 times
Download: 0 times
Share this document with a friend
Description:
DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University. PART 1 Introduction. All information on math-part of course on: http://www.math.unimaas.nl/personal/ronaldw/DAM/DataMiningPage.htm. Data mining - a definition. - PowerPoint PPT Presentation
Popular Tags:
134
DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering Maastricht University
Transcript
Page 1: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

DATA MINING from data to information

Ronald WestraDep. MathematicsKnowledge EngineeringMaastricht University

Page 2: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

PART 1

Introduction

Page 3: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

 

All information on math-part of course on:

http://www.math.unimaas.nl/personal/ronaldw/DAM/DataMiningPage.htm

Page 4: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 5: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 6: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Data mining - a definition

 

"Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data

in order to discover meaningful patterns and results."  

(Berry & Linoff, 1997, 2000)

Page 7: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

DATA MINING 

Course Description:

In this course the student will be made familiar with the main topics in Data Mining, and its important role in current Computer Science. In this course we’ll mainly focus on algorithms, methods, and techniques for the representation and analysis of data and information.

Page 8: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

DATA MINING 

Course Objectives:

To get a broad understanding of data mining and knowledge discovery in databases.

To understand major research issues and techniques in this new area and conduct research.

To be able to apply data mining tools to practical problems.

Page 9: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

1. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996), Data Mining to Knowledge Discovery in Databases: http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf

2. Hand, D., Manilla, H., Smyth, P. (2001), Principles of Data Mining, MIT press, Boston, USA MORE INFORMATION ON: ELEUMand:http://www.math.unimaas.nl/personal/ronaldw/DAM/DataMiningPage.htm

Page 10: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Hand, D., Manilla, H., Smyth, P. (2001),

Principles of Data Mining,

MIT press, Boston, USA

+ MORE INFORMATION ON: ELEUM or DAM-website

Page 11: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

What is Data Mining?• data information knowledge

• patterns structures models

The use of Data Mining• increasingly larger databases TB (TeraBytes)

• N datapoints and K components (fields) per datapoint

• not accessible for fast inspection

• incomplete, noise, wrong design

• different numerical formats, alfanumerical, semantic fields

• necessity to automate the analysis

Page 12: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Applications

• astronomical databases

• marketing/investment

• telecommunication

• industrial

• biomedical/genetica

Page 13: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Historical Context

• in mathematical statistics negative connotation:

• danger for overfitting and erroneous generalisation

Page 14: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Data Mining Subdisciplines

• Databases

• Statistics

• Knowledge Based Systems

• High-performance computing

• Data visualization

• Pattern recognition

• Machine learning

Page 15: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Data Mining -methodes

• Clustering

• classification (off- & on-line)

• (auto)-regression

• visualisation techniques: optimal projections and PCA (principal component analysis)

• discrimnant analysis

• decomposition

• parameteriical modelling

• non-parameteric modeling

Page 16: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Data Mining essentials

• model representation

• model evaluation

• search/optimisation

Data Mining algorithms

• Decision trees/Rules

• Nonlinear Regression and Klassificatie

• Example-based methods

• AI-tools: NN, GA, ...

Page 17: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 1: Introduction

Data Mining and Mathematical Statistics

• when Statistics and when DM?

• is DM a sort of Mathematical Statistics?

Data Mining and AI

• AI is instrumental in finding knowledge in large chunks of data

Page 18: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Mathematical Principles in Data Mining

Part I: Exploring Data Space

* Understanding and Visualizing Data Space

Provide tools to understand the basic structure in databases. This is done by probing and analysing metric structure in data-space, comprehensively visualizing data, and analysing global data structure by e.g. Principal Components Analysis and Multidimensional Scaling.

* Data Analysis and Uncertainty

Show the fundamental role of uncertainty in Data Mining. Understand the difference between uncertainty originating from statistical variation in the sensing process, and from imprecision in the semantical modelling. Provide frameworks and tools for modelling uncertainty: especially the frequentist and subjective/conditional frameworks.

Page 19: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Mathematical Principles in Data Mining

PART II: Finding Structure in Data Space

* Data Mining Algorithms & Scoring Functions

Provide a measure for fitting models and patterns to data. This enables the selection between competing models. Data Mining Algorithms are discussed in the parallel course.

* Searching for Models and Patterns in Data Space

Describe the computational methods used for model and pattern-fitting in data mining algorithms. Most emphasis is on search and optimisation methods. This is required to find the best fit between the model or pattern with the data. Special attention is devoted to parameter estimation under missing data using the maximum likelihood EM-algorithm.

Page 20: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Mathematical Principles in Data Mining

PART III: Mathematiscal Modelling of Data Space

* Descriptive Models for Data Space

Present descriptive models in the context of Data Mining. Describe specific techniques and algorithms for fitting descriptive models to data. Main emphasis here is on probabilistic models.

* Clustering in Data Space

Discuss the role of data clustering within Data Mining. Showing the relation of clustering in relation to classification and search. Present a variety of paradigms for clustering data.

Page 21: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

EXAMPLES

* Astronomical Databases

* Phylogenetic trees from DNA-analysis

Page 22: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 1: Phylogenetic Trees

The last decade has witnessed a major and historical leap in biology and all related disciplines. The date of this event can be set almost exactly to November 1999 as the Humane Genome Project (HGP) was declared completed. The HGP resulted in (almost) the entire humane genome, consisting of about 3.3.109 base pairs (bp) code, constituting all approximately 35K humane genes. Since then the genomes of many more animal and plant species have come available. For our sake, we can consider the humane genome as a huge database, existing of a single string with 3.3.109 characters from the set {C,G,A,T}.

Page 23: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 1: Phylogenetic Trees

This data constitutes the human ‘source code’. From this data – in principle – all ‘hardware’ characteristics, such as physiological and psychological features, can be deduced. In this block we will concentrate on another aspect that is hidden in this information: phylogenetic relations between species. The famous evolutionary biologist Dobzhansky once remarked that: ‘Everything makes sense in the light of evolution, nothing makes sense without the light of evolution’. This most certainly applies to the genome. Hidden in the data is the evolutionary history of the species. By comparing several species with various amount of relatedness, we can from systematic comparison reconstruct this evolutionary history. For instance, consider a species that lived at a certain time in earth history. It will be marked by a set of genes, each with a specific code (or rather, a statistical variation around the average).

Page 24: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 1: Phylogenetic Trees

If this species is by some reason distributed over a variety of non-connected areas (e.g. islands, oases, mountainous regions), animals of the species will not be able to mate at a random. In the course of time, due to the accumulation of random mutations, the genomes of the separated groups will increasingly differ. This will result in the origin of sub-species, and eventually new species. Comparing the genomes of the new species will shed light on the evolutionary history, in that: we can draw a phylogenetic tree of the sub-species leading to the ‘founder’-species; given the rate of mutation we can estimate how long ago the founder-species lived; reconstruct the most probable genome of the founder-species.

Page 25: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 26: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 27: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 28: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 29: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 30: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 31: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 2: data mining in astronomy

Page 32: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 2: data mining in astronomy

Page 33: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Example 2: data mining in astronomy

Page 34: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 35: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 36: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

DATA AS SETS OF MEASUREMENTS AND

OBSERVATIONS

Data Mining Lecture II[Chapter 2 from Principles of Data Mining by

Hand,, Manilla, Smyth ]

Page 37: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 2: DATA AS SETS OF MEASUREMENTS AND OBSERVATIONS

Readings:

• Chapter 2 from Principles of Data Mining by Hand, Mannila, Smyth.

Page 38: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.1 Types of Data

2.2 Sampling 1. (re)sampling

2. oversampling/undersampling, sampling artefacts

3. Bootstrap and Jack-Knife methodes

2.3 Measures for Similarity and Difference1. Phenomenological

2. Dissimilarity coefficient

3. Metric in Data Space based on distance measure

Page 39: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Types of data

Sampling :

– the process of collecting new (empirical) data

Resampling :

– selecting data from a larger already existing collection

Page 40: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Sampling

–Oversampling

–Undersampling

–Sampling artefacts (aliasing, Nyquist frequency)

Page 41: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Sampling artefacts (aliasing, Nyquist frequency)

Moire fringes

Page 42: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Resampling

Resampling is any of a variety of methods for doing one of the following:

– Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (= jackknife) or drawing randomly with replacement from a set of data points (= bootstrapping)

– Exchanging labels on data points when performing significance tests (permutation test, also called exact test, randomization test, or re-randomization test)

– Validating models by using random subsets (bootstrap, cross validation)

Page 43: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Bootstrap & Jack-Knife methodes

using inferential statistics to account for randomness and uncertainty in the observations. These inferences may take the form of answers to essentially yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression).

Page 44: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Bootstrap method

bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample.

"Bootstrap" means that resampling one available sample gives rise to many others, reminiscent of pulling yourself up by your bootstraps.

cross-validation: verify replicability of results

Jackknife: detect outliers

Bootstrap: inferential statistics

Page 45: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.3 Measures for Similarity and Dissimilarity

1. Phenomenological

2. Dissimilarity coefficient

3. Metric in Data Space based on distance measure

Page 46: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

1. Euclidean distance

2. Metric

3. Commensurability

4. Normalisatie

5. Weighted Distances

6. Sample covariance

7. Sample covariance correlation coefficient

8. Mahalanobis distance

9. Normalised distance and Cluster Separation (zie aanvullende tekst)

10. Generalised Minkowski

Page 47: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

1. Euclidean distance

Page 48: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

2. Generalized p-norm

Page 49: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Generalized Norm / Metric

Page 50: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Minkowski Metric

Page 51: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Minkowski Metric

Page 52: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Generalized Minkowski Metric

In the data space is already a structure present.

The structure is represented by the correlation and given by the covariance matrix G

The Minkowski-norm of a vector x is:

xGxx T 2

Page 53: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

1. Euclidean distance

2. 2. Metric

3. Commensurability

4. Normalisatie

5. Weighted Distances

6. Sample covariance

7. Sample covariance correlation coefficient

8. Mahalanobis distance

9. Normalised distance and Cluster Separation (zie aanvullende tekst)

10. Generalised Minkowski

Page 54: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

Mahalanobis distance

Page 55: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

8. Mahalanobis distance

The Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936.

It is based on correlations between variables by which different patterns can be identified and analysed. It is a useful way of determining similarity of an unknown sample set to a known one.

It differs from Euclidean distance in that it takes into account the correlations of the data set.

Page 56: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

8. Mahalanobis distance

The Mahalanobis distance from a group of values with mean

and covariance matrix Σ for a multivariate vector

is defined as:

Page 57: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

8. Mahalanobis distance

Mahalanobis distance can also be defined as dissimilarity measure between two random vectors x and y of the same distribution with the covariance matrix Σ :

Page 58: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance Measure and Metric

8. Mahalanobis distance

If the covariance matrix is the identity matrix then it is the same as Euclidean

distance. If covariance matrix is diagonal, then it is called normalized Euclidean distance:

where σi is the standard deviation of the xi over the sample set.

Page 59: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance measures and Metric

8. Mahalanobis distance

Page 60: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance measures and Metric

8. Mahalanobis distance

Page 61: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.4 Distance measures and Metric

8. Mahalanobis distance

Page 62: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

2.5 Distortions in Data Sets

1. outlyers

2. Variance

3. sampling effects

2.6 Pre-processong data with mathematical transformationes

2.7 Data Quality

• Data quality of individual measurements [GIGO]

• Data quality of Data collections

Page 63: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 64: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Part II. Exploratory Data Analysis

Page 65: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

VISUALISING AND EXPLORING DATA-SPACE

Data Mining Lecture III[Chapter 2 from Principles of Data Mining by

Hand,, Manilla, Smyth ]

Page 66: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 3: Visualising and Exploring Data-Space

Readings:

• Chapter 3 from Principles of Data Mining by Hand, Mannila, Smyth.

3.1 Obtain insight in the Structure in Data Space

1. distribution over the space

2. Are there separate and disconnected parts?

3. is there a model?

4. data-driven hypothesis testing

5. Starting point: use strong perceptual powers of humans

Page 67: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 3: Visualising and Exploring Data-Space

3.2 Tools to represent a variabele

1. mean, variance, standard deviation, skewness

2. plot

3. moving-average plot

4. histogram, kernel

Page 68: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

histogram

Page 69: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 3: Visualising and Exploring Data-Space

3.3 Tools for repressenting two variables

1. scatter plot

2. moving-average plots

Page 70: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

scatter plot

Page 71: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

scatter plots

Page 72: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

LECTURE 3: Visualising and Exploring Data-Space

3.4 Tools for representing multiple variables

1. all or selection of scatter plots

2. idem moving-average plots

3. ‘trelis’ or other parameterised plots

4. icons: star icons, Chernoff’s faces

Page 73: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Chernoff’s faces

Page 74: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Chernoff’s faces

Page 75: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Chernoff’s faces

Page 76: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

3.6 MDS: Multidimensional Scaling

DIMENSION REDUCTION

Page 77: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

1. With sub-scatter plots we already noticed that the best projections were determined by the projection that resulted in the optimal spreading of the set of points. This is in the direction of the smallest varaince. This idea is now worked out.

Page 78: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Analysis

2. Underlying idea : supose you have a high-dimensional normal-distributed data set. This will take the shape of a high-dimensional ellipsoid.

An ellipsoid is structured from its centre by orthogonal vectors with different radii. The largest radii have the strongest influence on the shape of the ellipsoid. The ellipsoid is described by the covariance-matrix of the set of data-points. The axes are defined by the orthogonal eigen-vectors (from the centre – the centroid – of the set), the radii are defined by the associated values.

So determine the eigen-values and order those in decreasing size: .

The first n ordered eigen-vectors thus ‘explain’ the following amount of the data: .

},...,,,{ 321 N

N

ii

n

ii

11/

Page 79: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

Page 80: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

Page 81: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

MEAN

Page 82: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

MEAN

Principal axis 1

Principal axis 2

Page 83: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

3. Plot the ordered eigen-values versus the index-number and inspect where a ‘shoulder’ occurs: this determines the number of eigen-values you take into acoount. This is a so-called ‘scree-plot’.

Page 84: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

4. Other derivation is by maximization of the variance of the projected data geprojecteerde data (see book).

This leads to an Eigen-value problem of the covariance-matrix, so the solution described above.

Page 85: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

5. For n points of p components there are: O(np2 + p3) operations required. Use LU-decomposition etcetera.

Page 86: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

6. Many benefits: considerable data-reduction, necessary for computational techniques like ‘Fisher-discriminant-analysis’ and ‘clustering’.

This works very well in practice.

Page 87: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

7. NB: Factor Analysis is often confused with PCA but is different:

the explanation of p-dimensional data by a smaller number of m < p factors.

Page 88: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Principal component analysis (PCA) is a technique that is useful for thecompression and classification of data. The purpose is to reduce thedimensionality of a data set (sample) by finding a new set of variables,smaller than the original set of variables, that nonetheless retains mostof the sample's information.

By information we mean the variation present in the sample,given by the correlations between the original variables. The newvariables, called principal components (PCs), are uncorrelated, and areordered by the fraction of the total information each retains.

Page 89: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

overview

• geometric picture of PCs

• algebraic definition and derivation of PCs

• usage of PCA

• astronomical application

Page 90: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Geometric picture of principal components (PCs)

A sample of n observations in the 2-D space

Goal: to account for the variation in a sample in as few variables as possible, to some accuracy

Page 91: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Geometric picture of principal components (PCs)

• the 1st PC is a minimum distance fit to a line in space

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

Page 92: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Algebraic definition of PCs

Given a sample of n observations on a vector of p variables

λ

where the vector

is chosen such that

define the first principal component of the sampleby the linear transformation

is maximum

Page 93: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Algebraic definition of PCs

Likewise, define the kth PC of the sampleby the linear transformation

λwhere the vector

is chosen such that is maximum

subject to

and to

Page 94: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Algebraic derivation of coefficient vectors

To find first note that

where

is the covariance matrix for the variables

Page 95: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

To find maximize subject to

Algebraic derivation of coefficient vectors

Let λ be a Lagrange multiplier

by differentiating…

then maximize

is an eigenvector of

corresponding to eigenvalue

therefore

Page 96: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

We have maximized

Algebraic derivation of

So is the largest eigenvalue of

The first PC retains the greatest amount of variation in the sample.

Page 97: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

To find the next coefficient vector maximize

Algebraic derivation of coefficient vectors

then let λ and φ be Lagrange multipliers, and maximize

subject to

and to

First note that

Page 98: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

We find that is also an eigenvector of

Algebraic derivation of coefficient vectors

whose eigenvalue is the second largest.

In general

• The kth largest eigenvalue of is the variance of the kth PC.

• The kth PC retains the kth greatest fraction of the variation in the sample.

Page 99: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Algebraic formulation of PCA

Given a sample of n observations

define a vector of p PCs

according to

on a vector of p variables

whose kth column is the kth eigenvector of

where is an orthogonal p x p matrix

Then is the covariance matrix of the PCs,

being diagonal with elements

Page 100: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Probability distribution for sample PCs

If (i) the n observations of in the sample are independent &

(ii) is drawn from an underlying population that follows a p-variate normal (Gaussian) distribution with known covariance matrix

then

where is the Wishart distribution

else utilize a bootstrap approximation

Page 101: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Probability distribution for sample PCs

If (i) follows a Wishart distribution &

(ii) the population eigenvalues are all distinct

then the following results hold as

(a tilde denotes a population quantity)

• all the are independent of all the

• and

are jointly normally distributed

Page 102: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Probability distribution for sample PCs

and

(a tilde denotes a population quantity)

Page 103: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Inference about population PCs

If follows a p-variate normal distribution

MLE’s of , , and

confidence intervals for and

hypothesis testing for and

then analytic expressions exist* for

else bootstrap and jackknife approximations exist

*see references, esp. Jolliffe

Page 104: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Practical computation of PCs

In general it is useful to define standardized variables by

If the are each measured about their sample mean

then the covariance matrix of

will be equal to the correlation matrix of

and the PCs will be dimensionless

Page 105: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Practical computation of PCs

Given a sample of n observations on a vector of p variables

(each measured about its sample mean)

compute the covariance matrix

where is the n x p matrix

Then compute the n x p matrix

whose ith row is the PC score

for the ith observation.

whose ith row is the ith obsv.

Page 106: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Practical computation of PCs

Write to decompose each observation into PCs

Page 107: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Data compression

Because the kth PC retains the kth greatest fraction of the variation

we can approximate each observation

by truncating the sum at the first m < p PCs

Page 108: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

usage of PCA: Data compression

Reduce the dimensionality of the data

from p to m < p by approximating

where is the n x m portion of

and is the p x m portion of

Page 109: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Dressler, et al. 1987

astronomical application: PCs for elliptical galaxies

Rotating to PC in BT – Σ space improves Faber-Jackson relation

as a distance indicator

Page 110: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

astronomical application: Eigenspectra (KL transform)

Connolly, et al. 1995

Page 111: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

references

Connolly, and Szalay, et al., “Spectral Classification of Galaxies: An Orthogonal Approach”, AJ, 110, 1071-1082, 1995.Dressler, et al., “Spectroscopy and Photometry of Elliptical Galaxies. I. A New Distance Estimator”, ApJ, 313, 42-58, 1987.Efstathiou, G., and Fall, S.M., “Multivariate analysis of elliptical galaxies”, MNRAS, 206, 453-464, 1984.Johnston, D.E., et al., “SDSS J0903+5028: A New Gravitational Lens”, AJ, 126, 2281-2290, 2003.

Jolliffe, Ian T., 2002, Principal Component Analysis (Springer-Verlag New York, Secaucus, NJ).Lupton, R., 1993, Statistics In Theory and Practice (Princeton University Press, Princeton, NJ).Murtagh, F., and Heck, A., Multivariate Data Analysis (D. Reidel Publishing Company, Dordrecht, Holland).Yip, C.W., and Szalay, A.S., et al., “Distributions of Galaxy Spectral Types in the SDSS”, AJ, 128, 585-609, 2004.

Page 112: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.5 PCA: Principal Component Ananlysis

Page 113: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 114: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 115: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 116: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 117: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

1 pc 2 pc

3 pc 4 pc

Page 118: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling [MDS]

1. Same purpose : represent high-dimensional data set

2. In the case of MS not by projections, but by reconstruction from the distance-table. The computed points are represented in an Euclidian sub-space – preferably a 2D-plane.

3. MDS performs better than PCA in case of strongly curved sets.

Page 119: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

The purpose of multidimensional scaling (MDS) is to provide a visual representation of the pattern of proximities (i.e., similarities or distances) among a set of objects

INPUT: distances dist[Ai,Aj] where A is some class of objects

OUTPUT: positions X[Ai] where X is a D-dimensional vector

Page 120: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

Page 121: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

Page 122: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

INPUT: distances dist[Ai,Aj] where A is some class of objects

Page 123: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

OUTPUT: positions X[Ai] where X is a D-dimensional vector

Page 124: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Multidimensional Scaling

How many dimensions ??? SCREE PLOT

Page 125: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Multidimensional Scaling: Nederlandse dialekten

Page 126: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 127: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 128: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Kohonen’s Self Organizing Map (SOM) and Sammon mapping

1. Same purpose : DIMENSION REDUCTION : represent a high dimensional set in a smaller sub-space e.g. 2D-plane.

2. SOM gives better results than Sammon mapping, but strongly sensitive to initial values.

3. This is close to clustering!

Page 129: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Kohonen’s Self Organizing Map (SOM)

Page 130: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

3.6 Kohonen’s Self Organizing Map (SOM)

Page 131: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

Sammon mapping

Page 132: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering
Page 133: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

1. Clustering versus Classification

• classification: give a pre-determined label to a sample

• clustering: provide the relevant labels for classification from structure in

a given dataset

• clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity

• Objectives: - 1. segmentation of space

- 2. find natural subclasses

Page 134: DATA MINING from data to information Ronald Westra Dep. Mathematics Knowledge Engineering

The End


Recommended