Multivariate descriptionVisualisation
Reduction of dimensionality
Data Mining courseMaster in Information Technologies
Enginyeria Informàtica
Tomàs Aluja
2
Two types of datasets to analyze
Data in Data Mining:massive, secondary, not random, with errors and missing values
topicsSocio-econ. Opinions Products
Data to explore Data to modelize
Output(s)Inputs
Course DM: Multivariate Visualisation. T. Aluja
3Course DM: Multivariate Visualisation. T. Aluja
4
Data exploration: Visualisation + “clustering”
• Data contains information about the genereting phenomenon.
• Visualization. The human eyes …– To consent a loss in the information in exchange for gaining
interpretability.
• Synthesis of the reality (clustering)– Reality is complex, we render operational simplifying it in a
limited number of clusters.
Snow’s Cholera Map, 1855
Course DM: Multivariate Visualisation. T. Aluja
5
South and North Korea at night
South Korea,Guess where is Seoul?
North KoreaNotice how dark it is
Course DM: Multivariate Visualisation. T. Aluja
6
Graph visualisation
Ggobi project
Course DM: Multivariate Visualisation. T. Aluja
Parallel coordinates of IRIS data
7Course DM: Multivariate Visualisation. T. Aluja
8
Iris versicolor
Iris virginica
Iris setosa
Course DM: Multivariate Visualisation. T. Aluja
9
Visualization of the tableBCN Quarters x Profession of inhabitants
Course DM: Multivariate Visualisation. T. Aluja
10
Spanish inquisition 1567‐1600sentences & crimes
Course DM: Multivariate Visualisation. T. Aluja
11
Visualisation of international cities according their
salaries. USB 1994.
Course DM: Multivariate Visualisation. T. Aluja
12
Microarray data: 64 cancers 6830 gen cromotografy
Course DM: Multivariate Visualisation. T. Aluja
13
M.Turk and A.Pentland. Eigen Faces for Recognition. Journal of Cognitive Neuroscience, 3(1), 1991.
Reconstitution of images
Course DM: Multivariate Visualisation. T. Aluja
14
Actual image
Course DM: Multivariate Visualisation. T. Aluja
15
Reconstituted image
Course DM: Multivariate Visualisation. T. Aluja
16
Monitoring of the inner temperatures of Lascaux cave (France):
Course DM: Multivariate Visualisation. T. Aluja
17
Multivariate VisualizationSelection of the active topic
• Exploratory situation (without response variable but with illustrative varaibles).
p
n
Variables
Variablesactivas
Variablesilustrativas
Ind
ivid
uos
Course DM: Multivariate Visualisation. T. Aluja
18
Active topic Multivariate technique
Continuous variables PCA - Principal Component Analysis
Count variables CA - (Simple) Correspondence Analysis
Categorical variables MCA - Multiple Correspondence Analysis
Course DM: Multivariate Visualisation. T. Aluja
19
PCA, CA, MCA can be useful for …
• Visualisition of the information contained in a data matrix • Detection of “outliers”
• Reduction of the dimensionality (feature selection)• Image compression• Extraction of new derived variables (latent), “feature
extraction”
• Smoothing of data (error reduction, avoiding collineality)• First phase of the explanatory variables for modeling
Course DM: Multivariate Visualisation. T. Aluja
20
Principal Component Analysis
• Cloud of points associated to the rows of the data matrix
• Total information contained in the cloud of points: the inertia respect G
i
i'
n
p
X=
••
•
•
•••
• •••
i
i'
var2
var1
var3Rp
Harold Hotelling, 1895-1973American statistician
Course DM: Multivariate Visualisation. T. Aluja
21
• Purpose:– To project the cloud of points upon a subspace (a
plan) to retain the maximum of the original cloud information.
Course DM: Multivariate Visualisation. T. Aluja
22
Principal Component Analysis
• Fitness Criterion– Find the subspace
maximizing the projected inertia.
• Decomposition of inertia in orthogonal directions (factorial axes) I I I Itotal p= + + +1 2
I I Ip1 2> > >
Course DM: Multivariate Visualisation. T. Aluja
23
Fit in Rp
2
1
n
i iu i
p N u X NX uMax ψ ψ ψ=
′ ′ ′= =∑
X uψ =
( )( )( )
Cov Xdiag X NX
Cor X⎧′ = ⎨⎩
1
1
, , ( ), ,
r
r
r rang Xu uλ λ→ =……
X NX u uλ′ =
1 1 1
1Max u X NX uu u
λ′ ′ =′ =
Let call u∈Rp the unit vector defining the direction maximizing the projected inertia
Diagonalization of the correlation matrix (or
covariance)
Let X be the data matrix: centered or standardized
Course DM: Multivariate Visualisation. T. Aluja
24
Eje 1nube multidimensional
Eje 2
Rp
Principal Components(derived latent variables)Factors, …
Direction maximizig the projected inertia: u1. Direction maximizing the projected inertia orthogonal to u1 : u2...
Xuα αψ =
Nα α αψ ψ λ′ =
1 2 3 4 5 6
0
1
2
3
Component Number
Eige
nval
ue
Scree Plot of Clarity-Quality
Assessing the importance of orthogonal directionsScree plot of eigenvalues:
Inertia of a PC
Course DM: Multivariate Visualisation. T. Aluja
25
variables muy correlacionadas
variable ortogonal
correlación muy negativacon x e y
•
xy
z
w
Associated cloud of points to the columns of a data matrix in Rn
ind3
varjsj
Nube de las variablesRn
ind2
ind1 Centered variables Standardized variables
n
p
X=
Course DM: Multivariate Visualisation. T. Aluja
26
Fit in Rn(standardized data)
•
v1
v3
v4
v2
Eje 1
Eje 2
•
•• •
Eje 1
Eje 2
v1
v4
v3
v2
Original cloud
Optimal joint visualisation of the correlations between variables
First factorial plan
Course DM: Multivariate Visualisation. T. Aluja
27
Fit in Rn
1 12 22
1
p
jv j
v N XX N vMax ϕ ϕ ϕ=
′ ′ ′= =∑
12X N vϕ ′=
1v v′ =
1 12 2
1 12 2
u X N v
v N Xuα α
α α
λ
λ
−
−
′=
=
12X N vα α
α α α
ϕϕ ϕ λ
′=′ =
1 12 2N XX N v vλ
⎫ ′ =⎬⎭
12
1 12 2
u
N vα α
α α
ϕ λ
ψ λ −
=
=
Let v∈Rp be the unit vector defining the direction maximizing the inertia:
Transition relationships between both fits:
Indirect projection formulas
12
( , )( , )
j
j j
cor xX N
s cor xα
α αα
ψϕ λ ψ
ψ− ⎧
′= = ⎨⎩
Interpeting the projections
Data matrix X: centered o standardized
Course DM: Multivariate Visualisation. T. Aluja
28
The PCA is a device to find artificial latent variables, from observed ones.
World of ideas, concepts, theories, …
Real worldObserved variables
PCA
exp
l.
Factors
ACP: Ψα α α α= + + +u u u1 2 px x x1 2 p
( , ) ( , ) ( , )n p n p p pΨ = X U
Var. 1
Var. 2
Var. p
Fac. 1
Fac. q
But only the first q Factors convey structural information, the remaining
are noise
Course DM: Multivariate Visualisation. T. Aluja
29
PCA in practice
• Role of de las variables: Normed or non normed analysis– Normed PCA means to give all varaibles the same importance, we
achive this by standardization of data (diagonalization of the correlation matrix)
– Non normed PCA means to give to each varaible an importance proportional to tis standard deviation. We achieve this working with the just centered data matrix (diagonaization of the covaraince matrix)
• What variables to analize?– This is the most crucial decision. Often the information contained is
obvious, then try to perform partial analysis. PCA is a device of exploration.
Course DM: Multivariate Visualisation. T. Aluja
30
PCA in practice
• How many factorial directions are significative? – Difficult to assess. How many axes remain stable with independent
data?– Use the screeplot.– Perform random perturbation of data to assess stability.
• How to interpret the axes– The significative axes convey structural (deterministic) information of
the phenomenon under study and they can be interpreted and given a name (this is the most appealing outcome.
– Interpretation is done in the basis of the correlations between the principal component (the new artificial latent variables and the original ones, the pc is a mean variable of the most correlated).
Course DM: Multivariate Visualisation. T. Aluja
31
Projection of illustrative variables
• Continuous– We depict their correlations with the factorial axes.
• Categorical– We represent a categorical varaible by the set of the centre
of gravity of the different subclouds of individuals correponding to each level of the categorical variable.
Very useful … It allows to relate each illustrative variable to the active topic altogether
Course DM: Multivariate Visualisation. T. Aluja
32
Finding the PCA solution iteratively (NIPALS)
Initialize X1←XFor h=1,..., r=rang(X)
Ψh = mean column of Xh
Repeat till convergence of uh
uh = X’hΨh
uh = uh/|uh|Ψh = Xh uh
Xh = Xh-1 - Ψh uh’
Rn
Rp
ψh
uh
hX ′hX
In the convergence: h h h h
h h h h
X X u uX X ψ ψ
′
′
Course DM: Multivariate Visualisation. T. Aluja
A relevant application: Google• GoogleTM uses SVD to accelerate finding relevant web pages. Define a web
site as an authority if many sites link to it. Define a web site as a hub if it links to many sites. We want to compute a ranking x1; … ; xN of authorities and y1;… ; yMof hubs.
As a first pass, we can compute the ranking scores as follows: xi0 is the number of
links pointing to i and yi0 is the number of links going out of i. But, not all links
should be weighted equally. For example, links from authorities (or hubs) should count more. So, we can revise the rankings as follows
Where A is the adjacency matrix with aij = 1 if i links to j. (of 109 order)
But an authority depends also from the pages linking to the linking pages of the authority. Hence iterating …
33
1 0
1 0i
j
x A y
y Ax
′=
=
1 1k k k ki jx A Ax y AA y− −′ ′= =
Course DM: Multivariate Visualisation. T. Aluja
34
prcomp(x, retx=T, center=T, scale.=F, tol = NULL, ...)
Arguments:x: a numeric (or data frame) which provides the data.retx: a logical value indicating whether the rotated variables should be returned.center: a logical value indicating whether the variables should be shifted to be
zero centered. scale.: a logical value indicating whether the variables should be scaled to have
unit variance before the analysis takes place.tol: a value indicating the magnitude below which components should be omitted.
Attributessdev: the standard deviations of the principal components (i.e., the square roots of
the eigenvalues).rotation: the matrix of variable loadings (i.e., a matrix whose columns contain the
eigenvectors). x: if 'retx' is true the value of the rotated data (the centred (and scaled if
requested) data multiplied by the 'rotation' matrix) is returned.
Course DM: Multivariate Visualisation. T. Aluja
35
biplot(x, y, var.axes = TRUE, main = NULL, ...)
Arguments:
x: The first set of points (a two-column matrix), usually associated with observations.
y: The second set of points (a two-column matrix), usually associated with variables.
var.axes: If 'TRUE' the second set of points have arrows representingthem as (unscaled) axes.
Course DM: Multivariate Visualisation. T. Aluja
36
Beyond PCA ⇒ MCA
• PCA just analyzes continuous variables through their correlations, hence it just can reveal linear relationships between variables
• Thus, transform the original variablesRecode them to ordinal to take into account non linearities
f(X) Ψ
var j a a
xj1 jk
ij → 001000
Ludovic LebartFrench statistician, promoter of MCA
Course DM: Multivariate Visualisation. T. Aluja
37
MCA of hypercubes• Dimensions (= categorical variables)• Measured variables in cells (=responses, they may be continuous
or categorical)• (Hypercube can be explicit or implicit in a relational DB.
A1 B1 C1A1 B2 C2A3 B1 C3A2 B2 C1
…
Hypercube dimensions Numerical coding (bining)
1000 10 1001000 01 0100010 10 0010100 01 100
…
A1 A2 A3 A4B1 B2
C1 C2 C3 (=Z)
Course DM: Multivariate Visualisation. T. Aluja
38
• Active Variables : Dimensions• Ilustrative variables : Responses
p
n
Variables
Dimensiones Variablesrespuesta
Ind
ivid
uos
1000 10 100
We will visualize the responses upon the grid provided by the dimensions
Course DM: Multivariate Visualisation. T. Aluja
39
MCAActive grid
Edad CSP Nivel de ingresos
2 1 3 0 1 0 1 0 0 0 0 1
Edad
CSP Ingr
.
nj
p
n
nnp
Ed1
Ed2
Ed3CSP2 CSP3
CSP1
ing3
ing1
ing2Course DM: Multivariate Visualisation. T. Aluja
40
El ACM como un ACP no lineal
Course DM: Multivariate Visualisation. T. Aluja
41
2 1 1
1
1n
i iu i
p nu D Z ZD uMaxn
ψ ψ ψ − −
=
′ ′ ′= =∑1 1npu D u−′ =
11eig Z ZD u up
λ−′⇒ =
0010 01 010 pi
nj
i
1 … j … J
Z=
D=
1 … pvariablesmodalities
n
1
1
1n1
1J
J
1i
ppnp n
= =
1
n
j iji
n z=
= ∑1 Zp
Row profile:
1Znp D up
ψ −=Chi-square Metric:
1Dnp
−⎛ ⎞⎜ ⎟⎝ ⎠
Course DM: Multivariate Visualisation. T. Aluja
42
What are the factors in MCA?
Edad
CSP
Nivel de ingresos
z1
z3
z2
Ψ
Rn
Max cor
u aj 1
p
j
j jk jkk
2 ( , )Ψ=
∑
∑=
z
z
⇒ Optimal quantificationof the categorical variables
MCA
Original categorical data Equivalent continuous factors
But we will work with more dimensions than in PCACourse DM: Multivariate Visualisation. T. Aluja
43
Interesting properties of the MCA displays
• Every individual is the cdg of their chosen modalities(apart from a multiplicative factor)
• Une modality (=level) is the cdg of individuals having chosen it(apart from a multiplicative factor)
ind
mod
1αλ
1αλ
1J
j ijji
z
pα
αα
ϕψ
λ=
∑
1n
i ijij
zn
αα
α
ψϕ
λ= ∑
Course DM: Multivariate Visualisation. T. Aluja
44
MCA iterative algorithm
Initialize Y0 ← Z; Z ← [Z1,... Zp]; D=Z’Z; Dk=Zk’Zk
For h=1,..., rang(Z)Ψh = rowmean of YRepeat till convergence of uh
uh = D-1Y’hΨhuh = uh/|uh|Ψh = (1/p) Yh uh
Yh = Yh-1 - Ψh uh’zk = Zk uk; uk = Dk
-1Zk’ Ψh k=1...p
Course DM: Multivariate Visualisation. T. Aluja
45
Projection of the illustrative variables
• Continuous– From their correlations with the factorial axes.
• Categorical– As the set of cdgs of the individuals having chosen each
level of the categorical variable.
Course DM: Multivariate Visualisation. T. Aluja
46
library(MASS) mca(df, nf = 2, abbrev = FALSE)
Arguments:df: A data frame containing only factors nf: The number of dimensions for the MCA.
Attributes: rs: The coordinates of the rows, in 'nf' dimensions. cs: The coordinates of the column vertices, one for each level of
each factor. fs: Weights for each row, used to interpolate additional factors
in 'predict.mca'. d: The singular values for the 'nf' dimensions.
Course DM: Multivariate Visualisation. T. Aluja
47
Qua
rters
CSP
nkj nk
nj n
CSP
n
Quarters
00010000 00000010000
nk njz1
z2
Max
u a
v b
1 k kk
j jj
cor( , )z z
z
z
1 2
2
=
=
∑
∑Rn
Ψ
Jean Paul Benzecri, Analyse des Données father
Simple Correspondences AnalysisAnalyisis of crosstables
Course DM: Multivariate Visualisation. T. Aluja
48
library(MASS)corresp(x, data, ...)
Argumentsx : A two-way frequency table. Currently accepted forms are
matrices, data frames ...
nf: The number of factors to be computed. (max. value = min (nrow-1, ncol-1).
Course DM: Multivariate Visualisation. T. Aluja