MULTIVARIATE STATISTICAL ANALYSIS FOR FOOD SCIENCE AND AGRICULTURE: AN INTRODUCTION 4. MULTIDIMENSIONAL SCALING Prof. Eugenio Parente Scuola di Scienze Agrarie- Università della Basilicata
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Outline • Multidimensional scaling (MDS)
• objectives of MDS • metric and monotonic MDS • MDS output • examples
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Multidimensional scaling Multidimensional scaling is a group of techniques used to fit a set of points in a q-dimensional space such as the distance between the points (δij) matches as closely as possible the dissimilarity (dij) between the original objects in the p-dimensional space in order to obtain a simple spatial model (map). The model does not need statistical distribution assumptions, but the data should satisfy metric conditions: • distance from an object with itself is 0 • distance from object A to object B is the same as the
distance of B from A (the dissimilarity matrix should be simmetrical)
• distance from A to C is less or equal than the sum of distances between A to B and B to C (triangle inequality)
MDS vs. PCA • Advantages
• MDS will usually find a solution with less dimensions compared to PCA
• If the purpose is just finding or visualizing natural groups of objects, a map is easier to explain than a score plot
• Cluster of objects are easier to visualize / highlight in a MDS map
• Disadvantages • It is usually more difficult to find relationships between dimensions
and original variables • MDS is usually effective only when observation are fairly well
spread out in the space
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Multidimensional scaling The model can be written as:
€
δij = f (dij )dij = h(x i,x j )
where: xi and xj are the vectors of the coordinates of objects i and j in the q-dimensional space (q<p) f(dij) is the assumed functional relationship between the dissimilarities and the distances h is the distance function (usually Euclidean, but Minkowski metrics can be used)
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Metric multidimensional scaling
A direct numerical comparison between fitted distances and dissimilarities (usually based on a least squares criterion) is used. The coordinates are iteratively calculated to minimize a goodness of fit statistics (stress). In linear metric scaling a linear model is used to relate distances to dissimilarities:
€
dij =α + βδij + εij
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Metric multidimensional scaling A fit criterion which is invariant both under rigid transformations (rotations, reflections, translations) and under non-rigid transformations (stretching and shrinking obtained by multiplications of the coordinate by a factor k) is:
€
S3 =
dij − f δij( )( )2
i< j∑
dij2
i< j∑
whose square root is known as stress
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Non-metric (monotonic) MDS When observed proximities contain information on rank order rather than on real distances, assuming a linear relationship between observed and fitted distances may be inappropriate and monotonic regression should be used
the fitted distances are chosen to represent a weak monotonicity condition
€
dij = ˆ d ij + εij
δi1 , j1< δi2 , j2
< ...δiN , jN
ˆ d i1 j1≤ ˆ d i2 j2
≤ ...≤ ˆ d iN jN
MDS input data • Dissimilarity matrices obtained in a direct way (ask
assessors to state how different two objects are, take measurement from a map, etc.)
• Dissimilarity matrices calculated from rectangular (n x p) data matrices • Euclidean distance (on standardized or unstandardized data) • Negative correlation (beware, high positive correlation = most
similar, high negative correlation = most dissimilar) • Other distance or correlation measures (Spearman, Guttman)
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Adjustable parameters in the analysis • Type of scaling (monotonic, metric: linear, log, power) • Number of dimensions • Stress function • Iteration and convergence parameters
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS output In the MDS output look for: • final configuration in the q dimensions
(q<p), coordinates and plots • final stress and proportion of the variance;
according to Kruskal (1964) • stress 0.20 -> poor fit • stress 0.10 -> fair fit • stress 0.05 -> good fit • stress 0.025 -> excellent fit
• Shepard diagram (plot of observed vs fitted distances)
Individual differences MDS • Uses multiple dissimilarity matrices (for example different
judges evaluating a common set of ptoducts) • The input is a rectangular matrix containing stacked
triangular dissimilarity matrices • Scales both objects in a common space in order to
• Find a common configuration for all objects • Calculate weights for judges in the common space • Assess goodness of fit for both objects and judges
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
The data file
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
The configuration
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
The output
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS examples Open file MDSRAPD.syo for examples of MDS on RAPD-PCR data (including bootstrapping/MDS procedure, which can be found in file sardiniabread.syo); look at the original data and final configuration for the MDS on RAPD data in file breadlab.xls; look at file MDSboot.xls for the final bootstrapping/MDS configuration; look at the command files for details on the procedures of pretreatment (boot.syc, Bread\fornextloop.syc)
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS examples Open file mds.syo for MDS examples on the RP-HPLC dataset for smear cheese.
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS of RAPD patterns of LAB Mk 1 2 3 4 5 6 7 8 9 10 11 12 Mk
1.000 bp
34
5.000 bp
34 29 5 28 38
Mk 13 14 15 16 17 18 19 20 21 22 23 24 Mk
30 29 30 18 30
Mk 37 38 39 40 41 42 43 44 45 46 47 Mk
18 18 42 14 37 42 27 37 3 24 27
Mk 25 26 27 28 29 30 31 32 33 34 35 36 Mk
22 35 28 22 8 24 28 9 24
A B
C D
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS of RAPD patterns of LAB
ZB CH CH CH CP CP CP MH CL CL CL CM
ZE ZE CE CD CD SA
SA
SA
ZE
ZE
CG CG
ZE
ZB ZE
ZE ZE SC SB SC
SC
ZE CO CM
ZB
ZB
ZB CE
MH ME MB MB MG MF MB MF ME MF MA MB MG MH MF
MH MB MD MG MA MF MD MG MH MH CD
ME MB MD MD MB MB MB
CE CF CE CD CA CD MA MG
MC MB MB MF MG MG ME ME MF MG MC MA
MB MG MB
MB
CA CA
CM CM CF CF
SB SA SB SB
SB
SB SB SB SA
SA CM SA SB SA SC SB SA
ZE CG
ZE ZB ZB ZB
ZB ZB ZE ZE
ZB ZB ZB ZE
ZB CG
ZE ZE ZB ZB ZB ZB
ZB ZG
ZG ZG ZG ZG
ZG ZG
ZG
ZB
SD SD
SC SC SC SD SD SD SD SD SD ZG SD SD SD SD
SD SC SC ZG
ZG
SA
ZG ZG
W.confusa Leuc.citreum
Lb.sanfranciscensis
Lb.plantarum Lb.pentosus Lb.brevis
Species
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS of RAPD patterns of LAB
-3 -2 -1 0 1 2 dim(1)
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
MB
CG
ZB
ZG ZE
SD
SC
SA SB
dim
(2)
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
MDS of RP-HPLC data from smear cheese
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
RP-HPLC data from smear cheese: PCA vs. MDS of Eulidean distance matrix
05/01/2013 Multistat 3 cfu, Dec 2012 - Jan 2013
Some rights reserved
This presentation was created by Eugenio Parente, 2008 (revised:s 2012). With the exception of figures and tables taken from published articles the material included in this presentation is covered by Creative Commons Public License “by-nc-sa” (http://creativecommons.org/licenses/by-nc-sa/2.5/deed.en).