+ All Categories
Home > Documents > Mutivariate statistical Analysis methods

Mutivariate statistical Analysis methods

Date post: 12-Jan-2016
Category:
Upload: rainer
View: 50 times
Download: 0 times
Share this document with a friend
Description:
Mutivariate statistical Analysis methods. JEHANZEB [email protected]. Basic statistical concepts and tools. Statistics. Statistics are concerned with the ‘ optimal ’ methods of analyzing data generated from some chance mechanism (random phenomena). - PowerPoint PPT Presentation
Popular Tags:
60
Mutivariate statistical Analysis methods JEHANZEB [email protected]
Transcript
Page 1: Mutivariate statistical Analysis methods

Mutivariate statistical Analysis methods

[email protected]

Page 2: Mutivariate statistical Analysis methods

Basic statistical concepts and tools

Page 3: Mutivariate statistical Analysis methods

Statistics

Statistics are concerned with the ‘optimal’ methods of analyzing data generated from some chance mechanism (random phenomena).

‘Optimal’ means appropriate choice of what is to be computed from the data to carry out statistical analysis

Page 4: Mutivariate statistical Analysis methods

Random variables A random variable is a numerical quantity

that in some experiment, that involve some degree of randomness, takes one value from some set of possible values

The probability distribution is a set of values that this random variable takes together with their associated probability

Page 5: Mutivariate statistical Analysis methods

The Normal distribution

Proposed by Gauss (1777-1855) : the distribution of errors in astronomical observations (error function)

Arises in many biological processes, Limiting distribution of all random variables for a

large number of observations. Whenever you have a natural phenomemon

which is the result of many contributiong factor each having a small contribution you have a Normal

Page 6: Mutivariate statistical Analysis methods

The Quincunx

Bell-shapeddistribution

Page 7: Mutivariate statistical Analysis methods

Distribution function

The distribution function is defined F(x)=Pr(X<x)

F is called the cumulative distribution function (cdf) and f the probability distrbution function (pdf) of X

and ² are respectively the mean and the variance of the distribution

²

)²(

²)()()(

2

2

1 xt

exfwheredxxftF

Page 8: Mutivariate statistical Analysis methods
Page 9: Mutivariate statistical Analysis methods

Moments of a distribution The kth moment is defined as

The first moment is the mean The kth moment about the mean is

The second moment about the mean is called the variance ²

dxxfxxE kk

k )()('

dxxfxxE kk

k )()()(

Page 10: Mutivariate statistical Analysis methods

Kurtosis: a useful moments’ function

Kurtosis 4=4-3²2

4 0 for a normal distribution so it

is a measure of Normality

Page 11: Mutivariate statistical Analysis methods

Observations Observations xi are realizations of a

random variable X The pdf of X can be visualized by a

histogram: a graphics showing the frequency of observations in classes

Page 12: Mutivariate statistical Analysis methods

Estimating moments

The Mean of X is estimated from a set of n observations (x1, x2, ..xn) as

The variance is estimated by

Var(X) =

n

iix

nx

1

1

2

1

2

1

1)( xx

n

n

ii

Page 13: Mutivariate statistical Analysis methods

The fundamental of statistics Drawing conclusions about a

population on the basis on a set of measurments or observations on a sample from that population

Descriptive: get some conclusions based on some summary measures and graphics (Data Driven)

Inferential: test hypotheses we have in mind befor collecting the data (Hypothesis driven).

Page 14: Mutivariate statistical Analysis methods

What about having many variables?

Let X=(X1, X2, ..Xp) be a set of p variables

What is the marginal distribution of each of the variables Xi and what is their joint distribution

If f(X1, X2, ..Xp) is the joint pdf then the marginal pdf is

ppiii dxdxxxxxfXf ....)...,,,..,()( 1111

Page 15: Mutivariate statistical Analysis methods

Independance

Variables are said to be independent if

f(X1, X2, ..Xp)= f(X1) . f(X2)…. f(Xp)

Page 16: Mutivariate statistical Analysis methods

Covariance and correlation

Covariance is the joint first moment of two variables, that is

Cov(X,Y)=E(X-X)(Y- Y)=E(XY)-E(X)E(Y)

Correlation: a standardized covariance

is a number between -1 and +1

)().(

),(),(

YVarXVar

YXCovYX

Page 17: Mutivariate statistical Analysis methods

For example: a bivariate Normal

Two variables X and Y have a bivariate Normal if

is the correlation between X and Y

2

2

21

21

1

12

21

1

221 12

1 ²

)²())((

²

)²(

),(yyxx

eyxf

Page 18: Mutivariate statistical Analysis methods
Page 19: Mutivariate statistical Analysis methods

Uncorrelatedness and independence

If =0 (Cov(X,Y)=0) we say that the variables are uncorrelated

Two uncorrelated variables are independent if and only if their joint distribution is bivariate Normal

Two independent variables are necessarily uncorrelated

Page 20: Mutivariate statistical Analysis methods

Bivariate Normal

If =0 then

So f(x,y)=f(x).f(y)

the two variables are thus independent

2

2

21

21

1

12 ²

)²())((2

²

)²(

1

1

221 12

1),(

yyxx

eyxf

2

2

²

)²(

2 ²2

1

y

e

1

1

²

)²(

1²2

1),(

x

eyxf

Page 21: Mutivariate statistical Analysis methods

Many variables

We can calculate the Covariance or correlation matrix of (X1, X2, ..Xp)

C=Var(X)=

A square (pxp) and symmetric matrix

)(..........

........)(

........)(

21

2221

1211

ppp

p

p

xv),xc(x),xc(x

),xc(xxv),xc(x

),xc(x),xc(xxv

Page 22: Mutivariate statistical Analysis methods

A Short Excursion into Matrix Algebra

Page 23: Mutivariate statistical Analysis methods
Page 24: Mutivariate statistical Analysis methods

What is a matrix?

Page 25: Mutivariate statistical Analysis methods

Operations on matrices

Transpose

Page 26: Mutivariate statistical Analysis methods

Properties

Page 27: Mutivariate statistical Analysis methods

Some important properties

Page 28: Mutivariate statistical Analysis methods

Other particular operations

Page 29: Mutivariate statistical Analysis methods

Eigenvalues and Eigenvectors

Page 30: Mutivariate statistical Analysis methods

Singular value decomposition

Page 31: Mutivariate statistical Analysis methods

Multivariate Data

Page 32: Mutivariate statistical Analysis methods

Multivariate Data

Data for which each observation consists of values for more than one variables;

For example: each observation is a measure of the expression level of a gene i in a tissue j

Usually displayed as a data matrix

Page 33: Mutivariate statistical Analysis methods

Biological profile data

Page 34: Mutivariate statistical Analysis methods
Page 35: Mutivariate statistical Analysis methods

The data matrix

npnn

p

p

xxx

xxx

xxx

....

....

....

21

22221

11211

n observations (rows) for p variables (columns) an nxp matrix

Page 36: Mutivariate statistical Analysis methods

Contingency tables

When observations on two categorial variables are cross-classified.

Entries in each cell are the number of individuals with the correponding combination of variable values

Eyes colour Hair colour

Fair Red Medium Dark

Blue 326 38 241 110

Medium 343 84 909 412

Dark 98 48 403 681

Light 688 116 584 188

Page 37: Mutivariate statistical Analysis methods

Mutivariate data analysis

Page 38: Mutivariate statistical Analysis methods

Exploratory Data Analysis

Data analysis that emphasizes the use of informal graphical procedures not based on prior assumptions about the structure of the data or on formal models for the data

Data= smooth + rough where the smooth is the underlying regularity or pattern in the data. The objective of EDA is to separate the smooth from the rough with minimal use of formal mathematics or statistics methods

Page 39: Mutivariate statistical Analysis methods

Reduce dimensionality without loosing much information

Page 40: Mutivariate statistical Analysis methods

Overview on the techiques

Factor analysisPrincipal components analysisCorrespondance analysisDiscriminant analysisCluster analysis

Page 41: Mutivariate statistical Analysis methods

Factor analysis

A procedure that postulates that the correlations between a set of p observed variables arise from the relationship of these variables to a small number k of underlying, unobservable, latent variables, usually known as common factors where k<p

Page 42: Mutivariate statistical Analysis methods

Principal components analysis

A procedure that transforms a set of variables into new ones that are uncorrelated and account for a decreasing proportions of the variance in the data

The new variables, named principal components (PC), are linear combinations of the original variables

Page 43: Mutivariate statistical Analysis methods

PCA

If the few first PCs account for a large percentage of the variance (say >70%) then we can display the data in a graphics that depicts quite well the original observations

Page 44: Mutivariate statistical Analysis methods

Example

Page 45: Mutivariate statistical Analysis methods
Page 46: Mutivariate statistical Analysis methods

Correspondance Analysis

A method for displaying relationships between categorial variables in a scatter plot

The new factors are combinations of rows and columns

A small number of these derived coordinate values (usually two) are then used to allow the table to be displayed graphically

Page 47: Mutivariate statistical Analysis methods

Example: analysis of codon usage and gene expression in E. coli (McInerny, 1997)

A gene can be represented by a 59-dimensional vector (universal code)

A genome consists of hundreds (thousands) of these genes

Variation in the variables (RSCU values) might be governed by only a small number of factors

For each gene and each codon i calculate RCSU=# observed codon /#expected codon

Page 48: Mutivariate statistical Analysis methods

Codon usage in bacterial genomes

Page 49: Mutivariate statistical Analysis methods

Evidence that all synonymous codons were not used with equal Evidence that all synonymous codons were not used with equal frequency:frequency:Fiers Fiers et al.,et al., 1975 A-protein gene of bacteriophage MS2, Nature 256, 273-278 1975 A-protein gene of bacteriophage MS2, Nature 256, 273-278

UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU UUU Phe 6 UCU Ser 5 UAU Tyr 4 UGU Cys 0Cys 0UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC UUC Phe 10 UCC Ser 6 UAC Tyr 12 UGC Cys 3Cys 3UUA Leu 8 UCA Ser 8 UAA Ter * UGA UUA Leu 8 UCA Ser 8 UAA Ter * UGA Ter *Ter *UUG Leu 6 UCG Ser 10 UAG Ter * UGG UUG Leu 6 UCG Ser 10 UAG Ter * UGG Trp 12Trp 12CUU Leu 6 CCU Pro 5 CAU His 2 CGU CUU Leu 6 CCU Pro 5 CAU His 2 CGU Arg 7Arg 7CUC Leu 9 CCC Pro 5 CAC His 3 CGC CUC Leu 9 CCC Pro 5 CAC His 3 CGC Arg 6Arg 6CUA Leu 5 CCA Pro 4 CAA Gln 9 CGA CUA Leu 5 CCA Pro 4 CAA Gln 9 CGA Arg 6Arg 6CUG Leu 2CUG Leu 2 CCG Pro 3 CAG Gln 9 CGG CCG Pro 3 CAG Gln 9 CGG Arg 3Arg 3

AUU Ile 1 ACU Thr 11 AAU Asn 2 AGU AUU Ile 1 ACU Thr 11 AAU Asn 2 AGU Ser 4Ser 4AUC Ile 8 ACC Thr 5 AAC Asn 15 AGC AUC Ile 8 ACC Thr 5 AAC Asn 15 AGC Ser 3Ser 3AUA Ile 7 ACA Thr 5 AAA Lys 5 AGA AUA Ile 7 ACA Thr 5 AAA Lys 5 AGA Arg 3Arg 3AUG MeU 7 ACG Thr 6 AAG Lys 9 AGG AUG MeU 7 ACG Thr 6 AAG Lys 9 AGG Arg 4Arg 4

GUU Val 8 GCU Ala 6 GAU Asp 8 GGU GUU Val 8 GCU Ala 6 GAU Asp 8 GGU Gly 15Gly 15GUC Val 7 GCC Ala 12 GAC Asp 5 GGC GUC Val 7 GCC Ala 12 GAC Asp 5 GGC Gly 6Gly 6GUA Val 7 GCA Ala 7 GAA Glu 5 GGA GUA Val 7 GCA Ala 7 GAA Glu 5 GGA Gly 2Gly 2GUG Val 9 GCG Ala 10 GAG Glu 12 GGG GUG Val 9 GCG Ala 10 GAG Glu 12 GGG Gly 5Gly 5

Page 50: Mutivariate statistical Analysis methods

Multivariate reduction

Attempts to reduce a high-dimensional space to a lower-dimensional one.

In other words, it tries to simplify the data set.Many of the variables might co-vary, therefore there might only

be one, or a small few sources of variation in the dataset

A gene can be represented by a 59-dimensional vector (universal code)

A genome consists of hundreds (thousands) of these genesVariation in the variables (RSCU values) might be governed by

only a small number of factors

Page 51: Mutivariate statistical Analysis methods

Plot of the two most important axes

Highly expressed genes

Lowly-expressed genes

Recently acquired genes

Page 52: Mutivariate statistical Analysis methods

Discriminant analysis

Techniques that aim to assess whether or a not a set of variables distinguish or discriminate between two or more groups of individuals

Linear discriminant analysis (LDA): uses linear functions (called canonical discriminant functions) of variable giving maximal separation between groups (assumes tha covariance matrices within the groups are the same)

if not use Quadratic Discriminant analysis (QDA)

Page 53: Mutivariate statistical Analysis methods

Example: Internal Exon prediction

Data: A set of exons and non-exons Variables : a set of features

donor/acceptor site recognizersoctonucleotide preferences for

coding regionoctonucleotide preferences for

intron interiors on either side

Page 54: Mutivariate statistical Analysis methods

LDA or QDA

Page 55: Mutivariate statistical Analysis methods

Cluster analysis

A set of methods (hierarchical clustering, K-means clustering, ..) for constructing sensible and informative classification of an initially unclassified set of data

Can be used to cluster individuals or variables

Page 56: Mutivariate statistical Analysis methods

Example: Microarray data

Page 57: Mutivariate statistical Analysis methods

Other Methods

Independant component analysis (ICA): similar to PCA but components are defined as independent and not only uncorrelated; moreover they are not orthogonal and uniquely defined

Multidimensional Scaling (MDS): a clustering technique that construct a low-dimentional geometrical representation of a distance matrix (also Principal coordinates analysis)

Page 58: Mutivariate statistical Analysis methods

Useful books: Data analysis

Page 59: Mutivariate statistical Analysis methods
Page 60: Mutivariate statistical Analysis methods

Useful book: R langage


Recommended