Microarrays & Expression profiling Matthias E. Futschik Institute for Theoretical Biology...

Post on 12-Jan-2016

212 views 0 download

Tags:

transcript

Microarrays & Expression profiling

Matthias E. Futschik

Institute for Theoretical Biology

Humboldt-University, Berlin, Germany

Bioinformatics Summer School, Istanbul-Sile, 2004

Yeast cDNA microarray

The Problem

MicroarraysMicroarraysThousands of simultaneously measured gene activitiesThousands of simultaneously measured gene activities

Genetic networksGenetic networksComplex regulation of gene expressionComplex regulation of gene expression

Medical Medical applicationsapplicationsNew drug discovery by New drug discovery by knowledge discoveryknowledge discovery

Methodology

1. From Image to Numbers: Image-Analysis

2. Well begun is half done: Design of experiments

3. Cleaning up: Preprocessing and Normalisation

4. Go fishing: Significance of Differences in Gene Expression Data

5. Who is with whom: Clustering of samples and genes

6. Gattica becomes alive: Classification and Disease Profiling

7. The whole picture: An integrative approach

What are microarrays? Microarrays consist of localised

spots of oligonucleotides or cDNA attached on glass surface or nylon filter

Microarrays are based on base-pair complementarity

Different production: Spotted microarrays Photolithographicly

synthesised microarrays (Affymetrix)

Different read-outs: Two-channel (or two-

colour) microarrays One-channel (or one-

colour) microarrays

Applications:

Clustering of arrays: finding

new disease subclasses

Clustering of genes:

Co-expression and co-regulation go together enabling functional annotation

Clustering of time series

Reconstruction of Reconstruction of gene networksgene networks ((Reverse engineeringReverse engineering))

Classification of tissue samples and marker gene identification

Typical cDNA microarray experiment

Microarray technology I Two-colour microarray (cDNA and spotted

oligonucleotide microarrays) Probes are PCR products based on a chosen cDNA library or

synthesized oligonucleotides (length 50-70) optimized for specifity and binding properties >> probe design

Probes are mechanically spotted. To control variation of amount of printed cDNA/oligos and spot morphology, reference RNA sample is included. Thus, ratios are considered as basic units for analysing gene expression. Absolute intensities should be interpreted with care.

MOVIE 1: Array production - Galbraith labMOVIE 1: Array production - Galbraith lab

MOVIE 2: Principles – Schreiber lab MOVIE 2: Principles – Schreiber lab

Affymetrix GeneChip technologyAffymetrix GeneChip technology

Production by photolitographyProduction by photolitography

Hybridisation process Hybridisation process and biotin labelíng;and biotin labelíng;Fragmentation aims Fragmentation aims to destroy higher to destroy higher order structures of order structures of cRNAcRNA

Microarray technologies II One-colour microarrays (Affymetrix GeneChips)

Measurement of hybridisation of target RNA to sets of 25-oligonucleotides (probes).

Probes are paired: Perfect match (PM) and mis-match (MM). PM are complementary to the gene sequence of interest. MM include a single nucleotide changed in the middle position of the oligonucleotide. MM serve for controlling of experimental variation and non-specific cross-hybridisation. Thus, MMs constitute internal references (on the probe site).

Average (PM-MM) delivers measure for gene expression. However, different methods to calculates summary indices exist (e.g. MAS,dchip, RMA...)

From Images to NumbersImpurities, Impurities, overlapping spotsoverlapping spots

Donut-shaped spots, Donut-shaped spots, Inhomogeneous Inhomogeneous intensitiesintensities

Local BackgroundLocal Background

Spatial biasSpatial bias

Image Analysis

1. Localisation of spots: locate centres after (manual) adjustment of grid

2. Segmentation: classification of pixels either as signal or background. Different procedures to define background.

3. Signal extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Data acquisition

Scans of slides are usually stored in 16-bit TIFF files. Thus, scanned intensities vary between 0 and 216.

Scanning of separate channels can adjusted by selection of laser power and gain of photo-multiplier. Common aim: balancing of channels. Common problems: avoiding of saturation of high intensity

spots while increasing signal to noise ratios.

Data acquisition

Image processing software produces a variety of measures: Spot intensities, local background, spot morphology measures. Software vary in computational approaches of image segmentation and read-out.

Open issues: local background correction derivation of ratios for spot intensities flagging of spots, multiple scanning procedures

Design of experimentTwo channel microarrays incorporate a reference sample. Choice of reference determines follow-up analysis.

A1

A2 A

3

RReference design:

● All samples are co-hybridised with common reference sample

● Advantage: Robust and scalable. Length of path of direct comparison equals 2.● Disadvantage: Half of the measurements are made on reference sample which is commonly of little or no interest

Alternative Designs:

● Dye-swap design: each comparison includes dye-swap to distinguish dye effects from differential expression (important for direct labelling method)

● Loop-design: No reference sample is involved. Increase of efficiency is, however, accompanied with a decrease of robustness.

● Latin-square design: classical design to separate effects of different experimental factors

A1

A2

A3

Comparison of designsComparison of designs::

Yang and Speed, Yang and Speed, Nature genetics reviews, 2002Nature genetics reviews, 2002

Define before experiment Define before experiment what differences (contrasts) what differences (contrasts) should be determined to should be determined to make best use out of make best use out of (usually) limited number of (usually) limited number of arraysarrays

Sources of variation in gene expression measurements using microarrays● Microarray platform

1. Manufacturing or spotting process

1. Manufacturing batch

2. Amplification by PCR and purification

3. Amount of cDNA spotted, morphology of spot and binding of cDNA to substrate

● mRNA extraction and preparation

1. Protocol of mRNA extraction and amplification

2. Labelling of mRNA

Sources of variation in gene expression measurements using microarrays II

1. Hybridisation

1. Hybridisation conditions such as temperature, humidity, hyb-buffer

● Scanning● scanner● scanning intensity and PMT settings

1. Imaging

1. software

2. flagging, background correction,...

Design of experiment

Important issues for DOE:

● Technical replicates assess variability induced by experimental procedures.

● Biological replicates (assess generality of results).

● Number of replicates depends on desired sensitivity and sensibility of measurements and research goal.

● Randomisation to avoid confounding of experimental factors. Blocking to reduce number of experimental factors.

Design of experiment ● Control spots

● assess reproducibility within and between array, background intensity, cross-hybridisation and/or sensitivity of measurement● can consists of empty spots or hybridisation-buffer, genomic DNA, foreign DNA, house-holding genes

● Foreign (non-cross-hybridising) cDNA can be 'spiked in'. Use of dilution series can assess sensitivity of detecting differential expression by 'ratio controls'.

Validation of results:● by other experimental techniques (e.g. Northern, RT-PCR)● by comparison with independent experiments.

Data storage

Microarrays experiments produce large amounts of data: data storage and accessibility are of major importance for the follow-up analysis.

Not only signal values have to be stored but also:● TIFF images and imaging read-out● Gene annotation● Experimental protocol● Information about samples● Results of pre-processing, normalization and further analysis

WorkflowWorkflow

Transformation

Plots,Histograms,and Tables

Experiment Explorer

Data Export

Plug-In Architecture

QuantitatedData Matrix

Image(s)

Sample Annotation

Biomaterial

Reporter Annotation

Array

Well Annotation

Hybridization

Filter

Analysis

Annotation Type

Extracts Array Design

Experiment

AdministratorUsersDefinitionsArray LIMS

Labeled Extracts

In fact, data of the whole experiment has to be stored, In fact, data of the whole experiment has to be stored, and its internal structure i.e. which sample was extracted by and its internal structure i.e. which sample was extracted by what methods was hybridised on which batch of slides bywhat methods was hybridised on which batch of slides bywhom and when?whom and when?

Type of storage ● Flat file: for small-scale experiments and one-off analysis. Example NCBI - GenBank

● Database: necessary for large scale experiments.

Microarray DBs typically relational, SQL-based models. Their internal relational structure should reflect the experiment structure.

BASE 1.x Software DiagramBASE 1.x Software Diagram

USER

WEB SERVER (Apache)

EXT. DATABASES

Extracts

Labeled Extracts

Experiment

Transformation

Plots,Histograms,and Tables

Experiment Explorer

Data Export

Plug-In Architecture

DATABASE LAYER (MySQL)

QuantitatedData Matrix

Image(s)

Sample Annotation

USER

APPLICATION LAYER (PHP, C++)

Biomaterial

Reporter Annotation

Array

Well Annotation

Array Design

Hybridization

Filter

Analysis

Annotation Type

These types of databasesThese types of databaseswill become essientialwill become essientialtools for post-genomic analysis.tools for post-genomic analysis.

Data storage II

Standardization of information by MGED:● MIAME (minimal information about a microarray experiment)● MAGE-ML based on XML for data exchange

Sharing microarray data: Sharing microarray data: • NCBINCBI• EBIEBI• StanfordStanford• Journals ie NatureJournals ie Nature

Take-home messages I

• Remember: Microarrays are shadows of genetic Remember: Microarrays are shadows of genetic networksnetworks• Watch out for experimental variationWatch out for experimental variation•The complexity of microarray experiments should The complexity of microarray experiments should be reflected in the structure used for data storagebe reflected in the structure used for data storage

Roadmap: Where are we?

Good news: We are almost ready for ‘higher` data analysis !

Data-Preprocessing

Background subtraction: May reduce spatial artefacts May increase variance as both

foreground and background intensities are estimates ( “arrow-like” plots MA-plots)

Preprocessing: Thresholding: exclusion of low

intensity spots or spots that show saturation

Transformation: A common transformation is log-transformation for stabilitation of variance across intensity scale and detection of dye related bias.

Log-transformationLog-transformation

The problem:

Are all low intensity genes down-regulated?? Are all genes spotted on the

left side up-regulated ??

Hybridisation model• Microarrays do not assess gene activities directly, but indirectly by measuring the fluorescence intensities of labelled target cDNA hybridised to probes on the array. So how do we get what we are interested in? Answer: Find the relation between flourescance spot intensities and mRNA abundance!

• Explicitly modelling the relation between signal intensities and changes in gene expression can separate the measured error into systematic and random errors.

• Systematic errors are errors which are reproducible and might be corrected in the normalisation procedure, whereas random errors cannot be corrected, but have to be assessed by replicate experiments.

Hybridisation model for two-colour arrays

I = N(θ) A + ε

A first attempt:For two-colour microarrays, the fundamental variables are the fluorescence intensities of spots in the red (Ir) and the green channel (Ig). These intensities

are functions of the abundance of labelled transcripts Ar/g. Under ideal

circumstances, this relation of I and A is linear up to an additional experimental error ε:

N : normalisation factor determined by experimental parameters θ such as the laser power amplification of the scanned signal.

Frequently, however, this simple relation does not hold for microarrays due to effects such as intensity background, and saturation.

Hybridisation model for two-colour arrays

( )( )

r r r r

g g g g

I k AR

I k A

M - κ (θ) = D + ε

Let`s try a more flexible approach based on ratio R (pairing of intensities reduces variablity due to spot morphology)

After some calculus (homework! I will check tomorrow) we get

How do we get κ (θ)?

κ: non-linear normalisation factors (functions) dependent on experimental parameters.

D = log2(Ar/Ag)

M = log2(Ir/Ig)

Normalization – bending data to make it look nicer...

Normalization describes a variety of data transformations aiming to correct for experimental variation

Within – array normalization Normalization based on 'householding genes' assumed to be

equally expressed in different samples of interest

Normalization using 'spiked in' genes: Ajustment of intensities so that control spots show equal intensities across channels and arrays

Global linear normalisation assumes that overall expression in samples is constant. Thus, overall intensitiy of both channels is linearly scaled to have value.

Non-linear normalisation assumes symmetry of differential

expression across intensity scale and spatial dimension of array

Normalization by local regression

Regression of local intensity >> residuals are 'normalized' log-fold changes

Common presentation:MA-plots: A = 0.5* log2(Cy3*Cy5)

M = log2(Cy5/Cy3)>> Detection of intensity-dependentbias!

Similarly, MXY-plots for detection of spatial bias. M, and thus κ, is function of A, X and Y

Normalized expression changesshow symmetry across intensity scale and slide dimension

Normalisation by local regression and problem of model selection

Example: Correction of intensity-dependent bias in data by loess (MA-regression: A=0.5*(log

2(Cy5)+log

2(Cy3)); M = log

2(Cy5/Cy3);

Raw data Local regression

Corrected data

However, local regression and thus correction depends onchoice of parameters.

Correction: M- M

reg

? ??

Different choices of paramters lead to different normalisations.

Optimising by cross-validation and iteration

Iterative local regression by locfit (C.Loader): 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression

2 iterations generally sufficient

GCV of MA

Optimised local scaling

Iterative regression of M and spatial dependent scaling of M: 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression 5) GCV of abs(M)XY-regression 6) Scaling of abs(M)

Comparison of normalisation procedures

MA-plots: 1) Raw data 2) Global lowess (Dudoit et al.) 3) Print-tip lowess (Dudoit et al.) 4) Scaled print-tip lowess (Dudoit et al.) 5) Optimised MA/MXY regression by locfit 6) Optimised MA/MXY regression wit1h scaling

=> Optimised regression leads to a reduction of variance (bias)

Comparison II: Spatial distribution

=> Not optimally normalised data show spatial bias

MXY-plots canindicate spatial bias

MXY-plots:

Averaging by sliding window reveals un-corrected bias

Distribution of median M within a window of 5x5 spots:

=> Spatial regression requires optimal adjustment to data

Statistical significance testing by permutation test

M

Original distribution

What is the probabilty to observe a median M within a window by chance?

Comparison with empirical distribution=> Calculation of probability(p-value) using Fisher’s method

Mr1

Mr2

Mr3

Randomised distributions

Statistical significance testing by permutation test

Histogram of p-values for a window size of 5x5Number of permutation: 106

p-values for negative M p-values for positive M

Statistical significance testing by permutation test

MXY of p-values for a window size of 5x5Number of permutation: 106

Red: significant positive MGreen: significant negative M

M. Futschik and T. Crompton, Genome Biology, 2004

Normalization makes results of different microarrays comparable

Between-array normalization scaling of arrays linearly or e.g. by quantile-quantile

normalization Usage of linear model e.g. ANOVA or mixed-models:

yijg

= µ + Ai + D

j + AD

ig + G

g + VG

kg + DG

jg + AG

ig + ε

ijg

Classical hypothesis testing:

1) Setting up of null hypothesis H0(e.g. gene X is not

differentially expressed) and alternative hypothesis H

a (e.g. Gene X is differentially expressed)

2) Using a test statistic to compare observed values with values predicted for H

0.

3) Define region for the test statistic for which H0 is

rejected in favour of Ha.

Going fishing: What is differentially expressed

Significance of differential gene expression

Typical test statistics1) Parametric tests e.g. t-test, F-test assume

a certain type of underlying distribution 2) Non-parametric tests (i.e. Sign test,

Wilcoxon rank test) have less stringent assumptions

t = ( t = (

P-value: probability of occurrence by chance

Two kinds of errors in hypothesis testing:1) Type I error: detection of false positive2) Type II error: detection of false negative

Level of significance :α = P(Type I error)Power of test : 1- P(Type II error) = 1 – β

Detection of differential expression What makes differential

expression differential expression? What is noise?

Foldchanges are commonly used to quantify differenitial expression but can be misleading (intensity-dependent).

Basic challange: Large number of (dependent/correlated) variables compared to small number of replicates (if any).

Can you spot the interesting spots?

Criteria for gene selection

Accuracy: how closely are the results to the true values

Precision: how variable are the results compared to the true value

Sensitivity: how many true posítive are detected Specificity: how many of the selected genes are

true positives.

>> Multiple testing required with large number of tests but small number of replicates.

>> Adjustment of significance of tests necessary

Example: Probability to find a true H

0 rejected for α=0.01 in 100 independent

tests: P = 1- (1-α) 100 ~ 0.63

Multiple testing poses challanges

Compound error measures:

Per comparison error rate: PCER= E[V]/N Familiywise error rate: FWER=P(V≥1) False discovery rate: FDR= E[V/R]

N: total number of tests V: number of reject true H

0 (FP)

R: number of rejected H (TP+FP) Aim to control the error rate: 1) by p-value adjustment (step-down procedures: Bonferroni, Holm, Westfall-Young, ...) 2) by direct comparison with a background distribution (commonly generated by random permuation)

Alternative approach:Treat spots as replicates

For direct comparison: Gene X is significantly differentially expressed if corresponding fold change falls in chosen rejection region. The parameters of the underlying distribution are derived from all or a subset of genes.

Since gene expression is usually heteroscedastic with respect to abundance,variance has to be stabalised by local variance estimation. Alternatively, local estimates of z-score can be derived.

Constistency of replicationsCase study: SW480/620 cell line comparisonSW480: derived from primary tumourSW 620: derived from lymphnode metastisis of same patient Model for cancer progression

Experimental design:• 4 independent hybridisations,• 4000 genes• cDNA of SW620 Cy5-labelled,• cDNA of SW480 Cy3-labelled. This design poses a problem! Can you spot it?

Usage of paired t-test

d

dt

sd: average differences of paired intensitiessd: standard deviation of d

p-value < 0.01 Bonferroni adjusted p-value < 0.01

Robust t-test

2 2 2 tot gene gene exp

Adjust estimation of variance:Compound error model:

Gene-specific error

Experiment-specificerror

This model avoidsselection of control spots

M. Futschik et al, Genome Letters, 2002

Another look at the results

Significant genes as red spots:3 σ-error bars do not overlap with M=0 axis. That‘s good!

Take-home messages II

Don‘t download and analyse array data blindly Visualise distributions: the eye is astonishing

good in finding interesting spots Use different statistics and try to understand the

differences Remember: Statistical significance is not

necessary biological significance!

Approaches of modelling in molecular biology

Bottom-upapproach

Top-down approach

System-wide measurements

Set of measurements of single component

Underlying molecular mechanism

Network of interactions of single components

Visualisation

• Important tool for detection of patterns remains visualization of results.

• Examples are MA-plots, dendrograms, Venn-diagrams or projection derived fro multi-dimensional scaling.

• Do not underestimate the ability of the human eye

Principal component analysisPCA: • linear projection of data onto majorprincipal components defined by the eigenvectors of the covariance matrix.• PCA is also used for reducing the dimensionality of the data.• Criterion to be minimised: square of the distance between the original and projected data. This is fulfilled by the Karhuven-Loeve transformation

Px Px

1( )( )

1

ti i

i

x xn

C

P is composed by eigenvectors of the covariance matrix

Example: Leukemia data sets by Golub et al.: Classification of ALL and AML

Sammon`s mapping:• Non-linear multi-dimensional scaling such as Sammon's mapping aim to optimally conserve the distances in an higher dimensional space in the 2/3-dimensional space.• Mathematically: Minimalisation of error function E by steepest descent method:

Multi-linear scaling

Example: DLBCL prognosis – cured vs featal cases

2( )1

Nij ij

Ni j ijiji j

D dE

DD

Clustering: Birds of a feather flock together Clustering of genes

Co-expression indicates co-regulation: functional annotation

Clustering of time series

Clustering of array: finding new subclasses in sample-

space

Two-way clustering: Parallel clustering of samples and

genes

Clustering methodsUnsupervised classification of genes and/or samples. Movtivation: Co-expression indicates co-regualtion

General division into hierarchical and partitional clustering

Hierachical clustering • can be divisive or agglomerative producing nested clusters. • Results are usually visualised by tree structures dendrogram. • Clustering depends on the linkage procedure used: single, complete, average, Ward,... • A related family of methods are based on graph-theoretical approach (e.g. CLICK).

Profilíng of breast cancer by Perou et al

Example for hierarchical clustering

A Alizadeh et al, Nature. 2000Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Clustering methods II

Partitional clustering • divides data into a (pre-)chosen number of classes.• Examples: k-means, SOMs, fuzzy c-means, simulated annealing, model-based clustering, HMMs,...• Setting the number of clusters is problematic

Cluster validity: • Most cluster algorithms always detect clusters, even in random data. • Cluster validation approaches address the number of existing clusters. • Approaches are based on objective functions, figures of merits, resampling, adding noise ....

Hard clustering vs. soft clusteringHard clustering:

Based on classical set theory

Assigns a gene to exactly one cluster

No differentiation how well gene is represented by cluster centroid

Examples: hierachical clustering, k-means, SOMs, ...

Soft clustering:

Can assign a gene to several cluster

Differentiate grade of representation (cluster membership)

Example: Fuzzy c-means, HMMs, ...

K-means clustering

1

1

0 1

1

0

ij

kk N

hc ij iji

N

ijj

i j

M U R j

N i

2( ) i ji j

E d x c

Partitional clustering is frequently based on the optimisation of a given objective function. If the data is given as a set of N dimensional vectors, a common objective function is the square error function:

where d is the distance metric and cj is the centre of clusters.

• Partitional clustering splits the data in k partitions with a given integer k. • Partition can represented by a partition matrix U that contains the membership values μij of each object i for each cluster j.• For clustering methods, which is based on classical set theory, clusters are mutually exclusive. This leads to the so called hard partitioning of the data.

Hard partions are defined as

k is the number of clusters and N is the number of data objects.

K-means algorithm

• Initiation: Choose k random vectors as cluster centres cj ;

• Partitioning: Assign xi to if for all k with k ;

• Calculation of cluster centres cj based on the partition derived in

step 2: The cluster centre cj is defined as the mean value of all

vectors within the cluster; • Calculation of the square error function E; • If the chosen stop criterion is met, stop; otherwise continue with

step 2.

For the distance metric D, the Euclidean distance is generally chosen.

Hard clustering is sensitive to noise

Example data set:

Yeast cell cylce data by Cho et al.

Standard procedure is pre-filteringof genes based on variation due to noise sensitvity of hard clustering.However, no obvious threshold exists!(Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000genes, Tamayo et al.: 823 genes)

=> Risk of essential losing information

=> Need of noise robust clustering method

Standard deviation of expression

Soft clustering is more noise robust

Hard clustering always detects clusters, even in random data

Soft clustering differentiates clusterstrength and, thus, can avoid detection of 'random' clusters

Genes with high membership valuescluster together inspite of added noise

Differentiation in cluster membership allows profiling of cluster cores● A gene can be assigned to several clusters

1. Each gene is assigned to a cluster with a membership value between 0 and 1

2. The membership values of a gene add up to one

3. Genes with lower membership values are not well represented by the cluster centroid

4. Expression of genes with high membership values are close to cluster centroid

=> Clusters have internal structures

Membership value > 0.5 Membership value > 0.7

Hard clustering

Varitation in cluster parameter reveals cluster stability

Variation of fuzzification parameter m determines 'hardness' of clustering:m → 1: Fuzzy c-means clustering becomes equivalent to k-means m → ∞: All genes are equivally assigned

to all clusters.

By variation of m clusters can be distinguished by their stability.

Weak cluster lose their core

Strong clusters maintain their core forincreasing m

m=1.1 m=1.3

Periodic and aperiodic clusters

Periodic clusters of yeast cell cycle:

Aperiodic clusters:

=> Aperiodic clusters were generally weaker than periodic clusters

Global clustering structure

c-means clustering allows definitionof overlap of clusters i.e. how manygenes are shared by two clusters. This enables to define a similarity measure between clusters. Global clustering structures can be visualised by graphs i.e. edges representing overlap.

Increasing number of clusters

Non-linear 2D-projection by Sammon's Mapping

=> Sub-clustering reveals sub-structures

M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data (in preparation)

Classifaction of microarray data

Many diseases involve (unknown) complex interaction of multiple genes, thus ``single gene approach´´ is limited genome-wide approaches may reveal this interactions

To detect this patterns, supervised learníng techniques from pattern recognition, statistics and artificial intelligence can be applied.

The medical applications of these “arrays of hope” are various and include the identification of markers for classification, diagnosis, disease outcome prediction, therapeutic responsiveness and target identification.

Gattica – the Art of Classifaction

In contrast to clustering approaches, algorithms for supervised classification are based on labelled data.

Labels assign data objects to a predefined set of classes.

Frequently the class distributions are not known, so the learning of the classifiers is inductive.

Task for classification methods is the correct assignment of new examples based on a set of examples of known classes. Generalisation

Class 1

Class 2

?

?

?

Challanges in classification of microarray data

• Microarray data inherit large experimental and biological variances

• experimental bias + tissue hetrogenity• cross-hybridisation• ‘bad design’: confounding effects

• Microarray data are sparse • high-dimensionality of gene (feature) space• low number of samples/arrays• Curse of dimensionality

• Microarray data are highly redundant•Many genes are co-expressed, thus their expression is strongly

correlated.

Classification I: Models K-nearest neighbour

Simple and quick method

Decision Trees Easy to follow the classification process

Bayesian classifier Inclusion of prior knowledge possible

Neural Networks No model assumed

Support Vector Machines Based on statistical learning theory; today`s state of art.

Criteria for classification

Accuracy: how closely are the results to the true values

Precision: how variable are the results compared to the true value

Sensitivity: how many true posítive are detected

Specificity: how many of the selected genes are true positives.

1-SensitivityS

pec

ific

ity

ROC curve

Getting a good team: Feature Extraction - gene selection

Selection of genes based on : Parametric tests e.g. t-test Non-parametric tests eg. Wilcoxon-Rank test But the 11 best players do not necessary form the best team!

Selection of groups of genes which act as good: Sensitivity measure

Genetic Algorithms

Decision tree

SVD

Bayesian Classifiers

 

( ) ( ) ( ) P C P C Px x x

 

The most fundamental classifier in statistical pattern recognition is the Bayes classifier which is directly derived from the Bayes theorem.

Suppose a vector x belongs to one of k classes. The probability P(C,x) of observing x belonging to class C is

P(x|C): conditional probability for x given class C is observed

P(C): prior probability for class C

Similarly, the joint probability P(C,x) can be expressed by

P(C|x): conditional probability for C given object x is observed P(x): is the prior probability of observing x.

( ) ( ) ( ) P C P C P Cx x

Bayesian Classifiers

( ) ( ) ( ) ( ) P C P C P C Px x x

( ) ( )( )

( )

P C P CP C

P

xx

x

( ) ( ) j kP C P Cx x

 

 

Since equations 1 and 2 describe the same probability P(C,x), we can derive

for all classes Cj with . This rule constitutes the Bayes classifier.

This is the famous Bayes theorem, which can be applied for classification as follows. We assigned x to class Cj if

Decision trees• Stepwise classification into two classes• Comprehensible ( in contrast to black box approaches)• Overfitting can occur easily• Complex (non-linear) interactions of genes may not be reflected in tree structure

Aritificial neural networks

 

1

1

i iiw x b

ye

•ANN were originally inspired by the functioning of biological neurons.

•Two major components: neurons and connections between them.

•A neuron receives inputs xi and determines the output y based on an

activation function. An example of a non-linear activation function is the sigmoid function

• Multi-layer perceptrons are hierachically structured and trained by backpropagation

Support vector machines

( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

( ) ( ) dK x y x y

•SVMs are based on statistical learning theory and belong to the class of kernel based methods.

•The basic concept of SVMs is the transformation of input vectors into a highly dimensional feature space where a linear separation may be possible between the positive and negative class members

Example study: Tumour/Normal classificationMotivation: Colon cancer should be detected as early as possible to avoid invasive treatment.

Data: Study by Alon et al. based on expression profiling of 60 samples with Affymetrix GeneChips containing over 6000 genes.

Method: Adaptive neural networks

M.Futschik et al, AI in medicine, 2003

Network structure can be translated in

linguistic rules

Take-home messages III

1. Visualisation of results helps their interpretation

2. A lot of tools for classification are around – know their strength and weakness

The Whole Picture

??

MetabolitesMetabolitesProtein FunctionsProtein Functions

Protein StructuresProtein Structures

Medical expertMedical expert knowledgeknowledge

MicroarraysMicroarraysDNA DNA

Chromosomal LocationChromosomal Location

Networks of Genes

Gene expression is regulated

by complex genetic networks

with a variety of interactions on

different levels (DNA, RNA, protein),

on many different

time scales (seconds to years)

and at various locations

(nucleus, cytoplasma, tissue).

Models: Boolean networks Bayesian networks Differential equations

Onthologíes: Categorising and labeling objects

Onthology: restricted vocabulary with structuríng rules describing relationship between terms

Representation as graph:•Terms as nodes•Edges as rules•Transitivity rule•Parent and child nodes

Bard and Rhee,Nature Gen. Rev., 2004

Gene Onthology

Consists of three independent onthologies:• molecular function e.g. enzyme• biological process e.g. signal transduction• cellular component

Gene sets / clusters can easily be analysed based on gene onthology terms

Mapping of gene expression to chromosomal location

Significance analysisof chromosomal location of differential gene expression (SW620 vs SW480)

1

0

1

k

i

s g s

i n iP

g

s

The p-value for finding at least k from a total of s significant differentially expressed genes within a cytoband window is

where g is the total number of genes with cytoband location and n the total number of genes within the cytoband window.

Relating number of gene copies and geneexpression I

Pollack et al., PNAS, 2001• Study of chromosomal abnormalities in breast cancer• usage of genomic DNA and cDNA arrays• hotspots of increased number of gene copies

Relating number of gene copies and gene expression II

Correlation of gene copy number and transcriptional levéls detected

Correlation of mRNA and protein abundanceIdeker et al, Science, 2001• Study of yeast galactose-utilisation pathway• Use of microarrays, quantative proteomics and databases of protein interactions• Dissection of transcriptional and post-translational control

New genes and interactions in GAL pathway were found

Av. Correlation of 0.63 between transcript and protein levels

Genome, Transcriptome and Translatome

Greenbaum et al, Genome Research, 2001• Interrelating geneome, transcriptome and translatome• Similar compostion based on functional categories of translatome and transcriptome • Differing composition of genome

Linking expression to drug effectiveness

Relevance networks:Butte et al, PNAS, 2000• Correlation between growth

inhibition by drugs and gene expression for NCI60 cell lines

• Gene expression based on Affymetrix chips (7245 genes)

• 5000 anticancer agents• Significance testing based

on randomisation• Significant link between

LCP1 and NSC 624044

Combinining gene expression data with clinical parameters

Diffuse large B-cell Lymphoma (DLBCL)• Most common lymphoid malignancy in adults• Treatment by multi-agent chemotherapy• In case of a relapse: bone marrow transplantation• Clinical course of DLBCL is widely variable: Only 40% of

treatments successful => Accurate outcome prediction is crucial for stratifying

patients for intensified therapy

Case study: DLBCL

Current prognostic model: International Prediction Index (IPI)Alternative: Microarray-based prediction of treatment outcome

DLBCL study by Shipp et al. (Nature Medicine, 2002, 8(1):68-74)

• expression profiles of 58 patients using Hu6800 Affymetrix chips (corresponding to ca. 6800 genes)• Prediction accuracy of outcome using leave-one-out procedure: Knn: 70.7%; WV: 75.9%; SVM:77.6%

Sammon's Mapping of top 22 genes ranked by signal-to-noise: Large overlap between classes with ‘cured’ and ‘fatal’ outcome.

Low correlation of gene expression with classes: Only 3 genes with correlation coef > 0.4

<=> Leukemia study by Golub et al : 263 genes

<=> Colon cancer study by Alon et al.: 215 genes

Limitation of microarray approach: Only mRNA abundance is measured.However, many different factors (patient and tumour related) determine outcome of therapy: Integration might be necessary!

DLBCL outcome prediction is challenging!

Prognostic models for DLBCL

Clinical predictor:• IPI based on five risk factors (age, tumour stage, patient’s

performance, number of extranodal sites, LDH concentration)• Survival rate determined in clinical study:

Low risk: 73%, low-intermediate: 51%, intermediate-high: 42%, high: 26%

• Conversion of IPI into Bayesian classifier using survival rates as conditional probabilities P:

e.g. Sample belongs to class ‘cured’ if P(‘cured’|IPI)> P(‘fatal’|IPI)

=> Overall accuracy of 73.2%.

Prognostic models for DLBCL

Microarray-based predictor:

•Identifies clusters by unsupervised learning

•Supervised classification

•EfuNN as five layered neural network

•Based on 17 genes using signal-to-noise criterion

•Accuracy using leave-one-out: 78.5%

Independence of predictors

C1

C28

3311

C1 C2

U

Microarray-based predictor

IPI-based predictor

Set theory:

For 19 of 56 samples complementary (8 samples only correctly classified by IPI-based predictor, 11 only by microarray-based predictor)

Setting upper threshold to 92.6% (52 out of 56 samples)

Mutual Information x,y = (0,1) : microarray-based, IPI-based predictions of class (cured -fatal)

P,Q : probability of microarray-, IPI-based predictions

R(x,y): joint probability of predictions by microarray- and IPI-based predictors

I = Σ x,y R(x,y) log2(R(x,y)/[P(x)Q(y)]) ~ 0.05

=> Microarray-based and IPI-based predictor statistically independent!

Hierarchical modular decision system

IPI

EFuNNBayesianClassifier

Class A Class B

Combined Prediction

1-

1

21

2

Class A/Class B

Decision layer

Class unit layer

Predictor modulelayer

1-

Three layered hierarchical modelPredictor module layer consisting of independently trained predictorsClass unit layer integrating prediction by single predictorsDecision layer producing final prediction

Model parameters: α, β1,β2

Training: error backpropagation with parallel trainingof neural network

Integration of predictions in class units: weighted sum

Validation: leave-one-out

Improved prediction by integration

Significantly improved accuracy of modular hierarchical system (parameter values:α=0.4, β1= 0.8, β2= 0.75)

73.2%IPI

78.5%EFuNN

87.5%Hierarchical model

AccuracyModel

Constructive and destructiveinterference:Both microarray-based and clinical predictor are necessary for improvement

Identification of areas of expertise

IPI Microarrayß = ß = 1 ß 2

=> Data stratification can be used to detect areas of expertise e.g. IPI risk group low, low-intermediate, intermediate- high for microarray-based classifier=> Identification by data stratification can indicate limits of models e.g. IPI risk group high for microarray-based classifier

Stratification of data set by IPI category

M.E. Futschik et al., Prediction of clinical behaviour and treatment for cancers, OMJ Applied Bioinformatics, 2003

TEŞEKKÜR !