Download - Joaquín Dopazo. CNIO. SOM y SOTA: Clustering methods in the analysis of massive biological data.

Joaquín Dopazo. CNIO.

SOM y SOTA: Clustering methods in the analysis of massive biological data

From genotype to phenotype. (only the genetic component)

>protein kunase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

…code for the structure of proteins...

…which accounts for the function...

…providing they are expressed in the proper moment and place...

…in cooperation with other proteins…

…conforming complex interaction networks...

Genes in the DNA... …whose final

effect can be different because of the variability.

Between 30.000 and 100.000.

40-60% display alternative splicing

Each protein has an average of 8 interactions

A typical tissue is expressing among 5000 and 10000

genes

That undergo post-

translational modifications

More than 3 millon SNPs have been

mapped

>protein kunase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

Pre-genomics scnario in the lab

Sequence

Molecular databases

Search results

Phylogenetic tree

alignment

Conserved region

MotifMotif

databases

Information

Secondary and tertiary protein structure

Bioinformatics tools for pre-genomicsequence data analysis

The aim:

Extracting as much information as possible for one single data

Genome sequencing

2-hybrid systems,Mass spectrometry for protein complexes

Post-genomic vision

ExpressionArrays

Literature, databases

Who?

Where, when and how much?

What do we know?

In what way?

SNPs

And who else?

genes

interactions

Post-genomic vision

Gene expression

Information

polimorphisms

InformationDatabases

The new tools:Clustering

Feature selectionMultiple correlation

Datamining

Brain and computers

Brain computes in a different way from digital computers

Structural components

Brain Computers

Neurons (Ramón y Cajal, 1911) chips

Speed slow (10-3s) fast (10-9s)

Procesing units 10 billion neurons, massively

interconnected (60 trillion synapses)

One or few

Brain is a highly complex, nonlinear, and parallel computerNeurons are organized to perform complex computations many times faster than the fastest computers.

Neural Networks

What is a neural network?

A Neural network is a massively parallel distributed processor able to store experiential knowledge and to make it available for use.

It resembles to brain in two respects:

Knowledge is acquired by the network through a learning process

Interneuron connection strengths (synaptic weights) are used to store the knowledge.

Neural Net classifiers

Supervised Unsupervised

Perceptrons Kohonen SOM

Growing cell structures

SOTA

Summing junction

Output

w1

w2

Activation function

uk

Threshold

k

X1

X2

:

:

:

:

Xp

Input signals

w2

:

X1

X2

:

:

:

:

Xp

Supervised learning: the perceptron

Supervised learning : training

Summing junction

w1

w2

Activation function

11111110000000

00000001111111

Training set

up down

u

u =x1*w1+x2*w2W1 = 1

W2=0

up =1down = 0

1 if u 1

0 if u<1u

Supervised learning: application

Summing junction

1

0

X

1

0 u

u =1*1+0*0= 1 1 if u 1

0 if u<1u

1 up

Supervised vs. Unsupervised learning

Supervised:The structure of the data is known beforehand. After a training process in

which the network learns how to distinguish among classes, you use the

network for assigning new items to the predefined classes

Unsupervised:

The structure of the data is not know beforehand. The network learns how data are distributed among classes, based on a function of distance

Sensory pathways in the brain are organised in such a way that its arrangement reflects some physical characteristic of the external stimulus being sensed.

Brain of higher animals seems to contain many kind of “maps” in the cortex.

In visual areas there are orientation and color maps In the auditory cortex there exist the so-called tonotopic

maps• The somatotopic maps represents the skin surface

The basis

Unsupervised learning:Kohonen self-organizing maps

Kohonen SOMThe causes of self-organisation

Kohonen SOM mimics two-dimensional arrangements of neurons in the brain. Effects leading to spatially organized maps are:

• Spatial concentration of the network activity on the neuron best tuned to the present input

• Further sensitization of the best matching neuron and its topological neighborhood.

Kohonen SOMThe topology

Two-dimensional network of cells with a hexagonal or rectangular (or other) arrangement.

x1, x2..xn

input

Outputnodes

Neighborhood

Neighborhood of a cell is defined as a time dependent function

Kohonen SOM The algorithm

Step 1. Initialize nodes to random values.Set the initial radius of the neighborhood.

Step 2. Present new input: Compute distances to all nodes. Euclidean distances are commonly used

Step 3. Select output node j* with minimum distance dj. Update node j* and neighbors. Nodes updated for the neighborhood NEj*(t) as:

wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j NEj*(t)(t) is a gain term than decreases in time.

Step4 Repeat by going to Step 2 until convergence.Input

Kohonen SOM Limitations

Arbitrary number of clustersThe number of clusters is arbitrarily fixed from the beginning. Some clusters can remain unoccupied.

Lack of the tree structureThe use of a two-dimensional structure for the net makes impossible to recover a tree structure that relates the clusters

and subclusters among them.

Non proportional clustering

Clusters are made based on the number of items so, distances among them are not proportional.

Growing cell structures

Growing cell structures produce distribution-preserving mapping. The number of clusters and the connections among them are dynamically assigned during the training of the network.

Kohonen SOM produce topology-preserving mapping. That is, the topology of the network and the number of clusters are fixed before to the training of the network

Insertion and deletion of neurons•After a fixed number of adaptations, every neuron q with a signal counter value hq > hc (a threshold) is used to create a new neuron

The direct neighbor f of the neuron q having the greatest signal counter value is used to insert a new neuron between them.

The new neuron is connected to preserve the topology of the network.

• Signal counter values are adjusted in the neighborhood

Similarly, neurons with signal counter values below a threshold can be removed.

Growing cell structuresNetwork dynamics

Similar to the used by Kohonen SOM, but with several important differences:

Adaptation strength is constant over time (b and n for the best matching cell and its neighborhood).Only the best-matching cell and its neighborhood are adapted.Adaptation implies the increment of signal counter for the best-matching cell and the decrement in the remaining cells.New cells can be inserted and existent cells can be removed in order to adapt the output map to the distribution of the input vectors.

Growing cell structures Limitations

Arbitrary number of clustersThe number of clusters is arbitrarily fixed from the beginning. Some clusters can remain unoccupied.

Lack of the tree structureThe use of a two-dimensional structure for the net makes impossible to recover a tree structure that relates the clusters

and subclusters among them.

Non proportional clustering

Clusters are made based on the number of items so, distances among them are not proportional.

Many molecular data have different levels of structured information.

Ej, phylogenies, molecular population data, DNA expression data (to same

extent), etc.

But, sometimes behing the real world there is some hierarchy...

A

B

C

D

20 items

Simulation

Mapping a hierarchical structure using a non-hierarchical method (SOM)

A,B

C,DE,F

G

H

Self Organising Tree Algorithm (SOTA)

A new neural network designed to deal with data that are related among them by means of a binary tree topology

Dopazo & Carazo, 1997, J. Mol. Evol.44:226-233

Derived from the Kohonen SOM and the growing cell structures but with several key differences:

The topology of the network is a binary tree.Only growing of the network is allowed.The growing mimics a speciation event, producing two new neurons from the most heterogeneous neuron.Only terminal neurons are directly adapted by the input data, internal neurons are adapted through terminal neurons.

SOTA:The algorithm

Input

SOTA, unlike other hierarchical methods, grows from top to bottom until an appropriate level of variability is reached

The Self Organising Tree Algorithm (SOTA) is a hierarchical divisive method based on a neural network

Step 1. Initialize nodes to random values.

Step 2. Present new input: Compute distances to all terminal nodes.

Step 3. Select output node j* with minimum distance dj. Update node j* and neighbors. Nodes updated for the neighborhood NEj*(t) as:

wij(t+1) = wij(t) + (t)(xi(t) - wij(t)); for j NEj*(t)(t) is a gain term than decreases in time.

Step 4 Repeat by going to Step 2 until convergence.

Step 5 Reproduce the node with highest variability.

Dopazo, Carazo (1997)

Herrero, Valencia, Dopazo (2001)

SOTA algorithm (neighborhood)

w

a s

Initial state

Actualization Growing and different neighborhoods

SOTA algorithm

EPOCH

YES

NO

NO

Cycle

Cycleconvergence?

Addcell

winner

sister mother

Initialisesystem

Networkconvergence?

YES

End

Cycle: repeat as many epochs as necessary to get convergence in the present state of the network. Convergence: relative error of the network falls below a threshold

When a cycle finishes, the network size increases: two new neurons are attached to the neuron with higher resources. This neuron becomes mother neuron and does not receive direct inputs any more.

Applications

Sequence analysis

Microarray data analysis

Population data analysis

• Massive data

• Information

•redundancy

Sequence analysis in the genomics era

Codification

Indeterminaciones.

R = {A ó G}; N= {A ó G ó C ó T}

Vectores de N x 4 (nucleótidos) o N x 20 (aminoácidos); más una componente para representar las deleciones

Other possible codifications: Frequencies of dipeptides or dinucleotides

Updating the neurons

Missing

Updated

Classifying proteins with SOM

Ferrán, Pflugfelder and Ferrara (1994) Self-organized neural maps of human protein sequences. Prot. Sci. 3:507-521.

Cy5 Cy3

cDNA arrays Oligonucleotide arrays

Gene expression analysis using DNA microarrays

Research paradigm is shifting

Hipothesis driven: one PhD per gene

Ignorance driven: paralelized automated approach

GbKb MbTb - Pb

sequences DNA arrays

Expression patterns

1 2 3 4

Patterns can be:• time series• dosage series• different patients• different tissues• etc.

Different DNA-arrays

The data

Characteristics of the data:

• Low signal to noise ratio

• High redundancy and intra-gene correlations

• Most of the genes are not informative with respect to the trait we are studying (account forunrelated physiological conditions, etc.)

• Many genes have no annotation!!

Genes(thousands)

Experimental conditions (from tens up to no more than a few houndreds)

A B C

Expression profile of a gene across the experimental conditions

Expression profile of all the genes for a experimental condition (array)

Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc.

Co-expressing genes... What do they have in common?

Different phenotypes...

What genes are responsible for?

Genes interacting in a network (A,B,C..)...

How is the network?

A

B C

DE

Genes of a classWhat profile(s) do they display? and... Are there more

genes?

Molecular classification of samples

Study of many conditions.Types of problemsCan we find groups of

experiments with similar gene expression profiles?

Unsupervised

Supervised

Reverse engineering

100/1 = 100 2

10/1 = 10 1

1/1 = 1 0

1/10 = 0.1 -1

1/100 = 0.01 -2

What are we measuring?

red

greenA (background)

B (expression)

Differential expression

B/A

Problem: is asymetrical solution: log-transformation

transformation

Distance

A

B

C

Differences

B<=>C

Correlation

A<=>B

Clustering methods

deterministic

NN

Non hierarchical Hierarchical

K-means, PCA UPGMA

SOM SOTA

Provides different levels of

information

Robust

Properties

Aggregative hierarchical clustering

CLUSTER

Relationships among profiles are represented by branch lengths.

Links recursively the closest pair of profiles until the complete hierarchy is reconstructed

Allows to explore the relationship among groups of related genes at higher levels.

Problems• lack of robustness

• solution may be not unique

• dependent on the data order

Aggregative hierarchical clustering

What level would you consider for defining a cluster?

Subjective cluster definition

Properties of neural networks for molecular data classification•Robust

• Manage real-world data sets containing noisy, ill-defined items with irrelevant variables and outliers

• Statistical distributions do not need to be parametric

• Fast and scalable to big data sets

Kohonen SOMApplied to microarray data

t1 t2 .. tp

sample1 a11 a12 .. a1p


: : : :

samplen an1 an2 .. anp

Group11

samplea, sampleb ...

Group12


Group13


Group14


node44 x1 x2 .. xp

node34 y1 y2 .. yp

node24 z1 z2 .. zp

Kohonen SOMmicroarray patterns

gen1 gen2 .. genp



: : : :

samplen an1 an2 .. anp

Kohonen SOMExample

Response of human fibroblasts to serum

Iyer et al., 1999 Science 283:83-87

The Self Organising Tree Algorithm (SOTA)

SOTA,opposite to other clustering methods, grows from top to bottom: growing can be stopped at the desired level of variability

SOTA nodes are weighted averages of every item under the node

SOTA

The Self Organising Tree Algorithm (SOTA) is a divisive hierarchical method based on a neural network

Advantages of SOTA

Clusters´patterns

Each node of the tree has a pattern associated wich corresponds to the cluster under itself.

Divisive algorithmSOTA grows from top to bottom: growing can be stopped at any desired level of variability.

Distribution preserving

The number of clusters depends on the variability of the data.

Robusteness against noise

TEST

From low resolution...

...to high resolution.Where stop growing?

exp1 exp2 .. expp

gen1 a11 a12 .. a1p

gen2 a21 a22 .. a2p

: : : :

genn an1 an2 .. anp

exp1 exp2 .. expp

gen1 a14 a17 .. a1q

gen2 a23 a21 .. a2r

: : : :

genn an9 an4 .. ans

95%

Permutation test for cluster size definition

TEST

are dij > 0.4?

SOTA/SOM vs classical clustering (UPGMA)

SOTA vs SOM

Acuracy: the silhouette

Axx

iji

ji

AxxxdA

ia,

),(1||

1)(

Cx

jiAC

i

j

xxdC

Cxd ),(||

1),(

),()( Cxdminib iAC

)W(1)()(

1)(

)()()(

)()(

)()(

)O(1)()(

1)(

)()()(

)()(

)()(

)(),()()(

)( rongiaib

iaiaib

isibia

ibiaBx

Kibia

ibiaib

isibia

ibiaAx

ibiamaxiaib

isi

i

Is the object closer to its cluster or to the closer cluster?