+ All Categories
Home > Documents > Introduction to Bioinformatics

Introduction to Bioinformatics

Date post: 26-Jan-2016
Category:
Upload: liora
View: 24 times
Download: 1 times
Share this document with a friend
Description:
Introduction to Bioinformatics. Biological Networks Department of Computing Imperial College London March 11, 2010 Lecture hours 16-17. Nataša Pržulj [email protected]. Data Clustering. find relationships and patterns in the data get insights in underlying biology - PowerPoint PPT Presentation
Popular Tags:
63
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 11, 2010 Lecture hours 16-17 Nataša Pržulj natasha@imperial 1
Transcript
Page 1: Introduction to Bioinformatics

Introduction to Bioinformatics

Biological NetworksDepartment of ComputingImperial College London

March 11, 2010

Lecture hours 16-17

Nataša Prž[email protected]

1

Page 2: Introduction to Bioinformatics

Data Clustering

• find relationships and patterns in the data • get insights in underlying biology• find groups of similar genes/proteins/samples

• deal with numerical values of biological data• they have many features (not just colour)

2

Page 3: Introduction to Bioinformatics

Data Clustering

3

Page 4: Introduction to Bioinformatics

• There are many possible distance metrics between objects

• Theoretical properties of distance metrics, d:– d(a,b) >= 0– d(a,a) = 0– d(a,b) = 0 a=b– d(a,b) = d(b,a) – symmetry– d(a,c) <= d(a,b) + d(b,c) – triangle inequality

Data Clustering

4

Page 5: Introduction to Bioinformatics

Example distances:• Euclidean (L2) distance• Manhattan (L1) distance• Lm: (|x1-x2|m+|y1-y2|m)1/m

• L∞: max(|x1-x2|,|y1-y2|)• Inner product: x1x2+y1y2

• Correlation coefficient• For simplicity we will concentrate on Euclidean

and Manhattan distances

Clustering Algorithms

5

Page 6: Introduction to Bioinformatics

Distance/Similarity matrices:• Clustering is based on distances –

distance/similarity matrix:• Represents the distance between objects• Only need half the matrix, since it is symmetric

Clustering Algorithms

6

Page 7: Introduction to Bioinformatics

Hierarchical Clustering:1. Scan dist. matrix for the minimum

2. Join items into one node

3. Update matrix and repeat from step 1

Clustering Algorithms

7

Page 8: Introduction to Bioinformatics

Hierarchical Clustering:Distance between two points – easy to compute Distance between two clusters – harder to compute:

1. Single-Link Method / Nearest Neighbor2. Complete-Link / Furthest Neighbor3. Average of all cross-cluster pairs

Clustering Algorithms

8

Page 9: Introduction to Bioinformatics

Clustering AlgorithmsHierarchical Clustering:1. Example: Single-Link (Minimum) Method:

Resulting Tree, orDendrogram:

9

Page 10: Introduction to Bioinformatics

Clustering AlgorithmsHierarchical Clustering:2. Example: Complete-Link (Maximum) Method:

Resulting Tree, orDendrogram:

10

Page 11: Introduction to Bioinformatics

Clustering AlgorithmsHierarchical Clustering:In a dendrogram, the length of each tree branch represents the distancebetween clusters it joins.

Different dendrograms may arise when different Linkage methods are used.

11

Page 12: Introduction to Bioinformatics

K-Means Clustering:• Basic Ideas : use cluster centroids (means) to represent cluster.

• Assigning data elements to the closet cluster (centroid).

• Goal: Minimize intra-cluster dissimilarity.

Clustering Algorithms

12

Page 13: Introduction to Bioinformatics

K-Means ClusteringExample:

Clustering Algorithms

13

Page 14: Introduction to Bioinformatics

• Differences between the two clustering algorithms:– Hierarchical Clustering:

• Need to select Linkage Method• to perform any analysis, it is necessary to partition the dendrogram into k disjoint

clusters, cutting the dendrogram at some point. A limitation is that it is not clear how to choose this k

– K-means: Need to select K– In both cases: Need to select distance/similarity measure

Clustering Algorithms

14

Page 15: Introduction to Bioinformatics

Nearest neighbours clustering:

Clustering Algorithms

15

Page 16: Introduction to Bioinformatics

Nearest neighbours clustering: Example:

Clustering Algorithms

Pros and cons:1.No need to know the number of clusters to discover beforehand (different than

in k-means and hierarchical).2. We need to define the threshold . 16

Page 17: Introduction to Bioinformatics

k-nearest neighbors clustering -- classification algorithm, but we use the idea here to do clustering:– For point v, create the cluster containing v and top k

closest points to v (closest training examples);e.g., based on Euclidean distance.

– Do this for all points v. – All of the clusters are of size k, but they can overlap.

• The challenge: choosing k.

Clustering Algorithms

17

Page 18: Introduction to Bioinformatics

What is Classification?The goal of data classification is to organize and

categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data.

Example:

Application: medical diagnosis, treatment effectiveness analysis, protein function prediction, interaction prediction, etc.

18

Page 19: Introduction to Bioinformatics

Classification = Learning the Model

19

Page 20: Introduction to Bioinformatics

What is Clustering?

• There is no training data (objects are not labeled)• We need a notion of similarity or distance (over what features?)• Should we know a priori how many clusters exist?• How do we characterize members of groups?• How do we label groups?

20

Page 21: Introduction to Bioinformatics

Supervised and Unsupervised

Supervised Classification = Classification We know the class labels and the number of classes

Unsupervised Classification = Clustering We do not know the class labels and may not know the

number of classes

21

Page 22: Introduction to Bioinformatics

Classification vs. Clustering

(we can compute it without the need of knowing the correct solution)

22

Page 23: Introduction to Bioinformatics

Classification is a 3-step process

1. Model Construction (Learning):– Each tuple is assumed to belong to a predefined

class, as determined by one of the attributes, called the class label.

– The set of all tuples used for construction of the model is called the training set.

– The model is presented in forms such as:• Classification rules (IF-THEN statements)• Mathematical formulae

23

Page 24: Introduction to Bioinformatics

2. Model Evaluation (Accuracy):Estimate accuracy rate of the model based on a test set.

• The known label of test sample is compared withthe classified result from the model.

• Accuracy rate is the percentage of test set samplesthat are correctly classified by the model.

• Test set is independent of training set, otherwise over-fitting will occur.

Classification is a 3-step process

24

Page 25: Introduction to Bioinformatics

3. Model Use (Classification):The model is used to classify unseen objects.• Give a class label to a new tuple• Predict the value of an actual attribute

Classification is a 3-step process

25

Page 26: Introduction to Bioinformatics

k-Nearest Neighbours (k-NN) Classification

• In k-nearest-neighbour classification, the training dataset is used to classify each member of a "target" dataset.

• There is no model created during a learning phase but the training set itself.

• It is called a lazy-learning method.• Rather than building a model and referring to it

during the classification, k-NN directly refers to the training set for classification.

26

Page 27: Introduction to Bioinformatics

Simple Nearest Neighbour Classification

• It is very simple. The training is nothing more than sorting the training data and storing it in a list.

• To classify a new entry, this entry is compared to the list to find the closest record, with value as similar as possible to the entry to classify (i.e., nearest neighbour). The class of this record is simply assigned to the new entry.

• Different measures of similarity or distance can be used.

27

Page 28: Introduction to Bioinformatics

Simple Nearest Neighbour Classification

28

Page 29: Introduction to Bioinformatics

• The k-Nearest Neighbour is a variation of Nearest Neighbour. Instead of looking for only the closest record to the entry to classify, we look for the k records closest to it.

• To assign a class label to the new entry, from all the labels of the k nearest records we take the majority class label.

• Nearest Neighbour is a case of k-Nearest Neighbours with k=1.

k-Nearest Neighbours (k-NN) Classification

29

Page 30: Introduction to Bioinformatics

k-Nearest Neighbours (k-NN) Classification

30

Page 31: Introduction to Bioinformatics

Correctness of methods

• Clustering is used for making predictions:– E.g., protein function, involvement in disease,

interaction prediction.

• Other methods are used for classifying the data (have disease or not) and making predictions.

• Have to evaluate the correctness of the predictions made by the approach.

• Commonly used method for this is ROC Curves.

31

Page 32: Introduction to Bioinformatics

Definitions (e.g., for PPIs):• A true positive (TP) interaction:

– an interaction exists in the cell and is discovered by an experiment (biological or computational).

• A true negative (TN) interaction: – an interaction does not exist and is not discovered by an

experiment.• A false positive (FP) interaction:

– an interaction does not exist in the cell, but is discovered by an experiment.

• A false negative (FN) interaction: – an interaction exists in the cell, but is not discovered by an

experiment.

Correctness of methods

32

Page 33: Introduction to Bioinformatics

• If TP stands for true positives, FP for false positives, TN for true negatives, and FN for false negatives, then:

• Sensitivity = TP / (TP + FN)• Specificity = TN / (TN + FP)

• In other words, sensitivity measures the fraction of items out of all possible ones that truly exist in the biological system that our method successfully identifies (fraction of correctly classified existing items), while

• specificity measures the fraction of the items out of all items that truly do not exist in the biological system for which our method correctly determines that they do not exist (fraction of correctly classified non-existing items).

• Thus, 1-specificity measures the fraction of all non-existing items in the system that are incorrectly identified as existing.

Correctness of methods

33

Page 34: Introduction to Bioinformatics

34

• Receiver Operating Curves (ROC curves) provide a standard measure of the ability of a test to correctly classify objects.

• E.g., the biomedical field uses ROC curves extensively to assess the efficacy of diagnostic tests in discriminating between healthy and diseased individuals.

• ROC curve is a graphical plot of the true positive rate, i.e., sensitivity, vs. false positive rate, i.e., (1−specificity), for a binary classifier system as its discrimination threshold is varied (see above for definitions).

• It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).

• The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test; the closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The area under the curve (AUC) is a measure of a test’s accuracy.

ROC Curve

Page 35: Introduction to Bioinformatics

ROC curveExample:• Embed nodes of a PPI network into 3-D Euclidean unit box

(use MDS – not required in this class, see reference in the footer if interested)• Like in GEO, choose a radius r to determine node connectivity• Vary r between 0 and sqrt(3) (diagonal of the box)

– r=0 makes a graph with no edges (TP=0, FP=0)– r=sqrt(3) makes a complete graph (all possible edges, FN=TN=0)

• For each r in [0, sqrt(3)]: – measure TP, TN, FP, FN – compute sensitivity and 1- specificity– draw the point

• Set of these points is the ROC curve

Note: • For r=0, sensitivity=0 and 1-specificity=0, since TP=0, FP=0 (no edges)• For r=sqrt(3), sensitivity=1 and 1-specificity=1 (or 100%), since FN=0, TN=0

D. J. Higham, M. Rasajski, N. Przulj, “Fitting a Geometric Graph to a Protein-Protein Interaction Network”, Bioinformatics, 24(8), 1093-1099, 2008.

Sensitivity = TP / (TP + FN)Specificity = TN / (TN + FP)

35

Page 36: Introduction to Bioinformatics

• Molecular networks: the backbone of molecular activity within the cell

• Comparative approaches toward interpreting them -- contrasting networks of different species molecular types under varying conditions

• Comparative biological network analysis and its applications to elucidate cellular machinery and predict protein function and interactions.

36

Types of Biological Network Comparisons:

Sharan and Ideker (2006) Sharan and Ideker (2006) Nature BiotechnologyNature Biotechnology 2424(4): 427-433(4): 427-433

Page 37: Introduction to Bioinformatics

• Data on molecular interactions are increasing exponentially• This flood of information parallels that seen for genome

sequencing in the recent past• Presents exciting new opportunities for understanding

cellular biology and disease in the future• The challenge: develop new strategies and theoretical

frameworks to filter, interpret and organize interaction data into models of cellular function

37

Types of Biological Network Comparisons:

Page 38: Introduction to Bioinformatics

38

• As with sequences, comparative/evolutionary view is a powerful base to address this challenge

• We will survey computational methodology it requires and biological questions it may be able to answer

• Conceptually, network comparison is the process of contrasting two or more interaction networks, representing different: • species, • conditions, • interaction types, or • time points

Types of Biological Network Comparisons:

Page 39: Introduction to Bioinformatics

3939

• Based on the found similarities, answer a number of fundamental biological questions:• Which proteins, protein interactions and groups of

interactions are likely to have equivalent functions across species?

• Can we predict new functional information about proteins and interactions that are poorly characterized?

• What do these relationships tell us about the evolution of proteins, networks, and whole species?

Types of Biological Network Comparisons:

Page 40: Introduction to Bioinformatics

404040

• Noise in the data – screens for PPI detection report large numbers of false-positives and negatives:• Which interactions represent true binding events?• Confidence measures on interactions should be taken into

account before network comparison.• However, since a false-positive interaction is unlikely to be

reproduced across the interaction maps of multiple species, network comparison itself increases confidence in the set of molecular interactions found to be conserved

Types of Biological Network Comparisons:

Page 41: Introduction to Bioinformatics

41414141

Types of Biological Network Comparisons:

41

• Such questions have motivated 3 types (modes) of comparative methods:1. Network alignment2. Network integration3. Network querying

Page 42: Introduction to Bioinformatics

424242424242

1. Network alignment:• The process of overall comparison of two or more

networks to identify regions of similarity and dissimilarity• Commonly applied to detect subnetworks that are

conserved across species and hence likely to present true functional modules

Types of Biological Network Comparisons:

Page 43: Introduction to Bioinformatics

43434343434343

2. Network integration:• The process of combining several networks

encompassing interactions of different types over the same set of elements (e.g., PPI and genetic interactions)

• To study their interrelations• Can assist in predicting protein interactions and

uncovering protein modules supported by interactions of different types

• The main conceptual difference from network alignment: the integrated networks are defined on the same set

of elements.

Types of Biological Network Comparisons:

Page 44: Introduction to Bioinformatics

4444444444444444

3. Network querying:• A given network is searched for subnetworks that are

similar to a subnetwork query of interest• This basic database search operation is aimed at

transferring biological knowledge within and across species

Summary:

Types of Biological Network Comparisons:

Page 45: Introduction to Bioinformatics

454545

1. Network Alignment

45

• Finding structural similarities between two networks

Page 46: Introduction to Bioinformatics

46464646

1. Network Alignment

46

RecallSubgraph isomorphism (NP-complete):• An isomorphism is a bijection between nodes of two networks G and H

that preserves edge adjacency

• Exact comparisons inappropriate in biology (biological variation)• Network alignment

– More general problem of finding the best way to “fit” G into H even if G does not exist as an exact subgraph of H

A 1B 2C 4D 3

G HC D 3 4

A B 1 2

Page 47: Introduction to Bioinformatics

47G H

1. Network Alignment

Page 48: Introduction to Bioinformatics

484848484848

1. Network Alignment

48

• Methods vary in these aspects:A. Global vs. local

B. Pairwise vs. multiple

C. Functional vs. topological information

Page 49: Introduction to Bioinformatics

4949494949

1. Network Alignment

49

• Methods vary in these aspects:A. Global vs. local

B. Pairwise vs. multiple

C. Functional vs. topological information

A. Local alignment: Mappings are chosen independently for each region of

similarity Can be ambiguous, with one node having pairings in

different local alignments Example algorithms:

PathBLAST, NetworkBLAST, MaWISh, Graemlin

Page 50: Introduction to Bioinformatics

505050505050

1. Network Alignment

50

• Methods vary in these aspects:A. Global vs. local

B. Pairwise vs. multiple

C. Functional vs. topological information

A. Global alignment: Provides a unique alignment from every node in the smaller

network to exactly one node in the larger network May lead to inoptimal machings in some local regions Example algorithms:

GRAAL, IsoRank, IsoRankN, Extended Graemlin

Page 51: Introduction to Bioinformatics

51515151515151

1. Network Alignment

51

• Methods vary in these aspects:A. Global vs. local

B. Pairwise vs. multiple

C. Functional vs. topological information

B. Pairwise alignment: Two networks aligned Example algorithms:

GRAAL, PathBLAST, MaWISh, IsoRankMultiple alignment: More than two networks aligned Example algorithms:

Greamlin, Extended PathBLAST, Extended IsoRank

Page 52: Introduction to Bioinformatics

5252525252525252

1. Network Alignment

52

• Methods vary in these aspects:A. Global vs. local

B. Pairwise vs. multiple

C. Functional vs. topological information

C. Functional information Information external to network topology used, e.g., protein sequence, to define

“similarity” between nodes Careful: mixing different biological data types, that might agree or contradict Example algorithms:

all except for GRAAL; some can exclude sequence, e.g. IsoRank, but then perform poorly

Topological information Only network topology used to define node “similarity” Good -- since it answers how much and what type of biological information can

be extracted from topology only

Page 53: Introduction to Bioinformatics

53535353

1. Network Alignment

53

• Since in general, the problem is computationally hard (generalizing subgraph isomorphism), heuristic approaches are devised.

• Key algorithmic components of network alignment algorithms:1. Scoring function – measures:

the similarity of each subnetwork to a predefined structure of interest

the level of conservation of this structure across the subnetworks being scored

the alignment quality

2. Search procedure: methods for searching for conserved subnetworks of interest

Page 54: Introduction to Bioinformatics

54

1. Measuring the alignment quality1) Edge correctness (EC)

• Percentage of edges in G that are aligned to edges in H

G H

1. Network Alignment

Page 55: Introduction to Bioinformatics

55

1. Measuring the alignment quality1) Edge correctness (EC)

• Percentage of edges in G that are aligned to edges in H

G H

EC=1/2=50%

1. Network Alignment

Page 56: Introduction to Bioinformatics

5656

1. Network Alignment

Page 57: Introduction to Bioinformatics

5757

1. Measuring the alignment quality3) Can the alignment be attributed to chance?

• Compare it with a random alignment of the two networks

• Compare it with the amount of alignment found between model networks (random graphs) of the size of the data

4) Biological quality of the alignment:• Do the aligned protein pairs have the same biological

function?

• Does the alignment identify evolutionary conserved functional modules?

• How much of the network alignment supported by sequence alignment? Note: We should not expect networks and sequences to give identical results!!

1. Network Alignment

Page 58: Introduction to Bioinformatics

585858

2. The search procedure – commonly used: How is “similarity” between nodes defined?

• Using information external to network topology, e.g., the sequence alignment score for protein pairs

• Using only network topology, e.g., node degree, common neighbourhood, graphlet degree vectors (e.g., GRAAL – GRAph Aligner – uses GDVs)

Use the most “similar” nodes across the two networks as “anchors” or

“seed nodes” in the two networks “Extend around” the seed nodes – “seed and extend method” – greedy

approach

1. Network Alignment

Page 59: Introduction to Bioinformatics

59595959

2. The search procedure – commonly used: How is “similarity” between nodes defined?

• Using information external to network topology, e.g., sequence alignment score of the proteins

• Using only network topology, e.g., node degree, common neighbourhood, graphlet degree vectors

Use the most “similar” nodes across the two networks as anchors or

“seed nodes” in the two networks “Extend around” the seed nodes – “seed and extend method” – greedy

approach

1. Network Alignment

Page 60: Introduction to Bioinformatics

60606060606060

1. Network Alignment

60

In a network alignment, we have interaction: matches (conserved interactions) – contribute to EC mismatches (gaps):

1. Insertions 2. Deletions

G1G2

If G2 evolved from G1:Gaps: B’1C’ vs BC (insertion of node 1)

C’2D’ vs CD (insertion of node 2)

If G1 evolved from G2:Gaps: B’1C’ vs BC (deletion of node 1)

C’2D’ vs CD (deletion of node 2)

Page 61: Introduction to Bioinformatics

6161616161

1. Network Alignment

61

One heuristic approach: A merged representation of the two networks being compared is

created, called a “network alignment graph” in which:• Nodes represent sets of molecules, one from each network• Edges represent conserved molecular interactions across different

networks• The alignment is simple when there exists a 1-to-1 correspondence

between molecules across the two networks, but in general there may be a complex many-to-many correspondence

Then apply a greedy algorithm for identifying conserved subnetworks embedded in the “network alignment graph”

Sharan and Ideker (2006) Sharan and Ideker (2006) Nature BiotechnologyNature Biotechnology 2424(4): 427-433(4): 427-433

Page 62: Introduction to Bioinformatics

626262626262

1. Network Alignment

62

“Network alignment graph”:

Sharan and Ideker (2006) Sharan and Ideker (2006) Nature BiotechnologyNature Biotechnology 2424(4): 427-433(4): 427-433

Page 63: Introduction to Bioinformatics

636363636363

1. Network Alignment

63

“Network alignment graph”:• Facilitates the search for conserved network regions• E.g.,

conserved dense clusters of interactions may indicate protein complexes,

conserved linear paths may correspond to signalling pathways• Finding conserved pathways was done by finding “high-scoring”

paths in the alignment graph (Kelley et al., PNAS, 2003): PathBLAST Identified five regions conserved across PPI networks of yeast S.

Cerevisiae and Helicobacter pylori Later extended to detect conserved protein clusters rather than

paths


Recommended