+ All Categories
Home > Documents > CSCE555 Bioinformatics

CSCE555 Bioinformatics

Date post: 21-Jan-2016
Category:
Upload: barr
View: 36 times
Download: 0 times
Share this document with a friend
Description:
CSCE555 Bioinformatics. Lecture 6 Sequence Alignment (partIII) Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Roadmap. - PowerPoint PPT Presentation
Popular Tags:
31
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 6 Sequence Alignment (partIII) Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .
Transcript
Page 1: CSCE555 Bioinformatics

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 6 Sequence Alignment (partIII)

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics

RoadmapRoadmap

Hashing Function based quick search

Heuristic algorithm: FASTA, BLAST

Multiple Sequence Alignment algorithm:

Clustal W

Summary

04/21/23 2

Page 3: CSCE555 Bioinformatics

Hash Table for Quick Hash Table for Quick SearchSearchSmith

18

Alice 19

Bob 18

Lucy 28

Alicia 32

Dan 30

Ron 32

George

32

O(n)

O(1)

Smith

18

Alice 19

Bob 18

Lucy 28

Alicia 32

Dan 30

Ron 32

George

32

O(log(n))

Page 4: CSCE555 Bioinformatics

SearchingSearchingConsider the problem of searching an

array for a given value◦ If the array is not sorted, the search requires

O(n) time If the value isn’t there, we need to search all n

elements If the value is there, we search n/2 elements on

average

◦ If the array is sorted, we can do a binary search A binary search requires O(log n) time About equally fast whether the element is found or not

◦ It doesn’t seem like we could do much better How about an O(1), that is, constant time search? We can do it if the array is organized in a particular way

4

Page 5: CSCE555 Bioinformatics

HashingHashingSuppose we were to come up with a

“magic function” that, given a value to search for, would tell us exactly where in the array to look◦If it’s in that location, it’s in the array◦If it’s not in that location, it’s not in

the arrayThis function is called a hash function

because it “makes hash” of its inputs

5

Page 6: CSCE555 Bioinformatics

(Magic) Hashing Function(Magic) Hashing FunctionA hash function is a function that:

◦When applied to an Object, returns a number

◦When applied to equal Objects, returns the same number for each

◦When applied to unequal Objects, is very unlikely to return the same number for each

Hash functions turn out to be very important for searching, that is, looking things up fast

6

Page 7: CSCE555 Bioinformatics

7

Example (ideal) hash Example (ideal) hash functionfunction

Suppose our hash function gave us the following values: hashCode("apple") = 5

hashCode("watermelon") = 3hashCode("grapes") = 8hashCode("cantaloupe") = 7hashCode("kiwi") = 0hashCode("strawberry") = 9hashCode("mango") = 6hashCode("banana") = 2

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0

1

2

3

4

5

6

7

8

9

Page 8: CSCE555 Bioinformatics

Example of Hash FunctionExample of Hash FunctionPRIVATE int hash_number (const char *key,

int size) { int hash = 0;

◦ if (key) { const char * ptr = key; ◦ for(; *ptr; ptr++)

hash = (int) ((hash*3 + (*(unsigned char*)ptr)) % size);

◦ } ◦ return hash; }

Page 9: CSCE555 Bioinformatics

9

FASTA (Fast Alignment)FASTA (Fast Alignment)

Page 10: CSCE555 Bioinformatics

BLAST (Basic Local Alignment BLAST (Basic Local Alignment Search Tool)Search Tool) Approach (BLAST) (Altschul et al. 1990, developed by NCBI)

◦ View sequences as sequences of short words (k-tuple) DNA: 11 bases, protein: 3 amino acids

◦ Create hash table of neighborhood (closely-matching) words

◦ Use statistics to set threshold for “closeness”

◦ Start from exact matches to neighborhood words Motivation

◦ Good alignments should contain many close matches

◦ Statistics can determine which matches are significant Much more sensitive than % identity

◦ Hashing can find matches in O(1) time

◦ Extending matches in both directions finds alignment Yields high-scoring/maximum segment pairs (HSP/MSP)

10

Page 11: CSCE555 Bioinformatics

11

BLAST (BLAST (Basic Local Alignment Search Tool)Basic Local Alignment Search Tool)

Page 12: CSCE555 Bioinformatics

12

Multiple Sequence Multiple Sequence AlignmentAlignment

Alignment containing multiple DNA / protein sequences

Look for conserved regions → similar functionExample:

#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

Page 13: CSCE555 Bioinformatics

Multiple Sequence Multiple Sequence Alignment: Why?Alignment: Why? Identify highly conserved residues

◦ Likely to be essential sites for structure/function

◦ More precision from multiple sequences

◦ Better structure/function prediction, pairwise alignments

Building gene/protein families

◦ Use conserved regions to guide search Basis for phylogenetic analysis

◦ Infer evolutionary relationships between genes Develop primers & probes

◦ Use conserved region to develop Primers for PCR Probes for DNA micro-arrays

13

Page 14: CSCE555 Bioinformatics

14

Multiple Alignment ModelMultiple Alignment Model

X1=x11,…,x1m1Model: scoring function s: A

Possible alignments of all Xi’s: A ={a1,…,ak}

Find the best alignment(s)

1 2* arg max ( ( , ,..., ))a Na s a X X X

Q3: How can we find a* quickly?

Q1: How should we define s?

S(a*)= 21

Q4: Is the alignment biologically Meaningful?

Q2: How should we define A?

X2=x21,…,x2m2

XN=xN1,…,xNmN

X1=x11,…,x1m1

X2=x21,…,x2m2

XN=xN1,…,xNmN

Page 15: CSCE555 Bioinformatics

04/21/23 15

Minimum Entropy ScoringMinimum Entropy Scoring

Intuition:

◦ A perfectly aligned

column has one single

symbol (least

uncertainty)

◦ A poorly aligned column

has many distinct

symbols (high

uncertainty)

Count of symbol a in column i

''

( ) logi ia iaa

iaia

iaa

S m p p

cp

c

Page 16: CSCE555 Bioinformatics

16

Multidimensional Dynamic ProgrammingMultidimensional Dynamic Programming

1, 2,...,

0,0,...,0

1 21 1, 2 1,..., 1 1 2

21, 2 1,..., 1 2

11 1, 2,..., 1 1

1, 2,...,

1, 2,..., 1

1 1, 2

0

( , ,..., )

( , ,..., )

( , ,..., )

max ...

( , ,..., )

...

i i iN

Ni i iN i i iN

Ni i iN i iN

Ni i iN i iN

i i iN

Ni i iN iN

i i

S x x x

S x x

S x x

S x

1,..., 1( , ,..., )iN iS x

Assumptions: (1) columns are independent (2) linear gap cost

Alignment: 0,0,0…,0---|x1| , …, |xN|

We can vary both the model and the alignment strategies

( ) ( )

( )

ii

S m G s m

G g dg

=Maximum score of an alignment up to the subsequences ending with 1 21 2, ,..., N

i i iNx x x

NP-complete problem. High complexity

Page 17: CSCE555 Bioinformatics

Approximate Algorithms for Approximate Algorithms for Multiple AlignmentMultiple Alignment Two major methods (but it remains a worthy

research topic)

◦ Reduce a multiple alignment to a series of pairwise alignments and then combine the result (e.g., Feng-Doolittle alignment)

◦ Using HMMs (Hidden Markov Models)

Feng-Doolittle alignment (4 steps)

◦ Compute all possible pairwise alignments

◦ Convert alignment scores to distances

◦ Construct a “guide tree” by clustering

◦ Progressive alignment based on the guide tree (bottom up)

17

Page 18: CSCE555 Bioinformatics

Progressive AlignmentProgressive Alignment

Page 19: CSCE555 Bioinformatics

How to Align One Sequence to How to Align One Sequence to an Existing Alignment?an Existing Alignment?

Add a sequence to an existing group:a sequence s: CGAAATC want to align to a

existing alignments1 AG–AT–s2 -GAATC

The high scoring pairwise alignment iss2 -G–AATCs CGAAATC

Hence , s is merged into the group alignment as:

s1 AG--AT–s2 -G–AATCs CGAAATC

fixed

add gaps if needed

Page 20: CSCE555 Bioinformatics

How to Align a Group to How to Align a Group to Another Group?Another Group?Two groups:

S1 ATTGCCATT--

S2 ATC-CAATTTT

S3 ATGGCCATT

S4 ATCTTC-TTThe highest score alignment is S1 – S3 , so it is used for

aligning the two groups as

S2 ATC–CAATTTT

S1 ATTGCCATT--

S3 ATGGCCATT--

S4 ATCTTC-TT--

Page 21: CSCE555 Bioinformatics

Limitation of Feng-Doolittle Limitation of Feng-Doolittle AlignmentAlignment Problems of Feng-Doolittle alignment

◦ All alignments are completely determined by pairwise alignment (restricted search space)

◦ No backtracking (subalignment is “frozen”) No way to correct an early mistake Non-optimality: Mismatches and gaps at highly

conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process

Iterative Refinement

◦ Re-assigning a sequence to a different cluster/profile

◦ Repeatedly do this for a fixed number of times or until the score converges

◦ Essentially to enlarge the search space

21

Page 22: CSCE555 Bioinformatics

Clustal W: A Multiple Clustal W: A Multiple Alignment ToolAlignment Tool CLUSTAL and its variants are software packages often

used to produce multiple alignments

Essentially following Feng-Doolittle

◦ Do pairwise alignment (dynamic programming)

◦ Do score conversion/normalization (Kimura’s model)

◦ Construct a guide tree (neighbour-journing clustering)

◦ Progressively align all sequences using profile

alignment

Offer capabilities of using substitution matrices like

BLOSUM or PAM

Many Heuristics

22

Page 23: CSCE555 Bioinformatics

One example of MSA using One example of MSA using ClustalwClustalw

Page 24: CSCE555 Bioinformatics
Page 25: CSCE555 Bioinformatics
Page 26: CSCE555 Bioinformatics

More Advanced MSA More Advanced MSA algorithmsalgorithmsKalignMAFFT (Multiple Alignment using Fast

Fourier Transform)MUSCLE stands for MUltiple Sequence

Comparison by Log-Expectation. MUSCLE is claimed to achieve both better average accuracy and better speed than ClustalW2 or T-Coffee

T-Coffee allows you to combine results obtained with several alignment methods

Page 27: CSCE555 Bioinformatics

27

Measuring Alignment Measuring Alignment SignificanceSignificanceThe statistical significance of a an

alignment score is used to try to determine if an alignment is the result of homology or just random chance.

The E-value of an alignment score is the expected number of unrelated sequences in a database that would have a score at least as good.

Page 28: CSCE555 Bioinformatics

28

EE-values and -values and pp-values-valuesThe E-value of a particular score is

determined by multiplying the number of sequences in the database, n, times the p-value of the score.

The p-value of score X is the probability of a single random alignment having score X or larger.

E-value(X) = n • p-value(X)

Page 29: CSCE555 Bioinformatics

Computing Computing pp-values-values

29

To compute the p-value of X, we must know how random scores are distributed.

The p-value of X is equal to the area under the distribution curve to the right of X.

For ungapped local alignments, the distribution can be computed analytically.

For gapped alignments, it must be estimated empirically.

Page 30: CSCE555 Bioinformatics

SummarySummaryHashing for quick searchBlast and FastaProgressive Multiple Sequence

alignmentTesting significance of

alignments

Page 31: CSCE555 Bioinformatics

Next LectureNext LectureProfiles and HMMReading:

◦Textbook (CG) chapter 4◦Textbook (EB) chapter 6


Recommended