+ All Categories
Home > Documents > Sequence Alignment II

Sequence Alignment II

Date post: 18-Mar-2016
Category:
Upload: bozica
View: 45 times
Download: 1 times
Share this document with a friend
Description:
Sequence Alignment II. K-tuple methods Statistics of alignments. Database searches. What is the problem? Large number of sequences to search your query sequence against. Various indexing schemes and heuristics are used, one of which is BLAST. - PowerPoint PPT Presentation
Popular Tags:
123
Sequence Alignment II K-tuple methods Statistics of alignments
Transcript
Page 1: Sequence Alignment II

Sequence Alignment II

K-tuple methodsStatistics of alignments

Page 2: Sequence Alignment II

Database searches

What is the problem? Large number of sequences to search your

query sequence against. Various indexing schemes and heuristics are

used, one of which is BLAST. heuristic is a technique to solve a problem that ignores

whether the solution can be proven to be correct, but usually produces a good solution, are intended to gain computational performance or conceptual simplicity potentially at the cost of accuracy or precision.

http://en.wikipedia.org/wiki/Heuristics#Computer_science

Page 3: Sequence Alignment II

K-tuple methods

http://creativecommons.org/licenses/by-sa/2.0/

Page 4: Sequence Alignment II

Concepts of Sequence Similarity Searching

The premise: The sequence itself is not informative; it

must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function.

Page 5: Sequence Alignment II

Important Terms for Sequence Similarity Searching with very different meanings

Similarity The extent to which nucleotide or protein

sequences are related. In BLAST similarity refers to a positive matrix score.

Identity The extent to which two (nucleotide or amino

acid) sequences are invariant. Homology

Similarity attributed to descent from a common ancestor.

Page 6: Sequence Alignment II

Sequence Similarity Searching: The Approach

Sequence similarity searching involves the use of a set of algorithms (such as the BLAST programs) to compare a query sequence to all the sequences in a specified database.

Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared.

Page 7: Sequence Alignment II

QUERY sequence(s)

BLAST database

BLAST program

BLAST results

Blast

Page 8: Sequence Alignment II

Topics:

There are different blast programs Understanding the BLAST algorithm

Word size HSPs (High Scoring Pairs)

Understanding BLAST statistics The alignment score (S) Scoring Matrices Dealing with gaps in an alignment The expectation value (E)

BLAST program

Page 9: Sequence Alignment II

The BLAST algorithm

The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 for optimal local alignments to a query. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment

search tool.” J. Mol. Biol. 215:403-410. Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ

(1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” NAR 25:3389-3402.

Page 10: Sequence Alignment II

http://www.ncbi.nlm.nih.gov/BLAST

blastnblastp

blastx

tblastntblastx

Page 11: Sequence Alignment II

Other BLAST programs

BLAST 2 Sequences (bl2seq) Aligns two sequences of your choice Gives dot-plot like output

Page 12: Sequence Alignment II

More BLAST programs

BLAST against genomes Many available BLAST parameters pre-optimized Handy for mapping query to genome

Search for short exact matches BLAST parameters pre-optimized Great for checking probes and primers

Page 13: Sequence Alignment II

How Does BLAST Work? The BLAST programs improved the overall

speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments.

Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of “T".

Page 14: Sequence Alignment II

Picture used with permission from Chapter 11 of “Bioinformatics:A Practical Guide to the Analysis of Genes and Proteins”

Page 15: Sequence Alignment II

Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs)

Page 16: Sequence Alignment II

Each BLAST “hit” generates an alignment that can contain one or more high scoring pairs (HSPs)

Page 17: Sequence Alignment II

Where does the score (S) come from?

The quality of each pair-wise alignment is represented as a score and the scores are ranked.

Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein).

The alignment score will be the sum of the scores for each position.

Page 18: Sequence Alignment II

What’s a scoring matrix?

Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query.

Page 19: Sequence Alignment II

PAM vs. BLOSUM scoring matrices

BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

Page 20: Sequence Alignment II

PAM vs BLOSUM scoring matricesThe PAM Family PAM matrices are based

on global alignments of closely related proteins.

The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence.

Other PAM matrices are extrapolated from PAM1.

The BLOSUM family BLOSUM matrices are based

on local alignments. BLOSUM 62 is a matrix

calculated from comparisons of sequences with no less than 62% divergence.

All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins.

Page 21: Sequence Alignment II

What happens if you have a gap in the alignment?

A gap is a position in the alignment at which a letter is paired with a null

Gap scores are negative. Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is frequently ascribed more significance than the length of the gap. Hence the gap is penalized heavily, whereas a

lesser penalty is assigned to each subsequent residue in the gap.

Page 22: Sequence Alignment II

Percent Sequence Identity

The extent to which two nucleotide or amino acid sequences are invariant

A C C T G A G – A G A C G T G – G C A G

70% identicalmismatch

indel

Page 23: Sequence Alignment II

BLAST algorithm Keyword search of all words of length w in

the query of default length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for

proteins Do local alignment extension for each hit

of keyword search Extend result until longest match above

threshold is achieved and output

Page 24: Sequence Alignment II

BLAST algorithm (cont’d)

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++KSbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18GAK 16GIK 16GGK 14GLK 13GNK 12GRK 11GEK 11GDK 11

neighborhoodscore threshold

(T = 13)

Neighborhoodwords

High-scoring Pair (HSP)

extension

Page 25: Sequence Alignment II

Local alignment

Find the best local alignment between two strings, over the recurrence:

),(

),(

),(0

max

1,1

1,

,1

,

jiji

jji

iji

ji

wvs

ws

vss

Page 26: Sequence Alignment II

Local alignment (cont’d)

Input: strings v and w and scoring matrix

Output: substrings of v and w whose global alignment as defined by , is maximal among all global alignments of all substrings of v and w

Page 27: Sequence Alignment II

Original BLAST

DictionaryAll words of length w

AlignmentUngapped extensions until score falls

below statistical threshold T Output

All local alignments with score > statistical threshold

Page 28: Sequence Alignment II

Original BLAST: ExampleA C G A A G T A A G G T C C A G T

C

T G

A

T

C C

T

G

G

A

T T

G

C

G

A

• w = 4, T = 4• Exact keyword

match of GGTC

• Extend diagonals with mismatches until score falls below a threshold

• Output resultGTAAGGTCCGTTAGGTCCFrom lectures by Serafim Batzoglou

(Stanford)

Page 29: Sequence Alignment II

Gapped BLAST: ExampleA C G A A G T A A G G T C C A G T

C

T G

A

T

C C

T

G

G

A

T T

G

C

G

A Original BLAST exact keyword search, THEN:

Extend with gaps in a zone around ends of exact match

Output resultGTAAGGTCCAGTGTTAGGTC-AGTFrom lectures by Serafim Batzoglou (Stanford)

Page 30: Sequence Alignment II

Gapped BLAST : Example (cont’d)

Original BLAST exact keyword search, THEN:

Extend with gaps around ends of exact match until score <T, then merge nearby alignments

Output resultGTAAGGTCCAGTGTTAGGTC-AGT

A C G A A G T A A G G T C C A G T

C

T G

A

T

C C

T

G

G

A

T T

G

C

G

A

From lectures by Serafim Batzoglou (Stanford)

Page 31: Sequence Alignment II

Topics:

The different blast databases provided by the NCBI Protein databases Nucleotide databases Genomic databases

Considerations for choosing a BLAST database

Custom databases for BLAST

BLAST databases

Page 32: Sequence Alignment II

BLAST protein databases available at through blastp web interface @ NCBI

blastp db

Page 33: Sequence Alignment II

Considerations for choosing a BLAST database

First consider your research question: Are you looking for an ortholog in a particular

species? BLAST against the genome of that species.

Are you looking for additional members of a protein family across all species? BLAST against nr, if you can’t find hits check wgs, htgs, and the

trace archives. Are you looking to annotate genes in your

species of interest? BLAST against known genes (RefSeq) and/or ESTs from a

closely related species.

Page 34: Sequence Alignment II

When choosing a database for BLAST… It is important to know your reagents.

Changing your choice of database is changing your search space completely

Database size affects the BLAST statistics record BLAST parameters, database choice, database size

in your bioinformatics lab book, just as you would for your wet-bench experiments.

Databases change rapidly and are updated frequently It may be necessary to repeat your analyses

Page 35: Sequence Alignment II

Topics: Choosing the right BLAST program Running a blastp search

BLAST parameters and options to consider Viewing BLAST results

Look at your alignments Using the BLAST taxonomy report

BLAST results

Page 36: Sequence Alignment II

BLAST parameters and options to consider:

conserved domains

Entrez query

E-value cutoff

Word size

Page 37: Sequence Alignment II

More BLAST parameters and options to consider:

filtering

matrix gap penalities

Page 38: Sequence Alignment II

Run your BLAST search:

BLAST

Page 39: Sequence Alignment II

The BLAST Queue:

click for more info

Note your RID

Page 40: Sequence Alignment II

Formatting and Retrieving your BLAST results:

options

Results

Page 41: Sequence Alignment II

A graphical view of your BLAST results:

Page 42: Sequence Alignment II

The BLAST “hit” list:

alignment

GenBank

Score E-Value

EntrezGene

Page 43: Sequence Alignment II

The BLAST pairwise alignments

Identity Similarity

Page 44: Sequence Alignment II

Sample BLAST output

Score ESequences producing significant alignments: (bits) Value

gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757... 171 3e-44gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer... 170 7e-44gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D... 170 7e-44gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio] 168 3e-43

ALIGNMENTS>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]Length = 148

Score = 171 bits (434), Expect = 3e-44 Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)

Query: 1 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60 MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPKSbjct: 1 MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60

Query: 61 VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120 V AHG+ V+G + ++DN+K T+A LS +H +KLHVDP+NFRLL + + A FGSbjct: 61 VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120

Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147 + F VQ A+QK +A V +AL +YHSbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148

• Blast of human beta globin protein against zebra fish

Page 45: Sequence Alignment II

Sample BLAST output (cont’d)

Score ESequences producing significant alignments: (bits) Value

gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1... 289 1e-75gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end 289 1e-75gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge... 280 1e-72gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin 260 1e-66gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud... 151 7e-34gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob... 149 3e-33

ALIGNMENTS>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome 11 Length = 81706 Score = 149 bits (75), Expect = 3e-33 Identities = 183/219 (83%) Strand = Plus / Plus Query: 267 ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326 || ||| | || | || | |||||| ||||| ||||||||||| |||||||| Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468

Query: 327 ctgcactgtgacaagctgcatgtggatcctgagaacttc 365 ||||||||| |||||||||| ||||| ||||||||||||Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507

• Blast of human beta globin DNA against human DNA

Page 46: Sequence Alignment II

What do the Score and the e-value really mean? The quality of the alignment is represented by the

Score. Score (S)

The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically .

The significance of each alignment is computed as an E value. E value (E)

Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.

Page 47: Sequence Alignment II

E value

E value (E) Expectation value. The number of different

alignments with scores equivalent to or better than S expected to occur in a database search by chance. The lower the E value, the more significant the score.

Page 48: Sequence Alignment II

Assessing sequence homology

Need to know how strong an alignment can be expected from chance alone

“Chance” is the comparison of Real but non-homologous sequences Real sequences that are shuffled to

preserve compositional properties Sequences that are generated randomly

based upon a DNA or protein sequence model (favored)

Page 49: Sequence Alignment II

High Scoring Pairs (HSPs)

All segment pairs whose scores can not be improved by extension or trimming

Need to model a random sequence to analyze how high the score is in relation to chance

Page 50: Sequence Alignment II

Expected number of HSPs Expected number of HSPs with score > S E-value E for the score S:

E = Kmne-S

Given: Two sequences, length n and m The statistics of HSP scores are characterized by

two parameters K and λ K: scale for the search space size λ: scale for the scoring system

Page 51: Sequence Alignment II

BLAST statistics to record in your bioinformatics labbook

Record the statistics that are found atbottom of your BLAST results page

Page 52: Sequence Alignment II

Scoring matrices

Amino acid substitution matrices PAM BLOSUM

Page 53: Sequence Alignment II

Bit Scores

Normalized score to be able to compare sequences

Bit score S’ = S – ln(K)

ln(2) E-value of bit score

E = mn2-S’

Page 54: Sequence Alignment II

Assessing the significance of an alignment

How to assess the significance of an alignment between the comparison of a protein of length m to a database containing many different proteins, of varying lengths?

Calculate a "database search" E-value. Multiply the pairwise-comparison E-value by the number of sequences in the database N divided by the length of the sequence in the database n

Page 55: Sequence Alignment II

Homology: Some Guidelines Similarity can be indicative of homology Generally, if two sequences are significantly

similar over entire length they are likely homologous

Low complexity regions can be highly similar without being homologous

Homologous sequences not always highly similar

Page 56: Sequence Alignment II

Homology: Some Guidelines Suggested BLAST Cutoffs

(source: Chapter 11 – Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins)

For nucleotide based searches, one should look for hits with E-values of 10-6 or less and sequence identity of 70% or more

For protein based searches, one should look for hits with E-values of 10-3 or less and sequence identity of 25% or more

Page 57: Sequence Alignment II

Contributors

Special thanks to David Wishart, Andy Baxevanis, Stephanie Minnema, Sohrab Shah, and Francis Ouellette for their contributions to these materials

http://creativecommons.org/licenses/by-sa/2.0/

Page 58: Sequence Alignment II

FASTA

A FASTA search begins by breaking the search sequence into words.

For genomic sequences, a word size of 4 or 6 nucleotides is used; 1 or 2 for polypeptide sequences.

Page 59: Sequence Alignment II

FASTA

Next a table is constructed for the query sequence (word size is 1): E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2

Page 60: Sequence Alignment II

FASTA

Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2 13

Page 61: Sequence Alignment II

FASTA

Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2 13 1

6

Page 62: Sequence Alignment II

FASTA

Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2 13 1 5

6 12

Page 63: Sequence Alignment II

FASTA

Next a table is constructed for the query sequence: E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2 13 1 5 7

6 12

Page 64: Sequence Alignment II

FASTA

The table for the query sequence is complete: E.g. FAMLGFIKYLPGCM

A C D E F G H I K L M N P Q R S T V W Y2 13 1 5 7 8 4 3 11 9

6 12 10 14

Page 65: Sequence Alignment II

FASTA Compare the query sequence table with the target sequence:

Query: FAMLGFIKYLPGCM Index of Gs are 5 and 12

Target: TGFIKYLPGACT Index of Gs are 2 and 9

Subtract 2 from 5 and 12; producing 3 and 10 Subtract 9 from 5 and 12; producing -4 and 3

1 2 3 4 5 6 7 8 9 10 11 12T G F I K Y L P G A C T

3 -4

10 3

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

Page 66: Sequence Alignment II

FASTA Compare the query sequence table with the target sequence:

Query: FAMLGFIKYLPGCM Index of Fs are 1 and 6

Target: TGFIKYLPGACT Index of F is 3

Subtract 3 from 1 and 6; producing -2 and 3

1 2 3 4 5 6 7 8 9 10 11 12T G F I K Y L P G A C T

3 -2 -4

10 3 3

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

Page 67: Sequence Alignment II

FASTA Compare the query sequence table with the target sequence:

Query: FAMLGFIKYLPGCM Index of Fs are 1 and 6

Target: TGFIKYLPGACT Index of F is 3

Subtract 3 from 1 and 6; producing -2 and 3

1 2 3 4 5 6 7 8 9 10 11 12T G F I K Y L P G A C T

3 -2 3 3 3 -3 3 -4 -8 2

10 3 3 3

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

Page 68: Sequence Alignment II

FASTAFAMLGFIKYLPGCM

|||||||| TGFIKYLPGACT

Offset by 3

1 2 3 4 5 6 7 8 9 10 11 12T G F I K Y L P G A C T

3 -2 3 3 3 -3 3 -4 -8 2

10 3 3 3

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

1410126

911348751132

YWVTSRQPNMLKIHGFEDCA

Page 69: Sequence Alignment II

Fasta (word size = 2)

Page 70: Sequence Alignment II

Database searches

Page 71: Sequence Alignment II

Odds score in sequence alignment

The chance of an aligned amino acid pair being found in alignments of related sequences compared to the chance of that pair being found in random alignments of unrelated sequences.

Page 72: Sequence Alignment II

Statistical significance of an alignment

The probability that random or unrelated sequences could be aligned to produce the same score. Smaller the probability is the better.

Page 73: Sequence Alignment II

Probability

What is the probability that a coin toss will yield a head?

What is the probability that the next pair of nucleotides will be a ‘match’ or ‘mismatch’?

Page 74: Sequence Alignment II

Bernoulli trials

A series of n number of independent trials with the same outcome probabilities and number of choices (e.g., head or tail; or match (m) or mismatch (mi)). P(hhhhh) P(mmmmm)

Page 75: Sequence Alignment II

Head or Tail..Longest run of heads or tails

Longest run of heads one would get in a random series of coin tosses? Fair coin, p = 0.5; 1/p = 1/0.5 = 2 Erdös and Rènyi longest run = log1/p(n)

If n = 100; longest run 6.65

Page 76: Sequence Alignment II

Alignment analogy

You have two sequences a and b of equal length a1a2a3a4

b1b2b3b4

if an = bn; then it is head (match) If an does not equal to bn then it is tail

(mismatch)

Page 77: Sequence Alignment II

Alignment Statistics:

For two sequences of length n and m, n times m comparisons are being made; thus the longest length of the predicted match would be log1/p(mn).

Page 78: Sequence Alignment II

Alignment Statistics:

Expectation value or the mean longest match would be E(M) = log1/p(Kmn), where K is a constant

that depends on amino acid or base composition and p is the probability of a match. This is only true for ungapped local alignments.

Page 79: Sequence Alignment II

Distribution of alignment scores

resembles Gumbel extreme value distribution.

Page 80: Sequence Alignment II

Extreme Value Distribution

Page 81: Sequence Alignment II

Extreme Value Distribution

In this distribution, the probability of a score being higher than x is given by:

•m and n are the lengths of the sequences compared •K and   can be calculated from the data in the matrix used and from the relative frequencies of the amino acids (or nucleotides)

Page 82: Sequence Alignment II

Alignment Statistics:

For two sequences of length n and m, n times m comparisons are being made; thus the longest length of the predicted match would be log1/p(mn).

For a pair of random DNA sequences of length 100 and p = 0.25 (equal A,T,C,G), the longest expected run of matches would be: 2 x log1/p (n) = 2 x log4 100 = 6.65

Page 83: Sequence Alignment II

Alignment Statistics

E(M)=log1/p(Kmn) means that match length gets bigger as the log of the product of sequence lengths. Amino acid substitution matrices will turn match lengths into alignment scores (S).

More commonly = ln(1/p) is used. Number of longest run HSP will be estimated E = Kmne-S

How good a sequence score is evaluated based on how many HSPs (i.e. E value) one would expect for that score.

Page 84: Sequence Alignment II

Alignment Statistics

Two ways to get K and : For 10000 random amino acid sequences

with various gap penalties, K and lambda parameters have been tabulated.

Calculation of the distribution for two sequences being aligned by keeping one of them fixed and scrambling the other, thus preserving both the sequence length and amino acid composition.

Page 85: Sequence Alignment II

Generate random sequences

You may use the function randperm

>> help randperm

RANDPERM(n) is a random permutation of the integers from 1 to n.

For example, RANDPERM(6) might be [2 4 5 6 1 3].

Page 86: Sequence Alignment II

Align a sequence with its randomly permuted state

>> x = 'atagacagacca'>> l = length (x)l = 12>> ind = randperm(12)ans =

Columns 1 through 9 9 4 5 7 3 11 2 8 6Columns 10 through 12 10 1 12

>> y = x(ind)y =

agaaactgccaa>> align1atagacagaccaagaaactgccaa

Page 87: Sequence Alignment II

Alignment Statistics

Page 88: Sequence Alignment II

Alignment Statistics

Page 89: Sequence Alignment II

Alignment Statistics

Page 90: Sequence Alignment II

Alignment Statistics

Page 91: Sequence Alignment II

Probability Distributions: Binomial Distribution

The number of an event (x) in n trials is given by binomial distribution:

n, p, and q are constantx variesn and x are discretep+q = 1

Probability of event 1 Probability of event 1

Binomial coefficient

probability

Page 92: Sequence Alignment II

Binomial Distribution

Only two outcomes are possible on each of n trials.

The probability of success for each trial is constant (p, and q does not change).

All trials are independent of each other.

Page 93: Sequence Alignment II

Matlab: binopdf function

Y = binopdf(x,n,p)Where x equals the number of successes

(outcome), n is the total possible number of trials, P is the probability of one type of outcome.

Page 94: Sequence Alignment II

Matlab: binopdf function

>> x = 0:10 % from 0, 1,2, ...,10 number of trials

>> y = binopdf(x,10,0.5) % calculate pdf

>> plot(x,y,'+') %plot n over y using + sign

Page 95: Sequence Alignment II

Binomial probability density function

Page 96: Sequence Alignment II

Applications

Calculate the probability of a couple’s (mother AA and father AB genotype) 2 of 10 children having AB blood type? n = 10 % total number of children x = 2 % number of children with AB blood p = 0.5 % probability of having AB genotype q = 0.5 % probability of having AA genotype

Page 97: Sequence Alignment II

Matlab

>> p = 0.5;>> q = 1-q;>> n = 10;>> x = 2;>> fn = factorial(n);>> fx = factorial(x);>> fnminusx = factorial(n-x);>> binocoef = fn./(fx.*fnminusx)>> Pr = binocoef*p^n*q^(N-n)

Page 98: Sequence Alignment II

Use parentheses in order to determine order in calculations

>> p = 0.5;>> q = 1-q;>> n = 10;>> x = 2;>> fn = factorial(n);>> fx = factorial(x);>> fnminusx = factorial(n-x);>> binocoef = fn./fx.*fnminusx>> Pr = binocoef*p^n*q^(N-n)

Page 99: Sequence Alignment II

Try this!

>> n = 1:100;>> y = binopdf(n,100,0.5);>> plot(n,y,'+')

Page 100: Sequence Alignment II

Binomial distribution

Page 101: Sequence Alignment II

Binomial Cumulative Distribution Function

Adds the probability value of the previous case to the next.

>> x = 0:10>> n = 10>> p = 0.5>> y = binocdf(x,n,p)>> plot(x,y,'r+')

Page 102: Sequence Alignment II

Cumulative distribution

Page 103: Sequence Alignment II

Expected value = mean value

The mean or expected value of an outcome (e.g., getting an H from a coin toss) for n trials would be E(H) = np p = E(H)/n 2 = np(1-p)

Page 104: Sequence Alignment II

Null hypothesis in statistics

States equality (or in cases greater than or less than) between observed and an expected value

To test a null hypothesis: perform a statistical test calculate a p value reject or do not reject the null hypothesis

using a threshold.

Page 105: Sequence Alignment II

Example

If a baseball team plays 162 games in a season and has a 50-50 chance of winning any game (p = winning = 0.5; q = losing = 0.5), then the probability of that team winning more than 100 games in a season is:

>> 1 - binocdf(100,162,0.5)

The result is 0.001 (i.e., 1-0.999). If a team wins 100 or more games in a season, this result

suggests that it is likely that the team's true probability of winning any game is greater than 0.5.

Page 106: Sequence Alignment II

Example

In a population of Drosophila, the frequency of AA genotype is p (0.5) and the frequency of AB genotype is q (0.5).

If you sample from this population the number of AA or AB individuals in the sampled population will be a function of their relative frequencies and the sample size (n). If n individuals are selected and x number of AB individuals are found, is this

number greater or less than what could be obtained by chance alone?>> binopdf(7,10,0.5)ans =0.1172>> binopdf(70,100,0.5)ans = 2.3171e-005

Page 107: Sequence Alignment II

Normal Distribution

A standard normal distribution will have a mean of 0 and variance of 1.

Page 108: Sequence Alignment II

Normal Probability Distribution

>> x = -5:0.05:5; >> y = normpdf(x); >>plot(x,y)

Page 109: Sequence Alignment II

Plot(x,y)

Page 110: Sequence Alignment II

Normal cumulative distribution

What is the probability that an observation from a standard normal distribution will fall on the interval [-1 1]?

>>p = normcdf([-1 1]);>>p(2) - p(1)ans = 0.6827

Page 111: Sequence Alignment II

PAM-2

Page 112: Sequence Alignment II

PAM-250

Page 113: Sequence Alignment II

PAM-250

Page 114: Sequence Alignment II

PAM-250

Page 115: Sequence Alignment II

PAM-250

Page 116: Sequence Alignment II

PAM-250

Page 117: Sequence Alignment II

PAM-250

Page 118: Sequence Alignment II

Multiple Sequence Alignment

Page 119: Sequence Alignment II

Multiple Sequence Alignment

Page 120: Sequence Alignment II

MegaBLAST

megaBLAST For aligning sequences which differ slightly due to

sequencing errors etc. Very efficient for long query sequences Uses big word (k-tuple) sizes to start search

Very fast Accepts batch submissions of ESTs Can upload files of sequences as queries

More detailed info: see megaBLAST pages

Page 121: Sequence Alignment II

P-values

The probability of finding b HSPs with a score >=S is given by: (e-EEb)/b!

For b = 0, that chance is: e-E

Thus the probability of finding at least one such HSP is: P = 1 – e-E

Page 122: Sequence Alignment II

Alignment Statistics

Page 123: Sequence Alignment II

Alignment Statistics


Recommended