bioinfo davidmount 1

following list.

Chapter 1: Historical Introduction and Overview Chapter 2: Collecting and Storing Sequences in the Laboratory Chapter 3: Alignment of Pairs of Sequences Chapter 4: Introduction of Probability and Statistical Analysis of Sequence Alignments Chapter 5: Multiple Sequence Alignment Chapter 6: Sequence Database Searching for Similar Sequences Chapter 7: Phylogenetic Prediction Chapter 8: Prediction of RNA Secondary Structure Chapter 9: Gene Prediction and Regulation Chapter 10: Protein Classification and Structure Prediction Chapter 11: Genome Analysis Chapter 12: Bioinformatics Programming Using Perl and Perl Modules Chapter 13: Analysis of Microarrays

Chapter 1: Historical Introduction and Overview

This chapter describes how bioinformatics has evolved into a new field of scientific investigation, describes the roles of biological and computational research in this field, and provides a brief historical account. Also provided is an overview of the chapters in this second edition. References to earlier and current reference books, articles, reviews, and journals provide a broader view of the field.

Chapter 2: Collecting and Storing Sequences in the Laboratory

http://www.bioinformaticsonline.org/ch/ch13/index.html













This chapter summarizes methods used to collect sequences of DNA molecules and store them in computer files. Procedures ranging from the actual sequencing, through determination of accuracy, choice of sequence format, conversions from one format to another, storage in databases, and accessing sequences in databases are described

Table 2.5. Major sequence databases accessible through the Internet

1. GenBank at the National Center for Biotechnology Information, National Library of Medicine, Washington, D.C. accessible from:http://www.ncbi.nih.gov/Entrez/

2. European Molecular Biology Laboratory (EMBL) Outstation at Hixton, Englandhttp://www.ebi.ac.uk/embl/index.html

3. DNA DataBank of Japan (DDBJ) at Mishima, Japanhttp://www.ddbj.nig.ac.jp/

4. Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, D.C. (see Barker et al. 1998), an annotated protein databasehttp://www-nbrf.georgetown.edu/pirwww/

5. The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne, an annotated protein databasehttp://www.expasy.org/cgi-bin/sprot-search-de

6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and complex concurrent searches of one or more sequence databases. The SRS system may also be used on a local machine to assist in the preparation of local sequence databases.http://srs6.ebi.ac.uk

The databases are available at the indicated addresses and return sequence files through an Internet browser. Many of the sites shown provide access to multiple databases. The first three database centers are updated daily and exchange new sequences daily, so that it is only necessary to access one of them. Additional Web addresses of databases of protein families and structure, and genomic databases, are given in Chapters 10 and 11. These databases can also provide access to sequences of a protein family or organism.

The annotated protein data banks traditionally examine the scientific literature for physical evidence that the protein is actually produced in cells. The presence of mRNA sequences reveals that the gene is expressed but do not reveal whether or not the mRNA is translated into a protein. However, some proteins may be difficult to detect because they are made in small quantities, in specific cells or tissues, or at a particular time in development. Codon use by the mRNA of suspect genes can be examined for consistency with codon use by other genes that are known to be translated, as discussed in Chapter 9.

Problems > Chapter 2

THE WWW SITES TO USE FOR THESE PROBLEMS ARE:

Entrez http://www.ncbi.nlm.nih.gov/entrez/ LocusLink http://www.ncbi.nlm.nih.gov/LocusLink/ SRS http://srs.ebi.ac.uk/ SGD http://www.yeastgenome.org/ PIR http://pir.georgetown.edu/ SwissProt http://www.expasy.ch/sprot/

READSEQ http://searchlauncher.bcm.tmc.edu/seq-util/readseq.htmlor do a Web search for Readseq to locate another site.

The Institute for Genomic Research (TIGR)

http://www.tigr.org

1. This problem practices using the Entrez search program at the National Center for Biotechnology Information (NCBI) to perform a search for the amino acid sequence of the human heat shock factor HSF1. Normally a large number of matches are found in such searches. We will use the Entrez Boolean search features, which restrict the reported matches to a series of required conditions. This feature allows us to narrow the search to the sequence that we want.

This SRS Web site given above also provides powerful database search routines especially designed for the retrieval of large data sets. The student is encouraged to repeat some of the following exercises on this site.

a. Go to the Entrez Web site and choose Protein from the drop-down window in the upper left.

b. Enter the terms <heat shock factor> (without the angled brackets) in the search window and click the mouse on GO. This search is to find any sequence entry in the available protein sequence databases that have these three words anywhere in the text. Show how many matches (hits) are found by clicking history.

c. Now reduce the search by entering the same terms but surrounding them by quotes "heat shock factor". The matches must now include this phrase. This time click Preview to go directly to the number of hits in the protein database. What is the number now?

d. Now limit the search by clicking the mouse on Preview/Index, go to add terms, choose organism in the first box, type human in the second, then click AND to limit the search to just human proteins, and then click Preview. The history will now show the results of a search for database entrees with the term "heat shock factor" AND originating from humans as the organism. How many hits are there now?

e. We can limit the hits to matches to RefSeq, which is GenBank's annotated sequence database, to give a best representative sequence entry for each protein. Click the mouse on Limits, and in the Limited To section of the pages, ignore the boxes on the left, and choose RefSeq in the right box. Then click GO and history. Now we have all human heat shock factors in RefSeq.

f. The gene of interest is HSF1. Click clear in the text entry box at the top of the page, type HSF1, and click Preview. There should now be one entry left in History. Clicking on the number 1 provides the sequence.

g. There are other ways of arriving at this final sequence. As another example, pull out all human protein sequences in RefSeq and all HSF1 sequences in all organisms and

then select the human one using another Boolean search feature of Entrez. First clear History, clear the upper text box, and reselect Limits, or else just reload the Entrez page and choose Protein in the upper left box. Enter human in the text box at the top, click Limits, and then in the Limited To area, choose Organism in the upper left box and RefSeq in the right box. Click GO and then History. Now we have a complete list of all human proteins in RefSeq.

h. Now replace human with HSF1 in the upper text field, click Limits, and in the Limited To area, choose gene name in the upper left box and RefSeq in the right box. Click GO and then History. The result should be a small number of HSF1 proteins.

i. Finally, note the numbers at the beginning of the two lines that start with a pound sign (#) in history that were found by the last two searches. Go to the upper text box and type <#1 AND #2> (assuming the numbers are 1 and 2) and omit the angled brackets. This now creates a new search in which only protein sequences are matched that are from humans and which are the HSF1 gene, i.e., the new search is an intersection of the previous two. Again, 1 protein should be left.

j. Note the RefSeq accession number starting with "NP" and use the mouse links to display the sequence in FASTA format. "NP" identifies the sequence as a curated protein sequence. The sequence may then be copied and pasted into the page of a simple text editor and saved as a local computer file.

k. While on the page with the target sequence, click on Links and choose the Nucleotide option. Now the mRNA and genome sequence corresponding to the protein should become available. Note that the RefSeq numbers start with NM for annotated mRNA sequence and NT for the annotated genome/chromosome 8 sequence. There are also links to a display of the genome/chromosome map location of the gene and other useful information to explore at leisure.

2. Another useful NCBI search tool is LocusLink, which can be used to search for information on genes and proteins based on the known location of the genes on chromosomes that have been sequenced. Eventually Entrez and LocusLink will probably be combined at NCBI to create an even more powerful search machine. We will retrieve information about the HSF1 protein using LocusLink.

a. Go to NCBI LocusLink address given above. In the first box choose LocusLink, then Brief in the second, and Human as the search organism in the third. Then enter HSF1 as the query and click GO. A small number of entries match the query and one of them should be HSF1. The position column shows the relative numbered position on the long arm (q for the long arm) of chromosome 8. The colored boxes provide sequence of the gene with direct links to RefSeq. Clicking on the green P will give the protein sequence entries of the protein including the RefSeq sequence labeled NP.

b. Click on the empty box beside the sequence and then click view to produce a page with a great deal of information about the gene, including gene structure, genome location, RefSeq protein, and nucleic acid sequence identifiers, and much useful information about the evidence on which the gene sequence is based. Click on OMIM (Online Inheritance in Man) to see a biological summary of the HSF1 gene functions.

3. Visit the Saccharomyces cerevisiae (budding yeast) genome database (SGD) Web site to learn about the yeast transcription factor HSF1. Go to SGD and look up the following information using the global gene hunter. Enter the name of the gene, limit the search by unclicking boxes as needed, and click Submit. Use the links on the following page to answer these next questions. (Note: There is a large group of Ph.D. fellows who scan the literature frequently and add the information to the SGD database.)

a. On what chromosome does the gene reside? b. What is the mature length of the protein? c. What are the SwissProt and PIR accession numbers? d. Is the gene found in other species—not at all, one or two, or many? If so, give an

example of the name of the similar gene in another species.

4. Using any accession number found above, retrieve the sequence in fasta format from SwissProt and save the file on your PC ScXXXX.pro, where XXXX is the gene name. Note that, traditionally, SwissProt only includes proteins for which there is physical evidence that they exist; e.g., they can be seen as a spot or a band on a gel.

5. READSEQ is a very useful utility for converting among sequence formats. Read through the online help file before continuing.

a. Retrieve the mRNA sequence of the yeast SNF2 gene from GenBank (the accession number is YSCSNF2A, but try using other fields).

b. Now go to a Web-based READSEQ conversion page and copy and paste the GenBank sequence into the sequence input box and choose Pearson/FASTA format as the output format. Click Perform Conversion and a new box will appear with the sequence in FASTA format. Copy and paste the sequence into the text editor and save as a file called snf2mRNA.seq on your computer.

6. In addition to individual genes, whole genomes of organisms are becoming available, including many prokaryotes, organelles, and viruses. One good way to retrieve these genomic sequences is through the NCBI Entrez page for genomes and the taxonomy browser.

a. Go to the NCBI Entrez page and then to the genome page (on bar at top of Entrez page). Enter "Homo sapiens mitochondrion" and click on the entry that appears for the human mitochondrion. Note that the RefSeq accession number starts with NC (nucleotide sequence of chromosome). Examine the sequence of the Homo sapiens mitochondria. What is the length? Roughly outline the genes that are present. Click on the map to see the genes that are present and then on the gene blocks to see the sequences.

b. Another resource for microbial genomes is at the The Institute for Genomic Research. Go to the Comprehensive Microbial Resource page, choose genomes, and click on the genome name of Synechocystis sp. PCC 6803 under the group Cyanobacteria. This is an ancient organism that produces oxygen from light and puts oxygen in the atmosphere. What is the size of the genome and how many proteins are encoded? What does the color code of the genome represent?

Chapter 3: Alignment of Pairs of Sequences

Knowing how to align a pair of nucleic acid or protein sequences is a fundamentally important area of bioinformatics. For very similar sequences with runs of identical or commonly found substitutions, this is quite readily done; but as the sequences become more divergent, they also become more difficult to align. Thus, a method is required to find the very best possible or optimal alignment given an expected pattern of variation in sequences that are related but have also diverged over evolutionary time. Even if such requirements are met, human judgment may have to be used because there may be more than one possible alignment and some regions may align much better than others, leaving the poorly aligning regions in doubt. This chapter discusses how to perform pair-wise sequence

alignments and how to score these alignments. The chapter following this one describes how to evaluate the significance of alignment scores.

Table 3.1. Web sites for alignment of sequence pairs

Name of site

Web address

References

Bayes block alignera

http://www.wadsworth.org/resnres/bioinfo Zhu et al. (1998)

Likelihood-weighted sequence alignmentb

http://stateslab.bioinformatics.med.umich.edu/service see Web site

PipMaker (percent identity plot), a graphical tool for assessing long alignments

http://www.bx.psu.edu/miller_lab/ Schwartz et al. (2000)

BCM Search Launcherc

http://searchlauncher.bcm.tmc.edu/ see Web site

SIM—Local similarity program for finding alternative alignments

http://us.expasy.org/ Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992)

Global alignment programs (GAP, NAP)

http://genome.cs.mtu.edu/align/align.html Huang (1994)

FASTA program suited

http://fasta.bioch.virginia.edu/ Pearson and Miller (1992); Pearson (1996)

Pairwise BLASTe

http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Altschul et al. (1990)

AceViewf shows alignment of mRNAs and ESTs to the genome sequence

http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly see Web site

BLATf Fast alignment for finding genes in genome

http://genome.ucsc.edu Kent (2002)

GeneSeqerf predicts genes and aligns mRNA and genome sequences

http://www.bioinformatics.iastate.edu/bioinformatics2go/ Usuka et al. (2000)

SIM4f http://globin.cse.psu.edu Floria et al. (1998)

a See Chapter 4 for description and examples. b A description of the probabilistic method of aligning two sequences is described in Durbin et al. (1998) and Chapter 4. A related topic, hidden Markov models for multiple sequence alignments, is discussed in Chapter 5. c This server provides access to a number of Web sites offering pair-wise alignments between nucleic acid sequences, between protein sequences, or between a nucleic acid and a protein sequence. d The FASTA algorithm normally used for sequence database searches (see Chapter 6) provides an alternative method to dynamic programming for producing an alignment between sequences. Briefly, all short patterns of a certain length are located in both sequences. If multiple patterns are found in the same order in both sequences, these provide the starting point for an alignment by the dynamic programming algorithm. Older versions of FASTA performed a global alignment, but more recent versions perform a local alignment with statistical evaluations of the scores. The program PLFASTA in the FASTA program suite provides a plot of the best-matching regions, much like a dot matrix analysis, and thus gives an indication of alternative alignments. The FASTA suite is also available from Genestream at http://vega.igh.cnrs.fr/. Programs include ALIGN (global, Needleman–Wunsch alignment), LALIGN (local, Smith–Waterman alignment), LALIGNO (Smith–Waterman alignment, no end gap penalty), FASTA (local alignment, FASTA method), and PRSS (local alignment with scrambled copies of second sequence to do statistical analysis). Versions of these programs that run with a command-line interface on MS-DOS and Macintosh microcomputers are available by anonymous FTP from ftp.virginia.edu/pub/fasta. e The BLAST algorithm normally used for database similarity searches (Chapter 6) can also be used to align two sequences. f Program useful for aligning expressed gene sequences (ESTs or mRNA) to genomic DNA.


PART I. DOT MATRIX ANALYSIS

Using DNA Strider on a Macintosh (DNA Strider is available from Dr. Christian Marck, [email protected])

A. Protein Sequence Comparison

Compare two Escherichia coli phage repressor protein sequences by copying

FASTA-formatted sequences of phage l cI repressor protein (accession no. RPBPL) and the phage p22 c2 repressor protein (accession no. RPBP22) from the GenBank display window at http://www.ncbi.nlm.nih.gov/entrez/ into two DNA Strider protein windows as follows:

1. Highlight lamc1.pro (RPBPL) with the mouse and, using the Copy option in the edit window, copy the sequence into the clipboard.

2. Start DNA Strider and open a new protein sequence window. 3. Paste the lamc1.pro sequence into the new protein window using the Paste

command in the Edit window. 4. Place the cursor at the start of the sequence. 5. Return to the editor and, using the same procedure as above, copy the

p22c2.pro sequence (RPBP22) into the clipboard, and then into a second new protein window in DNA Strider.

6. Place the cursor at the start of the sequence. The sequences in the two windows, one in the top window and the other in the bottom window, may now be compared using the matrix option of DNA Strider.

7. Hold down the Option key, choose the matrix drag-down window, and choose the protein matrix option. Then release the mouse button.

8. In the window that appears, set the protein matrix options on the right. Choose a window and stringency of 1, and a scale of auto by sliding the cursor in the windows to these choices. Choose an identity matrix. Then click on the matrix button for proteins.

9. Examine the matrix for the presence of a row of dots that represents a region of sequence similarity. Note the background matching that also appears, and which will be eliminated below by using a larger window.

10. Close the matrix window and then repeat the matrix analysis by choosing a window of 2 and a stringency of 2. Note that the similarity stands out much more clearly.

11. Repeat the matrix analysis again looking for a stringency of 2 in a window of 3 amino acids. Note that the region of similarity stands out more clearly still but that the resolution, i.e., the exact position of the individual amino acid matches, is not as clear.

12. Repeat the analysis using the amino acid scoring matrix BLOSUM62.

B. DNA Sequence Comparison

1. To retrieve the DNA sequences of the above two repressor genes, go to the GenBank entries for the phage λ and p22 genomes and retrieve the gene sequences from the features table (NC_001416, complementary strand positions 37227..37940 and NC_002371, complementary strand positions 12764..13414, respectively).

2. Find the coding sequence entry (CDS) for the repressor genes in Features and click the mouse on the CDS link. A new window will come up with the DNA sequence.

3. Copy and paste the sequences into two DNA windows in DNA Strider. 4. Use the DNA matrix option in the matrix window to obtain a dot matrix

analysis of these DNA sequences using stringencies and windows of 1 and 1, and 7 and 10, respectively, using the identity matrix.

C. Self-comparison for Finding Repeated Sequences

1. Open a GenBank window for the haptoglobin hp2 protein sequence

(accession no. 1006264A). 2. Copy the sequence into a new protein window in DNA Strider. 3. Use the protein self-matrix option to compare the sequence to itself. Use

window 1, stringency 1, and identity matrix. 4. Note the presence of any repeated elements and where they are.

D. Complex Repeated Elements

1. Obtain the human and chicken erythroid transcription factors (accession nos. CAA35120 and P17678, respectively) from GenBank.

2. Copy and paste the sequence into protein windows in DNA Strider. 3. Compare these sequences, first each to itself and then to each other using

the same stringency and window settings in each case (2/3 or much higher, such as 15/23).

4. What primary structure features do these proteins share? Look at the sequences and see if you can identify any features, e.g., repeats of the same amino acid, that are affecting the appearance of the dot matrix.

E. Sequence Complexity

When the same sequence characters are repeated many times, the complexity of the sequence is said to be low; i.e., the number of all the available sequence characters is quite small or only a single character may be present. These regions can make alignments look artificially good and score artificially high. They become quite apparent on the self-matrix as horizontal or vertical rows of dots.

1. Examine the self-matrix pattern for the human erythroid factor above (match of stringency 1 to window 1, identity matrix) and describe what is observed around sequence positions 55 and 265.

2. Examine the sequence and report what is found in the sequence at these positions.

Using EMBOSS Dot Matrix Software

For the instructor. A knowledgeable computer support person will need to compile the EMBOSS programs on a UNIX or Linux server (Mac OS X is an alternative, but more time-consuming, option) and then provide X server access to PCs from the server as discussed in Figure 3.5 and the text. The EMBOSS programs are well documented, and online help is accessed through the tfm program followed by the program name, e.g., dotmatcher. Some of the displays done above with DNA Strider cannot be shown because there is a minimal window size of 3 in dotmatcher. An alternative dot matrix program, dotter, is described in the text. The sequences should be retrieved as text files in FASTA format in a convenient location for student access on the server. This task is good practice for students to do themselves if they have an account on the server. For now, they could save the GenBank files on a PC and then move them to the server, for example. It is also a good idea to make a protein scoring matrix that scores identities as 1 and mismatches as 0 using a text editor and place this matrix in the EMBOSS data directory with the other scoring matrices. In Chapter 12, students will learn how to retrieve sequences from GenBank directly using Perl scripts.

For the students. For the above problems, it is best to have FASTA files of the sequences (other sequence formats can also be used).

1. Run dotmatcher on the remote server using the X-Window client program. The program prompts for sequence names, reads in the sequences, and prompts for window size and stringency.

2. For most input queries, hit Return to give reasonable choices. 3. You can also use different scoring matrices to test the effects on the results, but

these must be entered as options when you type in the name of the program, e.g., "dotmatcher–matrixfile=mychoice."

4. Type "tfm dotmatcher" to read about all of the options. 5. If all goes well, the results will be displayed in a window.

PART II. ALIGNMENT OF TWO SEQUENCES BY THE DYNAMIC PROGRAMMING ALGORITHM

In this section, protein sequence pairs will be aligned using Internet servers.

1. Using one of the Web sites listed below and the default conditions provided by the Web site, align the protein sequences for the phage λ and p22 phage repressors. Cut and paste FASTA files of the sequences already available into these sites.

2. Record the resulting percent identity and similarity and briefly describe what each represents.

Internet Sites for Sequence Alignment

The following are Web sites that will perform sequence alignment of two sequences by the dynamic programming algorithm.

1. LALIGN (http://fasta.bioch.virginia.edu/) at the University of Virginia. This program is also available for Mac and PC computers but without a windows/mouse interface. The program finds not just 1, but also n nonoverlapping alignment of two sequences according to the SIM algorithm discussed in the text. In these alignments, the same two residues will never be found together more than once.

2. SIM (http://us.expasy.org/tools/sim-prot.html) uses the same algorithm as the above site.

3. BCM (http://searchlauncher.bcm.tmc.edu/) Baylor College of Medicine Web site offers a variety of methods of sequence alignment. Read the "h" option to see how these programs work. Not all of these programs use dynamic programming as the sole method. LFASTA and BLAST2 search for common words and then align on the basis of these words. The program align is a global alignment program based on the Needleman–Wunsch alignment algorithm instead of the Smith–Waterman local alignment algorithm. Unless dealing with strongly similar sequences of the same length, and alike along their entire lengths, a global alignment will not be useful.

PART III. CALCULATION OF SEQUENCE ALIGNMENT SCORES

Calculation of Log Odds and Odds Scores by the BLOSUM Method

In one column of an alignment of a set of related, similar sequences, amino acid D changes to amino acid E at a frequency of 0.10, and the number of times this change is expected based on the number of occurrences of D and E in the column is 0.05.

1. What is the odds score of finding a D-to-E substitution in an alignment? 2. What is the log odds score for the D-to-E substitution in bits? (Note: log to base 2 =

natural log / 0.693.) 3. What would be the entry in the BLOSUM amino acid scoring matrix for this

substitution? Compare your result to the actual entry in the BLOSUM62 matrix. 4. In the same column, D does not change at all at a frequency of 0.80, and the

expected frequency of D not changing is 0.10. Calculate the corresponding log odds score and the BLOSUM62 entry for D not changing.

Log Odds and Odds Score of a Short Alignment

1. Using the above values, what is the log odds score of the following alignment in bits?(Note that these two short sequences have very low sequence complexity by having only two amino acids of the available 20. These sequences were chosen to simplify the calculations. Alignments of low complexity sequences can give quite high scores that are misleading of the sequence similarity, as discussed in Chapter 6.)DEDEDEDEDDDDDDDD

2. What is the odds score of the above alignment?

PART IV. COMPARING ALIGNMENT SCORES WITH SMALL AND LARGE GAP PENALTIES

For this question, use the program LALIGN on the University of Virginia FASTA server http://fasta.bioch.virginia.edu/. This program aligns sequences by a local dynamic programming algorithm and includes end gap penalties. It produces as many different alignments as specified, with no two alignments including a match of the same two sequence positions.

1. Obtain the following two sequences from GenBank in FASTA format: recA.pro (P03017) from the bacterium E. coli and rad51.pro (P25454) from budding yeast (Saccharomyces cerevisiae). These proteins have the same function, i.e., promoting the pairing of homologous single-stranded DNAs. They almost certainly have the same three-dimensional structure but have diverged enough that they are difficult to align.

2. Use LALIGN to align the above two sequences with gap penalties of –12 and –2. Note the length of the alignment, the percent identity, and the score of the alignment.

3. Repeat the alignment with gap penalties of –5 and –1 and note the features of the alignment.

4. Describe what happened when the gap penalties were reduced. Which of these alignments looks like a local alignment and which looks like a global alignment?

PART V. USING THE DYNAMIC PROGRAMMING METHOD TO CALCULATE THE LOCAL ALIGNMENT OF TWO SHORT SEQUENCES BY HAND

The BLASTP algorithm performs a local alignment between a query sequence and a matching database sequence using the dynamic programming algorithm with the BLOSUM62 scoring matrix, a gap opening penalty of –11, and a gap extension penalty of –1 (i.e., a gap of length 1 has a penalty of –11, one of length 2, –12, etc.). Align the sequences MDPW and MEDPW using the Smith–Waterman algorithm described in the dynamic programming notes by following the global alignment example given in the notes, but using the Smith–Waterman algorithm.

1. Make a matrix for keeping track of best scores and a second matrix to keep track of the moves that give the best scores. (Hint: The alignment of M's, P's, and W's all give high scores, so the problem boils down to how to align D with ED and is actually quite a trivial problem.)

2. Use the BLOSUM62 matrix and BLASTP gap penalties of –11,–1. What is the optimal alignment and score between these two sequences?

Chapter 4: Introduction to Probability and Statistical Analysis of Sequence Alignments

One of the most important recent advances in sequence analysis is the development of methods to assess the significance of a local alignment between DNA or protein sequences. For sequences that are obviously related—two proteins that are clearly in the same family, or two matching or overlapping DNA fragments—such an analysis is hardly necessary. The question of significance arises when comparing two sequences that are not so clearly similar but are shown to align in a promising way. In such a case, a significance test can help the biologist to decide whether an alignment found by the computer program is one that would be expected between related sequences or would just as likely be found if the sequences were not related. A significance test is also critical for evaluating the results of a database search for sequences that are found to be similar to a query sequence using the BLAST and FASTA programs (Chapter 6). The test is applied to every sequence matched so that the most significant matches can be reported. Finally, a significance test can also help to identify regions in a single sequence that have an unusual composition suggestive of an interesting function.

Our goal here is to examine the significance of sequence alignment scores obtained by the dynamic programming method. Adequate theory has been developed and supportive experimental data have been obtained that together provide a reliable evaluation of local sequence alignments. This chapter outlines some of the major features of statistical testing and probability calculations and shows how to use these features to evaluate the significance of a sequence alignment.

Chapter 4 Web Search Terms

Probability and statistics courses with supportive Web pages are so commonly taught in colleges and universities that a large amount of supplementary information to this chapter can readily be found using search terms found throughout the chapter. The National Center for Biotechnology Information (NCBI) also provides informative manuals on statistical analysis of sequence alignment scores used in BLAST searches. The following search terms may also provide updates to Web site locations.

FASTAa suite of computer programs by Dr. W.R. Pearson that includes tools for finding alternative alignments of sequences and statistical evaluation of sequence alignment scores.

LALIGNa program for finding alternative local alignments of sequences as a test for validity of the highest scoring alignment.

PRSSa sequence scrambling tool that is a part of the FASTA suite of programs and that may be used to evaluate the statistical significance of local alignment scores.

SIM see LALIGN.

SAPSa program for evaluation of statistical features of repeats and amino acid patterns and clusters in the same sequence.


1. Log odds and odds score of a short alignment. This question is a continuation of the question in Part III in Chapter 3 (p. 119).

a. Using the values calculated in Part III, Chapter 3, what is the log odds score of the following alignment in bits? (Note that these two short sequences have very low sequence complexity by having only two amino acids of the available 20. These sequences were chosen to simplify the calculations. Such alignments are quite high scoring, but the low complexity means that the score can be misleading as discussed in Chapter 5, p. 254.)

b. What is the odds score of the following alignment? DEDEDEDE DDDDDDDD

c. Using the section "Quick Determination of the Significance of an Alignment Score" (p. 139), and assuming that the above alignment was found by aligning two sequences of length 250, is the alignment significant at the 0.05 level? (That is, could an alignment of two random sequences of the same length achieve such a score with a probability of 0.05?)

d. If the gap penalty was very high, e.g., gap opening of 8 and gap extension of 8, so that no gaps were produced, and the BLOSUM62 scoring matrix was used, calculate the significance of the alignment using Equation 8. You will need to find the value of K and λ in Table 4.3 (p. 142) and note that λ in this table assumes that the alignment score is in half-bits so that the alignment score must be in these units also.

2. Statistical evaluation of sequence alignment scores. This question is a continuation of Part IV in Chapter 3 (p. 119).

We will calculate significance of an alignment between two sequences by scrambling one sequence many times and recalculating the alignment scores to see how they compare.

a. The program PRSS on the Pearson FASTA Web site http://fasta.bioch.virginia.edu/ will scramble the second sequence and calculate many alignment scores. Scrambling can be done at the individual amino acid level or with a window of amino acids to keep repetitive sequences intact.

b. A plot of the scores of the scrambled sequence alignments is shown on the Web page, and these scores are compared to the original alignment score between the sequences.

c. The scores are fitted to an extreme value distribution curve and K and λ are calculated.Note that when there are many such comparisons made, e.g., when the first sequence is compared to 100 scrambled second sequences, the expected value of this many alignments achieving the original score has to be calculated. If the probability that one score of an alignment with a scrambled sequence achieves the original score is 1/10,000 and 100 scrambled sequences were tested, then the expected value for 100 sequences is 1/10,000 x 100 = 1/100.

Obtain the same two sequences from GenBank in FASTA format as done previously.

a. Use PRSS to align the reca.pro (P03017 from the bacterium E. coli) and rad51.pro (P25454 from budding yeast S. cerevisiae) sequences downloaded in Chapter 3 problems with gap penalties of –12 and –2 and perform 1000 scrambled alignments. Note the expect value for the alignment score found between these proteins.

b. Repeat the analysis with gap penalties of –5 and –1 and note the expect score. c. Describe what happened when the gap penalties were reduced.

3. A Bayesian method for estimating evolutionary distance between nucleic acid sequences.

a. A great deal can be learned about the use of PAM matrices using the DNA PAM matrices as an example (p. 95).

i. First an alignment between two DNA sequences without any gaps is found. The object is to figure out how long ago (in PAM units of time where 1 PAM equals 10 million years) the sequences might have diverged to give the observed variation (the number of mismatches) in the alignment.

ii. A PAM1 matrix is made for an expected model of evolution, e.g., each base can change into any other base and the overall rate of change in the sequences is 1% (see Table 3.4, p. 107).

iii. For longer periods, e.g., PAM10, the PAM1 matrix is multiplied by itself n times (10 for PAM10).

iv. The more time, the greater the amount of change expected and these changes are reflected in the log odds scores of each particular PAM matrix. These are shown in the table of nucleic acid substitution matrices (Table 3.6 on p. 108) that assumes a uniform rate of mutation among nucleotides and that a 1% change in sequence represents 10 my (million years) of mutation.

v. Examine the substitution rates in the following sequences and decide approximately how many years ago they became separated. AGTTG ACTAA GCCAG GTCAC ACTTG CCGGA GCCTC GTGTC

b. What log odds and odds scores are found for the alignment for PAM distances of 10, 25, 50, 100, and 125, and which score is highest?

c. Add up the odds scores and determine the ratio of each to the total. What is the sum of these numbers, and what do these numbers represent?

Chapter 5: Multiple Sequence Alignment

One of the most important contributions of biological sequences to evolutionary analysis is the discovery that sequences of different organisms are often related. Similar genes are conserved across widely divergent species, often performing a similar or even identical function, and at other times, mutating or rearranging to perform an altered function through the forces of natural selection. Thus, many genes are represented in highly conserved forms in a wide range of organisms. Through simultaneous alignment of the sequences of these genes, the patterns of change in the sequences may be analyzed. Because the potential for learning about the structure and function of molecules by multiple sequence alignment (msa) is so great, the necessary computational methods have received a great deal of attention. In msa, sequences are aligned optimally by bringing the greatest number of similar characters into register in the same column of the alignment, just as described in Chapter 3 for the alignment of two sequences.

As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies considerably, being much greater as the degree of sequence similarity decreases. If the amount of sequence variation is minimal, it is quite straightforward to align the sequences, even without the assistance of a computer program. However, if the amount of sequence variation is great, it may be very difficult to find an optimal alignment of the sequences because so many combinations of substitutions, insertions, and deletions, each predicting a different alignment, are possible.

Table 5.1. Examples of programs for multiple sequence alignment

Name Uses Reference

Global alignments including progressive

CLUSTALW standard progressive alignment Thompson et al. (1994a, 1997)

CLUSTALX (graphical interface)

most useful for similar sequences Higgins et al. (1996)

MAFFT: rapid multiple sequence alignment based on Fourier transform (progressive and iterative programs)

fast, accurate msa alignments with novel scoring systems

Katoh et al. (2002)

MAVID for progressive msa of genome sequences

progressive alignment method for large numbers of DNA sequences with viewer

Bray and Pachter (2003)

MSA optimal alignment using dynamic programming—limited to few/short sequences

Lipman et al. (1989); Gupta et al. (1995)

MULTIPIPMAKER produces percent identity plot of multiple DNA sequences

Schwartz et al. (2003)

POA fast, accurate alignment of large numbers of sequences (ESTs) by partial-order graphs

Lee et al. (2002)

PRALINE versatile tool kit for producing msa's by different strategies

Heringa (1999); Simossis and Heringa (2003)

T-COFFEE uses CLUSTALW method but with pair-wise alignments to increase accuracy; flexible

Poirot et al. (2003)

Iterative and other methods

DIALIGN segment alignment; very accurate msa method for DNA and protein sequences; aligns based on matching segments without gap penalties

Morgenstern et al. (1998)

PRRP progressive global alignment method repeatedly improves msa; produced by progressive alignment using command line options

Gotoh (1996)

SAGA genetic algorithm; user intense method based on biologically relevant method

Notredame and Higgins (1996)

Local alignments of proteins

Aligned Segment Statistical Evaluation Tool (Asset)

sophisticated pattern-finding and statistical analysis method—command line

Neuwald and Green (1994)

BLOCKS Web site finds blocks (ungapped domains) by pattern search or Gibbs sampling

Henikoff and Henikoff (1991, 1992)

eMOTIF Web server useful analysis of protein families to find most significant patterns in families

Nevill-Manning et al. (1998)

GIBBS, the Gibbs sampler statistical method

finds patterns in unaligned sequences by statistical method—command line

Lawrence et al. (1993); Liu et al. (1995); Neuwald et al. (1995)

HMMER hidden Markov model software tools for producing a profile hidden Markov model to represent an msa

Eddy (1998)

MACAW, a workbench for multiple alignment construction and analysis

aligner/editor for locating and adjusting local alignment blocks on PC

Schuler et al. (1991)

MEME Web site, expectation maximization method

locates localized sequence blocks "motifs" by statistical method

Bailey and Elkan (1995); Grundy et al. (1996, 1997); Bailey and Gribskov (1998)

Profile analysis at UCSD produces a sequence profile from an msa

Gribskov and Veretnik (1996)

SAM hidden Markov model Web site

produces an HMM for an msa Krogh et al. (1994); Hughey and Krogh (1996)


To locate Web sites of programs specified in the problems, perform a Web search using the program name. Instructors may also wish to set up local copies of these programs, which is best done on a UNIX or Linux server.

1. Practice using the CLUSTALW program to align the set of proteins in the RAD51-RECA group. These proteins all promote homologous DNA strand interactions during genetic recombination between DNA molecules. The sequences may be retrieved from the SwissProt server (perform Web search to find link to SwissProt) by their accession numbers in FASTA format: P25454, P25453, P03017, P48295. Use a simple text editor to make a FASTA multiple sequence by catenating these individual sequence files into one FASTA msa file (see p. 53).

a. Locate a CLUSTALW Web site. This program is available for PCs and also on a Web site at Baylor College of Medicine (BCM) searchlauncher site.

b. Copy and paste the catenated FASTA sequence file into the CLUSTALW data window. c. Use the default alignment conditions provided by the program. d. Note the two kinds of msa output formats. One is the align format with numbers, and

the second is the FASTA format with the aligned sequences joined end to end in FASTA format, with gaps in each sequence corresponding to the alignment.

e. Save this file for later reference.

2. Go to the BAliBASE Web site and retrieve the SwissProt accession numbers of the 1csy (SH2) group of proteins that align in the 20-40% identity, reference 1 range.

a. Retrieve these sequences from SwissProt and, using a simple text editor, place them in the FASTA msa format. Note that the BAliBASE alignments are based on known structural alignments and therefore are a test of the ability of msa programs to provide an msa that is structurally correct.

b. Try to align these proteins by searching for the POA, DIALIGN, and CLUSTALW Web sites and pasting the sequences into the program sequence window.

c. Compare the alignments to the correct ones on the BAliBASE site and note which program, if any, does the best job.

3. When a global msa can be made, one can pick out the most conserved regions (motifs), make a scoring matrix, and search for other sequences that have this same motif. The matrix will take into account the variation found in the sequences. We will make a position-specific scoring matrix (PSSM, also called a scoring matrix, or weight matrix) by hand corresponding to a short msa and then use the matrix to scan a sequence. Here is a table showing the frequency of each base in an alignment that is four bases long:

a. Assuming that the background frequency is 0.25 for each base, calculate a log odds score for each table position; i.e., log to the base 2 of the ratio of each observed value to the expected frequency.

b. Align the matrix with each position in the sequence TGAGCTAA starting at position 1, 2, etc., and calculate the log odds score for the matrix to match that position.

c. Now convert the alignment scores to ODDS scores, sum them, and calculate the probability of the best matching position.

4. In question 3, we assumed that we already have a global alignment of a set of sequences so that a scoring matrix could be made from the alignment. Although we may know that a set of sequences has the same function, and thus should align, the sequences may vary so much that it is difficult to align them globally. In this case, we have to resort to a statistical analysis to find conserved patterns. The following problem goes through the first few steps required to find the best alignment by a statistical method. Students will need to study first the example of the expectation maximization algorithm in the text.

Analyze the following ten DNA sequences by the expectation maximization algorithm. Assume that the background base frequencies are each 0.25 and that the middle three positions are a motif. The size of the motif is a guess that is based on a molecular model. The alignment of the sequences is also a guess.

seq1 C CAG Aseq2 G TTA Aseq3 G TAC Cseq4 T TAT Tseq5 C AGA Tseq6 T TTT Gseq7 A TAC Tseq8 C TAT Gseq9 A GCT Cseq10 G TAG A

a. To start the PSSM, make a table with three columns (position in motif) and four rows (1 for each base).

b. Calculate the observed frequency of each base at each of the three middle positions in the alignment.

c. Using the frequencies in the column tables, and the background frequencies, calculate the odds likelihood of finding the motif at each of the possible locations in sequence 5.

d. Calculate the probability of finding the motif at each position in sequence 5. e. Calculate what change will be made to the base count in each column of the motif

table as a result of matching the motif to the first position in sequence 5. This is usually a fractional number of one base.

f. What other steps are taken to update or maximize the table values?

5. MEME is a server that will take as input a set of sequences and find alignment by the expectation maximization method.

a. Paste the same unaligned RAD51-RECA sequences (problem 1) into the sequence window and use the defaults provided by the program. Students will need to provide their own E-mail address to receive results.

b. Examine the results and note how many conserved regions were found. c. Save these results for later analysis.

6. A simplified hidden Markov Model (HMM) is shown below. (Red square, match state; green diamond, insert state; blue circle, delete state—probability of 1; arrows, probability of going from one state to the next.)

a. Calculate the probability of the sequence TAG by following a path through the model starting at Begin, going through each of the three match states (red squares), and ending at End.

b. Repeat step a for a path that, starting at Begin, goes first to the first insert state (green diamond), then to a match state (red square), then to a delete state (blue circle, probability 1 for any character in this state), then to a match state, and finishes at End.

c. Which of the two paths is the more probable one, and what is the ratio of the probability of the higher to the lower one? The highest-scoring path is the best alignment of the sequence with the model.

d. To improve the model, we keep adjusting the scores for the states and transition probabilities by aligning additional sequences with the model using an HMM adaptation of the expectation maximization algorithm. In the expectation step, we calculate all of the possible paths through the model, sum the scores, and then calculate the probability of each path. Each state and transition probability is then updated by the maximization step of the algorithm to make the model better predict the new sequence.

For this example, suppose that the model has been made from 30 sequences, and that the alignment of TAG in step a has a probability of 1.0; i.e., this path is overwhelmingly the best of all possible paths through the model. What would be the new fractions in the first match state? (Note that only a fraction of the sequences originally passed through the first state—think carefully about how many actually did.) Similarly, what would be the new values for the transition probabilities from Begin to this first match state, assuming that 0.7 of the 30 sequences followed the path from Begin to the first match state? (Hint: Updating the match state frequencies should be done by going back to the raw base numbers in each column. Similarly, updating the transition probability should be done by counting the number of sequences that would have followed the path from Begin to the first match state.)

e. Change all the values in each of the states to log odds scores, assuming that the frequency of each base is 0.25. Also change the transition probabilities to log odds, i.e., log to base 2 of the ratio of observed transition probability to background probability. (Note: Transition background is an equal probability of making a transition to each subsequent state and will be calculated by dividing 1 by the

number of possible transitions from each state; i.e., the background probability will be one of 1/2 = 0.5, 1/3 = 0.33, or 1/1.) Now calculate the probability in step a as a log odds score.

7. Analyze the following ten DNA sequences by the Gibbs sampling algorithm. seq1 C CAG Aseq2 G TTA Aseq3 G TAC Cseq4 T TAT Tseq5 C AGA Tseq6 T TTT Gseq7 A TAC Tseq8 C TAT Gseq9 A GCT Cseq10 G TAG A

a. Assuming that the background base frequencies are 0.25, calculate a log odds matrix for the central three positions.

b. Assuming that another sequence G TTT G is the left-out sequence, slide the log odds matrix along the left-out sequence and find the log odds score at each of three possible positions.

c. Change each log odds score to an odds score and sum the odds scores. Calculate the probability of a match at each position in the left-out sequence. (Odds score = 2 raised to the power of the log odds score.)

d. How do we choose a possible location for the motif in the left-out sequence?

8. This problem explores the information content of a scoring matrix by the relative entropy method (ignores background frequencies). Read the notes on information content of sequences on page 213 before trying this problem.

a. Calculate the entropy or uncertainty (Hc) for each column and for the entire matrix. b. Calculate the decrease in uncertainty or amount of information (Rc) for column 1 due

to these data (for DNA, Rc = 2 – Hc and for proteins, Rc = 4.32 – Hc). c. Calculate the amount that the uncertainty is reduced (or the amount of information

contributed) for each base in column 1.

Chapter 6: Database Searching for Similar Sequences

Similarity searches in sequence databases have become a mainstay of bioinformatics, and large sequencing projects in which all of the genomic DNA of an organism is obtained have become quite commonplace. Similarity searches can also be remarkably useful for finding the function of genes whose sequences have been determined in the laboratory but for which there is no biological information. In these searches, the sequence of the gene of interest is compared to every sequence in a sequence database, and the similar ones are identified. Alignments with the best-matching sequences are shown and scored. If a query sequence can be readily aligned to a database sequence of known function, structure, or biochemical activity, the query sequence is predicted to have similar properties. The strength of these predictions depends on the quality of the alignment between the sequences. As a rough rule, for searches of a protein sequence database with a query protein sequence, if more than one-half of the amino acid sequence is identical in the sequence alignments, the prediction is very strong. For searches of a nucleic acid sequence database with a nucleic acid query sequence, the sequences should be translated if they encode proteins because related protein sequences are more readily identified. If only nucleic acid sequences are compared, then most of the sequences should be identical with few gaps for a strong prediction. As the degree of similarity decreases, confidence in the prediction also decreases. The programs used for these database searches provide statistical evaluations that serve as a guide for evaluation of the alignment scores.

Previous chapters have described methods for aligning sequences or for finding common patterns within sequences. The purpose of making alignments is to discover whether sequences are homologous, i.e., likely to be derived from a common ancestor sequence. If a strong homology relationship can be established, the sequences are likely to have maintained the same function as they diverged from each other during evolution. If an alignment can be found that would rarely be observed between random sequences, the sequences can be predicted to be related with a high degree of confidence. The presence of one or more conserved patterns in a group of sequences is also useful for establishing evolutionary and structure–function relationships among them.

The methods used for establishing sequence relationships in database searches are summarized in Table 6.1. In addition to standard searches of a sequence database with a query sequence, a matrix representation of a family of related protein sequences may be used to search a sequence database for additional proteins that are in the same family, or a query protein sequence may be searched for the presence of sequence patterns that represent a protein family to determine whether the sequence belongs to that particular family. Genomic DNA sequences may also be searched for consensus regulatory patterns such as those representing transcription-factor-binding sites, promoter recognition signals, or mRNA splicing sites; these types of searches are discussed in Chapter 9.

Table 6.1. Types of database searches for proteins

Type of search

Target database Method

Type of query data

Examples of programs used,location (also see Tables6.2, 6.4, 6.7, and 6.8)

Results of database search

A. Sequence similarity search with

protein sequence database (or genomic

search for database sequence that can be aligned with

single sequence, e.g.,DAHQSNGA

FASTA (TFASTAa), SSEARCHhttp://fasta.bioch.virginia.edu/fasta/BLASTP (TBLASTNa)

list of database sequences

query sequence

sequencesa)

query sequence

http://www.ncbi.nlm.nih.gov/BLAST/WU-BLASThttp://blast.wustl.edu/

having the most significant similarity scores

B. Alignment search with profile (scoring matrixb,d with gap penalties)

protein sequence database

prepare profile from a multiple sequence alignment (Profilemake) and align profile with database sequence

profile representing gapped multiple sequence alignment, e.g.,D-HQSNGAESHQ-YTMEAHQSN-LEGVQSYSL

Profilesearch ftp.sdsc.edu/pub/sdsc/biology

list of database sequences that can be aligned with the profile

C. Search with position-specific scoring matrixc,d

(PSSM) representing ungapped sequence alignment (BLOCK)


prepare PSSM from ungapped region of multiple sequence alignment or search for patterns of same length in unaligned sequences,c then use for database search

PSSM representing ungapped alignment, e.g.,DAHQSNESHQSYEAHQSNEGVQSY

MASThttp://meme.sdsc.edu/meme/website/mast.html

list of database sequences with one or more patterns represented by PSSM but not necessarily in the same order

D. Iterative alignment search for similar sequences that starts with a query sequence, builds a gapped


uses initial matches to query sequence to build a type of scoring matrix and searches for additional matches to the matrix by an iterative search methodd

builds matches to query sequence, e.g.,DAHQSNGA

iteration 1 H-SNGA EAHQSN

PSI- BLASThttp://www.ncbi.nlm.nih.gov/BLAST/

PSI-BLAST finds a set of sequences related to each other by the presence of common patterns

multiple alignment, and then uses the alignment to augment the searchd

-L

further iterations

(not every sequence may have same patterns)

E. Search query sequence for patterns representative of protein familiese

database of patterns found in protein families

search for patterns represented by scoring matrix or hidden Markov model (profile HMM)e

single sequence, e.g.,DAHQSNGA

Prositehttp://www.expasy.ch/prositeINTERPROhttp://www.ebi.ac.uk/interproPfamhttp://www.sanger.ac.uk/PfamCDD/CDART http://www.ncbi.nlm.nih.gov/BLAST (also see Table 10.5)

list of sequence patterns found in query sequence

a Searches of this type include the use of programs that search nucleic acid databases for matches to a query protein sequence by automatically translating the nucleic acid sequences in all six possible reading frames (TFASTA, TBLASTN). These searches may be useful when only genomic sequences or partial cDNA sequences (expressed sequence tag or EST sequences) of an organism are available. Genomic sequences that encode proteins may also have been found by gene prediction programs (Chapter 9). The predicted protein is then usually entered in the protein sequence databases. Matches to these predicted proteins may be found by searches of the protein sequence databases. These gene predictions are error-prone (see Chapter 9). b A multiple sequence alignment that includes gaps may be represented by a profile, a type of scoring matrix discussed in Chapter 5, page 189. The consecutive rows of the matrix represent columns of the multiple sequence alignment, and the column values represent the distribution of amino acids in each column of the alignment. The profile includes extra columns with gap opening and extension penalties. The profile is aligned to a sequence by sliding the profile along the sequence and finding the position with the best alignment score by means of a dynamic programming method. The alignment may include gaps in the database sequence. The best scoring alignments are with database sequences that have a pattern similar to that represented by the profile. c The position-specific scoring matrix (PSSM), or weight matrix as it is sometimes called, is a representation of a multiple sequence alignment that has no gaps (a BLOCK). The matrix may be made from a multiple sequence alignment or by searching for patterns of the same length in a set of sequences using pattern-finding or statistical methods, e.g., expectation maximization, Gibbs sampling, ASSET, and by aligning these patterns, as discussed in Chapter 5. The consecutive columns of the matrix represent columns of the aligned patterns and the rows represent the distribution of amino acids in each column of the alignment. The PSSM columns include log odds scores for evaluating matches with a target sequence. The matrix is used to search a sequence for comparable patterns by sliding the matrix along the sequence and, at each position in the sequence, evaluating the match at each column position using the matrix values for that column. The log odds

scores for each column are added to obtain a log odds score for the alignment to that sequence position. High log odds scores represent a significant match. d Using a scoring matrix instead of a single query sequence can enhance a database search because the matrix represents the greater amount of sequence variation found in a multiple sequence alignment. Amino acid representation in each column of the alignment is also reflected in the matrix scores for that column; the more common an amino acid, the higher the score for a match to that amino acid. Note also that the matrix does not store any information about correlations between sequence positions. Thus, if two amino acids are commonly found together in the sequences at two positions of the alignment, these will each be independently scored by the matrix, but there will be no information as to their co-occurrence (or covariation) in the sequences. Since this type of information is missing, the matrix can give high scores to patterns that include new combinations of amino acids not found in the original set of sequences. Scoring covariation in sequence positions is discussed further in Chapters 8, 9, and 10. e Pattern databases are described in Chapter 10.

Table 6.2. Web resources for performing database searches with a simple query sequence

Server/program

Web address or FTP site

References

BLAST—Basic Local Alignment Search Toola

http://www.ncbi.nlm.nih.gov/BLASTFTP to ftp.ncbi.nih.gov/blast/executables

Altschul et al. (1990, 1997);Altschul and Gish (1996)

WU-BLASTb sites that run WU-BLAST 2.0 are listed at http://blast.wustl.eduprograms obtainable at http://blast.wustl.edu/blast/executables with licensing agreement

Altschul et al. (1990, 1997);Altschul and Gish (1996)

FASTAc http://fasta.bioch.virginia.edu/fastaFTP to ftp.virginia.edu/pub/fasta

Pearson (1995, 1996, 1998, 2000)

BCM Search Launcher (Baylor College of Medicine)

http://searchlauncher.bcm.tmc.edu/

see Web site

TIGR gene indices search

http://www.tigr.org see Web site

There are also many other BLAST and FASTA servers on the Web, including ones for searches in specific organisms (see Chapter 11). The TIGR site is given as an example of such a site. a A stand-alone BLAST server may also be established on a local machine running Windows, UNIX, or Mac OS. b Executable programs for UNIX platforms are available from the FTP site. c Executable programs that run on PC, Macintosh, or UNIX platforms are available from the FTP site. The FASTA package also includes programs for performing pair-wise sequence alignments and for a statistical analysis of alignment scores (see

Chapter 3). A number of Web sites offer FASTA database search, including the FASTA server and the BCM Search Launcher.

Table 6.5. Databases available on BLAST Web server

Database/Description

A. Peptide Sequence Databases

nr

All non-redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF

swissprot

Last major release of the SwissProt protein sequence database (no updates)

pat

Proteins from the Patent division of GenPept

Yeast

Yeast (Saccharomyces cerevisiae) genomic CDS translations

ecoli

Escherichia coli genomic CDS translations

pdb

Sequences derived from the three-dimensional structure from Brookhaven Protein Data Bank

Drosophila genome

Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP)

month

All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days

B. Nucleotide Sequence Databases

nr

All GenBank+RefSeq Nucleotides+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1, or 2 HTGS sequences); no longer "non-redundant"

est

Database of GenBank+EMBL+DDBJ sequences from EST Divisions

est_human

Human subset of GenBank+EMBL+DDBJ sequences from EST Divisions

est_mouse

Mouse subset of GenBank+EMBL+DDBJ sequences from EST Divisions

est_others

Non-Mouse, non-Human sequences of GenBank+EMBL+DDBJ sequences from EST Divisions

gss

Genome survey sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences

htgs

Unfinished high-throughput genomic sequences: phase 0, 1, and 2 (finished, phase 3 HTG sequences are in nr)

pat

Nucleotides from the Patent division of GenBank

yeast

Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences

mito

Database of mitochondrial sequences

vector

Vector subset of GenBank(R), NCBI, in ftp://ftp.ncbi.nih.gov/blast/db

E. coli

Escherichia coli genomic nucleotide sequences

pdb

Sequences derived from the three-dimensional structure from Brookhaven Protein Data Bank

Drosophila genome

Drosophila genome provided by Celera and Berkeley Drosophila Genome Project (BDGP)

month

All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days

alu

Select Alu repeats from REPBASE, suitable fro masking Alu repeats from query sequences. It is available by anonymous FTP from ftp.ncbi.nih.gov (under the /pub/jmc/alu directory). See "Alu alert" by Claverie and Makalowski (1994)

dbsts

Database of GenBank+EMBL+DDBJ sequences from STS Divisions

chromosome

Searches complete genomes, complete chromosome, or contigs from the NCBI Reference Sequence project

C. Human Genome Blast Databases

genome

Human genomic contig sequences with NT_#### accessions

mrna

Human RefSeq mrna with NM_#### or XM_#### accessions

protein

Human RefSeq proteins with NP_#### or XP_#### accessions

gscan mrna

Predicted mRNA sequences generated by running GenomeScan program on human genomic contigs

gscan protein

CDS translations from gscan mrna set

D. CDD Search

Compares protein sequences to the conserved Domain Database. The CDD is a database containing a collection of functional and/or structural domain derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI. For more information, see the CDD homepage.

Source: http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html#protein_databases

Table 6.7. Examples of guest Web sites for performing a database search based on the Smith–Waterman dynamic programming algorithm

Server/program

Reference

Web address

BCM Search Launcher (with programming links to several servers)

Baylor College of Medicine

http://searchlauncher.bcm.tmc.edu/seq-search/protein-search.html

MPsrcha EMBL/EBI http://www.ebi.ac.uk/MPsrch/

Scanps G.Barton, European Bioinformatics Institute

http://www.compbio.dundee.ac.uk;http://www.ebi.ac.uk/scanps

Swatb Phil Green, University of Washington

http://www.genome.washington.edu/UWGC/analysistools/Swat.cfm

SWsrch DNA http://www.dna.affrc.go.jp/search/

Databank of Japan

A comprehensive list of servers for these types of analyses may be found at http://restools.sdsc.edu/biotools/biotools1.html a MPsearch is an extremely fast implementation of the Smith–Waterman dynamic programming algorithm by J.F. Collins and S. Sturrock, Biocomputing Resource Unit, the University of Edinburgh, distribution rights by Oxford Molecular Ltd. Some versions of the MPsearch algorithm at this site use the same penalty for all gaps, others use gap opening and extension penalties. The former is designed to find similar sequences in which gaps are less important in the alignment, the latter the more distant sequence alignments. Current versions of these programs rank the sequences found by two kinds of scoring systems. A statistical analysis is performed but the scores do not appear to be length-normalized. Hence, the sensitivity of the program may not exceed that shown by FASTA (Pearson 1996). b Includes Smith–Waterman and Needleman–Wunsch search algorithms. Calculates statistical significance using extreme value statistics (like FASTA and BLAST).

Table 6.8. Programs and Web sites for database similarity searches with a regular expression, motif, block, or profile

Program

Database searched

Source or location or analysis

1. Regular expression and motifsa

EMOTIF Scan SwissProt and Genpept

http://dna.stanford.edu/emotif/emotif-scan.html

Prosite patterns SwissProt and TrEMBL

http://au.expasy.org/tools/scanprosite/

ISREC pattern-finding service

SwissProt and non-redundant EMBL database

http://hits.isb-sib.ch/cgi-bin/hits_patsearch/

fpat PDB SwissProt Genpept

http://www.ibc.wustl.edu/fpat/ (Web site not currently active)

PHI-BLAST BLAST databases http://www.ncbi.nlm.nih.gov/

MOTIF SwissProt, PDB, PIR, PRF, Genes

http://motif.genome.jp/

2. Blocks

BLOCKSb most databases http://blocks.fhcrc.org/blocks/make_blocks.html

MASTc most databases http://meme.sdsc.edu/meme/website/

BLIMPSd locally available databases

anonymous FTP ftp.ncbi.nih.gov/repository/blocks/unix/blimps

Probee BLAST databases anonymous FTP ftp.ncbi.nih.gov/pub/neuwald/probe1.0

Genefindf PIR http://pir.georgetown.edu/gfserver

3. Profiles

Profilesearchg locally available databases

anonymous FTP ftp.sdsc.edu/pub/sdsc/biology/profile_programs

Profile-SSh most databases http://www.psc.edu/general/software/packages/profiless/profiless.html

These resources search for similarity to a sequence pattern. Resources for producing patterns from aligned or unaligned sequences are described in Chapter 4. An individual sequence may also be searched for matches to a motif database, and this procedure is discussed in Chapter 9. Additional resources for database searching are listed in Bork and Gibson (1996). A statistical estimate of finding the site by random chance in a sequence is sometimes but not always given. Reading how these estimates are derived by the individual programs is strongly recommended. The statistical theory for sequence alignments described in Chapter 3 can be used in these types of analyses (Bailey and Gribskov 1998) but may not always be implemented. a The Scan Web page shows how to compile a regular expression. Mismatches with the expression are allowed. The Prosite form of a regular expression is at http://www.expasy.ch/tools/scnpsit3.html. PHI-BLAST is a BLAST derivative that searches a given sequence for a regular expression and then searches iteratively for other sequences matching the pattern found, at each iteration including the newly found sequences to expand the search. b The BLOCKS server will send a new block analysis to the MAST server. c MAST is the Motif Alignment and Search Tool (Bailey and Gribskov 1998). Available protein databases are similar to those on the BLAST server. It is also possible to search translated nucleotide sequence databases. d BLIMPS will prepare a PSSM from a motif and perform a database search with the PSSM (see README file on FTP site). e PROBE (Neuwald et al. 1997) is described in the text. f The GENEFIND site has the program MOTIFIND for Motif Identification by Neural Design (Wu et al. 1996). This motif finder uses a neural network design to generate motifs and a search strategy for those motifs. The method performed favorably in sensitivity and selectivity with others such as BLIMPS and Profilesearch and is in addition very fast. Neural networks are described in Chapters 8 and 9. g Profilesearch is one of a set of programs in the GCG suite (see text). It is important to review the parameters of the program which if used inappropriately can lead to incomplete or low-efficiency searches (Bork and Gibson 1996). h A version of Profilesearch running at the University of Pittsburgh Supercomputing Center.


Web sites mentioned in the problems can be found in the chapter or by using a search engine. Sequences may be retrieved in FASTA format from Entrez or a protein sequence database such as PIR or SwissProt.

1. FASTA uses a lookup table as a rapid way to find common letters and words in the same

order and of approximately the same separation in two sequences. Produce a lookup table for single amino acids in the following two protein sequences, and then explain how this information will be used to determine what the alignment should be. sequence 1: ACNGTSCHQEsequence 2: GCHCLSAGQDAmino Acid Protein in Sequence 1 Protein in Sequence 2 Offset Value*

*Offset value is defined as the value in sequence 1 less the value in sequence 2

2. Retrieve the protein sequence of the E. coli RecA protein from SwissProt or Entrez and then submit the sequence to the University of Virginia FASTA server. The PIR identifier of the query sequence is RQECA and the GenBank index is 72985 (search for gi|72985 in Entrez). An identifier number or else the sequence itself in FASTA format may be pasted into the sequence entry window using the "Enter query sequence" drop-down window. Search the database described as the NCBI Human proteins library and use the default search parameters provided by the program.

Answer the following questions:

a. Identify the name and gi (GenBank index) of the highest scoring sequence. b. How many standard deviations above the mean is this score? Note that z´ is a

normalized score, calculated as z´ = 50 + 10z, where z is the raw z score. This raw z score represents the number of standard deviations that a given score s is from the mean, calculated by z = (s - m)/s, where m is the mean and s is the standard deviation.

c. Using Equation 1 relating z score to probability of such a score between unrelated sequences, what is the probability of an alignment between unrelated sequences achieving this high a z score?

d. How many database sequences were searched? What is the expect value (E) for a search of this many sequences achieving a score as high as z?

e. By looking at the scores and E values from this search, what is the approximate value of z´ (z´ = 50 + 10z) that corresponds to an expect value of 0.02 (an approximate cutoff for significance)? How many sequences reached this high a score?

f. Is the alignment of the highest scoring sequence with RecA protein significant and why? How could the significance be further tested?

g. What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the human protein?

h. What was the lowest reported score in this search, and is this score significant?

i. What scoring matrix and gap penalties were used as default values by the FASTA program?

3. For protein database searches, the BLASTP algorithm first makes a list of three-letter words in the query sequence and then scores these words for matches with themselves and with all other possible words using the BLOSUM62 scoring matrix. The 50 highest-scoring matches are kept. Database sequences are then scanned for matches to these high-scoring words, and if such are found, a local alignment is made with the query sequence by dynamic programming. Use the BLOSUM62 scoring matrix in Figure 3.16, page 105. Note that the matrix values are in half-bit units.

a. Suppose that the three-letter word HFA is in the query sequence, what is the log odds score of a match of HFA with itself?

b. Scan through the table and find the highest-scoring match with H (say amino acid X). What would be the score for HFA in our query sequence matching XFA in the database sequence?

c. Scan again and find any worst-scoring match with H. What is the score for a match of HFA with YFA?

d. Repeat the last two questions for the second and third letters in HFA. e. How many possible matches are there with HFA? (BLASTP uses approximately the

best 50.) f. How many words will be searched for, starting with a query sequence that is 300

amino acids long?

4. Run the E. coli RecA protein against the yeast genome on the BLAST server. Choose the BLASTP program and carefully review the various option windows on the page that comes up. Choose yeast as the genome database to be searched. Enter the RecA sequence in FASTA format or the PIR identifier into the input data window and indicate which choice was made in the small option window just above the input data window. Otherwise, use the default parameters provided by the program. You must wait in a queue for the results, then click on the format results window.

Answer the following questions:

a. In the diagram that comes up, click the mouse on the yeast sequence which best matches the RecA query sequence. Identify the name and gi (GenBank index) of the highest-scoring sequence and the score in bits.

b. What scoring matrix and gap penalties were used? c. What values of K and λ were used for calculating the expect values (E) for the

gapped alignment (note that there are two sets of these parameters–one for ungapped and one for gapped alignments)? Where do these values come from?

d. The score shown in the program output is in units of "normalized bits" = [(λ x raw score) - ln K] / ln 2. The raw score is shown in parentheses. What are the units of the raw score (those of the BLOSUM62 matrix)? Calculate the raw score in bits from the "normalized bits."

e. How many database sequences were searched? f. Calculate the expect value E for a search of this many sequences achieving a score

as high as that found in part 1. In the formula, be sure to use the effective lengths of the sequences given in the program output.

g. By looking at the scores and E values from this search, what is the approximate value of the alignment score in normalized bits that corresponds to an expect value E of 0.06 (close to an approximate cutoff of 0.02–0.05 for significance)? How many sequences reached this high a score?

h. Is the alignment of the highest-scoring sequence with RecA protein significant and why? What biological information (protein structure and function) does this match suggest about the bacterial RecA protein and the yeast protein?

i. What was the lowest reported score in this search and is this score significant?

5. PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. A note of caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but alien sequences. The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The sequences found can then be multiply aligned by other better-defined methods.

Perform the following analysis and answer these questions:

We provide a protein sequence of a DNA polymerase called iota that replicates past sites of DNA damage and makes mutations. This is a mouse homolog (Entrez search for gi|6755274) of a yeast gene called RAD30. Submit the sequence to PSI-BLAST searching the nr (nonredundant) Genpro database. Use the given (default) options of the program. Repeat the search for an additional iteration using the cutoff scores recommended by the program.

a. How many matches were found above the cutoff score after the initial search? b. Using the Web links provided, identify some of the highest-scoring sequences. What

classes of organisms do the matched genes originate from? Is this sequence representative of a protein family found in just a few or many organisms?

c. How many additional matches were found after the first iteration, and do most appear to be the same type of function, e.g., DNA repair or replication?

6. MAST search with PSSMs obtained from MEME and BLOCKS alignments. We will use two Web sites that search for common patterns in a submitted group of protein sequences—the

BLOCKS server at the Frederick Hutchison Cancer Facility, University of Washington, and the MEME server at the University of California at San Diego supercomputing center. These sites provide examples of well-defined pattern analyses. A family of related sequences from a PSI-BLAST search should usually be subjected to further analysis by these other methods. These searches produce a log odds scoring matrix (position-specific scoring matrix or PSSM; see Chapter 5) that may then be used to search through other sequences for the same pattern. There is no provision for gaps. The MAST program also at UCSD searches every sequence in a protein sequence database for those sequences that have high-scoring matches to the patterns. The BLOCKS server has a number of very useful programs for sequence analysis and maintains a database of aligned sequence patterns from related sequences called the BLOCKS database. BLOCKS define a region of similarity that is a signature of a particular protein family. A family may be defined by one or more BLOCKS. A single sequence may be aligned with all of the existing BLOCKS in the database to determine whether the sequence carries any of the patterns represented by the database. The BLOCKS server searches sequentially through the sequences for common patterns and also uses the Gibb's sampler to locate patterns. MEME uses the expectation maximization algorithm to locate patterns.

These servers produce large volumes of output and MEME E-mails the results in Web page (HTML) format. A family of five related protein sequences that are repair proteins in the RecA-Rad51 family were analyzed for common patterns (search for gi|54866, gi|118683, gi|132224, gi|3914552, and gi|1350566 in Entrez). These proteins bind to single-stranded and double-stranded DNAs and promote base-pairing between the molecules that can lead to genetic recombination. Retrieve them in FASTA format and paste them together in series in the FASTA msa format (see Chapter 2, p. 53) using a simple text editor.

a. BLOCKS search: Perform a BLOCKS search of these protein sequences on the BLOCKS Web site and answer these questions:

i. How many blocks were found by the MOTIF program and by the Gibbs sampler, and approximately how long were they?

ii. Were any of the patterns found by the MOTIF and Gibbs sampling the same ones?

iii. Are the patterns convincing; i.e., do at least some of the columns have a majority of one amino acid or is there a lot of variation?

iv. How do the relative positions of each pattern in the five original sequences compare?

b. MEME search: Submit the same five sequences to the MEME Web site, requesting a search for three patterns that may or may not be present in all of the sequences with one copy per sequence. Use the default options of MEME. Examine the results of the MEME analysis and answer the following questions. (Note that MEME sends two files: the first one showing the patterns found, and the second a map of the sequence showing the relative positions of the patterns.)

i. How many patterns were found and approximately how long were they? ii. How does the relative position of each pattern in the five original sequences

compare? c. MAST search: Use the first MEME output file to search the SwissProt database to find

additional family members that share the same patterns. A very large output file will be produced. Scan the file, noting the expect values for the aligned regions, and answer the following questions:

i. Can additional members of this family be identified by this approach? Give three examples of different types of organisms that are in the matched list.

ii. How does the relative order of the patterns in the matched sequences compare with those in the query sequences? Would you expect these sequences to align well?

iii. In the PSSM-to-sequence alignments shown, how was the alignment score determined?

Chapter 7: Phylogenetic Prediction

A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family members might have been derived during evolution. The evolutionary relationships among the sequences are depicted by using a graph called a tree. Sequences are placed as outer branches of a tree, and the branching relationships of the inner part of the tree then reflect the degree to which different sequences are related. For example, two sequences that are very much alike will be located at neighboring outside branches and will be joined to a common branch beneath them. Less related sequences will be on branches that are more distant from each other on the tree. The object of phylogenetic analysis is to discover the branch arrangements and branch lengths in trees that best represent the relationship among all the sequences.

Phylogenetic analysis of nucleic acid and protein sequences is an important area of sequence analysis, for example, in the study of the evolution of a family of sequences. Using this type of analysis, sequences that are the most closely related can be identified by their occupying neighoring branches on a tree. Thus, when a gene family is found in an organism or group of organisms, phylogenetic relationships among the genes can help to predict

which ones might have an equivalent function that has been conserved during evolution of the corresponding organisms. Such functional predictions can then be tested by genetic experiments.

Phylogenetic analysis may also be used to follow the changes occurring in a rapidly changing species, such as a virus. Analysis of the types of changes within a population can reveal, for example, whether or not a particular gene is under selection (McDonald and Kreitman 1991; Nielsen and Yang 1998), or the timing of genetic variation in the human genome (Toomajian et al. 2003).

Procedures for phylogenetic analysis are strongly linked to those for sequence alignment, which was already discussed in Chapters 3 and 5. Similar problems are also encountered. For example, just as two very similar sequences can be easily aligned even by eye, a group of sequences that are very similar but with a small level of variation throughout can easily be organized into a tree. Conversely, as sequences become more and more different through evolutionary change, they can be much more difficult to align. A phylogenetic analysis of very different sequences is also difficult to do because there are so many possible evolutionary paths that could have been followed to produce the observed sequence variation. Because of the complexity of this problem, considerable expertise is required for difficult situations.

Table 7.1. Phylogenetic relationships among organisms

Site Name

Address

Description

Reference

Entrez http://www3.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html

taxonomically related structures or group of organisms

see Web page

RDP (Ribosomal database project)

http://rdp.cme.msu.edu ribosomal RNA-derived trees

Maidak et al. (1999)

Tree of life http://phylogeny.arizona.edu/tree/ phylogeny.html

information about phylogeny and biodiversity

Maddison andMaddison (1992)

Date post:	13-Nov-2014
Category:	Documents
Upload:	patelrutvij
View:	272 times
Download:	0 times

bioinfo davidmount 1

Documents