+ All Categories
Home > Documents > 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an...

10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an...

Date post: 18-Aug-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
38
SysGen10v6 5/8/13 1 10. Genome Annotation and Databases Once a genome sequence is assembled and its initial quality confirmed, how do you go about using all this data? Suppose we have identified a mutation in a genetic screen (as in Chapter 6), and we have mapped it between two SNPs (as in Chapter 9). What genes might be affected by the allele identified in our genetic screen? Or, in other words, what genes lie between the two SNPs? If the genome were completely sequenced and assembled properly, we could determine all the genomic sequence between the two SNPs used for mapping. However, a file containing those assembled bases is only minimally useful as is, and will need to be further analyzed before we can determine what genes lie between the two SNPs. The truth is, a file containing the assembled bases of a genome becomes much more useful to biologists once these bases are annotated. The process of annotation involves identifying structural and functional features of the genome, and noting them as such. For example, which regions of the genome likely encode for genes? Which regions of the genome are used as centromeric sequences? Some of this information can be identified and noted without performing any further wet lab experiments by computationally using rules biologists have learned to predict where important features are and what their functions might be, and also by linking the genome information with experimentally determined knowledge. Specifically, if a gene is predicted computationally (sometimes referred to as bioinformatically) it is useful to know whether there is any experimental evidence for the existence of that gene. Experimentally determined knowledge, either acquired previously and/or concurrently to genome acquisition, can be associated with relevant parts of the genome. Experimental evidence such as that relating to the function of that gene and its gene product, relating to expression levels and conditions, and related to knowledge about genetic variations are all examples of the types of information that can be valuable annotations to the genome. These experiments can include large-scale genome wide experiments as well as small-scale in depth experiments on individual genes and their functions. The biological information resources that organize information around genomes have become essential tools of life science research. For any given organism’s genome, the completeness of genome annotation can vary greatly, mainly depending on the resources available for the genome's continued annotation. The organization of the genome and its associated annotation is carried out using computer-based databases. Humans are amazingly good at dealing with a hodge-podge of information, with a brain that can sort through data and can sometimes retrieve information that can be years since its last access. We do not yet fully understand how the human brain works and how the databases in our heads are built. A computer-based database needs instructions as to how to organize information. There are now thousands of biological databases that range in size, complexity, purpose, and interface, depending on whether the database is meant for direct use by humans, computers, or both. As the extent of genome-scale data surpasses a human’s ability to comprehend it at one time, scientists rely on databases of genomic information. Since genomic databases are interposed between primary data and the geneticist, it is crucial to understand how the information gets into such database, how one assesses the quality of these data, and the concepts underlying their storage, query and integration. These databases require the information within them to be computable, and biological ontologies define terms and their relationships and allow information to be structured. In this chapter, we will begin by discussing sequence annotation, focusing on gene annotation. Specifically, we will discuss the types of computational analyses that can be carried out on the genome to predict features and function. Next, as a common way to organize genomic data is to attach the information to the linear structure of the genome, and we will discuss the different types of experimental data that get attached to the genome, and how the information is organized so as to be computable, and displayed for use. Finally, we will discuss
Transcript
Page 1: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     1  

10. Genome Annotation and Databases Once a genome sequence is assembled and its initial quality confirmed, how do you go about using all this data? Suppose we have identified a mutation in a genetic screen (as in Chapter 6), and we have mapped it between two SNPs (as in Chapter 9). What genes might be affected by the allele identified in our genetic screen? Or, in other words, what genes lie between the two SNPs? If the genome were completely sequenced and assembled properly, we could determine all the genomic sequence between the two SNPs used for mapping. However, a file containing those assembled bases is only minimally useful as is, and will need to be further analyzed before we can determine what genes lie between the two SNPs.

The truth is, a file containing the assembled bases of a genome becomes much more useful to biologists once these bases are annotated. The process of annotation involves identifying structural and functional features of the genome, and noting them as such. For example, which regions of the genome likely encode for genes? Which regions of the genome are used as centromeric sequences? Some of this information can be identified and noted without performing any further wet lab experiments by computationally using rules biologists have learned to predict where important features are and what their functions might be, and also by linking the genome information with experimentally determined knowledge.

Specifically, if a gene is predicted computationally (sometimes referred to as bioinformatically) it is useful to know whether there is any experimental evidence for the existence of that gene. Experimentally determined knowledge, either acquired previously and/or concurrently to genome acquisition, can be associated with relevant parts of the genome. Experimental evidence such as that relating to the function of that gene and its gene product, relating to expression levels and conditions, and related to knowledge about genetic variations are all examples of the types of information that can be valuable annotations to the genome. These experiments can include large-scale genome wide experiments as well as small-scale in depth experiments on individual genes and their functions. The biological information resources that organize information around genomes have become essential tools of life science research. For any given organism’s genome, the completeness of genome annotation can vary greatly, mainly depending on the resources available for the genome's continued annotation. The organization of the genome and its associated annotation is carried out using computer-based databases. Humans are amazingly good at dealing with a hodge-podge of information, with a brain that can sort through data and can sometimes retrieve information that can be years since its last access. We do not yet fully understand how the human brain works and how the databases in our heads are built. A computer-based database needs instructions as to how to organize information. There are now thousands of biological databases that range in size, complexity, purpose, and interface, depending on whether the database is meant for direct use by humans, computers, or both. As the extent of genome-scale data surpasses a human’s ability to comprehend it at one time, scientists rely on databases of genomic information. Since genomic databases are interposed between primary data and the geneticist, it is crucial to understand how the information gets into such database, how one assesses the quality of these data, and the concepts underlying their storage, query and integration. These databases require the information within them to be computable, and biological ontologies define terms and their relationships and allow information to be structured. In this chapter, we will begin by discussing sequence annotation, focusing on gene annotation. Specifically, we will discuss the types of computational analyses that can be carried out on the genome to predict features and function. Next, as a common way to organize genomic data is to attach the information to the linear structure of the genome, and we will discuss the different types of experimental data that get attached to the genome, and how the information is organized so as to be computable, and displayed for use. Finally, we will discuss

Page 2: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     2  

where the information in genome databases comes from, how complete this information is, and the importance of data standards. A. Annotating the Genome Genome annotations indicate the locations of structural and functional features of the genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams, the completely annotated genome would indicate the properties of the organism and how the genome is used to determine these properties. In practice, we are often most interested in the set of genes encoded by the genome. Genomes annotations are important at different scales (Figure 1 [SGF-1316]), and these annotations can be thought of as marks upon the genome, visualized as decorations along a linear string of nucleotides. At the largest scale, we need to know which sequences are associated with which chromosomes and more specifically, which arm of the chromosome. This type of information can be annotated because of mapping experiments that occur during genome assembly, as discussed in Chapter 9. Similarly, important chromosomal features such as centromeric and telomeric sequences, which may be been identified during genome assembly, can also be associated with the genome sequence. At a smaller scale, we need to know where specific genes lie within the genome sequence and, if genetic map data exists, how the genetic map relates to the genome. At the gene level, we want to know the sequences associated with a particular gene. Useful details about a gene includes information such as: does it encode protein? are there introns and if so, where? where are the gene regulatory sequences on the genome, and are there functional elements such as gene regulatory protein binding sites known to be associated within this regulatory sequences?

One goal of annotating a gene in a genome is to associate specific bases within the genome to a gene. A gene, as found in a genome, will span many bases, and a gene model is typically included during annotation. A gene model is the best estimate of the structure of a gene along a genome and typically includes information about exons and introns, transcriptional start and end sites, and for protein-coding genes, potential translational start and stop codons. These gene models are derived from different types of experiments, including de novo predictions of genes, analysis of RNA and protein gene products, comparative studies of other genomes, and molecular lesions that affect gene activity.   In this section, we focus on computational methods important for annotations at the gene level. These gene level annotations often come from a combination of computational analyses of the assembled genome sequences, confirmation from experimental results, and comparative studies from other genomes. Once gene level annotations are made, these genes are often used as nodes for the further associations of other types of information that can be layered to build genome databases. Annotations of gene regulatory sequences and discussions of genome-scale experimental methods on genes and their products will be discussed in later chapters.

Page 3: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     3  

Figure 1 [SGF-1316]. Example of annotations to a human chromosome at different scales. Genomes can be scanned for potential gene-encoding regions

Once a genome is sequenced and assembled, an experiment that is typically carried out on a genome is to look to see what genes are encoded by that genome. This prediction of genes uses rules that we have learned about how information is encoded. These rules are used to scan the genomic sequence for regions that follow these rules and thus are likely to encode for genes. Of course, the better we understand these rules, the better we are at designing algorithms to predict genes.

We know quite a bit about protein coding genes and how they are decoded by the cell’s machinery. Because we understand how open reading frames are read by the ribosome to code for proteins, we can use that information to identify potential protein coding regions in the genome using computational methods. Similarly, our understanding of how pre-mRNA process occurs informs our ability to predict where intron/exon splice sites within a gene. This information is then annotated on the genome as the gene model for a protein coding gene, which typically includes the start and stop codons of the open reading frame and also the locations of introns and exons. Specifics regarding these methods are discussed in more detail below.

Some genes do not encode for proteins and are called non-coding RNAs (ncRNAs, see Table [SGT_464]). The signatures of non-coding RNAs (ncRNAs) are markedly distinct from protein-coding genes. Typically, ncRNAs derive their functions by the structures they can fold into, and thus rules that predict RNA structures and features are important for finding potential ncRNAs within a genome. Although we now appreciate that many types non-coding RNAs are present in each organism and many of the genes in a genome encode these RNAs, we generally know less about ncRNAs and thus we are less capable of predicting them from

Page 4: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     4  

genomic sequence alone. As the rules for each type of ncRNA vary, the prediction of each type of ncRNA is carried out differently.

Figure 2 [SGF-1224]. A typical gene and its protein product. There are signals for each step.

An open reading frame suggests the existence of a protein coding gene How might you scan a genome for a protein coding gene? We know that the ribosome synthesizes genes by reading mRNAs using tRNAs that carry the appropriate amino acids to be joined to a growing polypeptide (see Chapter 2 for details). These tRNAs bind to the mRNA at a codon which is 3 nucleotides long, and the next tRNA comes in binds to the next 3 nucleotides. Translation by the ribosome begins at a start codon (AUG in an mRNA; sometimes more formally called an initiator methionine codon because the tRNA that binds to AUG brings along the methionine amino acid) and ends at a stop codon (typical UAA, UAG, or UGA in an mRNA). Using these rules, gene prediction algorithms can be designed to scan genomes for potential start codons (ATG on the DNA) that are followed by a string of codons that eventually will end with a stop codon (TTA, TAG, or TGA in the DNA). A string of codons that is uninterrupted by a stop codon is called an open reading frame, and suggests the existence of a protein coding gene. Because either strand of the DNA can encode for a RNA molecule, both strands must be scanned for open reading frames. For organisms with no introns or few introns (i.e., most prokaryotes and some eukaryotes, including the bakers yeast Saccharomyces cerevisiae), gene prediction can be as simple as looking for open reading frames flanked by start and stop codons. Initially, a minimal size of the open reading frame is typically decided upon, to minimize the chance of an open reading frame that occurred by chance being called a gene. Figure 3 [SGF-1303] shows a random DNA strand where all possible open reading frames flanked by start and stop codons have been noted. Because the ribosome reads consecutive non-overlapping triplets, any single DNA molecule has to be examined in all three reading frames for open reading frames.

Page 5: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     5  

Figure 3 [SGF-1303]. Random sequence with ORFs indicated. Start codons in green; stop codons in red; ORFs in green; DNA in blue; amino acids in black in three-letter code. 1000 random nucleotides have 13 ORFs in one strand, the longest is 48 amino acids. Oc*, ochre stop codon (TAA); Am*, amber stop codon (TAG); Op*, opal stop codon (TGA).

Page 6: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     6  

pre-mRNA splicing can be predicted by characteristic sequence properties

In most eukaryotes, pre-mRNA splicing (the excision of intervening sequences, or introns) occurs, making the final mature mRNA sequence different from the sequence found in the genome. The inclusion of introns complicates, gene prediction as intron sequences will interrupt the open reading frame, making gene prediction more complicated than the simple scanning for open reading frames flanked by start and stop codons. However, our understanding of the how the spliceosome works allows us to look for signals within the genome for sequences that can be used by the spliceosome to remove introns from pre-mRNA molecules. These signals are called splice donor and acceptor sequences, and when found within a DNA region that is transcribed, can be used to predict where pre-mRNAs are spliced. Eukaryotes have essentially absolutely conserved splice donors and acceptors. The donor sequence occurs right after the end of an exon and is GU in the pre-mRNA (GT in the DNA). The acceptor sequence, which occurs right before the next exon, is AG (see Figure 4 [SGF-1255]). As a dinucleotide sequence will appear in the genome at random quite frequently, and it is also useful to know what other sequences are used to designate splice sites to the spliceosome. For splicing in mammals, there is also a branch point sequence (BPS) important for pre-mRNA splicing. The human consensus BPS is yUnAy, and it is typically found about 25 nucleotides upstream of the 3’ acceptor splice site (Figure 5 [SGF-1274]). This BPS is defined by examining many different known BPS and seeing whether there are preferences for certain bases at particular sites in the pre-mRNA. See See Box-331 for logo definition. A polypyrimidine tract is often found between the branch point and the 3' acceptor contains a polypyrimidine tract. Another characteristic of splicing for protein coding genes is that the exons joined must be in frame, to maintain for the coding of a protein. Thus, exons in the middle of a protein coding gene in humans will be flanked by a 5' splice donor site, a 3' splice acceptor site which is near a polyprimidine tract and a BPT, and will be able to splice in frame to the next exon. This information is incorporated into algorithms that can scan the genome for protein coding genes.

Figure 4 [SGF-1255]. Splicing in frame. Small human intron of Homo sapiens cadherin 1, type 1, E-cadherin (epithelial) (CDH1). BPS consensus is yUnAy, where y = weak preference for pyrimidine and U and A are strong preferences; "n" any nt). Note that a polypyrimidine tract is enriched in U and C nuclotides in RNA, and thus T and C nucleotides in DNA. {missing nt in bps in read above}

Figure 5 [SGF-1274] reference Gao et al. 2008 Nucleic  Acids  Research  Volume  36,  Issue  7  Pp.  2257-­‐2267.  

Page 7: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     7  

ncRNA genes can be identified by predicted secondary structure Our understanding of the production and processing of ncRNA lags our understanding of how protein coding genes are read from the genome, and this is one reason why we are less able to predict the ncRNAs in the genome compared to protein coding genes. Because the structures of non-coding RNAs are most important for their function, algorithms that predict particular types of ncRNAs examine sequences in the genome to models of RNA structure and features. Important RNA secondary structures include stems, loops, and branches (Figure [SGF-1063]). These RNA structures are produced through base-paring in the RNA molecule, and thus the specific sequence at the structure can be changed while still allowing for the formation of the structure as long as the complementary sequence needed for pairing is maintained. For example, in Figure [SGF-1026], you can see that a similar stem loop structure can be maintained given two very different primary RNA sequences. The predictions of ncRNAs mainly depend on the ability of particular sequences to form structures characteristic of the type of ncRNA, and each type of ncRNAs are predicted using different rules. Unlike protein coding genes that have well defined characteristics such as open reading frame, start codons, stop codons, and splice consensus sequences, the primary sequence can vary greatly among different ncRNAs of a similar class and function. Typically, other types of experiments such as sequence comparisons between genomes and the direct examination of RNA are used to annotate ncRNAs in a newly sequenced genome. Because we have known about rRNAs and tRNAs for some time, we are pretty good at predicting these genes. On the other hand, our knowledge of miRNAs is newer, and, their generally small size makes computational prediction more difficult. The truth is for ncRNAs, RNA experiments are valuable (see below).

Figure [SGF-1063]. RNA secondary structure. A. Stem and loop structure. B. An internal loop. C. A branch.

Figure…..[SGF-1026] compensating changes and co-variation GC to CG

Page 8: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     8  

Ribosomal RNAs are highly conserved in evolution and are of the first part of a genome to be analyzed because they are used in phylogenetic analysis. These are key components of the ribosome. For eukaryotes, the 28S, 18S and 5S (the S refers to a classic method of separation of macromolecules). It is worth noting that G•U basepairs in RNA are about as stable as A•U basepairs, which are less stable than G•C base pairs. Genome comparisons can help refine computational gene predictions Once a genome is scanned for potential genes, these genomes can be compared to other previously sequenced and annotated genomes. For two genomes that are closely related, many gene placements may be conserved between species. If there are no closely related genomes, comparative studies of gene products between genomes can still be informative, particularly for the conserved set of proteins found in most organisms. Genome sequences can be used to search databases for similar sequences in other genomes that have been previously annotated to encode genes. For protein coding genes, the search may be carried out differently than for ncRNAs. For protein coding genes, instead of looking for sequence similarly at the DNA/RNA level, the types of peptides potentially encoded by possible open reading frames can be compared to the peptides that have been previously annotated, since the protein product is likely what is conserved and given the redundancy in amino acid specification, the same amino acid sequence may be coded for by very different sequences. The similarity between a predicted protein encoded by the genome and a protein sequence found in the database is measured as a proportion of identical amino acids and also be the proportion of similar amino acids. Similarity among amino acids is based on chemical structure as well as the patterns of substitutions in homologous proteins. Examples of similar amino acids are serine and threonine, which are both polar amino acids and which can both be phosphorylated. In most cases, experiments have demonstrated that substation of serine for threonine or vice versa will not adversely affect the structure or function of the protein, and thus these amino acids are considered very similar. Similarly, isoleucine and valine can both be considered hydrophobic amino acids, but isoleucine is slightly bulkier than valine. Assessing the similarity of two peptides involves judgments based on how similar two amino acids are for each other and how their substitutions will affect protein function. Another assumption that needs to be made is over how many amino acids needs there to be identify/similarity for there to be functional equivalence? There are several methods for identifying similar proteins from genomic sequence. Some methods are less computationally intensive, and thus can be done more quickly. For rapid searches, Blat or Blast are two common programs which check for regions of alignment to known proteins given a putative peptide sequence. In these algorithms, a rapid computational search is made for alignment of short seed sequences, and then the matches are extended out. The likelihood of obtaining such a match is calculated based on the length and quality of the match, and the size of the sequence database.

However, if time is not an issue, the full alignment of the predicted proteins to the full length of a protein product is preferred. Smith-Waterman alignment can do this. When we find alignment across the full length of the protein, we can be more confident when annotating the gene that it is like the previously known one.

Figure [SGF-1058] shows the alignment of the Ras proteins from various species. The protein sequences are not only similar, you can see that there is a large amount of identity between the three proteins, particularly at the N-terminus. This is particularly amazing, as these proteins have maintained identity despite being separated by over a half a billion years of animal evolution. If a genome sequence search for similar proteins shows such an alignment, we would feel relatively confident annotating the sequence coding for this protein as a Ras gene.

Page 9: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     9  

Figure [SGF-1058]. RAS proteins are highly similar in amino acid sequence. *, identical amino acid; :, highly similar amino acid; ., similar amino acid. N.B. Ras prtoeins are not really 1:1 orthologs… Similar sequences can have different relationships and should be noted during annotation When sequence comparisons are made to define genes, when sequence identity is found throughout the genes across different species, these sequences are called homologues. Homologous genes are derived from an ancient common ancestor, and the differences we see in these sequences in our modern day organisms are due to divergence over evolutionary time. Presumably, the similarity in sequence suggests the gene product can perform similar function within the cell and the organism. The RAS gene we discussed above are considered homologues. When there is a single homologue in each species and this homologue is thought to come from a single gene in a common ancestor, these homologs are called orthologs. Orthologs often have the same function, and thus the identification of orthologs is useful for annotating genomes. However, identifying genes with similar sequences may not mean that the gene product is doing the same thing. One reason this is the case is because gene duplication does happen over evolutionary time, and this duplication of sequence can result in more than one sequence in organism B that is homologous to a single sequence in organism A. The duplicated genes in organism B are called paralogs, because they are homologous genes that arose due to a duplication event. One way this situation could have occurred is that the ancient ancestor or organism B underwent a gene duplication of this gene while no gene duplication occurred in organism A. Paralogs can arise before or after speciation and once the gene has duplicated, it can then diverge in function. Figure SGF-1045 shows the relationships of several homologous genes, A, B, and C. The relationship of the EGF-receptor gene family can be seen in Figure XXXX (need to make). Here, Drosophila melanogaster and Caenorhabditis elegans each have a single EGF-receptor gene while humans and mouse each have four (EGF-receptor,HER2/c-neu, HER3, HER4). The fly and worm genes are clearly orthologous and both are orthologous to the four mammalian genes.

Page 10: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     10  

Figure SGF-1045. Orthologies. A and D are strict 1:1:1 orthology for the genes in the three species. B shows a duplication (expansion) in species B. This results in a 1:2:1 orthology relationship. C shows a loss (contraction) in species B. This results in a 1:0:1 orthology relationship. Orthology to a reference genome helps define gene structures In addition to being a good way to identify genes in newly sequened genomes, checking a gene protein to its homologs from other species provides a check on computational gene prediction. Especially in organisms with introns, the computational prediction of proteins can miss exons and may result in missing parts of the gene. Another reason a part of a gene may be missing is that the start codon may be misidentified (since AUG is also used within the protein to code for methionine) and then the 5’ end of the gene may be missing. Assembly problems in the genome can also lead to missing sequences and thus missing gene parts.

When proteins are aligned to homologs, any potentially missing domains can be identified, and then a search through the genome. Figure [SGF1313], Alignment a protein from a new genome to homologs from other species provides a reality check on our annotation. The most clearcut cases involve missing domains. For example, as shown in Figure [SGF1313], If we find domains red and green but not blue, we would make it a priority to look for domain blue in the genome sequence.

Figure [SGF1313]. Fixing gene models by orthology. A shows a protein in one species. B shows the predicted gene model, which misses the blue domain. Searching the genome identifies the blue domain in genome sequence (C), which allows the gene model to be fixed in D.

Page 11: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     11  

ncRNAs do not show wobble ncRNAs can also be predicted by genome comparisons. Because much of the protein coding genes will show up as similarity, the prediction of ncRNAs is often done by taking the sequences that DO NOT code of protein coding genes and checking these for similarity. Protein sequence comparisons focus on domains and architecture Domains are basic units of protein structure and are typically not split during evolution. Many domains are known from crystallography or other strucutureal studies, but many are predicted based on multiple alignment of sequences. Domain definitions are often a moving target, as new domains are defined by additional sequence information, structural information, and functional data. Domains can be defined by sequence identity and similarity, function and structure. There are several active protein domain database projects InterPro provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites using signatures, provided by ten databases. These include the pfam project (www.pfam.janelia.org), which curates protein domain families and scans new sequences for the presence of domains. ISMART (http://smart.embl-heidelberg.de/) PRINTS (http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/index.php) identifies conserved motifs in proteins. CATH/Gene3D at University College, London, UK PANTHER at University of Southern California, CA, USA PIRSF at the Protein Information Resource, Georgetown University Medical Centre, Washington DC, USA Pfam at the Wellcome Trust Sanger Institute, Hinxton, UK PRINTS at the University of Manchester, UK ProDom at PRABI Villeurbanne, France PROSITE and HAMAP at the Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland SMART at EMBL, Heidelberg, Germany SUPERFAMILY at the University of Bristol, UK TIGRFAMs at the J. Craig Venter Institute, Rockville, MD, US Some of these domains are well known and have well-defined function. Other domains have no known function and are called DUF for Domain of Unknown Function. Many proteins contain highly independent modules. Many proteins contain highly independent modules. These modules often work independently and are identified by molecular genetic structure-function studies, by structural studies, or by sequence comparison. For example, in Figure [SGF-1028], a hypothetical protein might have a catalytic domain, two different domains that localize the protein within the cell, and a regulatory domain.

Page 12: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     12  

Figure [SGF-1028]. Modular Proteins. A. Schematic of some types of domains in a multi-domain protein. B. Schematic of a protein sequence comparison by domains. Each domain can have different extents of sequence similarity, e.g., amino acid identity. The architecture of the proteins is conserved as both species have all three recognized domains. An example of a modular protein is Grb2. Grb2 comprises three domains: an amino-terminal SH3 domain, an SH2 domain and a second SH3 domain (Figure [SGF-1253]). The conservation is strong throughout except in the small spacer between the SH2 and carboxyl-terminal SH3 domain. SH3 domains bind proline-rich SH#-binding domains, while SH2 domains bind short phosphotyrosine-containing peptides (Figure [SGF-1319]).

Figure [SGF-1319]. Schematic of Grb2. Maybe add label “phosphotyrosine peptide”

Page 13: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     13  

Figure [SGF-1314]. Grb2 family is a modular protein of three domains. Human Grb2 is the first; the second is Drosophila drk; the third line is C. elegans SEM-5. Training Gene prediction computer programs use various parameters that are often genome specific. How can you know what to use if you have not analyzed the genome? One solution to this Catch 22 is to train the program with some genes that have been verified by cDNA analysis. Indeed, inclusion of RNA-seq transcriptome data is now standard in a genome sequencing project. The software programs that predict coding genes can be trained with these RNA data. Such trainable systems (often called machine learning systems) use a set of methods in which existing data is used to teach the computer about a desired result based on positive and negative training sets of data. For example, if we have a set of hundreds of sequences that we believe include genes, this serves as a positive role model for the computer, and another set of sequences that we believe do not contain genes, it would serve as a negative role model for the computer. The computer program then learns from these training sets, and then can predict whether a new sequence is likely to contain a gene. Experimental evidence helps define gene structures While predictions are often quite good, experimental evidence is most often better at finding genes than getting the splice sites predicted correctly. One major type of experimental data is cDNA sequence, for example obtained from RNA-seq reads (Figure [SGF-1062]). Long reads are better at finding the correct phases of alternative spliced mRNAs (Figure [SGF-1329]).

Page 14: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     14  

Figure 6 [SGF-1062]. mapping RNA reads to confirm exon-intron boundaries.. Top, assembly cDNA. Bottom mapping reads to a reference genome sequence.

Figure 7 [SGF-1329]. cDNA evidence for alternative splicing. Pseudogenes are non-functional genes Pseudogenes are similar to normal genes but non-functional. In general, pseudogenes are considered to have evolved from functional genes. Pseudogenes arise by duplication of a gene with concomitant or subsequence modification to a non-functional state. These are called duplicated or non-processed pseudogenes. Another type of pseudogene arises from retrotransposition in which an mRNA is reverse transcribed into DNA, and the DNA copy (cDNA) is inserted into the genome. Such processed pseudogenes lack introns.

Figure 8 [SGF-1304]. Types of inactivating mutations causing pseudogenes.

Page 15: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     15  

Figure 9. ACYL3 gene disrupted during human evolution. A stop codon occurred after humans diverged from gorillas. Reprinted from doi:10.1371/journal.pcbi.0030247.g002 Zhu J, Sanborn JZ, Diekhans M, Lowe CB, Pringle TH, et al. (2007) Comparative Genomics Search for Losses of Long-Established Genes on the Human Lineage. PLoS Comput Biol 3(12): e247. doi:10.1371/journal.pcbi.0030247. Reuse permitted by creative commons.org/licenses/by/2.5/ The hallmarks of pseudogenes are that they are not expressed and cannot encode functional proteins. For example, accumulation of stop codons in the protein coding sequence is indicative of pseudogenes. Some pseudogenes are indeed expressed. For example, the ACYL3 o-acyltransferase gene appears to have become inactive in humans due to the presence of a stop codon, TGA, rather than the tryptophan-encoding TGG codon. (Figure 9 ACYL3). Table [SGT-463] Pseudogenes type Definition

Processed pseudogene created by retrotransposition of the mRNA of a functional protein-coding parent gene followed by accumulation of disabling mutations

Duplicated pseudogene created by genomic duplication of a functional protein-coding parent gene followed by accumulation of disabling mutations

Unitary pseudogene Pseudogene for which the ortholog in a reference species (mouse) is coding but the human locus has accumulated disabling mutations

Polymorphic pseudogene Locus coding in some individuals but with disabling mutations in the reference genome

Pei et al. Genome Biology 2012 13:R51 doi:10.1186/gb-2012-13-9-r51

Gene models can change over time

Page 16: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     16  

As evidence arises, our understanding of each gene can change, and gene structure models can be altered. Often what were considered two separate genes are merged in to one gene (Figure [SGF-1029]), or what was thought to be one gene is split into two genes.

Figure [SGF-1029]. Splits and merges of gene models. A predicted gene might be split into two separate genes. Two adjacent predicted genes can be merged into one gene. In A, exons 1-4 are in the same gene model. In B, exons 1 and 2 are in one gene model, while exon 3 and 4 are in a distinct gene model. Genomes contain repeated sequences of known or unknown function. Genomes include many classes of repeats. Some of these are due to transposable elements (discussed in more detail in Chapter…), others are multigene families, and other are structural components of the genome such as centromeres, telomeres, etc... Others remain mysteries. The repeat structure of a genome can be determined by C0t analysis. The rate of renaturation of genomic DNA is proportional to concentration, and thus each abundance class renatures at a different rate (C0t value). A C0t curve will reveal the proportion of genome that is repetitive and the relative abundance of each class. Repeats are conveniently classified as tandem repeats or dispersed repeats. Dispersed repeats are transposons, tRNA genes, pseudogenes, and paralogs (see below). Tandem repeats include tandem arrays of paralogs, so-called satellite DNA and ribosomal DNA.

Figure [SGF-1330]. Cot analysis defines the extent of repetitive sequences in a genome. At least half the human genome is derived from transposable elements, which are long interspersed elements. The most common is the LINE-1 or L1 element, some of which are still active. Functional L1 elements have two ORFs in their 6000 nt. A 5’-UTR has a promoter that directs expression of two ORFs. A 3’UTR has a polyA tail. ORF1 is a mysterious RNA binding protein; ORF2 is involved in retrotransposition with domains that have endonuclease, reverse transcriptase activity and a 3′ terminal zinc finger-like domain.

Page 17: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     17  

From the introduction to: Penzkofer T, Dandekar T, Zemojtel T. L1Base: from functional annotation to prediction of active LINE-1 elements. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D498-500. PubMed PMID: 15608246; PubMed Central PMCID: PMC539998. RAS proteins are highly conserved (Figure [SGF-1058]). For example, about 90% of the amino acids in the rirst 80% of the protein are identical over half a billion years of animal evolution. ADD yeast and Dicty?

Figure [SGF-1058]. RAS proteins are highly similar in amino acid sequence. *, identical amino acid; :, highly similar amino acid; ., similar amino acid.

Figure [SGF-1253]. How domains can shuffle without change in protein structure Search to find similar proteins. A standard method in sequence analysis is to use on sequence to search databases for similar sequences. Similarity is measured as proportion of identical amino acids or nucleotides, or in the cases of peptide sequences, proportion of similar amino acids. Similarity among amino acids is based on chemical structure as well as the patterns of substitutions in homologous proteins. For example, serine and threonine are both polar amino acids and both can be phosphorylated. Isoleucine and valine are both hydrophobic amino acids. In most cases substation of serine for threonine or vice versa will not adversely affect the structure or function of the protein. Different methods of identifying similar proteins exist. For rapid searches, Blat or

Page 18: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     18  

Blast are favorites. Given enough time, full alignment of proteins are preferred (e.g, Figure [SGF-1058]). Protein alignments allow predicted evolutionary trees to be drawn. An evolutionary tree shows the inferred relationship among species, genes or proteins. The tree structure implies a common ancestor A phylogenetic or evolutionary tree is rooted and has a sense of time. An unrooted tree just shows the relationship of the proteins. As the number of prtoeins in the tree increases, the number of possible trees increases even faster. For example, with three proteins there is one possible unrooted tree and three possible rooted trees (Figure [SGF-1277]). However, with four proteins there are three possible unrooted trees and 15 possible rooted trees (Figure [SGF-1278]). The more trees, the more ambiguity in the exact tree that is consistent with the existing data. For this (and ?other) reason, trees can change as new data are added. There are many ways to construct a tree. See Graur and Li (Fundamentals of Molecular Evolution, Sinauer) for some details.

Figure [SGF-1277]. Trees of three proteins: the one possible unrooted tree and three possible rooted trees.

Figure [SGF-1278]. Trees of four proteins. Three possible unrooted and fifteen possible rooted trees with four proteins.

Page 19: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     19  

Figure SGF-1279. Example of a tree. Left, unrooted. Four sequences with differences indicated in red. The inferred tree is shown below. Right, rooted. Same four sequences with an outgroup that allows the tree to be rooted. Let’s consider a simple example of a gene tree (Figure [SGF-1279]). Four sequences of length 22 nucleotides are very likely related as 18 of the positions are identical (shown in black in Figure [SGF-1279]. Sequences a and b differ by one nucleotide, and sequences c and d also differ by one nucleotide. However, since there are four differences between a and c, between a and d, between b and c and between b and d, we can draw the unrooted tree shown. If we have a slightly less related sequence, namely B, differing in, for example, four positions from each of a, b, c and d, we can use B as an outgroup to root the tree. Some annotations are to Domains of Unknown Function (DUFs). The function of these domains remain to be discovered. An example chosen at random is shown in (Figure [SGF-1320]).

Page 20: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     20  

Figure. SGF-1320. DUF1041: A domain of Unknown Function. Functional equivalence can be assessed by cross-species complementation. Many people consider that orthologs have to have the same function, but this is an inference not a requirement. We call these functionally equivalent orthologs. See chapter – or Box. Table SGT- ..

DNA Phenotype

+ - wild type A(lf) - mutant A(lf) A wild type

+ A wild type A(lf) species2-A wild type

+ species2-A wild type The meaning of percent identity of proteins depends on the rate of their evolution, which differs among protein families and even domains! C. elegans ced-9 and human bcl-2 complement in spite of limited sequence similarity: only 43/280=15% are identical. Include somewhere a table of conservative changes: basic: K, R, H acidic: D, E, etc.

Page 21: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     21  

Figure. SGF-1057. Alignment of C. elegans CED-9 and human BCL2 proteins shows minimal similarity, but human BCL2 can replace CED-9 in transgenic C. elegans. * indicates amino acid identity; :, strongly similar amino acids; ., similar amino acids. CED-9 is 280 amino acids, of which 43/280=15% identical. Secretion signals A very common protein feature is a signal to direct insertion into the ER, a secretion signal called the signal peptide. Predictions made by software such as SignalP are often used for genome annotation (Figure [SGF-1281]). Note that some proteins are secreted without a signal peptide, and of course not all predicted secreted proteins are in fact secreted.

Page 22: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     22  

Figure [SGF-1281]. Signal Peptide prediction. First 70 amino acids of chicken EGFR with predicted likelihood of being part of signal peptide (green) and cleavage site (blue, red). B. Keeping track of annotations: ontologies Biological databases store information about many types of things, called entities. Entities include species, strains, alleles, genes, proteins, regulatory elements, repeats, plasmids, people, institutions, reagents, and so forth. It is crucial to have distinct names for specific entities. Databases include descriptions of these entities. They also include the relationships between entities, such as what genes interact with other genes, what alleles cause what phenotypes and what cells express what genes. Stable, unique identifiers help maintain data integrity Imagine a gene with eight names. For example, the yeast SIN3 gene is also known as YOL004W, CPE1, GAM2, RPD1, SDI1, SDS16, and UME4. Databases store synonyms and make it easy to keep these straight. Or imagine a gene with a variety of types of names, for example, Human Epidermal Growth factor Receptor or EGFR is also known as “v-erb-b avian erythroblastic leukemia viral oncogene homolog oncogene ERBB”, “ERBB1”, “HER1”, and “species antigen 7 (sa7)”. Now imagine two names that mean different things. cdc25 is a Schizzosaccharomyces pombe phosphatase and activator of protein kinase, and CDC25, an guanine nucleotide exchange factor for RAS. Finally, imagine genes that change names because they merge or split. These nightmares happen frequently and can lead to great confusion. An excellent solution to these situations is to assign a stable name that identifies the gene. The largest database of biomedical abstracts (PubMed) does not store first names of authors of papers. Many individual researchers have the same initials and thus these names are ambiguous in the database. It would be an enormous task to disambiguate these names. However, a new project to keep track of unique identifiers for researchers, ORCID (www.orcid.org), might well solve this problem. As humans we would prefer to use whatever symbol we like, and computer programs can often help us translate our symbols into hidden unique identifiers. When there is ambiguity, we might get to choose. For example, if one searches PubMed for “elegans” you get returns that include C. elegans the nematode and S. elegans the turtle. Searching for “C. elegans” you might get C. elegans the flowering plant (Camelia elegans). NLM-NCBI uses unique identifiers for each taxon, so with discipline the user can find specifically the species of interest (Figure [SGF-1267]).

Page 23: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     23  

Figure [SGF-1267]. Examples of NCBI taxon records Controlled Vocabularies Many projects have tried to rein in the confusion by using only particular terms. Since the project controls the terms used they are called Controlled Vocabularies (CVs). The CVs are often shown as a drop-down menu on a webform to enforce their use upon data entry into a database (Figure [SGF-1322]). We are thus familiar with controlled vocabularies such as a list of countries. One example of a CV used in database curation is the types of alleles specified in a webform by the Mouse Genome Database (Figure [SGF-1322]).

Figure [SGF-1322]. Mouse genome informatics CV and webform for types of alleles Ontologies organize information Ontologies formalize some types of information. An ontology is a description of the relationship among defined terms. Both the terms and their relationships are defined, and this structured information allows computers to utilize the information effectively. A hierarchy or outline is a simple type of ontology, but often us unable to accurately represent the knowledge. For example, if a cell type is part of two tissues, an outline form would have to have that cell type in two places. Figure [SGF-1315]. Ontologies also allow computers to make rigorous inferences from available data by reasoning over the relationships connecting entities (BOX 332).

Page 24: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     24  

Figure [SGF-1315]. Ontology structures: nested hierarchy versus directed acyclic graph. Arrows indicate “part-of” relationship A nested hierarchy and he equivalent directed acyclic graph (DAG) will have the same number of connections (edges) but the DAG can have fewer nodes (Figure [SGF-1315]). A real example of sucha nested hierarchy is the NLM’s Medical Subject Headings (MESH) (Figure [SGF-1323]).

Figure [SGF-1323]. Example from NCBI MESH. The entries under a term are different in different trees. Ontologies cover many types of information As the number of biological databases expand into the thousands, there is increasing use of ontologies to cover all sorts of terms. It is easier to sue an existing ontology than starting from scratch, and this effort is helping to allow cross-database coordination of information. An anatomy ontology defines the parts of an organism and their relationships. For example a mouse foot is part of a hindlimb. The foot contains parts such as blood vessels, bone and connective tissue (Figure [SGF-1282]).

Page 25: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     25  

Figure [SGF-1282]). A small part of the Mouse Anatomy Ontology

A phenotype ontology contains terms that describe the phenotypes and their relationships. For example, Nematode organism behavior variants include body posture variants (Figure [SGF-1284]).

Figure [SGF-1284]. Small part of Worm Phenotype Ontology The Sequence Ontology describes aspects of sequences and sequence analysis. A

region is a span of sequence; a biological region is a region that has some biological relevance; a sequence alteration is a change in the sequence. A deletion is both a biological region and a sequence alteration.

Figure [SGF-1271] Small part of Sequence Ontology.

The Evidence Code Ontology describes experimental evidence of all sorts. Most relevant to this book so far, Figure [SGF-1285] shows a very small part of types of experimental phenotypic evidence.

Page 26: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     26  

Figure [SGF-1285]. Small part of the Evidence Code Ontology.

The Gene Ontologies capture some basic information about genes and their products. We would like to transfer information and knowledge about genes and gene products among organisms based on sequence orthology. Orthologous genes, in general, have the same function. This Orthology Hypothesis always has to be tested in detail, but is highly useful for annotating genomes, or any genome-scale study where we want o make inferences but cannot immediately test all the specifics. To transfer such knowledge, we need to make the information computable, that is, in a standardized form so that computer programs can utilize the mass of information about thousands or tens of thousands of genes. A major effort to standardize information about gene and gene product function is the Gene Ontology Consortium, which continues to develop the Gene Ontologies. This information is organized in three sets of standardized terms with defined relationship among them (an ontology): Celulalar Component, Molecular Function, and Biological Process: Cellular Component. The Cell Component Ontology describes where in a cell (or outside a cell) gene products are active. Molecular Function. The Molecular Function Ontology describes the biochemical activity of a gene product. Biological Process. The Biological Process Ontology describes the biological objective to which a gene product contributes. Annotation of a gene product to these three categories is a useful first step in having some idea of what the gene product does. For example, a protein is in the cell component cytoplasm; has the molecular function of protein kinase activity; and has the biological function of the cell cycle tells us that it is a Cytoplasmic Protein Kinase involved in the Cell Cycle, a concise description of its general function. Consider the human protein Calmodulin. There are a number of cellular component annotations (Figure [SGF-1305]). The nucleus is an intracellular organelle. So is the centrosome. The cytoplasm is part of a cell, as is the plasma membrane. There are also a number of Molecular Function annotations for calmodulin (Figure [SGF-1306]). Finally, there are many Biological Processs annotations for human calmodulin (Figure [SGF-1325]).

Figure [SGF-1305] Cellular component annotations for human calmodulin

Page 27: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     27  

Figure [SGF-1306]. Molecular function annotations for human and yeast calmodulin.

Figure [SGF-1325]. Some Biological Process annotations for human calmodulin. Gene Ontologies (GO) include evidence for the associations. Gene Ontology annotations include the type of evidence used for the assertion. The type of evidence is specified using a simple ontology of evidence codes. These are of three main types: experimental, computational and automatic. Table SGT- Gene Ontology Evidence Codes (partial list) Experimental Evidence Codes Example IDA: Inferred from Direct Assay in vitro enzyme assay IPI: Inferred from Physical Interaction yeast 2-hybrid experiment with Name of

Protein IMP: Inferred from Mutant Phenotype based on phenotype of mutant IGI: Inferred from Genetic Interaction genetic interaction with gene that has this

GO term based on experiment IEP: Inferred from Expression Pattern regulation implies function Computational Analysis Evidence Codes ISO: Inferred from Sequence Orthology Known ortholog has this GO term ISA: Inferred from Sequence Alignment Alignment of Sequence with

experimentally characterized gene products

ISM: Inferred from Sequence Model Prediction methods for non-coding RNA

Page 28: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     28  

genes, e.g., tRNASCAN-SE, Snoscan, Rfam

IGC: Inferred from Genomic Context co-occurrence in an operon IKR: Inferred from Key Residues used when gene product lacks key

residues Automatically-assigned Evidence Codes IEA: Inferred from Electronic Annotation unreviewed sequence based matches Characterization of a genome by GO. The set of gene products in a genome can be characterized by the aggregate of GO annotations. To get such an overview requires only focusing on high level terms, and this leads to a superficial view, as shown in the example of the top 25 GO terms in chicken genome (Figure [SGF-1326]). Annotation of genes to Biological Process, Molecular Function or Cellular Component or the root of the tree, indicates that the biological process is unknown. An evidence code, ND (no data) indicates there was no data at the time of annotation.

Figure [SGF-1326]. Top Gene Ontology terms assigned to genes in chicken genome. [generated by PWS in Excel from supplemental GO terms in the chicken genome paper) SLIMs. For many analyses, trimmed down versions of the full ontologies are used; these are called GO-SLIMs, and contain only high level terms. Extending annotations using protein family trees Protein family analysis can allow rigorous extension of annotation to other species. We are now comfortable with inferring common function from orthology, and the logic can be broken down as follows (Figure [SGF-1073]). IEA annotations automatically assign function based on presence of domains of orthology; annotations done systematically across species but checked by expert curators are considered more accurate and thus useful. For example, a computer program might annotate a protein involved in eye development in insects and mice to nematodes, which do not have eyes! Of course, one can define the taxa that have certain processes and limit automatic extension to these; this is done by UniProt, for example.

Page 29: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     29  

Figure [SGF-1073]. PAINT method of extending GO annotations. Blue box, annotated directly from experimental evidence; Green box, inferred function of ancestral protein; yellow, inferred annotation based on descent from ancestor. IDA, inferred from direct assay. ISA, inferred from sequence alignment. Ontologies are evolving On the other hand, this annotation does not tell us enough mechanistic detail to fully appreciate the product: we would also like to know what regulates the protein kinase, what are its substrates, and so forth. The Gene Ontologies are moving towards including more mechanistically-relevant relationships. Duck eye lens epsilon-crystallin is identical to duck heart lactate dehydrogenase (LDH) B4. These proteins would be annotated to two molecular functions and two biological processes. IA bag of terms with these annotations would be confusing. However, if the molecular function is annotated in the context of a biological process, then this amazing dual use would be clear (Figure [SGF-1331]). The duck gene products are not annotated, but mammalian crystallins do have a similar bewildering set of annotations.

Figure [SGF-1331]. Extending the GO to connect annotations. C. Storing and accessing genome annotations

Page 30: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     30  

A genome database typically organizes information around a genome sequence A genome database typically is organized around a genomic sequence. For example, the database might simply store the genome sequence and some description of features of that sequence. Such descriptions are called annotations. However, annotations can get rather complex with many thousands of specific types and sub-types of features. What types of information do we want? We might want to know what regions of the genome are repetitive, or the extent of each gene. There are a number of ways to organize genomic information. One major way is around the genome sequence. However, there are other vantage points including protein, mRNA, cell (or part of anatomy) and biological processes. The genome comprises DNA sequence, which can be connected to gene models by gene predictions and experimental evidence as discussed above (Figure [SGF-1286]). The genome sequence is linked to genetically defined alleles by sequence variation. Alleles confer phenotypes and this links gene models to biological processes via alleles. Gene models are linked to anatomy by gene expression, and so forth.

Figure [SGF-1286]. Generic relationships among entities in a genomic database Data tracks organize information in a genome browser Since there are many types of data that can be laid onto a genome, different information can by separated into tracks (Figure [SGF-1307]).

Page 31: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     31  

Figure [SGF-1307]. Track concept. Add greyscale data to Measurement 2 and perhaps a different sort of Feature type 2. Genomic databases have a variety of content DNA sequence (Chapter 8) is submitted by the producer of the data to one of the large public DNA sequence databases (GenBank or EBI CHECK OUT!). The sequence typically includes some annotations. RNA or cDNA sequence is obtained by submission of the data upon publication. Primary annotations to the genomic sequence include the extent of clones, genes, and other elements Gene expression data relates a gene or specific DNA sequence to a level, time, place or condition of gene expression. It might be derived from direct assay of the mRNA, protein, or from a reporter gene construct. For large-scale gene expression data (see Chapter 11) such as microarray or RNA-seq the data are submitted to standard databases. For other assays there is no requirement for submission of data upon publication, and most such data are curated by hand. Genetic variants relative to the reference sequence. There is no perfect way to display variation. One way is to have a reference genome and indicate differences in other individuals oar strains. Another way is have a superset of all genes (pangenome). Sequence conservation. A quantitative measure of the extent of conservation is calculated from aligning multiple sequences (Chapter ??) Genetic mapping data (Chapter 9) Genetic map data and correlation of genetic and physical maps are easily displayed on a genome browser. Map positions for many loci are interpolated from empirically measured distances. Association of genetic variants to phenotypes (Chapters 3,4). Where variation is identified as causal or associated with a physical change in the DNA sequence.

Page 32: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     32  

Figure [SGF-1082]. Part of human RAB3A locus. [saved from UCSC as EPS, then copy into AI, and export).

Figure [SGF-1084]. Repeated elements, Results can be tied to reagents to keep track of an inference chain Imagine an RNAi experiment is done with a particular sequence that is uniquely mappable to the genome but it is assigned to a gene, T. We associate T to Phenotype W using this RNAi reagent. Now, new sequencing of cDNA reveals that T, which had been predicted from the genomic sequence and had only partial cDNA support, is split into two genes, T and U. If the RNAi sequence is now in U, and to keep the database straight, someone has to realize this and change the association of the Phenotype to U. However, if the RNAi experiment is associated with a sequence that is remapped continually to genome, then the Phenotype is associated correctly with gene U. Many data are curated by humans In many cases, data is entered into genomic database after examination and processing by a professional curator. For example, a biologist reads a research paper and extracts information using controlled vocabularies and often an ontology. Such information, in simple terms are statemtns such as: RNAi of gene A confers phenotype B; gene C is expressed in cell D; and so forth. In rare cases, data are extracted automatically from papers. It would be delightful if researchers enter data, but his has not yet caught on. Wiki and the like. Community annotation can be wonderful if there are enough people and there is some editing. The main problem is that the user does not know whether the information is highly valuable, highly suspect or somewhere in between. Accessing data There are a number of ways to access genome-scale data. You can download all the data and deal with it yourself. You can look at webpages that are built to be (one hopes) easy to use. Another way is to query the database via specialized interfaces. Databases seek to keep only a single copy of a datum in one place and have pointers between parts of the database rather than have multiple instances of each. For example, if a person has an address, and that person is referred to in many places in the database, if you change the

Page 33: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     33  

person’s address, then you would have to change it in all places. A normalized database has each datum in one place, thereby facilitating maintenance. On the other hand, for rapid presentation of data on the internet, it is often preferable to have all the information to support a webpage in one place, even if duplicated many times over. Many fast websites rely on denormalized databases that are periodically built from easy-to-maintain normalized databases. The Intermine platform makes use of denormalization and “pre-computed” queries to serve data rapidly over the internet. EXAMPLE OF COMPLEX QUERIES AND USE CASES. Suppose you want to find all protein kinases with a sterile phenotype in an organism. This is the type of query that a database allows (Figure [SGF-1327]).

Figure [SGF-1327]. Complex query of yeast database identifies the five protein serine/threonine kinases that mutate to a sterile phenotype. Let us return to the question posed at the beginning of this chapter: what genes might be affected by the allele identified in our genetic screen? If we have mapped our mutation between snp1 and snp2 we then can look at a genome browser to learn than genes C, D, E, F, and G are candidates for being affected by our mutation (Figure [SGF-1332]).

Figure [SGF-1332]. Using information in a database to limit candidate genes. HIGHLIGHTS

Annotations to a genome indicates the location of interesting features and regions.

Protein-coding genes can be predicted by combinations of features.

Pre-mRNA splicing sites have characteristic features.

mRNA and protein evidence for genes greatly improves the annotations.

Page 34: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     34  

Gene models are hypotheses about the structure of a gene.

As annotation of a genome proceeds, gene models often change by refinements, splits and merges.

Non-coding RNA genes can often be predicted.

Some features of proteins such as signal peptides can be predicted.

Protein domains can be recognized by amino acid sequence alignment and evolutionary conservation.

Homologs are genes that are similar by descent from a common ancestor.

Orthologs are homologs that differ only by divergence of the species.

Pseudogenes are apparently non-functional genes

Gene and protein trees describe the evolutionary relationship among genes and proteins

Orthology is based on structure, not function, but we infer that orthologs have similar function.

Functional complementation can test whether orthologs share function.

Unique identifiers help keep track of entities.

Controlled vocabularies are strictly limited sets of terms.

An ontology is a defined set of terms with defined relationships

Ontologies help organize information and allow computational analysis of that information.

Many ontologies are used by biological databases and knowledgebases.

The Gene Ontologies capture basic information about genes and gene products

Genome databases organize information around DNA sequence.

There are thousands of biological databases.

Databases allow questions to be asked that would be tedious or impossible to ask without them.

References Wang Z. and Burge, C.B. Splicing regulation: From a parts list of regulatory elements to an integrated splicing code RNA 2008. 14: 802-813

Page 35: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     35  

Ashburner et al., nature genetics 2000. A key reference on the need for the Gene Ontologies www.geneontology.org Sarah Hunter; Philip Jones; Alex Mitchell; Rolf Apweiler; Teresa K. Attwood; Alex Bateman; Thomas Bernard; David Binns; Peer Bork; Sarah Burge; Edouard de Castro; Penny Coggill; Matthew Corbett; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D. Finn; Matthew Fraser; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Conor McMenamin; Huaiyu Mi; Prudence Mutowo-Muellenet; Nicola Mulder; Darren Natale; Christine Orengo; Sebastien Pesseat; Marco Punta; Antony F. Quinn; Catherine Rivoire; Amaia Sangrador-Vegas; Jeremy D. Selengut; Christian J. A. Sigrist; Maxim Scheremetjew; John Tate; Manjulapramila Thimmajanarthanan; Paul D. Thomas; Cathy H. Wu; Corin Yeats; Siew-Yit Yong (2011). InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Research 2011; doi: 10.1093/nar/gkr948 Model Organism Database (MOD) organism or group database URL

sea  urchin http://www.spbase.org/SpBase/

cellular  slime  mold   http://dictybase.org/ Arabidopsis   http://www.arabidopsis.org/ human   uniprot…..ensemble, ucsc ncbi Drosphila  melanogaster  (fruitfly)   http://flybase.org/  and  other  nematodes   wormbase.org yeast  budding   http://www.yeastgenome.org/ yeast  fission   http://www.pombase.org/ Gramene   http://www.gramene.org/ mouse   http://www.informatics.jax.org/ rat   http://rgd.mcw.edu/ zebrafish   http://zfin.org/ frog   http://www.xenbase.org/  

Page 36: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     36  

BOX 332. Types of relations in the Gene Ontology The ontologies of the Gene Ontology (GO) Consortium are structured as a graph, with terms as nodes in a graph and the relations between terms as arrows. The relations between GO terms are categorized and defined. The current relations are: is a (is a subtype of); part of; regulates, negatively regulates and positively regulates. The arrowhead indicates the direction of the relationship; Dotted lines represent an inferred relationship, i.e., one that has not been expressly stated by an annotation. See: http://www.geneontology.org/GO.ontology.relations.shtml

Figure [SGF-1257]. Reasoning over relationships in an ontology.

Page 37: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     37  

BOX 331. Logo representation FIGURE [SGF-1308]. http://weblogo.berkeley.edu/ CREATES SEQUENCE LOGOS. Uncorrected scores will show the proportion of sequence above random. With the 8 sequences shown, position 7 has two each A, C, T and G and there is no information and thus a zero bit score. Position 1 is invariant and gets a score of two bits (See BOX 8– Bits, Bytes and Bases).

Logos also work for proteins.

Page 38: 10. Genome Annotation and Databasesbi190/SG10.pdf · genome. Without annotation, the genome is an extremely difficult to interpret string of As’, C’s T’s and G’s. In our dreams,

SysGen10v6     5/8/13     38  

Figure [SGF-1309]. See Table.. for one letter amino acid code.


Recommended