Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | alexandrina-shaw |
View: | 225 times |
Download: | 0 times |
RNA Structure Prediction
Chapter 16
Primary, Secondary and Tertiary Structures
RNA Structures
Ab Initio
Prediction based on a single RNA sequenceSearch for RNA structure with lowest energyFree energy calculated from G-C < A-U < G-U < unpaired pairsStacking between aromatic rings (van der Waals interactions [no apostrophe]) gives rise to cooperativetyNeighboring loops or bulges impose unfavorable entropic changeFind all possible base-pairing interactionCalculate the energy of each and choose the lowest energy configuration
Dot MatricesPlot all interactions in self alignment plotFind diagonals after applying sliding window
Dynamic ProgrammingFind the single optimal matchUse Watson-Crick and wobble base pairing scoresConformations with slightly higher energies may exist without optimal base pairing
Partition Function
Use a probability distribution to generate sub-optimal structures within a given energy range
Mfoldhttp://mfold.bioinfo.rpi.edu/applications/mfold/Dynamic programming and thermodynamic calculationRNAfoldhttp://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgiExtend alignment to more than one diagonal in dotplot to calculate thermodynamic stability of structures
Comparative Approach
Assumption that homologous RNA sequences fold into same structure
CovariationCovariant regions in homologous sequences are likely to be basepairedPredict consensus structure based onm predictions for all aligned sequencesRNAalifoldhttp://rna.tbi.univie.ac.at/cgi-bin/RNAalifold.cgiPrealignmentPredictions based on covariance, minimum free eneregy, dynamioc programming finds optimal satructure for entire alignmentFoldalignNo prealignmenthttp://foldalign.ku.dk/Clustal alignment and dynamic programming
Chapter 17
Genome Mapping, Assembly and Comparison
Definitions
Genomics – study of genomes
Structural genomics (genome analysis) – identification of genes, annotation of gene features, comparison of genome structures
Functional genomics – analysis of genome wide gene expression and gene functions
Genome Mapping
`Cytological map •Banding pattern of metaphase chromosomes•Low resolution (Dustin units)Genetic map • Relative positions of genetic markers•Marker associated with specific genetic trait•The closer the markers, the lower the probability of separation in cross-over event, and independent inheritancePhysical map •Order of clone fragments using a library of radio-labeled probes
Genome Sequencing
Shotgun approach•Sequence large number of randomly cloned DNA fragments•Number of fragments to be sequenced is large to allow overlap to reconstruct entire genome•Requires no knowledge of physical map•Typically equivalent of 6 genome length (“6× coverage”) must be sequences to ensure correct assembly•Gaps filled in with PCR “chromosome walking” (successive sequencing from primers designed from last round of sequencing results)
Hierarchical approach•Clone of very large fragments (100-300kb) into Bacterial Artificial Chromosomes (BACs)•Map BAC inserts by restriction enzyme analysis•Arrange in order•Choose smallest number of BACs that cover entire genome (“golden tiling path”)•Sub-clone BAC insert fragments into bacterial vectors and sequence
Genome Sequence Assembly
Short sequence 500bp runs → 5-10kb contigs → 30-50kb supercontigs (scaffolds)
Major challenges
•Sequence errors•Vector DNA contamination (filtering programs)•Repetitive sequence regions (RepeatMasker)
Dealing with repeats (almost…)
•Forward-reverse constraint
Base calling: Phred
•http://www.phrap.org/•Fourier analysis to resolve fluorescent traces•Assignment to base giving probability score
Sequence assembly: Phrap•http://www.phrap.org/•Takes Phred files as input•Performs Smith-Waterman local alignment•Progressively merge sequence pairs with highest to lowest similarity scores, removing overlaps•Outputs contigs
Base calling and assembly programs
→ Nucleotide sequence
Additional software
VecScreen•To remove “contaminating” vector DNA sequences from genomes•http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html•Performs BLAST screen of submitted sequence against UniVec non-redundant vector database•Matches are displayed
TIGR Assembler (last updated 2003)•http://www.jcvi.org/cms/research/software/•Uses forward-reverse constraints•Smith-Waterman sequence assmbly
ARACHNE•http://www.broad.mit.edu/wga/•Gives statistical scores to overlaps•Corrects error in multiple overlaps•Outputs contigs or supercontigs
EULER•http://nbcr.sdsc.edu/euler/•Uses shortest distance traveling salesman algorithm•Useful for assembly of sequences with repeats
Genome Annotation
•Sequence
•Gene structures (GenScan, FgenesH)
•Predictions verified by BLAST against sequence database, cDNA and EST (GeneWise, Spidey, SIM4, EST2Genome)
•Manually verified by human curators
•Functional assignment of proteins by BLAST searches of protein database
•Further functional description from Pfam and InterPro and literature
Gene Ontology
Uses limited vocabulary to describe
•Cellular components•Biological processes•Molecular functions
Vocabulary arranged in a hierarchical manner from widest to most specific description
GO: “cytochrome c oxidase gene ” in Ensembl
.
.
.
Automated Genome Annotation
•Genome data generated at exponential rate requires automatic genome annotation•Based on homologies
Genequiz•http://swift.cmbi.kun.nl/swift/genequiz/•BLAST and FASTA homology searches of database•Domain analysis with PROSITE and Blocks databases•Analysis of secondary and supersecondary (eg. Coiled-coils)•All results compiled to produce summary with assigned confidence level
Annotation of hypothetical proteins
•In newly sequences genome as much as 40% of protein are “hypothetical”
To assign function:•Homology searches in databases•Search for similar motifs, domains and secondary structures•Identify conserved functional sites by HMM•Predict structure with fold recognition or threading•Assign broad function to protein•Test assigned function experimentally
How many genes in a genome?
•Total number of human genes ~25,000•Equivalent to that in mouse•4× more than Saccharomyces cerevisiae•Not number of cells in organism that counts, but number of specialized cells (tissues) and response conditions
Genome Economy
•One gene → one protein is not true•EST suggests >100,000 proteins in humans (from 25,000 genes?)
Alternative splicing•Joining different exons from a single transcript to form different proteins
Exon shuffling•Joining exons from different genes
•Drosophila Dscam gene contains 115 exons, 20 of which are constitutively spliced and 95 of which are alternatively spliced •Expresses 38,016 different mRNAs by virtue of alternative splicing
Trans-splicing•Drosophila mdg4 gene•Joins 4 exons on sense strand and 2 exons on anti-sense strand
•Single transcript of encodes dentin phosphoprotein and sialoprotein. Protein is cleaved to form two different proteins
•Human transcript for Prostrate Specific Antigen (PSA) also encodes PSA-LM in 4 th intron
Comparative Genomics
•Compare genomes from different organisms
Whole Genome Alignment•Extent of genome conservation•Mechanism of genome evolution•MUMer and BLASTZ•Modified BLAST to align long genome sequences
Finding a minimal genome•What are the minimum number of genes to support a free-living cellular entity?•Useful to identify genes constituting essential metabolic pathways
Lateral Gene Transfer•Identify by G-C skew•GC%•Codon bias
Gene order comparisons
• Where gene order is conserved between genomes, it is called synteny• Synteny may indicate functional relationships• Often indicate physical interaction of proteins• Genes encoding proteins catalyzing consecutive steps of metabolic
pathway sometimes are ordered – co-regulation of “operon”?• MAL cluster in yeast: multigene complex that encodes the MAL23
trans-acting MAL-activator, MAL21 maltose permease, and MAL22 maltase in order on chromosomes 2, 3, 7, 9 and 10