Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | berenice-griffin |
View: | 233 times |
Download: | 1 times |
Lecture 6. Topics in RNA Bioinformatics
The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2
Lecture outline1. Identification of RNAs2. Identification of RNA structures, interactions
and functions
Last update: 6-Oct-2015
IDENTIFICATION OF RNASPart 1
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4
Understanding machine language• This is how a
PDF file looks like when we open it in binary mode (shown as hexadecimal numbers).
• How do we interpret it?
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5
Understanding machine language
Last update: 6-Oct-2015
Version number
Language
Want to know more? Look for a standard called ISO32000.
Number of pages
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6
Understanding machine language• We looked for elements that were easy to interpret
– There were many parts the meanings of which were not as obvious
– Would be more complicated if it was an executable program instead, as it would contain both control and data elements
• In general, we tried to separate the long piece of content into elements/element types, and annotate each of them– Meanings of some elements can be determined with the help
of other elements (e.g., number of pages)– Next (more difficult) step is to understand the relative
locations of the different elements and how they interact with others
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7
Understanding genomic language• Now, how do we interpret the human
genome?
Last update: 6-Oct-2015
......TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC......
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8
Understanding genomic language• Again, we first look for functional elements• We focus on genes in this lecture• Classification:– By end-product: protein-coding vs. non-coding– By type: mRNAs, tRNAs, miRNAs, lncRNAs, ...– Sub-elements at the transcriptional level: whole
transcripts, exons, introns, ...– Sub-elements at the translational level: 5’UTR,
coding sequence, 3’UTR, ...
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9
Structure of a protein-coding gene
Last update: 6-Oct-2015
Image source: http://www.carolguze.com/text/442-1-humangenome.shtml
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10
Human gene annotation sets• RefSeq (NCBI, National Center for Biotechnology Information, USA
National Institute of Health)– Standard for most biologists
• Ensembl (EMBL-EBI, European Molecular Biology Laboratory-European Bioinformatics Institute)– Automatic annotation
• Havana (Wellcome Trust Sanger Institute)• Gencode (ENCODE, Encyclopedia of DNA Elements)
– Based on latest experimental data– Level 1: Experimentally validated– Level 2: Manually checked, but do not have experimental support– Level 3: Automatic annotation
• UCSC, University of California at Santa Cruz
Each has different versions
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11
Comparison of gene annotation sets
Last update: 6-Oct-2015
Image source: Harrow et al., Genome Research 22(9):1760-1774, (2012)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12
Comparison of gene annotation sets
Last update: 6-Oct-2015
UCSC
Gencode v17
Gencode v14
Gencode v7
RefSeq
Ensembl
Example: p53
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13
Annotation file formats• GFF format (from http://genome.ucsc.edu/FAQ/FAQformat.html): tab-delimited. Fields:
1. seqname - The name of the sequence. Must be a chromosome or scaffold. 2. source - The program that generated this feature. 3. feature - The name of this type of feature. Some examples of standard feature types
are "CDS", "start_codon", "stop_codon", and "exon". 4. start - The starting position of the feature in the sequence. The first base is numbered
1. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for
this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".
7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that
represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.
9. group - All lines with the same group are linked together into a single item. • GTF format: Similar to GFF, except that the group field is replaced by a list of
attributes in <name>, <value> pairs
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14
Example• Gencode v12 GTF file:
Last update: 6-Oct-2015
chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";
chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";
Key:Annotation setFeatureGene nameTranscript typeAnnotation level
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15
Gene annotation: The process• How to discover genes?– Experimental:
• EST (Expressed Sequence Tag) libraries• Tiling microarrays• RNA sequencing• ...(Require observed expression)
– Computational:• Similarity search• Simple features• Machine learning• Hidden Markov Models• ...
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16
Computational gene finding – similarity search
• Find sequences that are similar to annotated genes– DNA (blastn)– Protein (blastx/tblastx): 6-
frame translation
Last update: 6-Oct-2015
Readingframe
Image credit: Wikipedia
+3 L V R T+2 T C S Y+1 N L F V 5’-AACTTGTTCGTACA-3’ 3’-TTGAACAAGCATGT-5’-1 K N T C-2 S T R V-3 V Q E Y
sr
G C G T G A C T T T C T
A
C
G
T
T
G
C
T
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17
Computational gene finding – simple features
• Based on sequence information only – “Ab initio gene finding”– Open reading frame (ORF)
• Existence of start and stop codons in-frame and within a reasonable distance– More complicated when introns are present
– Splice junctions• Grammar rules or probabilistic models
– Promoter signals• TATA boxes• CpG islands• ...
– Codon bias– ...
Last update: 6-Oct-2015
Image source: http://www.blackwellpublishing.com/ridley/a-z/codon_bias.asp
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18
Combining features• How to combine the various features?• Essentially a machine learning problem– For each window (e.g., 100-400bp), compute the
various features– Gather some positive examples (known coding
genes)– Gather some negative examples (known non-genic
regions)– Train a statistical model that can tell whether the
window (or the middle nucleotide) is likely genic/coding
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19
Computational gene finding – machine learning
• GRAIL: Neural network-based method
Last update: 6-Oct-2015
Image credit: Uberbacher and Mural, PNAS 88(24):11261-11265, (1991)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20
Fine-grained modeling• All the above methods have limitations:– Similarity search: Only for genes with annotated
homologs– Simple features: Each feature is weak, and thus
can lead to false positives and false negatives– Machine learning (in that form): Does not fully
utilize information about neighboring positions, also not able to tell precise element boundaries
• Need methods that provide finer-grained modeling of gene structures
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21
Hidden Markov Models (HMMs)• Hidden Markov Models are statistical models
for modeling unobserved information based on observed data sequence– Observed data: DNA sequence– Unobserved information:• State of each nucleotide (exon, intron, etc.)• Transition probability between states• Emission probabilities: E.g., what is the probability of
emitting a certain nucleotide in the exon state?
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22
HMM example• Suppose you have two coins, one is biased and
one is unbiased, which coin is used each time if you observed the sequence <T, H, T>?
Last update: 6-Oct-2015
? ? ?
T H T
A possible model:
B
A0.5
0.5
0.9
0.1 0.8
0.2
0.5
0.50.25
0.75
H
T
H
T T H T
B A A
A possible run:
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23
HMM algorithms• There are algorithms for the following problems:– Given a model, compute data likelihood of observed
sequence, Pr(O|)• Forward algorithm• Backward algorithm
– Given a model and an observed sequence, determine the most likely state sequence,• Viterbi algorithm
– Given a set of states and a series of observed data sequences, estimate the transition and emission probabilities• Baum-Welch algorithm
Last update: 6-Oct-2015
argmax𝑄 Prሺ𝑄|𝑂,ሻ
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24
Computational gene finding – HMMs
• GeneScan:– Both transcription
(exon/intron) and translation (UTR/CDS)
– Positive and negative strands– Single-exon vs. multi-exon
genes– Three different frames
• One type of generalized HMMs (GHMMs): Emission of a sequence instead of a single nucleotide
Last update: 6-Oct-2015
Image credit: Burge and Karlin, Journal of Molecular Biology 268(1):78-94, (1997)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25
Computational gene finding – HMMs
• VEIL: Multi-level models
Last update: 6-Oct-2015
Overall:
Exon and stop codon:
Image credit: Henderson et al., Journal of Computational Biology 4(2):127-141, (1997)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26
Gene finding in the post-NGS era• With the invention of RNA-seq, the ability to experimentally
discover gene locations has been greatly improved:1. Sequence all RNAs2. Map them to reference genome or perform de novo assembly
• Issues:– Experimental noise– Mapping:
• Availability of good reference genome• Mapping of split reads and paired-end reads
– Assembly:• Lots of ambiguity
– Cell/tissue/condition-specific expression• Over- and under-representation of certain transcripts
– Biochemical activity vs. biological function
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27
Split mapping• TopHat2
Last update: 6-Oct-2015
Image credit: Kim et al., Genome Biology 14(4):R36, (2013)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28
Transcript isoforms [Project]• Given a set of RNA-
seq short reads mapped to a gene, determine the transcript isoforms present and their relative abundance
• Cufflinks
Last update: 6-Oct-2015
Image credit: Trapnell et al., Nature Biotechnology 28(5):511-515, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29
Non-coding RNAs (ncRNAs)• Non-coding RNAs are RNAs that function
without translating into proteins• Many types:
Last update: 6-Oct-2015
Type Abbreviation Function
Ribosomal RNA rRNA Translation
Transfer RNA tRNA Translation
Small nuclear RNA snRNA Splicing
Small nucleolar RNA snoRNA Nucleotide modifications
MicroRNA miRNA Gene regulation
Small interfering RNA siRNA Gene regulation
Long non-coding RNA (>200nt) lncRNA Various (mostly unknown)
… … …
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30
Identifying non-coding RNAs [project]
• Some features:– Strong evolutionary conservation– Strong secondary structure– Weak coding potential– (For small RNA) Strong RNA-seq signals selected
for small RNA– (For non-polyadenylated RNA) Weak RNA-seq
signals enriched for poly-A RNA– ...
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31
Machine learning for identifying ncRNAs
Last update: 6-Oct-2015
Image credit: Lu, Yip et al., Genome Research 21(2):276-285, (2011)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32
Identifying long non-coding RNAs
Last update: 6-Oct-2015
Image credit: Nam and Bartel, Genome Research 22(12):2529-2540, (2012)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33
Structural models for ncRNA• Some small RNAs have strong structural
features, which can be used to identify them from genomic sequences
Last update: 6-Oct-2015
tRNA snoRNA
Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34
Covariance models
Last update: 6-Oct-2015
Input multiple sequence alignment and consensus structure:
Construction of guide tree from consensus structure:
Image credit: INFERNAL user’s guide
Output CM:
Node Description
MATP Pair
MATL Single strand, left
MATR Single strand, right
BIF Bifurcation
ROOT root
BEGL Begin, left
BEGR Begin, right
END End
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35
Rfam• For RNA, analogous to Pfam (protein family)• Mirrors:– Sanger Institute, Wellcome Trust Foundation, UK
http://rfam.sanger.ac.uk/– Howard Hughes Medical Institute Janelia Farm
Research Campus, USAhttp://rfam.janelia.org/
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 36
Rfam• Three classes of families:
– Non-coding RNA genes– Structured cis-regulatory elements– Self-splicing RNAs
• Each family provides the following:– Covariance models (CMs, slightly more complicated than profile
HMMs) (patterns)– Multiple sequence alignments (conservation)
• Seed alignment (one or more experimentally validated examples, possibly with other high-confidence predicted members)
• Full alignment (based on CMs built from the Infernal software)
– Consensus secondary structures (conservation)• Current status:
– Version 12.0 (July 2014), with 2450 families
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37
Rfam• If a sequence is queried against a family, a bit
score is given to indicate how likely it really belongs to the family as compared to the background: bit-score = log2(PCM / Pnull)
• Source of secondary structures in Rfam– Literature• Experimentally validated• Predictions
– Predictions using the WAR software
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38
Example: RF00005• Alignment
Last update: 6-Oct-2015
Image credit: Rfam
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39
Example: RF00005• Secondary structure
Last update: 6-Oct-2015
Image credit: Rfam
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40
Pseudogenes• Pseudogenes are former genes that have lost
their ability to code for (the original) protein• Classification:– By mechanism of creation:• Non-processed pseudogenes: Mutation (e.g., pre-
mature stop codon)• Processed pseudogenes: Reverse transcription (missing
introns)
– By copy of gene:• Duplicated copy• The only copy (unitary pseudogenes)
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41
Identifying pseudogenes• Look for sequences
similar to annotated coding genes or with strong coding potential
• Consider those that cannot produce the corresponding protein
Last update: 6-Oct-2015
Image credit: Zhang et al., Bioinformatics 22(12):1437-1439, (2006)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42
Circular RNAs• Some RNAs
take a circular form– Due to back-
splicing of exons
– More stable– May act as
miRNA decoys
Last update: 6-Oct-2015
Image credit: Wilusz and Sharp, Science 340(6131):440-441, (2013)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43
Identification of circular RNAs [project]
• Detection of back-splicing– Based on genomic
location of exon annotation
– Need to be distinguished from SVs
Last update: 6-Oct-2015
Image credit: Gao et al., Genome Biology 16(1):4, (2015)
IDENTIFICATION OF RNA STRUCTURES, INTERACTIONS AND FUNCTIONS
Part 2
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45
Identification of RNA structures [lecture]
• Sequence-based– Sequence conservation/co-conservation– Minimizing free energy– Partition function: Sample from the probabilistic
distribution of structures• Sequencing-based– RNA footprinting– High-throughput versions
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46
Identification of RNA interactions• With DNAs– Sequence complementarity
• With RNAs– Sequence complementarity (more specific)
• With proteins– More difficult– High-throughput methods [project]
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47
Micro RNAs (miRNAs)• A miRNA can have its
own gene, or can be within the intron of another gene
• A number of processing steps, finally a single-strand RNA, part of the RNA-induced silencing complex (RISC)
Last update: 6-Oct-2015
Image credit: Narayanese at Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 48
miRNA targeting• miRNA triggers mRNA cleavage or translational repression
Last update: 6-Oct-2015
Image credit: Kelvinsong at Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49
miRNA naming convention• Pre-miRNA: mir-<number> (e.g., mir-29)• Mature miRNA: miR-<number> (e.g., miR-29)• Specifying the species of origin: <species>-miR-<number> (e.g., hsa-miR-
29)• Nearly identical miRNAs: miR-<number><letter> (e.g., miR-29b)• Pre-miRNAs at different genomic locations but code for 100% identical
mature miRNAs: mir-<number>-<number> (e.g., mir-194-1)• If two mature miRNAs are from different arms of the same pre-miRNA:
– Standard: miR-<number>-<3p | 5p> (e.g., miR-337-3p)– If expression levels are known, the one with the lower expression
level: miR-<number>* (e.g., miR-123*)• Could combine multiple things (e.g., hsa-miR-125a-5p)
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50
Prediction of miRNA targets• MicroRNA recognizes and binds specific features on
target mRNAs (usually at 3’UTR for animals)– In plants, usually near exact match– In animals, usually good match at ~6 nucleotide “seed
region”• Possibly effects from other positions
– Secondary structure• No perfect prediction algorithms– Poor consistency between predictions by different
algorithms• Number of validated examples is small– How many?
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 51
Identification of miRNA targets
Last update: 6-Oct-2015
Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 52
Prediction of miRNA targets• Comparison of some prediction methods:
Last update: 6-Oct-2015
Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 53
Experimental validation of miRNA targets
• Classification of evidence based on TarBase 7.0:– Not all are direct evidence
Last update: 6-Oct-2015
Method Throughput Intended useReporter Genes Low Validation of miRNA:UTR (or binding region) interactionNorthern Blotting Low Relative effect of miRNA on mRNA levelsqPCR Low Quantification of miRNA effect on mRNA levelsWestern Blot Low Relative assessment of miRNA effect on protein concentrationELISA Low Quantification of miRNA effect on protein concentration5 RLM-RACE Low Identification of cleaved mRNA targetsMicroarrays High High-throughput assessment of miRNA effect on mRNA expressionRNA-Seq High High-throughput assessment of miRNA effect on mRNA expressionQuantitative Proteomics (e.g., pSILAC) High High-throughput assessment of miRNA effects on protein concentrationAGO-IP High Identification of enriched transcripts (miRNAs and mRNAs) in AGO immunoprecipitatesHITS-CLIP High Sequencing of AGO binding regions on targeted transcriptsPAR-CLIP High Sequencing of AGO binding regions on targeted transcriptsCLASH / PAR-CLIP + Ligation High Sequencing of AGO binding regions on targeted transcripts. Production of chimeric
miRNA:mRNA reads for the identification of interacting pairs.Biotin miRNA tagging High/Low Pull-down of biotin-tagged miRNAs and estimation of bound transcript content using
qPCR (Low yield), microarrays (High-throughput) and RNA-Seq (High-throughput)IMPACT-Seq High Pull-down of biotin-tagged miRNAs, identification of interacting pairs and binding
regions.PARE / Degradome-Seq High High-throughput identification of cleaved mRNA targets3Life High High-throughput reporter gene assaymiTRAP High miRNA trapping by RNA baiting
Table credit: Vlachos et al., Nucleic Acids Research 43(D1):D153-D159, (2015)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 54
MiRNA networks
Last update: 6-Oct-2015
Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 55
MiRNA networks
Last update: 6-Oct-2015
Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 56
The ceRNA hypothesis• Competing endogenous RNA: different targets
compete for their common targeting miRNAs
Last update: 6-Oct-2015
Image credit: Salmena et al., Cell 146(3):353-358, (2011)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 57
miRBase• List of miRNA (families)– Latest release: Release 21 (June 2014), 28645 entries
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 58
Modeling RNA-RNA interactions• Computational
pipeline:
Last update: 6-Oct-2015
Image credit: Schmitz et al., Nucleic Acids Research 42(12):7539-7552, (2014)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 59
Protein-RNA interactions• Experimental methods:– Probed by immuno-precipitation (with or without
cross-linking), or oligo(dT) pull-down– Many types of experiment: RIP-seq, HITS-CLIP,
PAR-CLIP, iCLIP, gPAR-CLIP, ...– Each type of data has its own properties and
biases• Need proper data processing and normalization
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 60
RNA functions• By sequence homology– Borrowing the known annotated function of
another RNA with a similar sequence• By structure– Borrowing the known annotated function of
another RNA with a similar structure• By target– Borrowing the known annotated function of the
target gene• De novo annotation
Last update: 6-Oct-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 61
Function of lncRNAs• More variability,
less known• Four main
archetypes:
Last update: 6-Oct-2015
Image credit: Wang and Chang, Molecular Cell 43(6):904-914, (2011)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 62
Summary• Identification of RNAs in a genome– mRNAs– Structured RNAs– miRNAs– Pseudogenes– Long non-coding RNAs– Circular RNAs
• Identification of RNA structures, interactions and functions
Last update: 6-Oct-2015