Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics...

Lecture 6. Topics in RNA Bioinformatics

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Identification of RNAs2. Identification of RNA structures, interactions

and functions

Last update: 6-Oct-2015

IDENTIFICATION OF RNASPart 1


Understanding machine language• This is how a

PDF file looks like when we open it in binary mode (shown as hexadecimal numbers).

• How do we interpret it?



Understanding machine language


Version number

Language

Want to know more? Look for a standard called ISO32000.

Number of pages


Understanding machine language• We looked for elements that were easy to interpret

– There were many parts the meanings of which were not as obvious

– Would be more complicated if it was an executable program instead, as it would contain both control and data elements

• In general, we tried to separate the long piece of content into elements/element types, and annotate each of them– Meanings of some elements can be determined with the help

of other elements (e.g., number of pages)– Next (more difficult) step is to understand the relative

locations of the different elements and how they interact with others



Understanding genomic language• Now, how do we interpret the human

genome?


......TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAACCCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCTGTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCAGAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC......


Understanding genomic language• Again, we first look for functional elements• We focus on genes in this lecture• Classification:– By end-product: protein-coding vs. non-coding– By type: mRNAs, tRNAs, miRNAs, lncRNAs, ...– Sub-elements at the transcriptional level: whole

transcripts, exons, introns, ...– Sub-elements at the translational level: 5’UTR,

coding sequence, 3’UTR, ...



Structure of a protein-coding gene


Image source: http://www.carolguze.com/text/442-1-humangenome.shtml


Human gene annotation sets• RefSeq (NCBI, National Center for Biotechnology Information, USA

National Institute of Health)– Standard for most biologists

• Ensembl (EMBL-EBI, European Molecular Biology Laboratory-European Bioinformatics Institute)– Automatic annotation

• Havana (Wellcome Trust Sanger Institute)• Gencode (ENCODE, Encyclopedia of DNA Elements)

– Based on latest experimental data– Level 1: Experimentally validated– Level 2: Manually checked, but do not have experimental support– Level 3: Automatic annotation

• UCSC, University of California at Santa Cruz

Each has different versions



Comparison of gene annotation sets


Image source: Harrow et al., Genome Research 22(9):1760-1774, (2012)


Comparison of gene annotation sets


UCSC

Gencode v17

Gencode v14

Gencode v7

RefSeq

Ensembl

Example: p53


Annotation file formats• GFF format (from http://genome.ucsc.edu/FAQ/FAQformat.html): tab-delimited. Fields:

1. seqname - The name of the sequence. Must be a chromosome or scaffold. 2. source - The program that generated this feature. 3. feature - The name of this type of feature. Some examples of standard feature types

are "CDS", "start_codon", "stop_codon", and "exon". 4. start - The starting position of the feature in the sequence. The first base is numbered

1. 5. end - The ending position of the feature (inclusive). 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for

this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".

7. strand - Valid entries include '+', '-', or '.' (for don't know/don't care). 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that

represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.

9. group - All lines with the same group are linked together into a single item. • GTF format: Similar to GFF, except that the group field is replaced by a list of

attributes in <name>, <value> pairs


http://genome.ucsc.edu/FAQ/FAQformat.html


Example• Gencode v12 GTF file:


chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1 ENSEMBL exon 17021 17055 . - . gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-202"; level 3; havana_gene "OTTHUMG00000000958.1";chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type "antisense"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_name "MIR1302-11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";...chr1 HAVANA gene 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";chr1 HAVANA transcript 34554 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA exon 35721 36081 . - . gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA CDS 35721 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";chr1 HAVANA start_codon 35734 35736 . - 0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM138A-001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

Key:Annotation setFeatureGene nameTranscript typeAnnotation level


Gene annotation: The process• How to discover genes?– Experimental:

• EST (Expressed Sequence Tag) libraries• Tiling microarrays• RNA sequencing• ...(Require observed expression)

– Computational:• Similarity search• Simple features• Machine learning• Hidden Markov Models• ...



Computational gene finding – similarity search

• Find sequences that are similar to annotated genes– DNA (blastn)– Protein (blastx/tblastx): 6-

frame translation


Readingframe

Image credit: Wikipedia

+3 L V R T+2 T C S Y+1 N L F V 5’-AACTTGTTCGTACA-3’ 3’-TTGAACAAGCATGT-5’-1 K N T C-2 S T R V-3 V Q E Y

sr

G C G T G A C T T T C T

A

C

G

T

T

G

C

T


Computational gene finding – simple features

• Based on sequence information only – “Ab initio gene finding”– Open reading frame (ORF)

• Existence of start and stop codons in-frame and within a reasonable distance– More complicated when introns are present

– Splice junctions• Grammar rules or probabilistic models

– Promoter signals• TATA boxes• CpG islands• ...

– Codon bias– ...


Image source: http://www.blackwellpublishing.com/ridley/a-z/codon_bias.asp


Combining features• How to combine the various features?• Essentially a machine learning problem– For each window (e.g., 100-400bp), compute the

various features– Gather some positive examples (known coding

genes)– Gather some negative examples (known non-genic

regions)– Train a statistical model that can tell whether the

window (or the middle nucleotide) is likely genic/coding



Computational gene finding – machine learning

• GRAIL: Neural network-based method


Image credit: Uberbacher and Mural, PNAS 88(24):11261-11265, (1991)


Fine-grained modeling• All the above methods have limitations:– Similarity search: Only for genes with annotated

homologs– Simple features: Each feature is weak, and thus

can lead to false positives and false negatives– Machine learning (in that form): Does not fully

utilize information about neighboring positions, also not able to tell precise element boundaries

• Need methods that provide finer-grained modeling of gene structures



Hidden Markov Models (HMMs)• Hidden Markov Models are statistical models

for modeling unobserved information based on observed data sequence– Observed data: DNA sequence– Unobserved information:• State of each nucleotide (exon, intron, etc.)• Transition probability between states• Emission probabilities: E.g., what is the probability of

emitting a certain nucleotide in the exon state?



HMM example• Suppose you have two coins, one is biased and

one is unbiased, which coin is used each time if you observed the sequence <T, H, T>?


? ? ?

T H T

A possible model:

B

A0.5

0.5

0.9

0.1 0.8

0.2

0.5

0.50.25

0.75

H

T

H

T T H T

B A A

A possible run:


HMM algorithms• There are algorithms for the following problems:– Given a model, compute data likelihood of observed

sequence, Pr(O|)• Forward algorithm• Backward algorithm

– Given a model and an observed sequence, determine the most likely state sequence,• Viterbi algorithm

– Given a set of states and a series of observed data sequences, estimate the transition and emission probabilities• Baum-Welch algorithm


argmax𝑄 Prሺ𝑄|𝑂,ሻ


Computational gene finding – HMMs

• GeneScan:– Both transcription

(exon/intron) and translation (UTR/CDS)

– Positive and negative strands– Single-exon vs. multi-exon

genes– Three different frames

• One type of generalized HMMs (GHMMs): Emission of a sequence instead of a single nucleotide


Image credit: Burge and Karlin, Journal of Molecular Biology 268(1):78-94, (1997)


Computational gene finding – HMMs

• VEIL: Multi-level models


Overall:

Exon and stop codon:

Image credit: Henderson et al., Journal of Computational Biology 4(2):127-141, (1997)


Gene finding in the post-NGS era• With the invention of RNA-seq, the ability to experimentally

discover gene locations has been greatly improved:1. Sequence all RNAs2. Map them to reference genome or perform de novo assembly

• Issues:– Experimental noise– Mapping:

• Availability of good reference genome• Mapping of split reads and paired-end reads

– Assembly:• Lots of ambiguity

– Cell/tissue/condition-specific expression• Over- and under-representation of certain transcripts

– Biochemical activity vs. biological function



Split mapping• TopHat2


Image credit: Kim et al., Genome Biology 14(4):R36, (2013)

http://genomebiology.com/2013/14/4/R36/figure/F6


Transcript isoforms [Project]• Given a set of RNA-

seq short reads mapped to a gene, determine the transcript isoforms present and their relative abundance

• Cufflinks


Image credit: Trapnell et al., Nature Biotechnology 28(5):511-515, (2010)


Non-coding RNAs (ncRNAs)• Non-coding RNAs are RNAs that function

without translating into proteins• Many types:


Type Abbreviation Function

Ribosomal RNA rRNA Translation

Transfer RNA tRNA Translation

Small nuclear RNA snRNA Splicing

Small nucleolar RNA snoRNA Nucleotide modifications

MicroRNA miRNA Gene regulation

Small interfering RNA siRNA Gene regulation

Long non-coding RNA (>200nt) lncRNA Various (mostly unknown)

… … …


Identifying non-coding RNAs [project]

• Some features:– Strong evolutionary conservation– Strong secondary structure– Weak coding potential– (For small RNA) Strong RNA-seq signals selected

for small RNA– (For non-polyadenylated RNA) Weak RNA-seq

signals enriched for poly-A RNA– ...



Machine learning for identifying ncRNAs


Image credit: Lu, Yip et al., Genome Research 21(2):276-285, (2011)


Identifying long non-coding RNAs


Image credit: Nam and Bartel, Genome Research 22(12):2529-2540, (2012)


Structural models for ncRNA• Some small RNAs have strong structural

features, which can be used to identify them from genomic sequences


tRNA snoRNA

Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg


Covariance models


Input multiple sequence alignment and consensus structure:

Construction of guide tree from consensus structure:

Image credit: INFERNAL user’s guide

Output CM:

Node Description

MATP Pair

MATL Single strand, left

MATR Single strand, right

BIF Bifurcation

ROOT root

BEGL Begin, left

BEGR Begin, right

END End


Rfam• For RNA, analogous to Pfam (protein family)• Mirrors:– Sanger Institute, Wellcome Trust Foundation, UK

http://rfam.sanger.ac.uk/– Howard Hughes Medical Institute Janelia Farm

Research Campus, USAhttp://rfam.janelia.org/


http://rfam.sanger.ac.uk/

http://rfam.janelia.org/


Rfam• Three classes of families:

– Non-coding RNA genes– Structured cis-regulatory elements– Self-splicing RNAs

• Each family provides the following:– Covariance models (CMs, slightly more complicated than profile

HMMs) (patterns)– Multiple sequence alignments (conservation)

• Seed alignment (one or more experimentally validated examples, possibly with other high-confidence predicted members)

• Full alignment (based on CMs built from the Infernal software)

– Consensus secondary structures (conservation)• Current status:

– Version 12.0 (July 2014), with 2450 families



Rfam• If a sequence is queried against a family, a bit

score is given to indicate how likely it really belongs to the family as compared to the background: bit-score = log2(PCM / Pnull)

• Source of secondary structures in Rfam– Literature• Experimentally validated• Predictions

– Predictions using the WAR software



Example: RF00005• Alignment


Image credit: Rfam


Example: RF00005• Secondary structure


Image credit: Rfam


Pseudogenes• Pseudogenes are former genes that have lost

their ability to code for (the original) protein• Classification:– By mechanism of creation:• Non-processed pseudogenes: Mutation (e.g., pre-

mature stop codon)• Processed pseudogenes: Reverse transcription (missing

introns)

– By copy of gene:• Duplicated copy• The only copy (unitary pseudogenes)



Identifying pseudogenes• Look for sequences

similar to annotated coding genes or with strong coding potential

• Consider those that cannot produce the corresponding protein


Image credit: Zhang et al., Bioinformatics 22(12):1437-1439, (2006)


Circular RNAs• Some RNAs

take a circular form– Due to back-

splicing of exons

– More stable– May act as

miRNA decoys


Image credit: Wilusz and Sharp, Science 340(6131):440-441, (2013)


Identification of circular RNAs [project]

• Detection of back-splicing– Based on genomic

location of exon annotation

– Need to be distinguished from SVs


Image credit: Gao et al., Genome Biology 16(1):4, (2015)

IDENTIFICATION OF RNA STRUCTURES, INTERACTIONS AND FUNCTIONS

Part 2


Identification of RNA structures [lecture]

• Sequence-based– Sequence conservation/co-conservation– Minimizing free energy– Partition function: Sample from the probabilistic

distribution of structures• Sequencing-based– RNA footprinting– High-throughput versions



Identification of RNA interactions• With DNAs– Sequence complementarity

• With RNAs– Sequence complementarity (more specific)

• With proteins– More difficult– High-throughput methods [project]



Micro RNAs (miRNAs)• A miRNA can have its

own gene, or can be within the intron of another gene

• A number of processing steps, finally a single-strand RNA, part of the RNA-induced silencing complex (RISC)


Image credit: Narayanese at Wikipedia


miRNA targeting• miRNA triggers mRNA cleavage or translational repression


Image credit: Kelvinsong at Wikipedia


miRNA naming convention• Pre-miRNA: mir-<number> (e.g., mir-29)• Mature miRNA: miR-<number> (e.g., miR-29)• Specifying the species of origin: <species>-miR-<number> (e.g., hsa-miR-

29)• Nearly identical miRNAs: miR-<number><letter> (e.g., miR-29b)• Pre-miRNAs at different genomic locations but code for 100% identical

mature miRNAs: mir-<number>-<number> (e.g., mir-194-1)• If two mature miRNAs are from different arms of the same pre-miRNA:

– Standard: miR-<number>-<3p | 5p> (e.g., miR-337-3p)– If expression levels are known, the one with the lower expression

level: miR-<number>* (e.g., miR-123*)• Could combine multiple things (e.g., hsa-miR-125a-5p)



Prediction of miRNA targets• MicroRNA recognizes and binds specific features on

target mRNAs (usually at 3’UTR for animals)– In plants, usually near exact match– In animals, usually good match at ~6 nucleotide “seed

region”• Possibly effects from other positions

– Secondary structure• No perfect prediction algorithms– Poor consistency between predictions by different

algorithms• Number of validated examples is small– How many?



Identification of miRNA targets


Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)


Prediction of miRNA targets• Comparison of some prediction methods:


Image credit: Thomas et al., Nature Structural & Molecular Biology 17(10):1169-1174, (2010)


Experimental validation of miRNA targets

• Classification of evidence based on TarBase 7.0:– Not all are direct evidence


Method Throughput Intended useReporter Genes Low Validation of miRNA:UTR (or binding region) interactionNorthern Blotting Low Relative effect of miRNA on mRNA levelsqPCR Low Quantification of miRNA effect on mRNA levelsWestern Blot Low Relative assessment of miRNA effect on protein concentrationELISA Low Quantification of miRNA effect on protein concentration5 RLM-RACE Low Identification of cleaved mRNA targetsMicroarrays High High-throughput assessment of miRNA effect on mRNA expressionRNA-Seq High High-throughput assessment of miRNA effect on mRNA expressionQuantitative Proteomics (e.g., pSILAC) High High-throughput assessment of miRNA effects on protein concentrationAGO-IP High Identification of enriched transcripts (miRNAs and mRNAs) in AGO immunoprecipitatesHITS-CLIP High Sequencing of AGO binding regions on targeted transcriptsPAR-CLIP High Sequencing of AGO binding regions on targeted transcriptsCLASH / PAR-CLIP + Ligation High Sequencing of AGO binding regions on targeted transcripts. Production of chimeric

miRNA:mRNA reads for the identification of interacting pairs.Biotin miRNA tagging High/Low Pull-down of biotin-tagged miRNAs and estimation of bound transcript content using

qPCR (Low yield), microarrays (High-throughput) and RNA-Seq (High-throughput)IMPACT-Seq High Pull-down of biotin-tagged miRNAs, identification of interacting pairs and binding

regions.PARE / Degradome-Seq High High-throughput identification of cleaved mRNA targets3Life High High-throughput reporter gene assaymiTRAP High miRNA trapping by RNA baiting

Table credit: Vlachos et al., Nucleic Acids Research 43(D1):D153-D159, (2015)


MiRNA networks


Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)


MiRNA networks


Image credit: Liu et al., Briefings in Bioinformatics 15(1):1-19, (2014)


The ceRNA hypothesis• Competing endogenous RNA: different targets

compete for their common targeting miRNAs


Image credit: Salmena et al., Cell 146(3):353-358, (2011)


miRBase• List of miRNA (families)– Latest release: Release 21 (June 2014), 28645 entries



Modeling RNA-RNA interactions• Computational

pipeline:


Image credit: Schmitz et al., Nucleic Acids Research 42(12):7539-7552, (2014)


Protein-RNA interactions• Experimental methods:– Probed by immuno-precipitation (with or without

cross-linking), or oligo(dT) pull-down– Many types of experiment: RIP-seq, HITS-CLIP,

PAR-CLIP, iCLIP, gPAR-CLIP, ...– Each type of data has its own properties and

biases• Need proper data processing and normalization



RNA functions• By sequence homology– Borrowing the known annotated function of

another RNA with a similar sequence• By structure– Borrowing the known annotated function of

another RNA with a similar structure• By target– Borrowing the known annotated function of the

target gene• De novo annotation



Function of lncRNAs• More variability,

less known• Four main

archetypes:


Image credit: Wang and Chang, Molecular Cell 43(6):904-914, (2011)


Summary• Identification of RNAs in a genome– mRNAs– Structured RNAs– miRNAs– Pseudogenes– Long non-coding RNAs– Circular RNAs

• Identification of RNA structures, interactions and functions


Date post:	05-Jan-2016
Category:	Documents
Upload:	berenice-griffin
View:	233 times
Download:	1 times

Lecture 6. Topics in RNA Bioinformatics The Chinese University of Hong Kong CSCI5050 Bioinformatics...

Documents