Download - Annotating nc-RNAs with Rfam

Luca Cozzuto @ Bioinformatics Core

http://rfam.sanger.ac.uk/

Non-coding RNA genes codify for a functional RNA product rather than for a protein.

Non-coding genes codify for a functional RNA product rather than for a protein.

Family of functional RNAs:

Biological function RNA family

Involved in protein synthesis tRNA, rRNA, SRP RNA, tmRNA

Post-trascriptional modification or DNA replication

snRNA, snoRNA, SmY, scaRNA, gRNA, RNAse P, RNAse MRP, Y RNA, telomerase RNA

Regulatory RNAs aRNA, NAT, crRRNA, long ncRNA, miRNA, piRNA, siRNA, tasiRNA, rasiRNA, 7SK

Parasitic RNA Retrotransposon, Viroid, satellite RNA

The majority of functional RNAs fold in stable structures that are essential for their biological activity.

Micro-RNA precursor

tRNA U2 spliceosomal

RNA

Part of Riboswitch

Unlike protein-coding genes functional RNAs often show no significant sequence similarity but preserve a base-paired secondary structure.

This makes very difficult to search for those genes looking only for sequence similarity (i.e. by using BLAST, FASTA…)

ncRNA_1 AAAAAAGGGGTTTTTT!

ncRNA_2 AAATAAGGGGTTATTT!

Struct ((((((....)))))) !

For Rfam database a functional RNA family is represented by a multiple sequence alignment and a covariance model.

The model takes into account both sequence and structure and can be used to scan a genomic sequence to detect new members of the same family.

The Rfam Seed alignment for the U12 minor spliceosomal RNA family.

Only one sequence, up to 10 kb

Search methodology

The query sequence is scanned against a library of Rfam sequences using WU-BLAST, with an E-value threshold of 1.0. Any matches to this are then scanned against the corresponding covariance model using the hand-curated threshold for that family.

Results Positive hits are reported together with the score, e-value and alignment to the family CM.

Bit score: how well the sequence matches your model. The score reflects whether the sequence matches better to the profile model (positive score) or to the null model of nonhomologous sequences (negative score).

E-value: expected number of false positives with bit scores at least high as your hit. The value is related to the size of database used for the search.

I Predicted secondary structure “<> [ ] { }” base pairs “_” hairpin loop “-” interior bulge and loop “,” single stranded multifurcation loop “:” external single stranded residues “.” insertion to the consensus.

II Consensus of the query model

III Alignment to the model and scoring system “Capital letter” = max score. “: +” score >=0 for base pairs and single stranded. “ ” negative score

IV Target sequence

Going to the family information A summary written in wikipedia about the family is shown together with information stored into the database.

Going to the family information Sequences part of that family can be viewed (if they are not so much)

Going to the family information Both seed and full alignments of members can be displayed.

Going to the family information Both seed and full alignments of members can be displayed.

Going to the family information The secondary structure can be viewed.

Going to the family information The secondary structure can be viewed.

Going to the family information Also the tree of genomes containing members of that family can be browsed

Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.

Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.

Going to the family information You can reach some publication on the family.

Problems in searching sequences

-  To speed up the searching it is necessary a filtering step based on blast search. This will decrease the sensitivity in finding true homologues of the functional RNA family.

-  The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats that looks like structured functional RNAs.

Gardner PP, et al. Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res. 2009

Batch search You can upload a file containing several sequences in fasta format. Generally a job takes 48 hours.

Files must have fewer than 100,000 lines and fewer

than 1000 sequences with a size shorter than

200,000 nucleotides

Browsing for genome Genomes scanned for the presence of a Rfma family are reported in Browse tab.

Browsing for genome Species, kingdom, number of Rfam families and members found within the specie (Regions) are reported.

Browsing for genome

Browsing for genome

You may install locally the infernal program available at http://infernal.janelia.org/.

To speed up the search you may install also the rfam_scan.pl script available at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/ that relies on Blast program.

Running a complete search for a whole genome.

Typical usage of infernal.

cmsearch -o output.aln --tabfile output.tab infile.fna Rfam.cm!

Typical usage of rfam_scan.pl

Perl rfam_scan.pl – blastdb Rfam.fasta -outfile.out Rfam.cm infile.fna !

Running a complete search for a whole genome.

Thanks!