Luca Cozzuto @ Bioinformatics Core
http://rfam.sanger.ac.uk/
Non-coding RNA genes codify for a functional RNA product rather than for a protein.
Non-coding genes codify for a functional RNA product rather than for a protein.
Family of functional RNAs:
Biological function RNA family
Involved in protein synthesis tRNA, rRNA, SRP RNA, tmRNA
Post-trascriptional modification or DNA replication
snRNA, snoRNA, SmY, scaRNA, gRNA, RNAse P, RNAse MRP, Y RNA, telomerase RNA
Regulatory RNAs aRNA, NAT, crRRNA, long ncRNA, miRNA, piRNA, siRNA, tasiRNA, rasiRNA, 7SK
Parasitic RNA Retrotransposon, Viroid, satellite RNA
The majority of functional RNAs fold in stable structures that are essential for their biological activity.
Micro-RNA precursor
tRNA U2 spliceosomal
RNA
Part of Riboswitch
Unlike protein-coding genes functional RNAs often show no significant sequence similarity but preserve a base-paired secondary structure.
This makes very difficult to search for those genes looking only for sequence similarity (i.e. by using BLAST, FASTA…)
ncRNA_1 AAAAAAGGGGTTTTTT!
ncRNA_2 AAATAAGGGGTTATTT!
Struct ((((((....)))))) !
For Rfam database a functional RNA family is represented by a multiple sequence alignment and a covariance model.
The model takes into account both sequence and structure and can be used to scan a genomic sequence to detect new members of the same family.
The Rfam Seed alignment for the U12 minor spliceosomal RNA family.
Only one sequence, up to 10 kb
Search methodology
The query sequence is scanned against a library of Rfam sequences using WU-BLAST, with an E-value threshold of 1.0. Any matches to this are then scanned against the corresponding covariance model using the hand-curated threshold for that family.
Results Positive hits are reported together with the score, e-value and alignment to the family CM.
Bit score: how well the sequence matches your model. The score reflects whether the sequence matches better to the profile model (positive score) or to the null model of nonhomologous sequences (negative score).
E-value: expected number of false positives with bit scores at least high as your hit. The value is related to the size of database used for the search.
I Predicted secondary structure “<> [ ] { }” base pairs “_” hairpin loop “-” interior bulge and loop “,” single stranded multifurcation loop “:” external single stranded residues “.” insertion to the consensus.
II Consensus of the query model
III Alignment to the model and scoring system “Capital letter” = max score. “: +” score >=0 for base pairs and single stranded. “ ” negative score
IV Target sequence
Going to the family information A summary written in wikipedia about the family is shown together with information stored into the database.
Going to the family information Sequences part of that family can be viewed (if they are not so much)
Going to the family information Both seed and full alignments of members can be displayed.
Going to the family information Both seed and full alignments of members can be displayed.
Going to the family information The secondary structure can be viewed.
Going to the family information The secondary structure can be viewed.
Going to the family information Also the tree of genomes containing members of that family can be browsed
Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.
Going to the family information If a PDB entry is available it is possible to see also the three-dimensional structure.
Going to the family information You can reach some publication on the family.
Problems in searching sequences
- To speed up the searching it is necessary a filtering step based on blast search. This will decrease the sensitivity in finding true homologues of the functional RNA family.
- The genomes of higher eukaryotes contain many ncRNA-derived pseudogenes and repeats that looks like structured functional RNAs.
Gardner PP, et al. Bateman A. Rfam: updates to the RNA families database. Nucleic Acids Res. 2009
Batch search You can upload a file containing several sequences in fasta format. Generally a job takes 48 hours.
Files must have fewer than 100,000 lines and fewer
than 1000 sequences with a size shorter than
200,000 nucleotides
Browsing for genome Genomes scanned for the presence of a Rfma family are reported in Browse tab.
Browsing for genome Species, kingdom, number of Rfam families and members found within the specie (Regions) are reported.
Browsing for genome
Browsing for genome
You may install locally the infernal program available at http://infernal.janelia.org/.
To speed up the search you may install also the rfam_scan.pl script available at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/tools/ that relies on Blast program.
Running a complete search for a whole genome.
Typical usage of infernal.
cmsearch -o output.aln --tabfile output.tab infile.fna Rfam.cm!
Typical usage of rfam_scan.pl
Perl rfam_scan.pl – blastdb Rfam.fasta -outfile.out Rfam.cm infile.fna !
Running a complete search for a whole genome.
Thanks!