Operon Prediction
Cao Fan
Operon
• A functioning unit of genomic material containing a cluster of genes under the control of a single regulatory signal or promoter
• Exists primarily in prokaryotes, also found in eukaryotes
Operon
Approaches- wet lab
• Demonstrate co-transcription of the candidate gene cluster via RT-PCR of whole cell RNA
• Reverse transcribe a specific RNA into a cDNA using a gene specific primer
• Amplify the cDNA via PRC using primers designed from genes within the gene cluster
• Successful PCR amplification signals the genes are members of an operon
Maritza Guacucano, Gloria Levican, David S. Holmes, Eugenia Jedlicki. An RT-PCR artifact in the characterization of bacterial operons. http://www.ejbiotechnology.info/content/vol3/issue3/full/5/index.html
Approaches – dry lab
Features used:• Intergenic distance (IG)• Conserved gene clusters (CG)• Functional relations (FR)• Experimental evidence (EE)• Sequence based features (SF)• Phylogenetic profiles(PP)
Intergenic distance
• IG(contiguous genes, same operon) < IG(contiguous genes, different operons)
• The most widely used parameter for operon prediction
• Best single predictor
Conserved gene clusters
• Genes in an operon tend to be preserved across phylogenetically related organisms
• Order of genes in an operon may not be conserved
• Sequence comparison between non-redundant genomes is usually performed to identify conserved clusters
Functional relations
• Genes in the same operon tend to encode functionally related proteins
• E.g. members of the same protein complex, enzymes part of a single metabolic pathway
Functional relations
Functional classifications:• Riley’s functional annotation• Metabolic pathways• Clusters of orthologous groups of proteins
(COG)• Gene ontologies (GO)
Sequence-based features
• Overrepresented sequence motifs and other sequence elements such as promoters, terminators are used
• Gene length ratio is also used. The ratio is shown to be genome specific
Phylogenetic profiles
• Indicate a general trend for a set of genes to be simultaneously present or absent in related organisms
• PP is shown to be genome specific
FeaturesIG only
IG, SF, EECG only
Rutger W.W. Brouwer, Oscar P.Kuipers and Sacha A.F.T. van Hijum. The relative value of operon predictions. Briefings in Bioinformatics 2008
SF
Features
Using both genome-specific and general genomic information
• Phuongan Dam, Victor Olman, Kyle Harris, Zhengchang Su and Ying Xu
• Features used:– Intergenic distance– Neighborhood conservation– Phylogenetic distance– Short DNA motifs– Similarity score between GO terms– Length ratio
Prediction of operons in microbial genomes
• by Maria D. Ermolaeva, Owen White and Steven L. Salzberg
• Features:– Conserved gene clusters
• Scoring method:– Log-likely scores
Prediction of operons in microbial genomes
• Gene pair: two adjacent genes separated by ≤200 bp
• Conserved gene pair: two adjacent genes (A,B) for which a homologous gene pair (A’,B’) can be found in another genome.
• Similarity(A,B) < Similarity(B,B’) and Similarity(A,B) < Similarity(A,A’)• Use BLASTP to find homologs
Prediction of operons in microbial genomes
• S pair: genes in the pair on the same strand• D pair: genes in the pair on different strands• SO pair: gene pair belong to the same operon• SN pair: gene pair belong to different operons• Directon: a maximal set of adjacent genes
located on the same DNA strand
Prediction of operons in microbial genomes
• Probability of a conserved S pair being an SO pair:
P = 1 – P[SN|(conserved, S)] - Pchance • P[SN|(conserved,S)] =
= =
Prediction of operons in microbial genomes
Calculate P(SN|S):• Assumption: orientation of operons is random• N(operons) = 2N(directons)• N(SN pairs) = N(operons) – N(adjacent, non-pairs) – N(D pairs)
= 2N(directons) – (N(genes) – N(pairs)) – N(D pairs)= 2N(directons) + N(S pairs) – N(genes)
• P(SN|S) = N(SN pairs) / N(S pairs)
Prediction of operons in microbial genomes
Calculating Pchance:Pchance = (0.1G/N(conserved S))h
G is the number of genomes searched, h is the number of genomes where homologs for a given gene is found
Prediction of operons in microbial genomes
Result: 7699 gene pairs in 34 bacterial genomes
with genes belonging to the same operon with probability >= 0.98
Sensitivity: 30% - 50%
OperonDB
• Gene pair: co-linear, maybe separated by other genes with the same orientation
• Modified probability estimation with integration of intergenic distances:
P = 1 – P(SN|(conserved, S))* - Pchance
where P(l|D) and P(l|S) define the probabilities for a given S or D pair to have intergenic distance l.
OperonDB
Result:• Sensitivity > 60%• Maximum accuracy: 80%
Relation to UROP