Evolution of bacterial regulatory systems
Mikhail Gelfand
Research and Training Center “Bioinformatics”Institute for Information Transmission
ProblemsMoscow, Russia
CASB-20, UCDS, La Jolla, 13-14.III.2009
Plan
• Co-evolution of transcription factors and their binding motifs
• Evolution of regulatory systems and regulons
Regulators and their motifs
• Cases of motif conservation at surprisingly large distances
• Subtle changes at close evolutionary distances
• Correlation between contacting nucleotides and amino acid residues
NrdR (regulator of ribonucleotide reducases and some other replication-related genes): conservation at large
distances
DNA motifs and protein-DNA interactions
CRP PurR
IHF TrpR
Entropy at aligned sites and the number of contacts (heavy atoms in a base pair at a distance <cutoff from a protein atom)
The CRP/FNR family of regulators
FNR
HcpR
CooA
Gam ma
Desulfovibrio
Desulfovibrio
TGTCGGCnnGCCGACA
TTGTgAnnnnnnTcACAA
TTGTGAnnnnnnTCACAA
TTGATnnnnATCAA
Correlation between contacting nucleotides and amino acid
residues• CooA in Desulfovibrio spp.• CRP in Gamma-proteobacteria• HcpR in Desulfovibrio spp. • FNR in Gamma-proteobacteria
DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVRDV COOA ELTMEQLAGLVGTTRQTASTLLNDMIREC CRP KITRQEIGQIVGCSRETVGRILKMLEDYP CRP KXTRQEIGQIVGCSRETVGRILKMLEDVC CRP KITRQEIGQIVGCSRETVGRILKMLEEDD HCPR DVSKSLLAGVLGTARETLSRALAKLVEDV HCPR DVTKGLLAGLLGTARETLSRCLSRMVEEC FNR TMTRGDIGNYLGLTVETISRLLGRFQKYP FNR TMTRGDIGNYLGLTVETISRLLGRFQKVC FNR TMTRGDIGNYLGLTVETISRLLGRFQK
TGTCGGCnnGCCGACA
TTGTgAnnnnnnTcACAA
TTGTGAnnnnnnTCACAA
TTGATnnnnATCAA
Contacting residues: REnnnRTG: 1st arginineGA: glutamate and 2nd arginine
The correlation holds for other factors in the family
The LacI family: subtle changes in motifs at close
distances
G
An
CGGn GC
The LacI family: systematic analysis
• 1369 DNA-binding domains in 200 orthologous rows <Id>=35%, <L>=71 а.о.
• 4484 binding sites, L=20н., <Id>=45%
• Calculate mutual information between columns of TF and site alignments
• Set threshold on mutual information of correlated pairs
Definitions
SitesLAFDHDQILQMAQERLQGKVRYQP-IGFELLPEKFSLRQLQRMYETVLGRS---LDKRNFLAFDHNQILDYGYQRLRNKLEYSP-IAFEVLPELFTLNDLFQLYTTVLGED--FADYSNF
tTAaTGgCTTTAtGcCACTAT
LSFDHNEILAYGHRRLRNKLEYSP-VAFEVLPEMFTLNDLYQLYTTVLGEN--FSDYSNFLSFDHNEILAYGHRRLRNKLEYSP-VAFEVLPEMFTLNDLYQLYTTVLGEN--FSDYSNFLAFDHSKILAYGHRRLCNKLEYSP-VAFDVLPEYFTLNDLYQFYSTVLGAN--FSDYSNFLAFDHSKILAYGHRRLCNKLEYSP-VAFDVLPEYFTLNDLYQFYSTVLGAN--FSDYSNFLAFDHSKILAYGHRRLCNKLEYSP-VAFDVLPEYFTLNDLYQFYSTVLGAN--FSDYSNFLAFDHNQILDYGYQRLRNKLEYSP-IAFEVLPELFTLNDLFQLYTTVLGED--FADYSNFLSFDHNEILAYGHRRLRNKLEYSP-VAFEVLPEMFTLNDLYQLYTTVLGEN—-FSDYSNFLSFDHNEILAYGHRRLRNKLEYSP-VAFEVLPEMFTLNDLYQLYTTVLGEN--FSDYSNF
TTAaaGTAAtAaTTACCATAAAaAtTGTCTTTAtGcCACTATTTATGGTAAATTcTACCATAATTATGGTAAATTcTACCATAATTATgGTCAgTTTcACcAaAA
tTAaTGgCTTTAtGcCACTAT
TTaGTCgAAATAaccaACtAATTATCGTCAtCtcGACGACAATttAGGTAAgTTATACTTTTA
4
1
20
1
,, )()(
),(log),(),(
n a ji
jiji npap
napnapjiI
)~(
)~(
,
,,,
ji
jijiji I
IEIZ
jiI ,~
i j
Z-score
Mutual information
Protein alignment
Correlated pairs
Higher order correlations-ATIKDVAKRANVSTTTV- AATTGTGAGCGCTCACT
SL SQ
TL
TQ
Not a phylogenetic trace
38[3]_ A R _GA A39[2]_ A R _GA A
28[4]_ S R _GCA _A CA
30_S R _GA A _ GCA29[12]_S R _GCA _ GA A
31_S R _GA A _ GCA27[8]_ S R _GCA _A CA
35[19]_T R _CA A _ GA A
32[5]_ S R _GCA _GA A
87[3]_ A R _GT A94_A R _GT A
56_S R _GGT
100[3]_A R _GT A
102_A R _ GT A101[3]_A R _GA A
110[2]_A R _GT A
41_S R _GGA
99[6]_ S R _GCA
93_A R _GA A
89[3]_ A R _GT A
88[5]_ A R _GT A91_A R _GA A
90[3]_ A R _GT A
49[7]_ A R _GA A _GGA
50[2]_ A R _GA A
92[3]_ S R _GA A
86_A R _GA A _ GCA
85[2]_ A R _GGA _GT A
84[13]_A R _GA A _ GT G
16_S R _GA A _ GAC
42[3]_ A R _GT A
37_A R _GT A
36[10]_A R _GA A
40_A R _GA A43[2]_ S R _GA A
98_S R _A CA _ GCA
97[2]_ A R _GA A _GCA
115[5]_A R _GA A
114[5]_A R _GT A _ GA T
14[3]_ S R _GA A _GA C
12[3]_ S R _GCA10[3]_ S R _GCA
13[3]_ S R _GCA
1[4]_S R _ GA A _GCA
11[12]_S R _GCA _ GA A
75[4]_ A R _GT A _T T A
83_S R _GA A82[5]_ S R _GCA _A CA
117[18]_ SR _ GGG_GGT
17[11]_A R _GT A _ GA A
57[3]_ S R _GGG_GCA
53_S R _T A A _ GAA51[14]_S R _GGA _ T GA
54_S R _GGA _ GTA52[8]_ S R _GGA _A A A
55[4]_ S R _GGA _GA A
9[7]_A R _ GT A _GA A18[4]_ A R _GT A
23[3]_ A R _GT A _GA A
19_A R _GT A _ GAA20_S R _GCA
21[2]_ A R _GCA _GT A
6_ MR _GT T _GGT
7[8]_MR _ GA T _GT T5[7]_MR _ GT T _T T T
4[5]_S R _ GGC_GGT8[10]_ MR _GT T _GA T
79[4]_ S R _GGT _GGA
15[2]_ S R _GCA _GA A112[8]_A R _GT A _ GA A
109[3]_A R _GT A _ GA A
111[7]_A R _GA A _ GA T113[2]_S R _GA A
108[31]_ AR _ GT A _GA A
95[12]_MR _GT T _ GA T
105[6]_A R _GT A _ GGA106_A R _ GT A _GA T
107[2]_A R _GA A _ GGA
104[22]_ AR _ GT A _GA A
103[9]_A R _GT A _ GA A
74[2]_ A R _GT A
76_T R _GA A _ GTA
77_MR _GT T
22[7]_ A R _GT A _GA A
24[17]_MR _GT T _ T T T
3[30]_ MR _GA T _GT T
2[54]_ MR _GT T _GA T
78[3]_ T R _CGA _GA A
81[65]_S R _GCA _ GA A
80[5]_ S R _GGT _GGA
116_S R _ GA A _GCA72_T R _GA A _ GGA
73_T R _GA A _ GAG
96[3]_ S R _GA A
71[4]_ A R _GT A
66[4]_ T R _GA A
68_S R _GA A _ GGT
70[3]_ S R _GGA69_A R _GA A
67_S R _GGT
65_S R _GA A
62[8]_ S R _GGA _GGT
60[8]_ T R _GA A
64_T R _GA A
61[8]_ S R _GA A63[3]_ T R _GA A
59_T R _GA A _ GTT58[4]_ A R _GT T _GT A
25[5]_ S R _GCA _GA A26[4]_ A R _GT A _GCA
34[11]_A R _GT A _ GA A48_A R _GA A _ GGT
47[3]_ A R _GA A _GT A
46[7]_ S R _GGG_GGA
45_S R _GA A44[2]_ S R _GGA _GCA
33[3]_ A R _GA A
NrtR (regulator of NAD metabolism)
Comparison with the recently solved structure: correlated positions indeed
bind the DNA (more exactly, form a hydrophobic cluster)
Catalog of events
• Expansion and contraction of regulons
• New regulators (where from?)
• Duplications of regulators with or without regulated loci
• Loss of regulators with or without regulated loci
• Re-assortment of regulators and structural genes
• … especially in complex systems
• Horizontal transfer
Regulon expansion, or how FruR has become CRA
• CRA (a.k.a. FruR) in Escherichia coli:– global regulator
– well-studied in experiment (many regulated genes known)
• Going back in time: looking for candidate CRA/FruR sites upstream of (orthologs of) genes known to be regulated in E.coli
Common ancestor of gamma-proteobacteria
icdA
aceA
aceB
aceEF
pckA
ppsApykF
adhE
gpmApgk
tpiA
gapApfkAfbp
FructosefruKfruBA
eda
eddepd
Glucose
ptsHI-crr
Mannose
manXYZ
mtlDmtlAMannitol
Gamma-proteobacteria
Common ancestor of the Enterobacteriales
icdA
aceA
aceB
aceEF
pckA
ppsApykF
adhE
gpmApgk
tpiA
gapApfkAfbp
FructosefruKfruBA
eda
eddepd
Glucose
ptsHI-crr
Mannose
manXYZ
mtlDmtlAMannitol
Gamma-proteobacteriaEnterobacteriales
Common ancestor of Escherichia and Salmonella
icdA
aceA
aceB
aceEF
pckA
ppsApykF
adhE
gpmApgk
tpiA
gapApfkAfbp
FructosefruKfruBA
eda
eddepd
Glucose
ptsHI-crr
Mannose
manXYZ
mtlDmtlAMannitol
Gamma-proteobacteriaEnterobacterialesE. coli and Salmonella spp.
Regulation of amino acid biosynthesis in the Firmicutes
• Interplay between regulatory RNA elements and transcription factors
• Expansion of T-box systems (normally – RNA structures regulating aminoacyl-tRNA-synthetases)
Recent duplications and bursts: ARG-T-box in Clostridium difficile
LJ_ARGS
LME_ARGS
LR_ARGS
LP_ARGS
CBE_ARGS
CPE_ARGSCB_ARGS
CTC_ARGS
CAC_ARGS
CDF_YQIXYZ
RDF02391
СDF_ARGC
CDF_ARGH
BC_ARGS2EF_ARGS
BH_ARGS
LSA_ARGSPPE_ARGS
LGA_ARGS
Bacillales
argSyqiXYZ
RDF02391
argCJBDF
predictedamino acidtransporters
NEW
argG
argH
Clostridiumdifficile
amino acidbiosynthetic genes
: ARG-specific T-box regulatory site
aminoacyl-tRNA synthetase
biosynthetic genes
amino acid transporters
NEW
Lactobacillales Clostridiales
argS argS
others
… caused by loss of transcription factor AhrC
Expansion of T-box regulon
regulation of expression of arginine biosynthetic and transport genes by T-box antitermination
: ARG-specific T-box regulatory site
Binding to 5’ UTR gene region regulation of gene expression
Other clostridia spp. (CA, CTC, CTH, CPE, CB, CPE)
yqiXYZ
argC
argH
yqiXYZ
argC
argG
argH
AhrC regulatory protein (negative regulation of arginine metabolism positive regulation of arginine catabolism)
...AhrC site
: AhrC binding site
Gram+ bacteria: Clostridiumdifficile:
AhrC is lost
5’
Duplications and changes in specificity: ASN/ASP/HIS T-boxes
CB_ASNS2
CDF_ASNA
EF_HISS
EX_HISS
BCL_HISSBH_HISS
OB_HISS
BC_HISS
TTE_HISS
DRE_HISS
CH_HISSCTH_HISS
PL_HISS
BE_HISSBL_HISS
BS_HISS
LME_HISXYZCDF_HISZX
LRE_HISXYZLSA_HISXYZ
OOE_HISXYZ
LP_HISXYZ
SGO_HISC
SMU_HISC
EF_HISXYZ
LMO_HISXYZ
EF_HISXYZ
LME_HIS(Z G\ )
LL_HISCLP_HISZ
LCA_HISZCB_ASNS3
CAC_ASNS32
BC_ASNS2
PPE_HISXYZ
PPE_ASNS
LB_ASNA
LD_ASNALJ_ QHMPgln
LJ_ASNA
PPE_ASNALP_ASNA
EX_ASNA
LB_ASNS2
CTC_ASNS2
PPE_HISSLP_HISS
LB_HISS
LJ_HISS
LRE_HISS
LRE_ASPS
LCA_HISS
CPE_ASNA
BC_ASNACBE_ASNS2
CTC_ASNACDF_ASNS2
CPE_ASNS2
his operon
his XYZ
Lactobacilla les
NEW
hisS
Other Gram +
ASP\ASN
HIS
Bacillales
HIS
aspS
SMU_ASPS2SG_ASPS2glnQHMP
L. johnsoniiasnA
ASP
ASN
asnAASN
Lac acillalestobasnS
ASN
aspS
hisXYZ
P. pentosaceus
asnS
HIS
ASP
Clostridiales
asnAASN
ASN
asnA
asnS
asnA
ASP
Rapid m utation of regulatory codons
ASN
AACGAC
hisSASP
Lac acillalestob
HIS
ASPhisS
L. reuteriaspS
ASN
ASN
ASN
ASN
Blow-up 1
PPE_ASNS2
LB_ASNA
LD_ASNALJ_GLNQHMP
LJ_ASNA
PPE_ASNALP_ASNA
PPE_HISSLP_HISS
LB_HISS
LJ_HISS
LRE_HISS
LRE_ASPS
LCA_HISS
aspShisSASP
Lac acillalestob
HIS ASPhisS
L. reuteri
aspS
ASP HIS
CACGAC
asnAASN
Lac acillalestob
disruption of hisS-aspS operonmutation of regulatory codon
L. johnsonii
asnA
ASP
ASN
glnQHMP
PPE_HISXYZ
ASN
AAC
P. pentosaceus
HIS
ASPhisXYZ
asnS
HIS
CAC
ASPASN
AAC GAC
Blow-up 2. Prediction
Regulators lost in lineages with expanded HIS-T-box regulon??
… and validation
• conserved motifs upstream of HIS biosynthesis genes
• candidate transcription factor yerC co-localized with the his genes• present only in genomes with the motifs upstream of the his genes• genomes with neither YerC motif nor HIS-T-boxes: attenuators
Bacillales (his operon)
Clostridiales
Thermoanaerobacteriales
Halanaerobiales
Bacillales
The evolutionary history of the his genes regulation in the Firmicutes
T-boxes: Summary / History
Life without Fur
Regulation of iron homeostasis (the Escherichia coli paradigm)
Iron:• essential cofactor (limiting in many environments)• dangerous at large concentrations
FUR (responds to iron):• synthesis of siderophores• transport (siderophores, heme, Fe2+, Fe3+)• storage• iron-dependent enzymes• synthesis of heme• synthesis of Fe-S clusters
Similar in Bacillus subtilis
Regulation of iron homeostasis in α-proteobacteria
Experimental studies:• FUR/MUR: Bradyrhizobium, Rhizobium and Sinorhizobium• RirA (Rrf2 family): Rhizobium and Sinorhizobium • Irr (FUR family): Bradyrhizobium, Rhizobium and Brucella
RirA IrrFeS heme
RirA
degraded
FurFe
Fur
Iron uptake systems
Siderophoreuptake
Fe / Feuptake Transcription
factors
2+ 3+
Iron storage ferritins
FeS synthesis
Heme synthesis
Iron-requiring enzymes
[iron cofactor]
IscR
Irr
[- Fe] [+Fe]
[+Fe][- Fe]
[+Fe][ Fe]-
FeS
FeS statusof cell
Distribution of
transcription factors in genomes
Search for candidate motifs and binding sites using standard comparative genomic techniques
Regulation of genes in
functional subsystemsRhizobiales
Bradyrhizobiaceae
Rhodobacteriales
The Zoo (likely ancestral state)
Reconstruction of history
Appearance of theiron-Rhodo motif
Frequent co-regulation
with Irr
Strict division of function
with Irr
All logos and Some Very Tempting Hypotheses:
1. Cross-recognition of FUR and IscR motifs in the ancestor.
2. When FUR had become MUR, and IscR had been lost in Rhizobiales, emerging RirA (from the Rrf2 family, with a rather different general consensus) took over their sites.
3. Iron-Rhodo boxes are recognized by IscR: directly testable
1
2
3
Summary and open problems• Regulatory systems are very flexible
– easily lost– easily expanded (in particular, by duplication)– may change specificity– rapid turnover of regulatory sites
• With more stories like these, we can start thinking about a general theory– catalog of elementary events; how frequent?– mechanisms (duplication, birth e.g. from enzymes,
horizontal transfer)– conserved (regulon cores) and non-conserved (marginal
regulon members) genes in relation to metabolic and functional subsystems/roles
– (TF family-specific) protein-DNA recognition code– distribution of TF families in genomes; distribution of
regulon sizes; etc.
People• Andrei A. Mironov – software, algorithms • Alexandra Rakhmaninova – SDP, protein-DNA correlations
• Anna Gerasimova (now at LBNL) – NadR• Olga Kalinina (on loan to EMBL) – SDP• Yuri Korostelev – protein-DNA correlations• Olga Laikova – LacI• Dmitry Ravcheev– CRA/FruR• Dmitry Rodionov (on loan to Burnham Institute) – iron etc.• Alexei Vitreschak – T-boxes and riboswitches
• Andy Jonson (U. of East Anglia) – experimental validation (iron)• Leonid Mirny (MIT) – protein-DNA, SDP• Andrei Osterman (Burnham Institute) – experimental validation
• Howard Hughes Medical Institute • Russian Foundation of Basic Research• Russian Academy of Sciences, program “Molecular and Cellular
Biology”