+ All Categories
Home > Documents > Direct inference of protein DNA interactions using...

Direct inference of protein DNA interactions using...

Date post: 19-Jan-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
16
Direct inference of proteinDNA interactions using compressed sensing methods Mohammed AlQuraishi a,b,c and Harley H. McAdams a,1 a Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305; b Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305; and c Department of Statistics, Stanford University, Stanford, CA 94305 Edited* by Stephen J. Benkovic, Pennsylvania State University, University Park, PA, and approved July 6, 2011 (received for review April 25, 2011) Compressed sensing has revolutionized signal acquisition, by enabling complex signals to be measured with remarkable fidelity using a small number of so-called incoherent sensors. We show that molecular interactions, e.g., proteinDNA interactions, can be analyzed in a directly analogous manner and with similarly remarkable results. Specifically, mesoscopic molecular interactions act as incoherent sensors that measure the energies of microscopic interactions between atoms. We combine concepts from com- pressed sensing and statistical mechanics to determine the intera- tomic interaction energies of a molecular system exclusively from experimental measurements, resulting in a de novoenergy potential. In contrast, conventional methods for estimating energy potentials are based on theoretical models premised on a priori assumptions and extensive domain knowledge. We determine the de novo energy potential for pairwise interactions between protein and DNA atoms from (i) experimental measurements of the binding affinity of proteinDNA complexes and (ii) crystal struc- tures of the complexes. We show that the de novo energy potential can be used to predict the binding specificity of proteins to DNA with approximately 90% accuracy, compared to approximately 60% for the best performing alternative computational methods applied to this fundamental problem. This de novo potential meth- od is directly extendable to other biomolecule interaction domains (enzymes and signaling molecule interactions) and to other classes of molecular interactions. DNA motifs structural biology machine learning proteinDNA binding DNA binding sites T he foundation of molecular analyses of chemical and biolo- gical phenomena is the energy potential, a mathematical description of the energy of every possible interaction in a molecular system (Fig. 1B). The accuracy of computational and laboratory studies of phenomena ranging from pharmaceutical drug interactions and protein folding to material phase transi- tions and thin film growth is often limited by the accuracy of these energy potentials. Currently, potentials are inferred using a mix- ture of theoretical modeling and experimental data (Fig. 1A). Physical potentialsrely on theoretical models to specify the potentials mathematical form and use experimental data to fit few model parameters (1). In contrast, statistical potentialsfit many parameters to experimental data and use theoretical mod- els for the expected statistics of interactions under randomness to infer a potential (2). In both approaches, theoretical models shape and constrain the inferred potential, resulting in a so-called parametric model. There are several drawbacks to this: (i) The a priori assumptions underlying the inferred potentials may be inaccurate. (ii) Substantial domain knowledge is required (often exceeding what is known). (iii) Potential modeling is lengthy and technically difficult. The theoretical development of some poten- tials has taken decades (3). To overcome these problems, poten- tials could in principle be determined strictly from experimental data without recourse to theoretical modeling by experimentally measuring the energies of all distinct interactions. In practice, direct measurement of interatomic potentials has been possible only for the simplest systems, due to a combinatorial explosion in the number of possible interactions that renders experiment- based inference intractable. We have developed a general meth- od for the inference of de novopotentials that circumvents the experimental intractability barrier by exploiting recent discov- eries in information theory known as compressed sensing. This approach results in a nonparametric potential that does not re- quire an a priori assumption of a theoretical model, overcoming a fundamental limitation of both physical and statistical potentials. Below, we demonstrate our method by applying it to the pre- diction of sequence-specific proteinDNA-binding interactions, a classic problem in molecular biology. Sequence-specific proteinDNA binding is a central phenomenon underlying transcriptional regulation of the cell in all organisms. Here, we describe a de novo potential for interatomic proteinDNA interactions and Atom types Energy scale Separation distance Interacting atom pair Energy potential Data-Driven Physics-Driven Parametric Statistical Potentials Assume a model of the expected statistics of random interactions Nonparametric Physical Potentials Assume simplifying physical approximations (e.g. bonds are springs) De Novo Potentials Inferred exclusively from experimental data without prior modeling assumptions A B Fig. 1. Types of energy potentials. (A) A potential V ðr;iÞ mathematically spe- cifies the energies of all microscopic interactions in a molecular system in terms of distance r and interaction type i. Conventional physical and statis- tical potentials are parametric mathematical models similar to the examples shown. Our de novo potentials are nonparametric; i.e., they do not assume a mathematical model. (B) A potential can be visualized as a heat map where the interaction energy of every atom pair as a function of the atomssepara- tion distance is represented by a color (pink: high potential energy, repulsive region; blue: low potential energy, attractive region). Author contributions: M.A. designed research; M.A. performed research; M.A. analyzed data; and M.A. and H.H.M. wrote the paper. The authors declare no conflict of interest. *This Direct Submission article had a prearranged editor. See Commentary on page 14713. 1 To whom correspondence should be addressed. E-mail: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/ doi:10.1073/pnas.1106460108/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1106460108 PNAS September 6, 2011 vol. 108 no. 36 1481914824 BIOPHYSICS AND COMPUTATIONAL BIOLOGY APPLIED PHYSICAL SCIENCES SEE COMMENTARY
Transcript
Page 1: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

Direct inference of protein–DNA interactionsusing compressed sensing methodsMohammed AlQuraishia,b,c and Harley H. McAdamsa,1

aDepartment of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305; bDepartment of Genetics, Stanford University Schoolof Medicine, Stanford, CA 94305; and cDepartment of Statistics, Stanford University, Stanford, CA 94305

Edited* by Stephen J. Benkovic, Pennsylvania State University, University Park, PA, and approved July 6, 2011 (received for review April 25, 2011)

Compressed sensing has revolutionized signal acquisition, byenabling complex signals to be measured with remarkable fidelityusing a small number of so-called incoherent sensors. We showthat molecular interactions, e.g., protein–DNA interactions, canbe analyzed in a directly analogous manner and with similarlyremarkable results. Specifically, mesoscopic molecular interactionsact as incoherent sensors that measure the energies of microscopicinteractions between atoms. We combine concepts from com-pressed sensing and statistical mechanics to determine the intera-tomic interaction energies of a molecular system exclusively fromexperimental measurements, resulting in a “de novo” energypotential. In contrast, conventional methods for estimating energypotentials are based on theoretical models premised on a prioriassumptions and extensive domain knowledge. We determinethe de novo energy potential for pairwise interactions betweenprotein and DNA atoms from (i) experimental measurements ofthe binding affinity of protein–DNA complexes and (ii) crystal struc-tures of the complexes.We show that the de novo energy potentialcan be used to predict the binding specificity of proteins to DNAwith approximately 90% accuracy, compared to approximately60% for the best performing alternative computational methodsapplied to this fundamental problem. This de novo potential meth-od is directly extendable to other biomolecule interaction domains(enzymes and signaling molecule interactions) and to other classesof molecular interactions.

DNA motifs ∣ structural biology ∣ machine learning ∣ protein–DNA binding ∣DNA binding sites

The foundation of molecular analyses of chemical and biolo-gical phenomena is the energy potential, a mathematical

description of the energy of every possible interaction in amolecular system (Fig. 1B). The accuracy of computational andlaboratory studies of phenomena ranging from pharmaceuticaldrug interactions and protein folding to material phase transi-tions and thin film growth is often limited by the accuracy of theseenergy potentials. Currently, potentials are inferred using a mix-ture of theoretical modeling and experimental data (Fig. 1A).“Physical potentials” rely on theoretical models to specify thepotential’s mathematical form and use experimental data to fitfew model parameters (1). In contrast, “statistical potentials” fitmany parameters to experimental data and use theoretical mod-els for the expected statistics of interactions under randomness toinfer a potential (2). In both approaches, theoretical modelsshape and constrain the inferred potential, resulting in a so-calledparametric model. There are several drawbacks to this: (i) The apriori assumptions underlying the inferred potentials may beinaccurate. (ii) Substantial domain knowledge is required (oftenexceeding what is known). (iii) Potential modeling is lengthy andtechnically difficult. The theoretical development of some poten-tials has taken decades (3). To overcome these problems, poten-tials could in principle be determined strictly from experimentaldata without recourse to theoretical modeling by experimentallymeasuring the energies of all distinct interactions. In practice,direct measurement of interatomic potentials has been possibleonly for the simplest systems, due to a combinatorial explosion in

the number of possible interactions that renders experiment-based inference intractable. We have developed a general meth-od for the inference of “de novo” potentials that circumvents theexperimental intractability barrier by exploiting recent discov-eries in information theory known as compressed sensing. Thisapproach results in a nonparametric potential that does not re-quire an a priori assumption of a theoretical model, overcoming afundamental limitation of both physical and statistical potentials.

Below, we demonstrate our method by applying it to the pre-diction of sequence-specific protein–DNA-binding interactions, aclassic problem in molecular biology. Sequence-specific protein–DNA binding is a central phenomenon underlying transcriptionalregulation of the cell in all organisms. Here, we describe a denovo potential for interatomic protein–DNA interactions and

Atom types Energy scale

Separation distance

Inte

ract

ing

atom

pai

r

Energy potential

Data-Driven Physics-Driven

Par

amet

ric

Statistical Potentials

Assume a model of theexpected statistics ofrandom interactions

No

np

aram

etri

c

Physical Potentials

Assume simplifyingphysical approximations(e.g. bonds are springs)

De Novo Potentials

Inferred exclusively fromexperimental data without

prior modeling assumptions

A

B

Fig. 1. Types of energy potentials. (A) A potential Vðr;iÞmathematically spe-cifies the energies of all microscopic interactions in a molecular system interms of distance r and interaction type i. Conventional physical and statis-tical potentials are parametric mathematical models similar to the examplesshown. Our de novo potentials are nonparametric; i.e., they do not assume amathematical model. (B) A potential can be visualized as a heat map wherethe interaction energy of every atom pair as a function of the atoms’ separa-tion distance is represented by a color (pink: high potential energy, repulsiveregion; blue: low potential energy, attractive region).

Author contributions: M.A. designed research; M.A. performed research; M.A. analyzeddata; and M.A. and H.H.M. wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

See Commentary on page 14713.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1106460108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1106460108 PNAS ∣ September 6, 2011 ∣ vol. 108 ∣ no. 36 ∣ 14819–14824

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

APP

LIED

PHYS

ICAL

SCIENCE

SSE

ECO

MMEN

TARY

Page 2: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

use this potential to computationally predict the DNA-bindingsites of proteins with near experimental accuracy. A schematicoverview summarizing the key steps in our approach is shownin Fig. S1.

The crux of our method is a unique mathematical formulationthat recasts the determination of potentials as a signal acquisitionproblem using compressed sensing techniques (4, 5). By exploit-ing a key property of compressed sensing (discussed below), wecircumvent the experimental intractability of determining energypotentials. This formulation rests on three key observations: (i) Incertain systems, e.g., biomolecular interactions, only a few typesof interatomic interactions are energetically important, such ashydrogen bonds. (ii) Mesoscopic interactions, e.g., protein–DNAbinding, can be viewed as sensors of the underlying microscopicpotential. (iii) There is a correspondence between the mathema-tical formulation of logistic regression and the statistical mechan-ical concept of the canonical ensemble. Using these ideas, weformulate the determination of energy potentials as a tractablesignal acquisition problem as described in Methods. Central tothis approach is the distinction between microscopic and meso-scopic interactions. The term “mesoscopic interactions” refers tointeractions between molecules or molecular complexes, whereas“microscopic interactions” refers to interactions at the atomicscale. This distinction is ultimately application-specific and de-pends on what interaction energies are to be determined andwhat experimental data are available. An essential requirementis that mesoscopic interactions be comprised of microscopic in-teractions, such that the energy of a mesoscopic interaction is thesum of the energies of its constituent microscopic interactions.The identities of the microscopic interactions constituting eachmesoscopic interaction must also be known. When these require-

ments are met, the method can determine the energies of themicroscopic interactions using observed energies or probabil-ities of the mesoscopic interactions as the experimental data.For protein–DNA interactions, the mesoscopic interactions areprotein–nucleotide binding events represented by protein–DNAcrystal structures with known binding energies or relative bindingprobabilities, and the microscopic interactions are pairwise, dis-tance-dependent, contacts between protein atoms and nucleotideatoms [standard atom type categories are used (6), and distancesare segmented into discrete bins whose widths are treated asmodel parameters as described in SI Text S1, Section 2]. Thus,each protein–nucleotide crystal structure characterizes a meso-scopic interaction whose constituent microscopic interactionsare readily identified from the structure. Intraprotein and intra-DNA energies are ignored as protein–DNA interactions are ourfocus. We use these data together with the known energies orprobabilities of protein–nucleotide binding events to infer theinteratomic protein–DNA potentials, and we show that thesede novo potentials can be used to predict protein–DNA-bindingmotifs with unprecedented accuracy.

ModelOverview. We outline our signal-based formulation of the poten-tial inference problem here and derive it formally in SI Text S1,Section 1. Signal acquisition is comprised of three parts: the sig-nal (conventionally an image), the sensor (camera photographingthe image), and the sensor measurements (light intensities of theimage) (Fig. 2A). A signal is represented by a vector whoseelements are the signal intensities at different locations (e.g.,different image pixels). Conventional sensing theory (Nyquist–Shannon) stipulates that signal acquisition requires twice as many

A

Separation distance

Inte

ract

ing

atom

pai

r

∑ L1

min

Signal Sensor Measurement Result

Microscopic potential Mesoscopic interaction(protein-DNA binding)

Mesoscopicinteraction

energy

Reconstructedmicroscopic potential

Image Image

Image

Incoherentlight sensor

Integratedlight intensity

Reconstructed image

C

B

Lightsensor

Lightintensity

(Many measurements needed)

(Few measurements needed)

(Few measurements needed)

L1

min

Atom types Energy scale

Fig. 2. Comparison of conventional sensing, compressed sensing, and de novo potential determination. (A) In conventional image sensing the intensities of allpixels are acquired directly. (B) In compressed sensing, the image is inferred using ℓ1 minimization from a relatively small number of measurements that sumthe signal intensity of multiple image pixels. (C) Potential determination as an application of compressed sensing. The potential is represented by a heat mapof microscopic interaction energies ranging from repulsive (dark pink) to attractive (dark blue). The interatomic protein–DNA potential can be inferred byℓ1 minimization from measurements provided by a small number of sensors (protein–DNA structures þ binding energies). See text for additional details andFig. S2 for a more mathematical treatment.

14820 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1106460108 AlQuraishi and McAdams

Page 3: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

measurements as the length of this vector for complete recoveryof the signal. However, the compressed sensing (CS) frameworkhas shown that under certain conditions, far fewer measurementsare necessary when the signal is inferred using ℓ1 minimization(5). We exploit this property to circumvent the combinatorialexplosion noted earlier that causes experimental intractability.The CS technique requires two conditions for applicability (5):(i) The signal must be nearly sparse; i.e., most vector elementsmust have negligible intensity. (ii) The sensors must be incoher-ent; i.e., they measure the integrated intensity of multiple signalvector elements (Fig. 2B), and the set of vector elements sensedmust be highly variable (ideally, random) between sensors (SI TextS1, Section 1). Also, the identity of the vector elements sensed byeach sensor must be known. We reformulate potential determi-nation as a CS problem by treating the interatomic potential asthe signal we wish to acquire, with mesoscopic interactions as thesensors and mesoscopic interaction energies as the measurements(Fig. 2C). The signal’s vector is comprised of the energies of alldistinct microscopic interactions, with different vector elementscorresponding to different microscopic interactions and signal in-tensity corresponding to interaction energy. In the protein–DNAapplication, we treat distinct combinations of protein atoms, nu-cleotide atoms, and distance bins as distinct interactions (SI TextS1, Section 2). This leads to a combinatorial explosion in thenumber of possible interactions, causing the signal’s vectorto be extremely long (up to approximately 50,000 elements)(SI Text S1, Section 2). However, the vector will be nearly sparsebecause most interactions are energetically negligible (7). In ourcrystal structure dataset, we found that only 9% of interactionenergies were nonnegligible (Results). This satisfies the first con-dition. Regarding the second condition, the energies of meso-scopic interactions are incoherent measurements, because (i)they are the summed energies of the microscopic interactionsand thus integrate the intensity of multiple vector elements,and (ii) the set of microscopic interactions present in each meso-scopic interaction is highly variable as discussed below. Becausethe vector elements sensed by each measurement must be known,the microscopic interactions comprising each mesoscopic interac-tion must be known. Protein–DNA crystal structures provide adataset of mesoscopic interactions whose constituent microscopicinteractions are identified from the positions of the protein andnucleotide atoms in the contact regions of the structure. We useda set of 63 such nonredundant structures (Dataset S1), combinedwith their measured binding affinities, as the dataset for the denovo potential determination described below. The nonredun-dancy of these structures ensures that each mesoscopic interac-tion samples a different set of microscopic interactions, becausethe intrinsic variability of the structures due to their varying spa-tial conformations and different amino acid compositions resultsin high variability in the microscopic interactions that constituteeach mesoscopic interaction. (The degree of incoherence is quan-tified later in Results). Now, because we have recast potential de-termination as a CS problem, only a small number of incoherentmeasurements, i.e., experimentally characterized protein–nucleo-tide binding events, are needed. This circumvents the experimen-tal combinatorial explosion problem cited earlier.

Mathematical Formulations. We show that ℓ1-regularized linearregression (8) infers potentials from mesoscopic interaction en-ergies in SI Text S1, Section 1 (see also Fig. S2). We also derive aprobability-based formulation that uses the relative probability ofa mesoscopic interaction within a collection of possible alterna-tive interactions (e.g., alternative DNA sites where the proteinbinds) as experimental data. This collection must form a canoni-cal ensemble, i.e., a set of physical states in which the energy mayvary, but the volume, temperature, and number of particles arefixed. Multiple distinct canonical ensembles can be used to infer asingle potential (e.g., multiple protein–DNA complexes can be

used to infer a single protein–DNA potential). We derive thisformulation using a constrained version of ℓ1-regularized multi-nomial logistic regression (9) (SI Text S1, Section 1). In ourprotein–DNA application, a collection of protein–nucleotidecomplexes in which the protein is fixed and individual nucleotidesare varied forms a canonical ensemble, and the protein’s relativebinding probabilities to different nucleotides are the experimen-tal data. These probabilities are obtained from experimentallydetermined position weight matrices (PWMs) of protein bindingsites or from consensus binding sequences (by assuming that con-sensus nucleotides bind with 100% probability).

ResultsApplication to Prediction of Protein Binding Sites. We have used theprobability-based formulation to determine the protein–DNApotential of helix-turn-helix (HTH) proteins and predict theirconsensus binding sequences and PWM motifs. We focus onHTH proteins as they are the most widely distributed familyof DNA-binding proteins, occurring in all biological kingdoms,with a large number of structures in the Protein Data Bank (10).HTH proteins include virtually all bacterial transcription factorsand about 25% of human transcription factors (11). For the pre-diction of consensus binding sequences, the potential is inferredusing probabilities derived from the consensus sequences ofprotein–DNA structures in a dataset reserved for training thealgorithm (SI Text S1, Section 3). A separate set of protein–DNA structures is used to test predictions made with the inferredpotential. For each protein–DNA structure in the test set, everyDNA sequence position is mutated in silico to every possible pairof nucleotides, and the relative binding affinities of the mutatedstructures are computed (SI Text S1, Section 3). In silico muta-genesis was carried out using the 3DNA software package(12, 13), which maintains the backbone atoms of the DNA mo-lecule, but replaces the base pair atoms in a way that is consistentwith the backbone orientation in the crystal. We assume indepen-dence of DNA positions and repeat this process for every posi-tion. The most probable nucleotides at all positions are predictedwith 12.9% error, compared to 42.1% error by the leading alter-native method (Table 1, Baseline model). For the more complexproblem of predicting quantitative PWMs, we determine thepotential using probabilities derived from published experimen-tally determined PWMs of the 63 protein–DNA structures in thedataset (Dataset S2 and SI Text S1, Section 3). Compared to lead-ing physical and statistical potentials (6, 14–16), our de novopotential method produces the best PWM score (Table 1) onthe symmetric Kullback–Leibler divergence (SKLD) metric(SI Text S1, Section 3). Note that the second and third best per-forming potentials require consensus binding sequences as input.Providing that input significantly simplifies the problem, whereasour method infers the consensus binding sequences.

Generalizations. We also consider two generalizations that relaxphysical constraints, yielding pseudopotentials that perform bet-ter in practice. First, we mathematically transform the regressioninputs to improve their statistical and numerical properties(SI Text S1, Section 2). Second, we infer distinct potentials forinteractions occurring in different regions of the HTH–DNA-binding interface, motivated by the observation that bindingaffinity is strongest in the core region of the binding interfaceand gets progressively weaker away from the core region (17)(SI Text S1, Section 2 and Fig. S3). We tested these generaliza-tions individually and in combination (see Table 1). The consen-sus sequence predictions are slightly improved (10.1% vs. 12.9%error), but the improvement in PWM prediction is dramatic(SKLD of 1.699 vs. 1.960), larger than the gain obtained in goingfrom the Quasichemical (6) to the DNAPROT (16) algorithm(2.248 vs. 1.991), which accounts for intra-DNA interactionsand requires consensus sequences. The positive impact of this

AlQuraishi and McAdams PNAS ∣ September 6, 2011 ∣ vol. 108 ∣ no. 36 ∣ 14821

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

APP

LIED

PHYS

ICAL

SCIENCE

SSE

ECO

MMEN

TARY

Page 4: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

second generalization on PWM prediction, but not on consensussequence prediction, results because consensus sequences do notcapture the relative binding strength of protein–DNA interac-tions for alternative DNA sequences. Fig. 3 shows a bar chartof the accuracy of all 63 predictions made using our de novopotential with both generalizations, along with representativebest, average, and worst case predictions of consensus sequencesand PWMs. Fig. 4 compares predictions for the proteins wherethe de novo algorithm exhibited the greatest improvements rela-tive to other methods.

Characterization of Best Performing Model. As discussed earlier, acollection of sensors must be incoherent to ensure high-qualityreconstruction of the underlying potential with compressivesensing methods, and the potential must be sufficiently sparserelative to the number of available measurements (SI Text S1,Section 1). To determine the degree to which these requirementsare satisfied by the potentials derived from the protein:DNAcomplexes in our dataset, we consider the best performing base-line model, applied without the two generalizations discussed inthe previous section. This model produced an energy potentialwith a total of 2,997 unique microscopic interactions using

1.3-Å wide distance bins and 5.9-Å cutoff distance (SI Text S1,Section 2). The total number of sensors in the dataset is 592,as each protein–DNA crystal structure yields multiple sensorsbecause we make the common assumption of independencebetween DNA base pair positions. As previously noted, accurateinference is still possible despite having a smaller number of sen-sors than the number of unique microscopic interactions, if thepotential is sufficiently sparse. Of the 2,997 unique interactions,only 270 have nonzero energy, suggesting that our dataset willyield accurate potentials. In fact, it is likely that the best perform-ing choices of binning width and cutoff distance used for the base-line model represent the optimal trade-off between spatialresolution and statistical power.

An additional way to address the suitability of the dataset forpotential inference is based on the consideration of all the pair-wise angles between the sensor vectors in the dataset. The distri-bution of the absolute values of the cosines of these pairwiseangles (Fig. S4) characterizes the incoherence of the sensors (18).The mean and median values of this distribution are 0.081 and0.041, respectively. These values are significantly lower than 1,thus indicating that the set of sensors comprised by the

Table 1. Performance of de novo potential and other leading potentials

Potential TypeIntra-DNAinteractions

Prediction quality

Consensussequence error

PWM symmetricKL divergence

Random model N/A N/A 75% 3.335Methods requiring consensus sequencesRosetta (12) physical yes N/A 2.632Cumulative contacts (13) statistical no N/A 2.033DNAPROT (14) statistical yes N/A 1.991

Methods not requiring consensus sequencesDNAPROT* (14) statistical no 60.20% 3.279Rosetta* (12) physical no 50.80% 2.719Quasichemical (6) statistical no 42.10% 2.248

Our methods (do not require consensus sequences)Baseline de novo no 12.90% 1.96Transformed inputs de novo no 10.20% 1.861Region specific de novo no 13.70% 1.792Both generalizations de novo no 10.10% 1.699

Performance is assessed based on predictions of consensus sequences and PWMs averaged over the 63 structures in the dataset. For consensus sequenceprediction, error is measured by the percentage of incorrectly predicted bases. For PWM predictions, the average SKLD over all DNA positions is reported(lower is better). A random model in which all DNA base pairs are assumed to be equally likely is also shown for reference.*Only the direct readout components of potentials are used in those tests because they do not require consensus sequences as input.

0123456

Symmetric Kullback-Leibler Divergence

BA

0%10%20%30%40%

% Error

Fig. 3. Representative performance of de novo potential in predicting DNA-binding sites of 63 proteins. (A) Bar chart of the errors (fraction of incorrect bases)in consensus sequences predicted using de novo potential method. Each bar represents a single prediction made by the algorithm, with shorter bars corre-sponding to better predictions. Highlighted examples (pink bars) represent best, average, and worst cases, with insets comparing experimentally determinedconsensus sequences (Top) to predictions (Bottom). (B) SKLD scores (lower is better) for PWM predictions, with insets comparing experimentally determinedPWMs (Top) to predictions (Bottom).

14822 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1106460108 AlQuraishi and McAdams

Page 5: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

protein:DNA crystal structures in our dataset has good incoher-ence properties (18, 19).

DiscussionPotential for Improving Performance. The protein–DNA-bindingsite predictions we report are an application of our de novopotential inference method, and they exhibit a dramatic improve-ment over the leading alternative methods, with predictions with-in the experimental error of the PWMs for at least half of thecases studied. Although the accuracy depicted in Fig. 3 is substan-tially better than achieved with alternative methods, the predic-tions for the proteins in the lower quarter of Fig. 3 need to beimproved. The 63 protein–DNA complexes in our databasemay provide biased or insufficient coverage of some of the micro-scopic interactions. Or, there could be significant variance in thequality of the crystal structures determined for the different com-plexes in the database that we curated from the Protein DataBase (10). Additionally, prediction errors might reflect the effectsof other mechanisms that affect the shape and accessibility of theDNA in vivo so that the structure of the crystallized complex dif-fers from the in vivo structure. Also, some transcription factorshave been observed to bind DNAwith two or more distinct motifs(20), and the alternate motifs would be missing from our dataset.

Principled Selection of Crystallization Targets. As in other CS appli-cation domains, the accuracy of the inferred potentials dependson the characteristics of the sensor matrix: in this case, the col-lection of protein–DNA structures available. As discussed above,quantitative measures such as coherence can be used to assess thesuitability of a sensor matrix for compressive sensing (5, 18, 19).These measures can provide a principled framework for selecting

additional crystallization targets that will maximally enhance thesensing performance of a protein–DNA structural dataset. Weshowed above that our current dataset has good incoherenceproperties, yet we expect that the addition of more protein–DNA crystal structures, specifically chosen to yield a sensormatrix with even lower coherence, will yield more accurate energypotentials and better binding site predictions.

Advantages over Statistical Potentials. Although statistical poten-tials and our de novo potentials both use experimental datasetsto derive the final energy potential, de novo potentials are non-parametric; i.e., they do not assume an underlying mathematicalform. In contrast, statistical potentials rely on experimentaldatasets to fit a parametric model with a fixed mathematicalform. De novo potentials overcome additional limitations specificto statistical potentials. First, although statistical potentials utilizeonly atomic data such as crystal structures to fit their parameters,de novo potentials combine structural information and experi-mental binding data into a single formulation for inference.In the field of protein–DNA-binding site prediction, combiningthese types of data has been a long-standing objective (21–24).Second, statistical potentials implicitly assume that all structurescome from the same canonical ensemble. This oversimplificationignores the chain connectivity and amino acid composition ofproteins, and it is thought to be the cause of common anomaliesobserved in statistical potentials (2). In de novo potentials, thisassumption is eliminated in the energy-based formulation, andit is relaxed significantly in the probability-based formulation,so that only subsets of the data are assumed to form canonicalensembles (SI Text S1, Section 1). Third, by assigning equal weightto microscopic interactions observed in different structures, sta-tistical potentials implicitly assume that different structures havethe same binding or formation energy. This is not the case, asdifferent protein–DNA complexes are known to have differentbinding affinities. From a statistical mechanical standpoint, highaffinity complexes correspond to more frequent mesoscopic in-teractions than low affinity complexes, requiring that the under-lying microscopic interactions be given proportionally greaterweight. De novo potentials take this into account, as the bindingenergies of mesoscopic interactions are an explicit part of the for-mulation. Fourth, there is no theoretical assurance that as moredata are added for the statistical potential fitting process, the in-ferred energies will ultimately converge to the true underlyinginteraction energies. The same observation holds for physicalpotentials. In contrast, for de novo potentials, if the formalrequirements of compressed sensing are satisfied, then additionalsensors in the dataset will lead to ever closer estimates of theinteraction energies.

De Novo Potentials in Other Molecular Interaction Domains. Theimportant insight that underlies our de novo methodology isthat experimental datasets relating to mesoscopic interactionsin a wide range of fields can be cast as incoherent measurementsof microscopic interactions in the compressed sensing framework.In these cases, powerful compressed sensing methods can be usedto determine de novo potentials. In the current protein–DNA ap-plication, microscopic interactions were defined to always involveone protein atom and one DNA atom, thus neglecting intra-DNAinteractions. If microscopic interactions are defined to includenoncovalent contacts between two DNA atoms, then the indirectreadout component of protein–DNA interactions can also bemodeled, capturing intra-DNA interactions.

Similarly, to infer a potential for protein–protein interactionsor for protein folding, noncovalent contacts between proteinatoms would be treated as the microscopic interactions. However,interactions need not be restricted to those involving pairsof atoms. Interactions involving multiple atoms, as well ascoarse-grained potentials in which the “atoms” of the systems are

Fig. 4. Examples highlighting significant improvement in prediction qualitybetween de novo potential and other leading potentials. (A) Experimentaland predicted consensus sequences for the Drosophila melanogaster Ultra-bithorax Hox protein (Left) and the Saccharomyces cerevisiae MATα2 (Right)protein are shown. (Top to Bottom) Experimental, de novo potential, Rosetta(direct readout only), Quasichemical, and DNAPROT (direct readout only).(B) Experimental and predicted PWMs for the Homo sapiens Pax6 Paireddomain (Left) and D. melanogaster Engrailed homeodomain (Right) areshown. (Top to Bottom) Experimental, de novo potential, Rosetta, Cumula-tive Contacts, Quasichemical, and DNAPROT.

AlQuraishi and McAdams PNAS ∣ September 6, 2011 ∣ vol. 108 ∣ no. 36 ∣ 14823

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

APP

LIED

PHYS

ICAL

SCIENCE

SSE

ECO

MMEN

TARY

Page 6: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

residues for example, could be used. For mesoscopic measure-ments of protein–protein interactions, biochemical data on pro-tein–protein binding kinetics can be used. The Protein–ProteinInteraction Thermodynamic Database (PINT) is a database ofsuch measurements (25). For mesoscopic measurements ofprotein folding, the kinetics or mean folding times of proteinsare necessary. Although such measurements are difficult toobtain, significant progress in experimental techniques has beenmade recently (26–28). The advent of these recent experimentaltechniques promises to make such measurements more readilyavailable in the future.

ACKNOWLEDGMENTS. R. Altman, G. Bejerano, J. Boyd Kozdon, A. Deacon,S. Hong, V. Pande, K. Sachs, and L. Shapiro provided helpful comments. Wethank E. Candes, T. Hastie, andM. Levitt for insightful discussions, A. Morozovfor helpful advice on the Rosetta protein–DNA module, and K. Arya andG. Cooperman for customizing the DMTCP checkpointing software for ourpurposes. Wolfram Research provided the Mathematica software environ-ment necessary for the analyses performed. This work was supported byDepartment of Energy Office of Science Grant DE-FG02-05ER64136 (toH.H.M.). We used resources of the National Energy Research ScientificComputing Center, which is supported by the Office of Science of the USDepartment of Energy under Contract DE-AC02-05CH11231. M.A. wassupported by the Stanford Genome Training Program (Grant T32 HG00044from the National Human Genome Research Institute).

1. Jorgensen WL, Tirado-Rives J (2005) Potential energy functions for atomic-levelsimulations of water and organic and biomolecular systems. Proc Natl Acad SciUSA 102:6665–6670.

2. Zhou Y, Zhou HY, Zhang C, Liu S (2006) What is a desirable statistical energy functionfor proteins and how can it be obtained? Cell Biochem Biophys 46:165–174.

3. Ponder JW, Case DA (2003) Force fields for protein simulations. Adv Protein Chem66:27–85.

4. Chartrand R, Baraniuk RG, Eldar YC, Figueiredo MAT, Tanner J (2010) Introduction tothe issue on compressive sensing. IEEE J Sel Top Signal Process 4:241–243.

5. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE SignalProcess Mag 25:21–30.

6. Donald JE, Chen WW, Shakhnovich EI (2007) Energetics of protein-DNA interactions.Nucleic Acids Res 35:1039–1047.

7. Brändén C-I, Tooze J (1999) Introduction to Protein Structure (Garland, New York),2nd Ed, p xiv.

8. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat SocSeries B Stat Methodol 58:267–288.

9. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linearmodels via coordinate descent. J Stat Softw 33:1–22.

10. Berman HM, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242.11. Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM (2009) A census of

human transcription factors: Function, expression and evolution. Nat Rev Genet10:252–263.

12. Lu XJ, Olson WK (2003) 3DNA: A software package for the analysis, rebuilding andvisualization of three-dimensional nucleic acid structures. Nucleic Acids Res31:5108–5121.

13. Lu XJ, Olson WK (2008) 3DNA: A versatile, integrated software system for theanalysis, rebuilding and visualization of three-dimensional nucleic-acid structures.Nat Protoc 3:1213–1227.

14. Morozov AV, Havranek JJ, Baker D, Siggia ED (2005) Protein-DNA binding specificitypredictions with structural models. Nucleic Acids Res 33:5781–5798.

15. Morozov AV, Siggia ED (2007) Connecting protein structure with predictions ofregulatory sites. Proc Natl Acad Sci USA 104:7068–7073.

16. Angarica VE, Perez AG, Vasconcelos AT, Collado-Vides J, Contreras-Moreira B (2008)Prediction of TF target sites based on atomistic models of protein-DNA complexes.BMC Bioinformatics 9:436.

17. Wintjens R, Rooman M (1996) Structural classification of HTH DNA-binding domainsand protein-DNA interaction modes. J Mol Biol 262:294–313.

18. Tropp JA (2008) On the conditioning of random subdictionaries. Appl ComputHarmon Anal 25:1–24.

19. Candes EJ, Romberg J (2006) Quantitative robust uncertainty principles and opti-mally sparse decompositions. Found Comput Math 6:227–254.

20. Badis G, et al. (2009) Diversity and complexity in DNA recognition by transcriptionfactors. Science 324:1720–1723.

21. Mirny LA, Gelfand MS (2002) Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Res 30:1704–1711.

22. Hoglund A, Kohlbacher O (2004) From sequence to structure and back again:Approaches for predicting protein-DNA binding. Proteome Sci 2:3.

23. Eisen M (2005) All motifs are NOT created equal: Structural properties of transcrip-tion factor-DNA interactions and the inference of sequence specificity. Genome Biol6:P7.

24. Moroni E, Caselle M, Fogolari F (2007) Identification of DNA-binding protein targetsequences by physical effective energy functions: Free energy analysis of lambdarepressor-DNA complexes. BMC Struct Biol 7:61.

25. Kumar MD, Gromiha MM (2006) PINT: Protein-Potein Interactions ThermodynamicDatabase. Nucleic Acids Res 34:D195–198.

26. Vendruscolo M, Paci E (2003) Protein folding: Binging theory and experiment closertogether. Curr Opin Struct Biol 13:82–87.

27. Oliveberg M, Wolynes PG (2005) The experimental survey of protein-folding energylandscapes. Q Rev Biophys 38:245–288.

28. Mello CC, Barrick D (2004) An experimentally determined protein folding energylandscape. Proc Natl Acad Sci USA 101:14102–14107.

14824 ∣ www.pnas.org/cgi/doi/10.1073/pnas.1106460108 AlQuraishi and McAdams

Page 7: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

Supporting InformationAlQuraishi and McAdams 10.1073/pnas.1106460108SI MethodsSI Methods describe the general methodology for de novo poten-tial determination using compressed sensing, the determinationof protein–DNA potentials, and the prediction of protein–DNA-binding sites and testing methodology. There are three sections.Section I contains the derivation of the general methodology forde novo potential determination using compressed sensing. Theformulation in Section I is not specific to a particular application,but is applicable to a range of biological and chemical systems.The mathematical notation and terminology used throughoutis in Section I.

This work brings together disparate concepts from informationtheory, statistical mechanics, and structural biology. The follow-ing background references may be useful to the reader:

• Linear and logistic regression (1).• Compressed sensing (2, 3).• Statistical mechanical ensembles (4, 5).• Structural basis of protein–DNA interactions (6, 7).

Section II describes the application of the general methodol-ogy from Section I to the determination of a de novo protein–DNA potential. We reformulate the abstract constructs of thegeneral methodology to the specifics of protein–DNA interac-tions and introduce several modifications that exploit the uniqueproperties of protein–DNA interactions. Our choices for meta-parameters and implementation details are also described inSection II.

Section III contains a description of the use of the protein–DNA potentials described in Section II to predict protein–DNA-binding sites. We detail our structure-based approach toprotein–DNA-binding site prediction, the dataset used for train-ing and testing, and the quantitative metrics used to compareresults between our de novo potential method and previouslypublished methods.

I. General De Novo Potential Determination Using CompressedSensing. This section introduces our general methodology forinferring potentials using compressed sensing. First, we showhow the energies of microscopic interactions can be determinedfrom the measurements of the interaction energies or relativeprobabilities of mesoscopic interactions by reformulating poten-tial inference as one of two possible regression problems: (i) lin-ear regression for when the measurements are energies ofmesoscopic interactions or (ii) multinomial logistic regressionwhen the measurements are the relative probabilities of meso-scopic interactions. Next, having reformulated the potentialdetermination problem as a regression problem, we show howcompressed sensing methods can be employed to infer a de novopotential using a relatively small number of these experimentalmeasurements.

Notation and preliminary assumptions. I is the set of all possiblemicroscopic interactions. A microscopic interaction may be assimple as two atoms existing within a predefined distance ofone another or as complex as a system of multiple moleculesin conjunction with the solvent. Note that an interaction maydepend on the distance of the interacting elements, in which casethe same elements interacting at different distances would corre-spond to different interactions in our formulation. Distance maybe defined in terms of discrete distance ranges, or bins. For eachmicroscopic interaction i ∈ I, we denote its energy by ei.

K is the set of mesoscopic interactions. Intuitively, a meso-scopic interaction k ∈ K is a large-scale interaction where manymicroscopic interactions combine to create the larger-scale inter-action. For example, a protein binding DNA is a mesoscopic in-teraction comprised of all the energetic interactions occurringbetween protein and DNA atoms. Formally, we associate withevery mesoscopic interaction k ∈ K a vector Ck ∈ Z�jIj that con-tains for each i ∈ I a count Ck;i of the number of times thatmicroscopic interaction i is observed in mesoscopic interactionk. For microscopic interactions unobserved in k ∈ K , their corre-sponding counts are set to 0.

We denote the energy of a mesoscopic interaction k by Ek anddefine it as

Ek ¼ ∑i∈I

eiCk;i:

This formalizes the notion of a mesoscopic interaction by definingthe energy of a mesoscopic interaction k to be the sum of theenergies of its constituent microscopic interactions.

Inference using mesoscopic interaction energies. Given a set of me-soscopic interactions K for which we know the constituent micro-scopic interactions, and a corresponding set of fEkgk∈K , linearregression (1) can be used to infer the energies feigi∈I of the mi-croscopic interactions. The mapping is direct (Fig. S2). We treatEk as the response variable yk, andCk;i as the input variable xk;i. Inlinear regression,

yk ¼ ∑i∈I

βixk;i;

thus the inferred coefficients fβigi∈I will be the microscopic inter-action energies feigi∈I . Alternatively in matrix notation, we setthe energy vector E ¼ fEkgk∈K to equal the response vectory ¼ fykgk∈K , and the counts matrix C ¼ fCk;igk∈K;i∈I to equalthe design matrix X ¼ fxk;igk∈K;i∈I , to obtain the standard linearregression relationship y ¼ Xβ. Within the compressed sensingframework, y is the set of compressive measurements, X is thesensor matrix (set of sensors), and β, the underlying microscopicpotential, is the signal.

Inference using relative probabilities of mesoscopic interactions. Ifthe absolute energies of mesoscopic interactions are unknown,but their relative probabilities are known, then a constrainedversion of multinomial logistic regression can be shown to inferthe energies of the microscopic interactions. A key feature of ourformulation is the use of a specific type of statistical mechanicalensemble, the canonical ensemble (4). An ensemble is a collec-tion of physical states, for example, the different structural con-formations of a protein. In canonical ensembles, the temperature,volume, and number of particles are assumed identical across thestates, but the energy may vary between states. In our formula-tion, different mesoscopic interactions correspond to differentstates of a canonical ensemble. For the protein–DNA-bindingproblem, this corresponds to the protein binding to alternateDNA sequences with different binding energies. The probabil-ities of finding a system in each of the different states of a cano-nical ensemble follow the Boltzmann distribution (4). We canexploit this property to reformulate potential determination as aninstance of multinomial logistic regression. In this formulation,

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 1 of 10

Page 8: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

the relative probabilities of mesoscopic interactions are used asexperimental data. By relative probabilities we mean that theprobability of a mesoscopic interaction only has to be normalizedwith respect to two or more mesoscopic interactions within thesame canonical ensemble. Multiple distinct canonical ensemblescan be used, and so not all mesoscopic interactions need to comefrom the same canonical ensemble. We first derive the generalformulation, and then we show its formal equivalence to multi-nomial logistic regression under appropriate constraints.

Formally, let K be the set of mesoscopic interactions for whichwe have measurements (described below). We induce a partitionfK1;⋯;KLg on K such that mesoscopic interactions within a blockKl come from the same canonical ensemble, and there are L dis-tinct canonical ensembles. In other words, for a fixed l ≤ L, for allk ∈ Kl, the temperature, volume, and number of particles areassumed fixed. Note that k is now being used to refer to a meso-scopic interaction within a given canonical ensemble.

For each canonical ensemble Kl, we require measurementscorresponding to the relative probabilities of every measuredmesoscopic interaction k ∈ Kl in the ensemble. In other words,the measured probabilities fpkgk∈Kl

obey

∑k∈Kl

pk ¼ 1.

For all l ≤ L. Because by definition all mesoscopic interactionsk ∈ Kl come from the same canonical ensemble, then themeasured probability pk of a given mesoscopic interaction k isdescribed by the Boltzmann distribution (4):

pk ¼1

Zle−αlEk ;

where Ek is the energy of the mesoscopic interaction as previouslydefined, Zl is the partition function of the canonical ensembleKl, and αl is the inverse temperature parameter for the ensemble(we are not following the standard convention of using β forthe inverse temperature to avoid confusion with the notation forregression coefficients).

Given a set of mesoscopic interactions K for which we knowthe constituent microscopic interactions, and a partitioning of themesoscopic interactions such that the probabilities fpkgk∈Kl

areknown for all l ≤ L, we can infer the energies of the microscopicinteractions. Recall the previous definition of the energy Ek, andobtain

pk ¼1

Zle−αlEk ¼ 1

Zle−αl∑

i∈I

eiCk;i

:

Expanding the partition function yields

pk ¼e−αl∑

i∈I

eiCk;i

∑j∈Kl

e−αl∑

i∈I

eiCj;i:

Multiplying the numerator and denominator by eαl∑i∈I eiCk;i , weobtain

pk ¼e−αl∑

i∈I

eiCk;i

∑j∈Kl

e−αl∑

i∈I

eiCj;i

eαl∑

i∈I

eiCk;i

eαl∑

i∈I

eiCk;i¼ 1

∑j∈Kl

eαl∑

i∈I

eiðCk;i−Cj;iÞ :

This yields a system of equations that can be solved for feigi∈I .The inverse temperature parameters fαlg1≤l≤L may either be

treated as free parameters or they may be the experimentallyknown temperatures of the mesoscopic interactions.

Instead of directly solving the above systems of equations,we recast the problem as a constrained version of multinomialregression. For this formulation, the inverse temperature para-meters have to be known a priori. For biological applicationswe assume that interactions occur under “standard biologicalconditions” (7). Furthermore, without loss of generality, weassume that the set K is partitioned into blocks of equal sizeof cardinality M and indexed such that the first M mesoscopicinteractions are in K1, the second M mesoscopic interactionsare in K2, and so forth, yielding fkg1≤k≤M ¼ K1, fkgMþ1≤k≤2M ¼K2;⋯;fkgMðL−1Þþ1≤k≤ML ¼ KL. No loss of generality results fromsuch a partitioning because zero-probability mesoscopic interac-tions (e.g., all atoms are in the same position) can be used to padcanonical ensembles to be of equal size. Using this partitioning,each canonical ensemble Kl can then be treated as a single datapoint in an M-class multinomial regression problem, with theregression input vectors fXl ∈ RIðM−1Þg1≤l≤L having the form

Xl ¼ fαlðCMl;i −CMðl−1Þþm;iÞgi∈I;1≤m≤M−1

Expanded:

Xl ¼ fαlðCMl;1 −CMðl−1Þþ1;1Þ;⋯;αlðCMl;I −CMðl−1Þþ1;IÞ;αlðCMl;1

−CMðl−1Þþ2;1Þ;⋯;αlðCMl;I −CMl−1;IÞg:

And the design matrix is fXlg1≤l≤L ¼ X ∈ RL×IðM−1Þ. The outputvectors fY l ∈ RM−1g1≤l≤L have the following form:

Y l ¼ fpMðl−1Þþmg1≤m≤M−1:

Expanded:

Y l ¼ fpMðl−1Þþ1;⋯;pMl−1g:

And the output matrix is fY lg1≤l≤L ¼ Y ∈ RL×ðM−1Þ. Usingthis formulation, the multinomial logistic regression problembecomes

Y l;m ¼ eXlβm

1þ∑M−1n¼1

Xlβn¼ e∑

IðM−1Þp¼1

βm;pXl;p

1þ∑M−1n¼1

e∑IðM−1Þp¼1

βn;pXl;p;

where βm;p is the pth coefficient for themth class in the regressioncoefficient matrix, Xl is indexed as previously described, and Y l;mis the output of themth class for the lth data point. TheMth classis treated as the comparison category in the regression. We nowintroduce two constraints that, when enforced, make the statisti-cal mechanical model derived earlier and multinomial logisticregression equivalent. The two constraints are

∀ m ∈ ½1;M − 1�; p ∉ ½Iðm − 1Þ þ 1;Im�: βm;p ¼ 0

∀ m;n ∈ ½1;M − 1�; i ∈ ½1;I�: βm;Iðm−1Þþi ¼ βn;Iðn−1Þþi:

The first constraint leads to cancellations that yield

Y l;m ¼ e∑Im

p¼Iðm−1Þþ1

βm;pXl;p

1þ∑M−1n¼1

e∑In

p¼Iðn−1Þþ1

βn;pXl;p:

Using the second constraint and simplifying the notation bysetting βm;Iðm−1Þþi ¼ β0i for all m and i, we obtain

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 2 of 10

Page 9: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

Y l;m ¼ e∑I

i¼1

β0iXl;Iðm−1Þþi

1þ∑M−1n¼1

e∑I

i¼1

β0iXl;Iðn−1Þþi:

Inserting the values for Xl as previously defined,

Y l;m ¼ e∑I

i¼1

β0iαlðCMl;i−CMðl−1Þþm;iÞ

1þ∑M−1n¼1

e∑I

i¼1

β0iαlðCMl;i−CMðl−1Þþn;iÞ:

Dividing the numerator and denominator by the expression in thenumerator we obtain

Y l;m ¼ 1∕�e∑

I

i¼1

β0iαlðCMðl−1Þþm;i−CMl;iÞ

þ ∑M−1

n¼1

e∑I

i¼1

β0iαlðCMl;i−CMðl−1Þþn;i−CMl;iþCMðl−1Þþm;iÞ�

Y l;m ¼ 1∕�e∑

I

i¼1

β0iαlðCMðl−1Þþm;i−CMl;iÞ

þ ∑M−1

n¼1

e∑I

i¼1

β0iαlðCMðl−1Þþm;i−CMðl−1Þþn;iÞ�

Y l;m ¼ 1∕�∑M

n¼1

eαl ∑

I

i¼1

β0iðCMðl−1Þþm;i−CMðl−1Þþn;iÞ�:

Finally by letting k be the mth mesoscopic interaction of the lthpartition of K , and noting that ∀ l: Kl ¼ ½Mðl − 1Þ þ 1;Ml�, we canrewrite the above expression as

Y l;m ¼ 1

∑j∈Kl

eαl ∑

I

i¼1

β0iðCk;i−Cj;iÞ;

which yields the sought equivalence of the statistical mechanicalmodel to multinomial logistic regression, where pk ¼ Y l;m and theinferred coefficient β0i ¼ ei for all i ∈ I. Thus using constrainedlogistic regression with the previously defined input vectorsfXl ∈ RIðM−1Þg1≤l≤L and output vectors fY l ∈ RM−1g1≤l≤L willyield the energies of all microscopic interactions.

This is a key result in our formulation. Once we recognize thatthe de novo potential determination problem can be cast as aregression problem, we can consider applying the methods ofcompressed sensing (2) to the regression problem.

Applying compressed sensing to de novo potential determination.Compressed sensing is a theoretical and applied framework thatenables the recovery of a signal with high accuracy from relativelyfew measurements. In the context of our regression formulation,the signal is the set of microscopic interaction energies feigi∈I , thesensors are mesoscopic interactions such as protein–DNA or pro-tein–protein binding events, and the measurements are the ener-gies or relative probabilities of the mesoscopic interactions.Compressed sensing enables the use of relatively few measure-ments to infer the energies of the microscopic interactions under-lying the mesoscopic interactions.

The application of compressed sensing to regression isachieved by imposing an ℓ1 regularization penalty on the regres-sion problem (8). Several compressed sensing approaches havebeen developed to accomplish this task. Here, we utilize the lassopenalty for linear regression (8) and its generalization for logisticregression (9). Other possibilities include the Dantzig selector(10) for linear regression and suitable generalizations for logisticregression (11).

Our formulation casts potential determination as a com-pressed sensing problem. Accurate inference of energy potentialsis ultimately dependent on two important characteristics of theparticular application. These two characteristics are the sparsityof the microscopic potential to be inferred and the coherence ofthe mesoscopic measurements used for inference. The com-pressed sensing literature contains many theoretical guaranteesabout the expected number of measurements required, denotedbym, given a signal of length n with s nonzero entries and a sensormatrix X with coherence μ (the design matrix in regression).The majority of these results concern sensor matrices with somedegree of randomness, such as the guarantees in Candes and Plan(12). In our specific application, the sensor matrix is fixed anddefined by the set of protein:DNA structures in the dataset.Theoretical guarantees for fixed matrices are discussed in thepapers by Candes and Romberg (13) and Tropp (14).

In general, most results relatem to an increasing function of n,s, and μ. In our context n is the total number of possible micro-scopic interactions, s is the number of microscopic interactionsI0 ⊂ I whose energies are nonnegligible, i.e., for all i ⊂ I0,ei ≫ 0, and μ is the coherence of the sensor matrix X . Thus, ifonly a small number of the microscopic interactions are nonne-gligible (i.e., s is small), then only a small number of measure-ments are needed for accurate inference of the potential. Therole that μ plays is slightly more subtle (the formal definitionof μ can be found in ref. 2). Intuitively, coherence relates tothe diversity of microscopic interactions that constitute eachmesoscopic interaction. The greater the variety of microscopicinteractions participating in each experimentally characterizedmesoscopic interaction, the smaller the coherence. Thus, to mini-mize the number of experimental measurements needed, oneneeds to minimize the coherence of the sensor matrix. This im-plies that the mesoscopic interactions must incorporate as wide arange of microscopic interactions as possible. However, thedegree to which the coherence of mesoscopic interactions canbe controlled is dependent on the application. For biologicalsystems, the choice of biomolecules comprising the set of meso-scopic interactions directly impacts the coherence of the sensormatrix.

In particular, for the protein–DNA application, the coherenceof the sensor matrix is dependent on the set of protein–DNAcomplexes used as mesoscopic interactions. For a given set of pro-tein–DNA complexes whose atomic structures have been deter-mined, the coherence μ can be computed from the counts ofatomic interactions present in the structures (2). This providesa principled approach for choosing new protein–DNA complexesfor crystallization, based on their predicted atomic structures andthe resulting value for μ. By carefully selecting protein–DNAcomplexes that yield the smallest predicted coherence of the sen-sor matrix, the accuracy of the inferred protein–DNA potentialcan be increased with a small number of targeted crystallizationexperiments.

II. De Novo Determination of Protein–DNA Potential. In this sectionthe general methodology derived in section I will be applied tothe specific problem of determining protein–DNA potentials.Several different approaches to potential determination will bediscussed. We first delineate the details common to all inferredprotein–DNA potentials and then describe the baseline potentialand modified potentials that relax physical constraints.

Common details of all protein–DNA potentials. In all protein-DNApotentials in this section, the microscopic interactions are definedas pairwise contacts between DNA and protein atoms at specifieddistance ranges. Intra-DNA interactions, known as indirect read-out, are ignored, as are intraprotein interactions. We use theatom classification scheme of the Quasichemical potential(15), which takes into consideration the chemical identity of

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 3 of 10

Page 10: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

the atom as well as its local moiety. In this scheme there are 37DNA atom types and 27 protein atom types. Contact distancesare categorized into bins of fixed width (see Metaparametersbelow). Thus, a model with D distance bins contains a total of37 × 27 ×D interaction types. The potential inference problemis to determine the energy of each of these microscopic interac-tion types using experimental measurements of mesoscopic inter-actions, specifically protein–DNA-binding events. As describedbelow, these binding events will be based on protein–DNA crystalstructures and their associated binding energies.

For this inference problem, the multinomial logistic regressionapproach derived in Section I is used. The canonical ensemblesare created by fixing the protein and varying the DNA sequencein silico by substitution of the alternative nucleic acids at each sitein the interaction region as described in the next paragraph.We make the conventional assumption of independence of theDNA base pair positions in the interaction region, and we treateach set of protein–DNA base pair complexes derived from se-parate protein–DNA crystal structures as a separate canonicalensemble. Thus each ensemble is comprised of four mesoscopicinteractions, corresponding to one protein binding to each of thefour possible nucleotide base pairs. Previous studies have pro-vided justification for treating such ensembles as canonicalensembles (16).

The experimental data we use to infer the microscopic energiesfor the protein–DNA-binding problem are crystallized protein:DNA structures in conjunction with their experimentally deter-mined DNA-binding motifs. A single protein–DNA-bindingevent corresponds to a protein–DNA base pair complex. Forthe regression input vectors, the counts of the interaction typesas defined above are required for each protein–DNA-bindingevent. We obtain these counts from the atomic coordinates ofprotein and DNA atoms in the X-ray crystal structures of pro-tein:DNA complexes. For each protein:DNA crystal complex,we generate in silico structural mutants so that every base pairin each DNA position is in silico mutated to every possible alter-native nucleotide. (For example, if there is an “A” at a particularposition, we construct the three alternative structures with “T,”“G,” and “C” at that position.) Thus, for each protein:DNAcomplex with a DNA sequence of length N, we obtain a totalof 4N structures which are grouped into N canonical ensemblesas previously described.

For the regression output vectors, the relative probabilities ofthe mesoscopic interaction events within each canonical ensem-ble are required. These relative probabilities correspond to theprobabilities with which a protein binds the four DNA nucleo-tides at a given position. When inferring a protein–DNA poten-tial to predict consensus sequences from the crystal structures, weset the probability of binding to the nucleotide observed in thestructure to 1 and the probability of binding to any other nucleo-tide to 0. When inferring a protein–DNA potential for predictingposition weight matrices (PWMs), the probabilities are derivedfrom experimentally determined PWMs that are manually cu-rated for each protein:DNA complex in our dataset (see Datasetbelow). A PWM is a set of probability distributions, one for eachbase pair position in the DNA-binding site, over the four possibleDNA nucleotides. The probability we assign to each protein–DNA-binding event is the weight of that event given in the experi-mental PWM.

All regressions are regularized using the lasso penalty (seeApplying compressed sensing to de novo potential determinationabove). Additionally, we assume that all protein–DNA interac-tions occur under standard biological conditions so that thereis a common temperature across all ensembles. Because thetemperature is fixed, it can be factored out of the inferredmicroscopic energies, and so we set it equal to 1 to simplify thecalculation.

Metaparameters. All protein–DNA potentials include twospatial metaparameters that require fitting: binning width andcutoff distance. Binning width corresponds to the resolution atwhich contact distances are discretized. Cutoff distance is the dis-tance beyond which interactions are ignored. A third parameter,known as the regularization parameter λ (1), adjusts the expecteddegree of sparsity of the microscopic potential (see Applyingcompressed sensing to de novo potential determination above).Because the sparsity of the microscopic potential is not knowna priori, this parameter must also be fitted.

To find the optimal combination of binning width, cutoff dis-tance, and λ, we vary these three metaparameters and test theresulting potential on a criteria that minimizes prediction error(see Testing Methodology in Section III). Binning width is variedfrom 0.6 to 3 Å, in steps of 0.1 Å. Cutoff distance is varied from 2to 20 Å, in steps equal to the binning width. For λ, the maximumvalue considered is the smallest value such that all input featuresin the model are weighted 0. The minimum value considered isequal to 20−3 times the maximum value. This is a standard ap-proach that ensures that sufficiently small values are consideredto include the regime where the model overfits (1).

Model solving. We perform ℓ1-regularized multinomial logisticregression using the R-based glmnet package (9). Solutions forthe constrained regression problem we derived earlier are feasi-ble for the unconstrained regression problem. Consequently, theconstraints are not enforced in the implementation as the uncon-strained optimization is more computationally efficient andenforcing the constraints is unlikely to improve performance.

Variants of protein–DNA potentials. The previous subsectiondescribed details common to all protein–DNA potentials. Weinfer three variants of protein–DNA potentials in this work. Thissubsection will describe the specific choices and motivations foreach variant.

Variant 1: The baseline.The baseline protein–DNA potential usesdifferences of counts of microscopic interactions as the regressioninput vectors, as previously derived in Inference using relativeprobabilities of mesoscopic interactions. Our formulation alsorequires that one state within each canonical ensemble, wherethe states correspond to the four nucleotides, is used as a com-parison category. We arbitrarily choose guanine for this purpose.See Inference using relative probabilities of mesoscopic interactionsfor more details.

Variant 2: Transformed inputs.The “transformed inputs” protein–DNA potential uses ratio of counts of microscopic interactions asthe regression input vectors. Specifically, the regression inputvectors fXl ∈ RIMg1≤l≤L have the following form:

Xl ¼�αl

�CMðl−1Þþm;i

∑M

n¼1CMðl−1Þþn;i

��i∈I;1≤m≤M

:

Expanded:

Xl ¼�αl

�CMðl−1Þþ1;1

∑M

n¼1CMðl−1Þþn;1

�;⋯;αl

�CMðl−1Þþ1;I

∑M

n¼1CMðl−1Þþn;I

�;

αl

�CMðl−1Þþ2;1

∑M

n¼1CMðl−1Þþn;1

�;⋯;αl

�CMl;I

∑M

n¼1CMðl−1Þþn;I

��:

In this protein–DNA potential all elements of the regression in-put vectors are also standardized such that their mean is zero andvariance is one. Input vectors with this property typically exhibitbetter statistical properties (1).

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 4 of 10

Page 11: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

Variant 3: Region-specific potentials. In the “region-specific” pro-tein–DNA potential, the energies of microscopic interactions areallowed to vary as a function of position along the protein–DNA-binding interface. The motivation behind this generalizationwill be addressed first, followed by a description of how regionspecificity is incorporated into the protein–DNA potential.

Potentials typically define interaction energies independentlyof the absolute spatial position of the interaction. This is a desir-able property when the position has no bearing on the energetics.However, if the position of the interaction is indicative of aconsistent physico-chemical environment, properties of this en-vironment can be exploited when modeling the energies of inter-actions. This is usually exploited on a small scale, for example, bytreating two carbon atoms as distinct atom types depending ontheir chemical moiety, as in the C1′ and C2′ atoms of DNA (15).In that case, the carbon atom’s local chemical moiety is the con-sistent physico-chemical environment, and it is used in modelingthe system to influence the interaction energies of the atoms, bytreating carbon atoms with distinct chemical moieties as distinctatom types. For protein–DNA potentials, we asked whether thisnotion can be generalized to a larger spatial context, by deriving apotential where the microscopic interaction energies are allowedto vary as a function of position along the protein:DNA-bindinginterface. A key requirement for such a potential to work is thatany given position along the binding interface must exhibit aconsistent physico-chemical environment across structures, forexample, to have essentially the same steric constraints. For theprotein–DNA potential, we hypothesized that this is the casewhen a set of proteins employs the same binding modality fordocking into DNA, as is often true for members of the same pro-tein family. This property was previously exploited in modelingthe binding of zinc finger proteins to DNA (17). We conjecturedthat a region-specific model may also be suitable for the helix-turn-helix (HTH) family, because (i) the sequence specificityof HTHs is largely mediated by interactions between the DNAand the recognition α helix of HTHs (18) and (ii) the relative or-ientation of the two core α helices that make up the HTH domainis conserved across HTH families, despite the broad structuraldiversity of HTH domains (18–20).

To accomplish this, we structurally aligned all 63 HTH:DNAcomplexes in our dataset (see Dataset) so that the DNA mole-cules are superimposed and the variation in the orientationand position of the recognition helices is minimized. Formally,if rmsdDNA is the rmsd between the backbone carbon atoms oftwo aligned DNA molecules and rmsdHTH is the rmsd betweenthe Cα atoms of two aligned HTH recognition helices, then wesolve the following optimization problem for all pairwise compar-isons of HTH:DNA complexes:

minalignments

rmsdHTH

subject to rmsdDNA < δ;

where δ is a parameter of the algorithm (a value of 2 Å is usedthroughout). Any HTH:DNA complex in the dataset can be usedas a baseline for a multiple alignment, so we select the complexthat minimizes the average rmsdHTH to all other complexes. Theresulting multiple alignment, shown in Fig. S3, produces a unifiedcoordinate system along the HTH:DNA-binding interface suchthat the physico-chemical environment at a given spatial positionis comparable across all structures.

To exploit this alignment in the determination of the protein–DNA potential, the resolution of the coordinate system is re-duced to the level of individual DNA base pairs. In other words,the number of distinct positions in the coordinate system is madeequal to the number of distinct DNA base pair positions in theHTH-DNA-binding interface. In this way, a DNA base in one

HTH:DNA complex is comparable with exactly one otherDNA base in every other complex. In total there are 13 distinctDNA base pair positions, encompassing the largest observedHTH:DNA-binding interfaces in our dataset.

To incorporate region specificity into the inferred protein–DNA potential, the elements of every regression input vectorare duplicated 13 times. Formally, the regression input vectorshave the following form:

X ðnÞl ¼ fαlðCMl;i −CMðl−1Þþm;iÞgi∈I;1≤m≤M−1

∪f1fr ¼ ngαlðCMl;i −CMðl−1Þþm;iÞgi∈I;1≤m≤M−1;1≤r≤13:

In the above formulation, the original set of elements in the inputvectors are left unchanged, and the nth duplicate set in an ensem-ble representing the nth DNA base pair is also left unchanged. Allother vector elements are set to 0. X ðnÞ

l is a canonical ensemblethat corresponds to a binding event at the nth DNA basepairposition, and 1fr ¼ ng is the indicator function testing whetherr equals n. This encoding of the regression input vectors allowsthe inference algorithm to model the shared interaction energiescommon across all positions, as well as capture position-specificadjustments to the core potential.

III. Application of De Novo Protein–DNA Potentials to the Prediction ofProtein–DNA-Binding Sites. In this section the de novo protein–DNA potentials described in Section II are used to predict theDNA-binding sites of proteins. This is accomplished by first usingthe inferred protein–DNA potentials to compute the binding en-ergies of the protein:DNA structures, and then using those com-puted energies to predict the DNA-binding sites of proteins. Wefirst describe the structure-based approach to binding site predic-tion and then detail the testing methodology and metrics used.

Structure-based prediction of protein DNA-binding sites. Structure-based methods for the prediction of protein–DNA-binding sitesfollow a somewhat standardized approach when predicting theDNA-binding affinity of proteins (15, 17, 21–23). Starting witha protein:DNA X-ray crystal structure, structure-based methodstransform the structure into a set of microscopic interactions (inour case protein atom to DNA atom contacts using the atom-typecategories discussed earlier). In some approaches, the structure isfirst relaxed using a molecular mechanics force field (24), but wedo not employ this step in our approach. Once the set of micro-scopic interactions is identified, the overall binding energy of aprotein:DNA structure is computed by adding up the individualenergetic contributions from all the microscopic interactions ob-served in the structure. How the interaction energies are defineddepends on the choice of potential used; in our case, we use thede novo protein–DNA potentials described in section II to com-pute the energies. Formally, if I is the set of all interactions ob-served in a structure, ei is the energetic contribution of interactioni, Ci is the number of times interaction i is observed in the struc-ture, then the binding energy of the structure, denoted by ΔG, is

ΔG ¼ ∑i∈I

eiCi:

The relative affinity of a protein to two different DNA sequencescan be evaluated by computing the binding energy of the proteinto those two sequences. This is done by fixing the protein in theprotein:DNA complex and mutating the DNA sequence in silicoas described next. To make this problem computationally tract-able, it is often assumed that the energetics for one DNA positionin the binding interface can be computed separately from otherpositions (15, 17, 21–23). We adopt this assumption. Doing sosimplifies the computation by requiring for a DNA sequenceof length N only 4N energetic calculations, where each base pair

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 5 of 10

Page 12: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

in the DNA is mutated in silico to every possible nucleotide in-dependently from other base pairs, and the binding energy ofeach of these in silico mutagenized protein:DNA structures iscomputed. Using the computed binding energies, the Boltzmannformula (16) can then be used to compute the probability ofobserving nucleotide m at position n, denoted by pðnÞm :

pðnÞm ¼ e−βΔGðnÞm

∑k∈fA;C;G;Tg

e−βΔGðnÞk

;

where ΔGðnÞm denotes the binding energy of the structure when

position n is mutated to nucleotide m, and β is the inversetemperature parameter that is set to unity in our computations.Performing this computation for every position n and every nu-cleotide m yields the predicted PWM for the DNA-binding sitesof the protein. When the consensus sequence is to be predicted,the most probable nucleotide at every position is chosen as theconsensus nucleotide.

Note that our approach does not require the energy computa-tion step because the probabilities can be computed directly usingthe inferred regression model. This is done by first convertingeach position in a protein:DNA structure into the appropriateregression input vector, as we did previously for the potential in-ference step, and then using the regression coefficient matrix tocompute the binding probabilities. We use this approach for all ofour DNA-binding site predictions, including the computationsthat produce the results in Figs. 3 and 4 and Table 1 of the maintext.

Testing methodology. Quality metrics. We used two metrics to testthe quality of predictions made by our de novo potentials appliedto the protein–DNA-binding site prediction problem. For consen-sus binding sites, we compute the fraction of bases correctly pre-dicted by the algorithm for a given protein. To count as correct apredicted base must exactly match the identity of the base in theprotein:DNA X-ray structure.

To test the performance of the algorithm in predicting positionweight matrices for protein–DNA-binding sites, we compare thealgorithm’s predictions to published experimentally determinedPWMs. To compare two PWMs, a distance measure over theirprobability distributions is required to assess the “closeness” ofthe prediction. The most commonly used such measure is thesymmetric Kullback–Leibler divergence (SKLD) (25). The SKLDbetween two PWMs P and Q is defined as

SKLDðP;QÞ ¼ 1

N∑N

n¼1∑

m∈fA;C;G;TgpðnÞm ln

pðnÞm

qðnÞm

þ qðnÞm lnpðnÞm

qðnÞm

;

where pðnÞm and qðnÞm are the probabilities of observing nucleotidemat position n in P and Q, respectively.

Dataset. We obtained a set of HTH:DNA complex structuresfrom the Protein Data Bank (26) by searching for X-ray crystalstructures that contain an HTH domain and DNAmolecules. We

considered complexes to be redundant if they shared the samesequence of amino acids within a 10-residue window of the recog-nition α-helix, and we retained only one representative for suchcomplexes. We chose this criterion due to the dominant rule thatrecognition α-helices play in effecting the sequence specificityof HTH proteins, and the fact that HTHs with otherwise highlysimilar sequences may still exhibit differential DNA-bindingproperties (27, 28). In addition, we removed complexes withpathologies such as a large number of missing heavy atoms inthe published structure. The resulting dataset is comprised of63 nonredundant HTH:DNA complexes (Dataset S1). For eachof the HTH:DNA complexes in this dataset, a PWM was derivedbased on experimental data curated from multiple sources(Dataset S2) (23, 29–72).

Validation. To test the performance of our models, we used9-fold cross-validation. The 63 complexes were split into 9 groupsof 7 complexes each. Each group yielded a testing configurationwhere 7 complexes comprise the test set, and the remaining 56the training set. The model was trained on the training set andtested on the test set using all 9 testing configurations, and theaverage was taken over all test sets. Because some metapara-meters vary the number of degrees of freedom in the model,fitting them strictly on the training set was not possible, as thealgorithm would always maximize the number of degrees of free-dom. Consequently the 3 metaparameters were fit by finding thevalue that minimizes average error over all test sets. Results areshown in Figs. 3 and 4 and Table 1 in the main text.

Assessment of other methods.Our objective was to create a con-sistent testing environment for all methods considered for com-parison with our algorithm. When original code implementationswere available, they were used; otherwise, the method was reim-plemented, as for the Quasichemical potential (15). Recom-mended default settings were chosen when applicable, and allparameter settings fit using the original methods’ datasetswere left unchanged. For the Rosetta potential, the energeticterms were set to the following values: faatr ¼ 0.947733, farep ¼0.577238, hbsc ¼ 1.596235, gbelec ¼ 0.203353, fasol ¼ 0.507356,dnabs ¼ 0.1, and dnabp ¼ 0.1. These values are based on the fittedparameters from ref. 23. Depending on the original datasets usedfor training each algorithm, some methods may have an advan-tage over other methods if their training dataset included struc-tures that were also in our curated test sets. Nonetheless, giventhe heterogeneous set of methods tested and their reliance ondifferent types of input data, the methodology used provided aconsistent and objective comparison.

When testing the consensus sequence predictions of the Roset-ta (23) and DNAPROT (21) algorithms, their indirect readoutcomponents had to be disabled as they explicitly rely on the con-sensus sequence itself as input. For Rosetta, this was done by set-ting the dna_bs and dna_bp terms to zero weight, to eliminate theindirect readout component from the potential. For DNAPROT,this was done by running it with the setting “-p′ -P -1 -e -c -D 0′”,which disables DNAPROT’s indirect readout component and itsCumulative Contacts (73) component, both of which rely on theconsensus sequence as input.

1. Hastie T, Tibshirani R, Friedman JH (2009) The Elements of Statistical Learning: DataMining, Inference, and Prediction (Springer, New York), 2nd Ed, p xxii.

2. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE SignalProcess Mag 25:21–30.

3. Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52:1289–1306.4. McQuarrie DA (2000) Statistical Mechanics (University Science Books, Sausalito, CA),

p xii.5. Reif F (1965) Fundamentals of Statistical and Thermal Physics (McGraw-Hill, New

York), p x.6. Brändén C-I, Tooze J (1999) Introduction to Protein Structure (Garland, New York),

2nd Ed, p xiv.7. Berg JM, Tymoczko JL, Stryer L (2007) Biochemistry (Freeman, New York), 6th Ed.

8. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat SocSeries B Stat Methodol 58:267–288.

9. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linearmodels via coordinate descent. J Stat Softw 33:1–22.

10. Candes E, Tao T (2007) The Dantzig selector: Statistical estimation when p is muchlarger than n. Ann Stat 35:2313–2351.

11. James GM, Radchenko P (2009) A generalized Dantzig selector with shrinkagetuning. Biometrika 96:323–337.

12. Candes EJ, Plan Y (2010) A probabilistic and RIPless theory of compressed sensing.IEEE Trans Inf Theory (in press).

13. Candes EJ, Romberg J (2006) Quantitative robust uncertainty principles andoptimally sparse decompositions. Found Comput Math 6:227–254.

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 6 of 10

Page 13: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

14. Tropp JA (2008) On the conditioning of random subdictionaries. Appl ComputHarmon Anal 25:1–24.

15. Donald JE, Chen WW, Shakhnovich EI (2007) Energetics of protein-DNA interactions.Nucleic Acids Res 35:1039–1047.

16. Berg OG, Vonhippel PH (1987) Selection of DNA-binding sites by regulatoryproteins—statistical-mechanical theory and application to operators and promoters.J Mol Biol 193:723–743.

17. Kaplan T, Friedman N, Margalit H (2005) Ab initio prediction of transcription factortargets using structural knowledge. PLoS Comput Biol 1:e1.

18. Wintjens R, Rooman M (1996) Structural classification of HTH DNA-binding domainsand protein-DNA interaction modes. J Mol Biol 262:294–313.

19. Suzuki M, Gerstein M (1995) Binding geometry of alpha-helices that recognize DNA.Proteins 23:525–535.

20. Pabo CO, Nekludova L (2000) Geometric analysis and comparison of protein-DNAinterfaces: Why is there no simple code for recognition? J Mol Biol 301:597–624.

21. Angarica VE, Perez AG, Vasconcelos AT, Collado-Vides J, Contreras-Moreira B (2008)Prediction of TF target sites based on atomistic models of protein-DNA complexes.BMC Bioinformatics 9:436.

22. Moroni E, Caselle M, Fogolari F (2007) Identification of DNA-binding protein targetsequences by physical effective energy functions: Free energy analysis of lambdarepressor-DNA complexes. BMC Struct Biol 7:61.

23. Morozov AV, Havranek JJ, Baker D, Siggia ED (2005) Protein-DNA binding specificitypredictions with structural models. Nucleic Acids Res 33:5781–5798.

24. Jorgensen WL, Tirado-Rives J (2005) Potential energy functions for atomic-levelsimulations of water and organic and biomolecular systems. Proc Natl Acad SciUSA 102:6665–6670.

25. Hastie T (1987) A closer look at the deviance. Am Stat 41:16–20.26. Berman HM, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242.27. Gajiwala KS, Burley SK (2000)Winged helix proteins. Curr Opin Struc Biol 10:110–116.28. Mo Y, Vaessen B, Johnston K, Marmorstein R (2000) Structure of the elk-1-DNA

complex reveals how DNA-distal residues affect ETS domain recognition of DNA.Nat Struct Biol 7:292–297.

29. Matys V, et al. (2006) TRANSFAC and its module TRANSCompel: Transcriptional generegulation in eukaryotes. Nucleic Acids Res 34:D108–110.

30. Kazakov AE, et al. (2007) RegTransBase—a database of regulatory sequences andinteractions in a wide range of prokaryotic genomes. Nucleic Acids Res 35:D407–D412.

31. Halfon MS, Gallo SM, Bergman CM (2008) REDfly 2.0: An integrated database of cis-regulatorymodules and transcription factor binding sites inDrosophila.Nucleic AcidsRes 36:D594–598.

32. Portales-Casamar E, et al. (2010) JASPAR 2010: The greatly expanded open-accessdatabase of transcription factor binding profiles. Nucleic Acids Res 38:D105–110.

33. Newburger DE, Bulyk ML (2009) UniPROBE: An online database of protein bindingmicroarray data on protein-DNA interactions. Nucleic Acids Res 37:D77–D82.

34. Munch R, et al. (2003) PRODORIC: Prokaryotic database of gene regulation. NucleicAcids Res 31:266–269.

35. Gama-Castro S, et al. (2008) RegulonDB (version 6.0): Gene regulation model ofEscherichia coli K-12 beyond transcription, active (experimental) annotated promo-ters and Textpresso navigation. Nucleic Acids Res 36:D120–124.

36. Sierro N, Makita Y, de Hoon M, Nakai K (2008) DBTBS: A database of transcriptionalregulation in Bacillus subtilis containing upstream intergenic conservation informa-tion. Nucleic Acids Res 36:D93–96.

37. Jagannathan V, Roulet E, Delorenzi M, Bucher P (2006) HTPSELEX—a database ofhigh-throughput SELEX libraries for transcription factor binding sites. Nucleic AcidsRes 34:D90–94.

38. Down TA, Bergman CM, Su J, Hubbard TJ (2007) Large-scale discovery of promotermotifs in Drosophila melanogaster. PLoS Comput Biol 3:e7.

39. Palaniswamy SK, et al. (2006) AGRIS and AtRegNet. A platform to link cis-regulatoryelements and transcription factors into regulatory networks. Plant Physiol140:818–829.

40. Bulow L, Engelmann S, Schindler M, Hehl R (2009) AthaMap, integrating transcrip-tional and post-transcriptional data. Nucleic Acids Res 37:D983–986.

41. Kumar MDS, et al. (2006) ProTherm and ProNIT: Thermodynamic databases forproteins and protein-nucleic acid interactions. Nucleic Acids Res 34:D204–D206.

42. Yellaboina S, Ranjan S, Chakhaiyar P, Hasnain SE, Ranjan A (2004) Prediction of DtxRregulon: Identification of binding sites and operons controlled by Diphtheria toxinrepressor in Corynebacterium diphtheriae. BMC Microbiol 4:38.

43. Franks AH, Griffiths AA, Wake RG (1995) Identification and characterization of newDNA-replication terminators in Bacillus subtilis. Mol Microbiol 17:13–23.

44. Griffiths AA, Wake RG (1997) Search for additional replication terminators in theBacillus subtilis 168 chromosome. J Bacteriol 179:3358–3361.

45. Griffiths AA, Andersen PA, Wake RG (1998) Replication terminator protein-basedreplication fork-arrest systems in various Bacillus species. J Bacteriol 180:3360–3367.

46. Sugisaki H, Kanazawa S (1981) New restriction endonucleases from Flavobacteriumokeanokoites (FokI) and Micrococcus luteus (MluI). Gene 16:73–78.

47. Falvey E, Grindley NDF (1987) Contacts between gamma-delta-resolvase and thegamma-delta-res site. EMBO J 6:815–821.

48. Moskowitz IP, Heichman KA, Johnson RC (1991) Alignment of recombination sites inHin-mediated site-specific DNA recombination. Genes Dev 5:1635–1645.

49. Rosandic M, et al. (2006) CENP-B box and pJ alpha sequence distribution in humanalpha satellite higher-order repeats (HOR). Chromosome Res 14:735–753.

50. Tronche F, Yaniv M (1992) HNF1, a homeoprotein member of the hepatic transcrip-tion regulatory network. Bioessays 14:579–587.

51. Liston DR, Johnson PJ (1999) Analysis of a ubiquitous promoter element in a primitiveeukaryote: early evolution of the initiator element. Mol Cell Biol 19:2380–2388.

52. Shen WF, et al. (1997) AbdB-like Hox proteins stabilize DNA binding by the Meis1homeodomain proteins. Mol Cell Biol 17:6448–6458.

53. Kostelidou K, Thomas CM (2000) The hierarchy of KorB binding at its 12 binding siteson the broad-host-range plasmid RK2 and modulation of this binding by IncC1 pro-tein. J Mol Biol 295:411–422.

54. Garcia-Castellanos R, et al. (2004) On the transcriptional regulation of methi-cillin resistance—MecI repressor in complex with its operator. J Biol Chem279:17888–17896.

55. Colloms SD, van Luenen HG, Plasterk RH (1994) DNA binding activities of theCaenorhabditis elegans Tc3 transposase. Nucleic Acids Res 22:5548–5554.

56. Prakash P, Yellaboina S, Ranjan A, Hasnain SE (2005) Computational prediction andexperimental verification of novel IdeR binding sites in the upstream sequences ofMycobacterium tuberculosis open reading frames. Bioinformatics 21:2161–2166.

57. Wilson DS, Guenther B, Desplan C, Kuriyan J (1995) High resolution crystal structureof a paired (Pax) class cooperative homeodomain dimer on DNA. Cell 82:709–719.

58. Hughes KT, Gaines PCW, Karlinsey JE, Vinayak R, Simon MI (1992) Sequence-specificinteraction of the Salmonella Hin recombinase in both major and minor grooves ofDNA. EMBO J 11:2695–2705.

59. Hoey T, Levine M (1988) Divergent homeo box proteins recognize similar DNAsequences in Drosophila. Nature 332:858–861.

60. White CE, Winans SC (2007) The quorum-sensing transcription factor TraR decodesits DNA binding site by direct contacts with DNA bases and by detection of DNAflexibility. Mol Microbiol 64:245–256.

61. Harbison CT, et al. (2004) Transcriptional regulatory code of a eukaryotic genome.Nature 431:99–104.

62. Chen SF, et al. (2001) Indirect readout of DNA sequence at the primary-kink site in theCAP-DNA complex: Alteration of DNA-binding specificity through alteration of DNAkinking. J Mol Biol 314:75–82.

63. Koudelka GB, Lam CY (1993) Differential recognition of OR1 and OR3 by bacterioph-age 434 repressor and Cro. J Biol Chem 268:23812–23817.

64. Koudelka GB, Harrison SC, Ptashne M (1987) Effect of non-contacted bases on theaffinity of 434 operator for 434 repressor and Cro. Nature 326:886–888.

65. Schumacher MA, Lau AOT, Johnson PJ (2003) Structural basis of core promoterrecognition in a primitive eukaryote. Cell 115:413–424.

66. Smale ST, et al. (1998) The initiator element: A paradigm for core promoter hetero-geneity within metazoan protein-coding genes. Cold Spring Harb Symp Quant Biol63:21–31.

67. Lo K, Smale ST (1996) Generality of a functional initiator consensus sequence. Gene182:13–22.

68. Javahery R, Khachi A, Lo K, Zenziegregory B, Smale ST (1994) DNA-sequence require-ments for transcriptional initiator activity in mammalian-cells. Mol Cell Biol14:116–127.

69. Huerta AM, Francino MP, Morett E, Collado-Vides J (2006) Selection for unequaldensities of sigma(70) promoter-like signals in different regions of large bacterialgenomes. PloS Genet 2:1740–1750.

70. Fischer SEJ, van Luenen HGAM, Plasterk RHA (1999) Cis requirements for transposi-tion of Tc1-like transposons in C. elegans. Mol Gen Genet 262:268–274.

71. Rodgers DW, Harrison SC (1993) The complex between phage 434 repressor DNA-binding domain and operator site OR3: structural differences between consensusand non-consensus half-sites. Structure 1:227–240.

72. van Luenen HGAM, Plasterk RHA (1994) Target site choice of the related transposableelements Tc1 and Tc3 of Caenorhabditis-Elegans. Nucleic Acids Res 22:262–269.

73. Morozov AV, Siggia ED (2007) Connecting protein structure with predictions ofregulatory sites. Proc Natl Acad Sci USA 104:7068–7073.

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 7 of 10

Page 14: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

Protein-DNA structures augmentedwith in-silico mutated DNA bases

Training data set

B Protein-DNA potential determination

Matrix of microscopicinteraction energies

Key observations

A Potential determination as compressed sensing

Compressed sensing requirements

Correspondence between statisticalmechanics and logistic regression

Limited set of microscopic interactions

Incoherent sensors

Corresponding PWMs

Energy scale Interaction count

Potential determination method

Counts of microscopic interactions(Incoherent sensors)

Relative DNA binding energies(Incoherent measurements)

Mesoscopic interactions combine microscopic

interactions

Mathematical formalism

Sparse signal

C DNA binding site prediction

Predicted PWM

Experimental or homology-based structure

Counts of microscopic interactions

energy

computation

minimization toinfer potential

Atom types

Separation distance

Inte

ract

ing

atom

pai

r

Fig. S1. Overview. (A) Three key observations are combined to cast the determination of energy potentials from crystal structures of molecular complexes as acompressed sensing problem: (i) Mesoscopic molecular interactions arise from a large number of microscopic interactions, enabling molecular complexes to betreated as incoherent sensors. (ii) The correspondence between themathematics of statistical mechanics and logistic regression facilitates a compressed sensingformulation that uses relative binding probabilities as measurements. (iii) Biomolecular interactions are mediated by a small number of energetically dominantatomic-level interactions, resulting in a sparse vector of microscopic interaction energies (sparse signal). These three observations satisfy the requirements ofcompressed sensing and underlie our method for inferring energy potentials exclusively from experimental data. (B) Determination of de novo potential ofprotein–DNA interactions. Two types of training data are used in inferring the potential: (i) protein–DNA crystal structures, which serve as incoherent sensors ofthe microscopic interactions in a protein-DNA-binding event, and (ii) experimental protein–DNA-binding probabilities, which serve as incoherent measure-ments of the binding energies of the mesoscopic protein–DNA-binding events. Because the crystal structures of protein–DNA complexes are typically deter-mined only for a single DNA sequence, whereas proteins bind with varying affinity to a family of sequences, we augment the dataset with structures in whichthe cognate DNA sequence is in silico mutated to alternative DNA sequences. The microscopic interaction energies for each distinct pair of protein atom andDNA atom are inferred from these datasets. (C) Using the inferred microscopic protein-DNA potential, the PWMs characterizing the DNA-binding sites of otherproteins can be computed.

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 8 of 10

Page 15: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

A

Counts of microscopic interactions instructures of protein-DNA complexes

X=

xy

Interaction energies

Binding energies ofprotein-DNA complexes

(Sensors)

(Signal)

(Measurements)

Energy scale Interaction count

yi = binding energy of ith crystal structure

xi = interaction energy of ith interaction type (interaction type is determined by protein atom, DNA atom, and distance bin)

Aij = number of times the jth interaction type occurs in the ith crystal structure

Given y and A, infer x

Fig. S2. Casting potential determination as a compressed sensing problem. In compressed sensing, the objective is to infer an unknown signal, represented bythe vector x, from a set of measurements represented by the vector y. The relationship between the measurements and the signal is assumed to satisfy theequation y ¼ A × x; i.e., each measurement yi is formed by the inner product of a row Ai with the vector x. Thus each row of A represents a distinct sensorvector. The number of measurements available (i.e., the length of y) is typically much smaller than the length of the signal vector x, which would make theequation y ¼ A × x impossible to solve in general. However, for a sparse signal vector x, compressed sensing techniques enable inference of the original signal.In the protein–DNA application, the signal is the vector of microscopic interaction energies (protein–DNA potential), where each entry in the signal vector xcorresponds to the energy of an interaction between a protein atom type and a DNA atom type within a discrete distance bin. Each row of A arises from adistinct protein–DNA crystal structure, and each column corresponds to a distinct type of microscopic interaction (combination of protein atom type, DNA atomtype, and distance bin). An element Aij of A encodes the number of times (indicated by gray levels) that the jth interaction type occurs in the ith crystalstructure. The set of measurements y are the experimental binding energies of the corresponding protein–DNA complexes. Because the binding energyof a complex is the sum of the energies of all microscopic interactions in the complex, it is equal to the inner product between the row of A that encodesthe complex and the signal vector x.

Fig. S3. Two views of 63 structurally aligned HTH:DNA complexes. The Cα traces of recognition α helices are shown in red and C30 traces of DNA helices areshown in blue.

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 9 of 10

Page 16: Direct inference of protein DNA interactions using ...scholar.harvard.edu/files/alquraishi/files/Full Text.pdf · Direct inference of protein–DNA interactions using compressed sensing

0.2 0.4 0.6 0.8 1

0.05

0.1

0.15

0.2

0.25

0.3

|cos(θ)|

Fra

ctio

n o

f An

gle

s

Fig. S4. Distribution of the absolute values of the cosines of angles between sensor vectors. For every pair of sensor vectors derived from the dataset and usedin the best performing “Baseline” model (see Table 1 in the main text), the acute angle θ was calculated and used to compute the histogram of j cos θj valuesshown. This distribution characterizes the incoherence of the sensor matrix used. For an effective sensor matrix, it is desirable to have as many of the values beclose to 0 as possible. The observed distribution suggests that the dataset used can act as an effective sensor matrix, with mean and median values of 0.081 and0.041 (in units of j cos θj), respectively.

Other Supporting Information FilesDataset S1 (XLSX)Dataset S2 (XLSX)

AlQuraishi and McAdams www.pnas.org/cgi/doi/10.1073/pnas.1106460108 10 of 10


Recommended