Modeling DNA Mutation and Recombination for Directed Evolution …€¦ · aspects of the DNA...

E

J. theor. Biol. (2000) 205, 483}503doi:10.1006/jtbi.2000.2082, available online at http://www.idealibrary.com on

00

Modeling DNA Mutation and Recombination forDirected Evolution Experiments

GREGORY L. MOORE AND COSTAS D. MARANAS*

Department of Chemical Engineering, ¹he Pennsylvania State ;niversity, ;niversity Park,PA 16802, ;.S.A.

(Received 28 October 1999, Accepted in revised form 15 April 2000)

Directed evolution experiments rely on the cyclical application of mutagenesis, screening andampli"cation in a test tube. They have led to the creation of novel proteins for a wide range ofapplications. However, directed evolution currently requires an uncertain, typically large,number of labor intensive and expensive experimental cycles before proteins with improvedfunction are identi"ed. This paper introduces predictive models for quantifying the outcome ofthe experiments aiding in the setup of directed evolution for maximizing the chances ofobtaining DNA sequences encoding enzymes with improved activities. Two methods of DNAmanipulation are analysed: error-prone PCR and DNA recombination. Error-prone PCRis a DNA replication process that intentionally introduces copying errors by imposingmutagenic reaction conditions. The proposed model calculates the probability of producinga speci"c nucleotide sequence after a number of PCR cycles. DNA recombination methodsrely on the mixing and concatenation of genetic material from a number of parent sequences.This paper focuses on modeling a speci"c DNA recombination protocol, DNA shu%ing. Threeaspects of the DNA shu%ing procedure are modeled: the fragment size distribution afterrandom fragmentation by DNase I, the assembly of DNA fragments, and the probability ofassembling speci"c sequences or combinations of mutations. Results obtained with theproposed models compare favorably with experimental data.

( 2000 Academic Press

Introduction and Background

Unprecedented opportunities are now withinour reach for generating novel enzymes andbiocatalysts using sophisticated techniques thatmutate, recombine and amplify nucleic acidsequences. Such nucleic acid manipulations areexploited within the framework of directedevolution experiments pioneered by Stemmer(1994a, b) and Arnold (1996). In directed evolu-tion the processes of natural evolution are accel-erated in a test tube for selecting proteins with

*Author to whom correspondence should be addressed.-mail: [email protected]

22}5193/00/150483#21 $35.00/0

the desired properties. A typical experimentalcycle of directed evolution begins with the selec-tion of a library of parent DNA sequences encod-ing for proteins that involve to some extent thesought after property. The diversity of sequencesbeing explored is next increased through themutagenesis step by introducing random pointnucleotide mutations and/or by recombiningDNA fragments. The mutagenesis and frag-mentation step renders all but very few of thesequences inactive. These DNA sequences arethen ligated into an expression vector and trans-formed into Escherichia coli cells. A screeningprocedure is next employed to isolate the few outof the many E. coli transformants containing the

( 2000 Academic Press

484 G. L. MOORE AND C. D. MARANAS

sequences encoding for active enzymes or func-tional proteins. These selected sequences are thenamplixed and the cycle of mutagenesis, screeningand ampli"cation is repeated multiple times untilproteins with the desired property or function arefound. Recently, remarkable successes of directedevolution have been reported, ranging from in-dustrial enzymes with substantially improvedactivities and thermostabilities to vaccines andpharmaceuticals (Schmidt-Dannert & Arnold,1999). These successes mark the onset of enor-mous possibilities for future uses of directedevolution in basic research for understandingprotein function and in industry for creating newbiocatalysts.

Except for the work of Moore et al. (1997)which examines the e!ect of library size andscreening capacity, directed evolution experi-ments have largely been based on empirical in-formation and lack quantitative description ofthe DNA recombination process. Although theyhave led to exciting successes, directed evolutionmethods require a large number of expensive andtime-consuming mutagenesis and/or recombina-tion experiments and often many proteins mustbe screened before one with the desired propertyis identi"ed. The enormous potential of thesemethods will be better realized if the experi-mental design were improved to be more e$cientand less expensive. This challenge provides themain motivation for this paper in which the ne-cessary modeling framework to enable predictionof size, nucleotide sequence, and activity informa-

FIG. 1. Three cycles of PCR produce 23"8 total strands afte16, two are the original DNA double strand, six are the result oand two are the result of three extension steps. Strands are show

tion in directed evolution experiments is de-veloped.

A key step in the directed evolution experi-mental cycle is the introduction of new geneticdiversity to the library. There are two basic waysfor introducing diversity: error-prone PCR andDNA recombination. Error-prone PCR proto-cols were used in early directed evolution experi-ments (Arnold, 1996). Polymerase chain reaction(PCR) is a DNA ampli"cation technique in whichan initial small amount of DNA is replicated inconsecutive cycles increasing its concentrationexponentially (see Fig. 1).

The error-prone PCR replication process(Leung et al., 1989; Cadwell & Joyce, 1992;Lin-Goerke et al., 1997) intentionally introducescopying errors by imposing mutagenic reactionconditions (e.g. through the addition of Mn2` orMg2`). The "rst step of PCR is the denaturiz-ation of the DNA into single strands. The secondstep is the annealing of a primer to the DNAsingle strands. Primers consist of two DNAoligonucleotides with lengths of 15}30 basepairs complementary to the ends of the ampli"edregion. The third step is primer extension bya polymerase (typically ¹aq). Nucleotides com-plementary to the single-strand template are ad-ded by using the original sequence as a template,extending the complementary strands until nor-mal DNA double strands are recovered. Un-avoidable mutations occur in this step whennon-complementary nucleotides are incorpor-ated into the chain. Eckert & Kunkel (1991)

r the third cycle, or 16 single strands of nucleotides. Of thesef one extension step, six are the result of two extension steps,n more lightly shaded as they undergo more extension steps.

MODELING DNA MUTATION AND RECOMBINATION 485

report mutation rates for ¹aq ranging from 10~7

up to 10~3 mutations per nucleotide poly-merized. These mutation rates are nucleotide de-pendent (Cadwell & Joyce, 1992; Sha"khaniet al., 1997). The control of these highly variable(spanning four orders of magnitude) copyingerrors is vital for mutagenesis since the &&right''number of mutations will provide just enoughdiversity for evolutionary advancement withoutproducing a build-up of deleterious errors. How-ever, the ability of error-prone PCR alone tosuccessively improve a DNA sequence throughcontinuously improving single-point mutations issomewhat limited since the build-up of deleteri-ous mutations typically overwhelms the bene"-cial ones. This has been recognized by researchersand currently DNA recombination, capable of"ltering out deleterious mutations while retain-ing the improving ones, is employed in directedevolution experiments.

Unlike error-prone PCR where no exchangeof genetic material occurs between parent se-quences, DNA recombination methods rely onthe mixing and concatenation of genetic materialfrom a number of parent sequences. Recombina-tion protocols include DNA shu%ing (sexualPCR) (Stemmer, 1994a, b), staggered extensionprocess (StEP) (Zhao et al., 1998), random-prim-ing recombination (RPR) (Shao et al., 1998), and

FIG. 2. DNA shu%ing occurs in three steps, the most imporeassembly of parent sequences occurs. The product will havsequences.

incremental truncation (Ostermeier et al., 1999).A thorough review of currently employed DNArecombination protocols can be found in Volkov& Arnold (1999). Directed evolution experimentsutilizing DNA recombination (shu%ing) as themutagenesis step are brie#y described as follows(see also Fig. 2).

First an initial set of parent sequences sharinga number of desired traits are selected for recom-bination. Next, the selected sequences undergorandom fragmentation typically using DNase I.Double-stranded fragments within a certain sizerange (e.g., 100}200 bp) are retained. The re-tained fragments are then reassembled by ther-mocycling with a DNA polymerase (PCR withoutadded primers). As in regular PCR, this involves"rst the denaturization of the double-strandedfragments into single-stranded ones. Denaturingis followed by annealing where single-strandedfragments anneal to other fragments overlappingby a su$ciently large number of complementarybases to form 3@ or 5@ overhangs. The third step ispolymerase extension (see Fig. 2). Note that the 3@overhangs are not changed because DNA poly-merase only possesses 5@P3@ activity. Thesethree steps are repeated and the average fragmentlength increases after each cycle. After a numberof cycles, DNA sequences of the original lengthare obtained. Finally, regular PCR with primers

rtant of which is a PCR reaction without primers in whiche a combination of genetic features from all of the parent


is utilized to amplify the reassembled strands.The key advantage of DNA shu%ing over error-prone PCR is that it can recombine a large num-ber of mutations within a few selection cyclesquickly yielding functional blocks with combina-tions of bene"cial mutations.

In recent years, directed evolution principleshave been successfully applied to enhance a num-ber of protein properties. These include enhance-ments in enzyme thermostability (Arnold &Moore, 1997; Kuchner & Arnold, 1997; Zhao& Arnold, 1997b; Giver et al., 1998; Matsumura& Ellington, 1999; Lin et al., 1999) and psych-rophilicity (Taguchi et al., 1998); alterations insubstrate speci"city (Zhang et al., 1997;Kumamaru et al., 1998; Hansson et al., 1999) andforeign media activity (Chen & Arnold, 1993;Moore & Arnold, 1996); improved stereoselectiv-ity (Reetz et al., 1997; Bornscheuer et al., 1998);development of pharmaceuticals and vaccines(Patten et al., 1997); bioremediation of poly-chlorinated biphenyls (Wackett, 1998); detoxi"-cation of an arsenate pathway (Crameri et al.,1997); augmentation of the stability of foldedantibody fragments (Proba et al., 1998; Marti-neau et al., 1998); and increased sensitivity toAZT for HIV research (Christians et al., 1999).

At the same time that new directed evolutionsuccess stories are published and the potential fordiscovering truly novel biocatalysts is gainingacceptance, it is becoming apparent that the pro-cess is limited by key unanswered questions re-garding the optimal mix, scheduling and setup oferror-prone PCR and DNA recombination steps;the optimal selection of parent sequences for re-combination; and the e!ect of parameters such asrecombinatory fragment length, annealing tem-perature and number of shu%ing cycles on theassembly of full length product sequences. Toanswer these questions, a set of quantitativemodels are introduced. The remainder of the pa-per is organized as follows. In the next section, amodel of error-prone PCR is presented, and thepredictions are compared to experimental data.Then, three models describing the DNA shu%ingprocess are discussed. The "rst (random frag-mentation model) describes the fragment sizedistribution after treatment with DNase I. Thesecond (fragment assembly model) predicts thefragment size distribution after each anneal-

ing/extension step. The third (sequence matchingmodel) estimates the fraction of fully assembledgenes whose nucleotide sequence matches a tar-get one. For all models, examples are providedalong with comparisons with experimental data.

Modeling Error-prone PCR

While lately error-prone PCR has been largelyreplaced by DNA recombination as the mutagen-esis step, modeling single-point mutations is stillimportant since they will occur within any re-combination protocol. Quantitative studies ofPCR have so far addressed PCR e$ciency (Weiss& Haeseler, 1995), reaction kinetics (Hsu et al.,1997), e!ect of annealing temperatures (Rychliket al., 1990), and primer lengths (Wu et al., 1991;Sakuma & Nishigaki, 1994). Eckert & Kunkel(1991) proposed the following simple equation:

f"Np2

for predicting the overall error rate f after N PCRcycles given that the per cycle error rate is p. Thisrelation does not account for the fact that copy-ing errors depend on the nucleotide being rep-licated. For example, A miscopies to C, G orT with di!erent probabilities (Sha"khani et al.,1997; Cadwell & Joyce, 1992; Lin-Goerke et al.,1997). This omission thus may yield inaccurateestimates.

In the proposed model, mutations that occurduring the extension step when nucleotides areadded via polymerase are treated as being nu-cleotide dependent. A per cycle mutation matrixM is de"ned that models these di!erent mutationrates with elements M

ijrepresenting the prob-

ability of nucleotide i mutating to nucleotide j :

M"AM

AAM

ATM

ACM

AGM

TAM

TTM

TCM

TGM

CAM

CTM

CCM

CGM

GAM

GTM

GCM

GGB .

These values depend on the experimental condi-tions. The per cycle mutation rate matrix M canthen be used to identify the mutation rate matrixCn after n extension steps. This matrix measures


the mutation rates of a sequence obtained aftern extension events starting from the original se-quence. Because the occurrence of mutations inone extension step is independent of mutationsthat occurred in previous extension steps a recur-sive relation for Cn is derived as follows:

Cnij"G

dij, n"0,

Mij, n"1,

+

k/A,C,T,G

MkjCn~1

ik, n*2,

where dij

equals one if i"j and zero otherwise.However, after N PCR cycles not all sequences

in the reaction mixture result after exactly N ex-tensions of the original sequence. This is due tothe fact that after a sequence is formed, it remainsin the mixture to serve as a template in sub-sequent extension steps. For example, after threePCR cycles (see Fig. 1), 16 single strands of DNAare produced, of which two are the originalDNA double strand (n"0), six are the result ofone extension step (n"1), six are the result oftwo extension steps (n"2), and two are the resultof three extension steps (n"3).

This result is generalized for N PCR cycles (seeFig. 3). In Appendix A, it is proven by inductionthat after N PCR cycles the number of sequenceswhich are the product of exactly n extensions ofthe original DNA strand is equal to

ZN,n

"2AN

n B .

The total number of single-stranded sequencespresent in the reacting mixture after N PCR

FIG. 3. After N PCR cycles, the reaction mixture contains aThese strands (Z

N,n) originate from either (i) old templates that

cycle (ZN~1,n

); or (ii) new strands extended from templates tha

cycles is equal to

2 ) 2N

since every PCR cycle doubles their number.Therefore, the fraction of the sequences present inthe reaction mixture after N PCR steps that arethe result of n extension events is equal to

12NA

N

n B .

This relation is used in conjunction with matrixCn to construct matrix PN with elements PN

ijrep-

resenting the probability of nucleotide i mutatingto nucleotide j after N PCR cycles:

PNij"

12N

N+n/0AN

n BCnij.

By exploiting the assumption that mutations atdi!erent locations along the sequence are inde-pendent of each other, the probability <N

S0,Sof

assembling a sequence S through successivesingle point mutations on an original sequenceS0 after N PCR cycles is given by

%NS0,S

"

12N

N+n/0AN

n BB<j/1

[Pn]s0j , sj

,

where B is the length of the two sequences ands0j

and sj

are the nucleotides at position j forsequences S0 and S, respectively. This relationprovides the quantitative means to a priori esti-mate the fraction of the sequences obtained after

number of strands that have been through n extension steps.have been through n extension steps prior to the N-th PCR

t had already been through n!1 extension steps (Zn~1,n~1

).

TABLE 1An example of mutation matrix calculation given reported mutation bias for zero

Mn2` concentration

PCR mutation matrix after 13 cycles (Sha"khani et al., 1997)

P13"C99.522% 0.227% 0.046% 0.205%

0.227% 99.522% 0.205% 0.046%

0.046% 0.137% 99.817% 0.000%

0.137% 0.046% 0.000% 99.817%D(A)

(¹)

(C)

(G)

Calculated mutation matrix

M"C99.926% 0.035% 0.007% 0.032%

0.035% 99.926% 0.032% 0.007%

0.007% 0.021% 99.972% 0.000%

0.021% 0.007% 0.000% 99.972%D(A)

(¹)

(C)

(G)

Average per-cycle mutation rate calculated"0.016%Reported per-cycle mutation rate (Ling et al., 1991)"0.02%

FIG. 4. The GC content of a DNA strand can signi"cantlyalter the number of mutations produced by error-pronePCR. Data shown here is for a 12-cycle PCR with no Mn2`added. Moore & Maranas (**), f"Np/2 (Eckert & Kun-kel, 1991) (*}*}).


N PCR steps that conform to some target se-quence S given the mutation matrix M. There-fore, by adjusting the reaction conditions to con-trol the mutation rate, an experimenter can con-trol the probability of achieving a desired targetsequence.

Next, the proposed model is veri"ed by calcu-lating the per cycle mutation rate matrix M giventhe mutation rate matrix P13 reported by Sha"k-hani et al. (1997) after 13 PCR cycles (see Table 1).The average per-cycle mutation rate, assumingan equal concentration of each type of nucleotidethroughout the sequence, is calculated to be0.016%. Note that the data presented by Sha"k-hani et al. (1997) correspond to experimentalconditions similar to the ones reported by Linget al. (1991). In the latter PCR study, an averageper-cycle mutation rate of 0.02% is reported,which is very close to the value 0.016% that theproposed model predicts. Fig. 4 illustrates thee!ect of the sequence GC content on the totalnumber of mutations expected after 12 PCRcycles. Data from error-prone PCR with noMn2` added (Sha"khani et al., 1997) is used toderive the per-cycle mutation matrix. As shownin Fig. 4, a GC rich strand can reduce the numberof mutations produced by almost one-half.

In the proposed model, the PCR e$ciency isassumed to be 100% meaning that the amount ofDNA present doubles from one cycle to the next.In practice, this is not always true since a lack of

excess primer or nucleotides may result in incom-plete ampli"cation. This assumption a!ects boththe calculation of the amount of DNA presentafter N cycles and Z

N,n. For a PCR e$ciency e an

ampli"cation of (1#e)N instead of 2N is achieved.The calculation of Z

N,nalso needs to be changed.

Furthermore, it is assumed that no mutational&&hot spots'', or positions in the sequence withan increased mutation rate, are produced. Thelack of &&hot spots'' is reported by Cadwell &Joyce (1992) and also by Sha"khani et al. (1997).


Finally, nucleotide insertions and/or deletionsare not modeled because such events are reportedto comprise less than 5% of all mutations(Sha"khani et al., 1997). Nevertheless, by aug-menting the mutation matrix M to include dele-tions and insertions in addition to nucleotidemutations such events can be accommodated atthe expense of increased dimensionality.

Modeling DNA Recombination

The modeling of three di!erent aspects of theDNA recombination process is addressed:

1. Random fragmentation model. In this modelthe size distribution of the DNA fragmentsafter treatment of the parent sequences withDNase I is examined. This provides thenecessary quantitative information regard-ing fragment size distribution necessary formodeling the subsequent DNA shu%ingstep.

2. Fragment assembly model. Given the initialfragment size distribution, the objectivehere is to model the fragment size distribu-tion after each annealing/extension step.This allows tracking of how e!ectively therecombination protocol assembles fulllength genes without regard to sequence orfunction of the assembled sequences.

3. Sequence matching model. After all shu%ingcycles have been completed, the fractionof fully-assembled genes whose nucleotidesequence matches a given target (e.g.,AGGTCC) is quanti"ed.

RANDOM FRAGMENTATION MODEL

After a gene of length B is treated with DNaseI (random fragmentation), a random distributionof nucleotide fragments is obtained. Randomfragmentation implies that each one of the B!1nucleotide-nucleotide bonds has an equal prob-ability P

cutof being broken. The resulting frag-

ment size probability distribution denoted byQ0

Lis desired to describe the fraction of fragments

of di!erent lengths ¸ present in the reaction mix-ture.

First the special case ¸"B is addressed. Theonly possible way for a fragment of length B to

result is if none of the B!1 bonds are cut. Theprobability of a single bond remaining intact is(1!P

cut). The random nature of fragmentation

implies that bond-breaking events are indepen-dent therefore

Q0B"(1!P

cut)B~1 .

While the generation of a fragment of lengthB requires that all B!1 bonds must remainintact, a fragment of length ¸ can be formed afterhaving di!erent numbers of bonds being broken.The total number of broken bonds cannot exceedB!¸ because in that case at least one of the¸!1 bonds in a fragment of length ¸ mustbreak. Therefore, the calculation of Q0

Lrequires

enumerating all possible ways of generatinga fragment of length ¸ after breakings"1,2,B!¸ bonds. Mathematically, this im-plies that Q0

Lis equal to the sum of the products

of the conditional probabilities PL Ds

of generatinga fragment of length ¸ given that s bonds arebroken times the probability P

sof breaking s

bonds:

Q0L"

B~L+s/1

PsP

L Ds, ¸"1,2, B!1.

There exist (B~1s

) alternatives for breaking s out ofB!1 bonds. Because bond cutting and bondpreservation are independent events, each one ofthese alternatives has a probability

(Pcut

)s(1!Pcut

)B~1~s

of occurring. By combining these two results weobtain

Ps"A

B!1

s BPscut

(1!Pcut

)B~1~s.

Random fragmentation implies that the orderin which fragments are produced does not a!ecttheir respective probabilities of occurrence. Forexample, two cuts that produce fragments oflengths a, b, and c occur with the same probabil-ity as two cuts that produce fragments of lengthsc, a, and b. This greatly simpli"es the analysis byallowing the placement of the fragment of length


¸ at the beginning without any loss of generality.Speci"cally, given that after breaking s bonds afragment of length ¸ is formed, the formation ofthe fragment of length ¸ can be assumed to occur"rst without any loss of generality. This meansthat there exists

AB!1!¸

s!1 Balternatives to form the remaining s!1 cuts.Each one of these alternatives signi"es a way ofgenerating a fragment of length ¸. Because thereexist

AB!1

s Bways of creating s cuts, the conditional probabil-ity P

L Dsis equal to

PL Ds

"AB!1!¸

s!1 BNAB!1

s B .

By combining the expressions for Psand P

L Dsthe

following result for Q0L

is obtained:

Q0L"

B~L+s/1

AB!1!¸

s!1 BPscut

(1!Pcut

)B~1~s .

After rearranging terms,

Q0L"P

cut(1!P

cut)L~1

]CB~L+s/1

AB!1!¸

s!1 BPs~1cut

(1!Pcut

)B~L~sD

TABL

Comparison of discrete model vs. exponeprobability c

Pcut

Q0100

, discrete model

10~4 0.00990%10~3 0.0906%10~2 0.370%10~1 0.000295%

and invoking the binomial distribution proper-ties this expression simpli"es further to Q0

L"

Pcut

(1!Pcut

)L~1. Therefore, the fragment sizeprobability distribution after random fragmenta-tion is

Q0L"G

Pcut

(1!Pcut

)L~1 for 1)¸)B!1,

(1!Pcut

)B~1 for ¸"B.

It is interesting to note that the resulting expres-sions for ¸)B!1 are independent of the lengthB of the original gene. Furthermore, it can beshown (see Appendix B) that for small values ofPcut

, Q0L

approaches the exponential distributionPcut

exp(!Pcut

¸) (see also Table 2) with a mean of1/P

cut. A graph of the expected fragment size

distribution after treatment with DNase I isshown in Fig. 5. Typically, only a range of frag-ments between ¸

1and ¸

2are retained (e.g.,

¸1"50, ¸

2"150) in subsequent DNA shu%ing

experiments. In this case, Q0L

must be renor-malized by dividing by +L2

L/L1Q0

L. Note also that

Q0L

is a monotonically decreasing function of¸ implying that irrespective of the size of B andthe fragmentation intensity, quanti"ed by P

cut,

&&small'' fragments are always more ubiquitousthan &&large'' ones.

Comparisons of the proposed model predic-tions with the bands obtained after agarose gelelectrophoresis requires converting the fragmentsize distribution to corresponding signal inten-sities. The intensity of an agarose gel band, com-posed of fragments of length ¸, is proportional tothe amount of intercalated ethidium bromide.This is approximately proportional to fragmentlength since ethidium bromide stains DNA se-quences evenly. Therefore, the relative intensityof a band I0

Lis proportional to the particular size

E 2ntial approximation for fragment sizealculation

Q0100

, exponential approximation

0.00990%0.0905%0.368%0.000454%

FIG. 5. Fragment size distribution after a 1000 bp gene isfragmented with DNase I with P

cut"0.01 resulting in

a mean fragment length of 100 bp. The dotted lines indicatethat only a portion of these fragments are retained forshu%ing.

FIG. 6. Calculated agarose gel intensities for Pcut"

0.01, 0.02 and 0.04 for a 1 kb gene. Pcut"0.01 (**);

Pcut"0.02 (- - - -); P

cut"0.04 (} } }).

FIG. 7. Calculated agarose gel intensities for Pcut"

0.002, 0.004, 0.01, 0.004 and 0.2 (top to bottom). The gel runsfrom a maximum of ¸"2000 at the left down to ¸"1 atthe right.


fragment distribution Q0L

times the number ofnucleotides ¸ in the fragment. Thus, the follow-ing expression describes the relative intensity dis-tribution:

I0L"G

¸Pcut

(1!Pcut

)L~1 for 1)¸)B!1,

B(1!Pcut

)B~1 for ¸"B.

Unlike Q0L

which is monotonically decreasing,I0L

exhibits a sharp maximum in intensity for¸"1/P

cut. It is interesting that the location of the

peak depends only on the bond-breaking prob-ability P

cut.

A plot of relative gel intensities I0L

after therandom fragmentation of a 1 kb gene forPcut"0.01, 0.02 and 0.04 is shown in Fig. 6. As

Pcut

increases the peak migrates to smaller frag-ment lengths and the relative intensity distribu-tion broadens. Density plots of the relative inten-sity shown in Fig. 7 simulate the appearance ofan agarose gel after DNase I fragmentation ofa 2 kb gene. Distributions for P

cut"0.002, 0.004,

0.01, 0.04 and 0.1 are shown (top to bottom),which produce intensity peaks at ¸"500, 250,100, 25 and 10 bp, respectively. The horizontallength scale shown is logarithmic due to the typi-cal rate of DNA migration through a gel. Theseplots conform to the qualitative features exhib-ited by agarose gels.

These predictions are next compared withagarose gel data quantifying the fragment sizedistribution at di!erent points in time. Table3 summarizes the location of the intensity peak atdi!erent digestion times observed on an agarosegel for a system examined by Volkov & Arnold

TABLE 3Random fragmentation reaction progress (Volkov & Arnold, 1999)

Digestion time (min) Fluorescence maximum, 1/Pcut

Pcut

0.5 600 bp 0.17%1 300 bp 0.33%2 120 bp 0.83%3 70 bp 1.4%5 40 bp 2.5%

FIG. 8. First-order kinetics of DNase I digestion.


(1999). The proposed model predicts that thepeak intensity must occur at 1/P

cut(bp). This

implies that based on the experimentally ob-served peak intensities a model-based estimate ofPcut

can be derived (see Table 3).Pcut

can alternatively be expressed as the extentof digestion

Pcut"

C0b!C

bC0

b

,

where Cb

equals the concentration of unbrokennucleotide-nucleotide bonds and C0

bequals the

initial concentration of bonds. C0b

can be repre-sented as C

geneB, where C

geneis the concentration

of the gene in solution. Because DNase I is inexcess, a "rst-order rate expression can be used to"t the rate of digestion:

Cb"C0

bexp(!kt).

This leads to the following expression for Pcut

:

Pcut"1!exp(!kt).

After substituting the model predictions forPcut

a straight line is obtained after plotting!ln(1!P

cut) vs. t as shown in Fig. 8. The slope

of this straight line is equal to the rate constant of0.320 hr~1 verifying the model predictions.

FRAGMENT ASSEMBLY MODEL

The goal of this model is to quantitativelydescribe how the fragment size distribution cha-nges after a shu%ing step. The value of this analy-sis is two-fold: "rst, it identi"es how may shu%ingcycles are necessary for reassembling the full-

length gene. Second, by modeling fragment sizedistribution, which is experimentally accessible, itprovides a unique way of matching experimentalwith modeling results quantifying importantparameters in the model. Such experimentalstudies are currently under investigation. InDNA shu%ing, fragments are assembled by aPCR-like reaction without added primers. De-natured fragments prime each other during theannealing step creating regions of overlap, whereannealing has taken place, and overhangs, wherethe fragments do not align. The overhangs thenserve as templates for ¹aq-catalysed extension.

In the proposed model it is assumed that terti-ary collisions are not important and that anneal-ing only occurs between pairs of fragments. Incompliance with ¹aq polymerase function, frag-ment assembly only occurs in the direction from5@ to 3@. Sequences of length no greater than thatof the original gene are assembled since the frag-ments are assumed to anneal only along areas of


high sequence identity. This requires that thegene does not have a high amount of repetition.The fraction of fragments that fail to annealduring each annealing step is represented byparameter NA which is assumed to depend onreaction conditions such as concentration andtemperature. Fragment annealing is assumed tobe governed by second-order kinetics so that theprobability of a fragment of length X and a frag-ment of length> annealing is proportional to theproduct of their relative concentrations. The pro-portionality constant, denoted by A(X, >, <), isassumed to be a function of only overlap (<) andannealed fragment lengths (X, >). A minimumoverlap of <

minnucleotides is assumed to be ne-

cessary for annealing. <min

depends on the degreeof identity shared by the parent sequences andreaction conditions and it is usually between5 and 15 nucleotides (Stemmer, 1994a). Frag-ments with an overlap smaller than <

minare

assumed to denature before extension takesplace.

FIG. 9. Possible overlap alternatives between two annealed

Given the original fragment size distributionQ0

Lobtained after random fragmentation, the

next step is to quantify how this distribution willbe reshaped after a shu%ing step. The fragmentprobability size distribution after N shu%ingcycles is denoted by QN

L. During the shu%ing step

pairs of DNA fragments randomly anneal andsubsequently extend giving rise to successivelylarger DNA fragments from one shu%ing cycle tothe next. The fragment growth depends on theallowable overlap choices between fragments andtheir respective chances of annealing and extend-ing. The allowable range of overlap for successfulannealing between two fragments of lengths Xand >, respectively, is illustrated in Fig. 9. Themaximum possible overlap is equal to the lengthof the smaller of the two fragments, or min(X,>).Every overlap value from <

minup to

min(X,>)!1 occurs twice, once for each of thetwo fragment overhang orientations (5@ and 3@).The maximum overlap min(X,>), however, oc-curs for DX!> D#1 internal annealing choices.

sequences.


This means that the multiplicity (degeneracy)dV

for di!erent overlap values < is as follows:

dV"

G2 for <

min)<)min(X, >)!1,

DX!> D#1 for <"min(X, >) .

The probability of observing a particular anneal-ing choice shown in Fig. 9 depends on the extentof overlap. The following annealing probabilitymodel is postulated where high or low overlapvalues are favored depending on the sign of theexponent a:

A(X, >, <)"dV<aN

.*/(X,Y)+

V/Vmin

dV<a.

For a"!0.5 this annealing probability be-comes inversely proportional to the square rootof the overlap length as Wetmur & Davidson(1967) suggest, thus favoring shorter overlapvalues. They assumed DNA annealing to bea two-step process, an initial rate determiningnucleation step and a fast `zipperinga step. Intheir analysis, nucleation is taken to be an ele-mentary second-order reaction, thus supportingthe second-order assumption above. The inversesquare-root dependence is caused by an excludedvolume e!ect which can be veri"ed by approxi-mating the DNA by an ideal random coil.

After establishing an annealing probabilitymodel the next step is to identify all mechanismsthat generate a fragment of a particular lengthafter a single annealing/extension cycle is com-pleted. Six di!erent pathways for producinga fragment of length ¸ are considered whichexhaustively enumerate all possibilities (Fig. 10).An fragment of length ¸ can be produced by (i)the extension of smaller fragments to length¸ ("rst two pathways); (ii) a fragment of length¸ that fails to extend after annealing (next threepathways); or (iii) a fragment of length ¸ that failsto anneal (last pathway). The "rst "ve pathwayslisted above require two fragments to collide andanneal. These collision pathways depend on threeprobability terms. First, the fragments must an-neal, and this occurs with probability (1!NA)where NA denotes the probability of having a

failed annealing. Second, the collision probabilitybetween two fragments of lengths X and > isproportional to the product of their relative con-centrations (or size probability distributions):

QN~1X

QN~1Y

.

Because many fragment combinations can com-bine to form a fragment of a particular length ¸,a summation over all X and > values that givefragments of length ¸ after extension is necessary.Third, the annealing probability A(X,>, <) mul-tiplying the product of the fragment size prob-ability distributions is assumed to be a functionof the fragment lengths X, > and the nucleotideoverlap <. These three factors govern the colli-sion and annealing of two fragments. Each one ofthe "ve possible collision pathways are nextexamined in detail.

The "rst pathway (outer extension) describesthe 5@P3@ successful annealing and extension oftwo fragments whose lengths X, > are smallerthen ¸ and their overlap <"X#>!¸ is suchthat two single-stranded fragments of length¸ are recovered after denaturing. The length ofthe "rst fragment X may vary between ¸

1and

¸ while the second fragment > is bounded be-tween ¸!X#<

minand ¸. The three probability

terms listed above result in the following expres-sion for the size distribution of fragments oflength ¸ obtained through the outer extensionpathway after the N-th shu%ing cycle:

QNL(outer extension)

"(1!NA)L+

X/L1

QN~1X

L+

Y/L~X`Vmin

QN~1Y

A(X,>,X#>!¸)

The second pathway (inner extension) con-siders the case when a smaller fragment annealscompletely within a fragment larger than ¸.Given an appropriate placement the smaller frag-ment can then be extended to produce a fragmentof length ¸. Similarly, the corresponding sizeprobability distribution term accounting for the

FIG. 10. The six pathways for producing a fragment of length ¸ by extension, failed extension and failed annealing.


inner extension pathway is

QNL(inner extension)

"(1!NA)B+

X/L`1

QN~1X

L~1+

Y/L1

QN~1Y

A(X,>,>).

The third, fourth and "fth pathways describecases when fragments of length ¸ are retainedafter annealing but unsuccessful extension. Thisoccurs when a 3@ overhang is created, causing the¹aq-catalysed extension to fail. The three failedextension pathways refer to the case where the

second fragment is smaller than ¸ (¸~ failedextension); larger than ¸ (¸` failed extension); orequal to ¸ (¸ failed extension). The followingprobability terms quantify the contribution of thethird, fourth and "fth pathways to QN

L:

QNL(¸~ failed extension)

"(1!NA)QN~1L

L+

Y/L1

QN~1Y

]AY~1+

V/Vmin

A(¸,>,<)#(¸!>)A(¸,>,> )B ,

FIG. 11. Fragment size distributions after N"5, 10 and15 shu%ing cycles of a (¸

1"50, ¸

2"100) random frag-

ment pool of a 1000 bp gene (NA"50%, a"!0.5,<min

"15). N"5 (**); N"10 (} } } }); N"15 (*}*}).

FIG. 12. Fragment size distributions after N"25, 30, 35,40 and 45 shu%ing cycles of a (¸

1"10, ¸

2"50) random

fragment pool of a 1000 bp gene (NA"70%, a"!0.5,<min

"5). N"25 (**); N"30 (...); N"35 (} } } });N"40 (***); N"45 (*}*).


QNL(¸` failed extension)

"(1!NA)QN~1L

B+

Y/L`1

QN~1Y

L+

V/Vmin

A(¸,>,<) ,

QNL(¸ failed extension)

"(1!NA)QN~1L

QN~1L

L~1+

V/Vmin

A(¸,¸,<) .

Finally, fragments of length ¸ may remain in thereaction mixture after failing to anneal. Failedannealing occurs with a probability of NA, sothe following expression represents the portion offragments of length ¸ that remain unchangedafter failed annealing:

QNL(failed annealing)"(NA)QN~1

L.

The sum of the contributions of the six pathwaysgenerates a recursive model for QN

Lthat tracks the

fragment size distribution from one shu%ingcycle to the next. An internal consistency checkveri"es that +

LQN

L"1 is preserved. The only

adjustable parameters in this model are the min-imum-allowable overlap <

min, the probability of

failed annealing NA, and the exponent a in theannealing probability expression. Resolving therecursion requires going back shu%ing steps,eventually encountering as an input the originalfragment size distribution Q0

Lobtained after ran-

dom fragmentation.Figure 11 illustrates the fragment size distribu-

tion predicted by the model after 5, 10 and 15shu%ing cycles. The original 1 kb gene is "rstrandomly fragmented and only fragments withsizes between 50 and 150 bp are retained forshu%ing. After only 5 shu%ing steps the signa-ture of the original fragment pool is still evidentin the form of a sharp peak. After 10 cycles thissharp peak is nearly eliminated and a singlebroad maximum can be found in the fragmentsize distribution. Finally, after 15 cycles this max-imum has migrated to reach the end of the lengthrange and a large portion of the fragments haveassembled into full length genes.

Comparisons with experimental data are en-couraging. Stemmer (1994b) initially studied theassembly of a 1 kb gene. The experiment began

with random fragmentation to an approximatemean fragment length of 100 bp veri"ed on anagarose gel implying a value for P

cutof 1%. Then

fragments sized from 10 to 50 bp were assembled,and aliquots taken after N"25, 30, 35, 40 and 45shu%ing steps were analysed on a gel to monitorthe progress of the reaction. After 25 cycles, anintensity peak could be seen at approximately¸"250. After 30 cycles, a peak could be seennear ¸"450. As the assembly progressed fur-ther, the #uorescence broadened, and full-lengthgenes were reassembled. The proposed modelmatches these experimental observations as


illustrated in Fig. 12. Parameter values ofPcut"1%, ¸

1"10 and ¸

2"50 are selected to

match the ones employed in Stemmer's work. Ana value of!0.5 was chosen (Wetmur & David-son, 1967). Furthermore, the last two parameterswere set at NA"70% and <

min"5.

SEQUENCE MATCHING MODEL

In the fragment assembly model the process ofrecovering full-length sequences was analysedwithout regard to the nucleotide sequence of theassembled genes. In the target sequence matchingmodel the goal is to relate the nucleotide se-quence of the fully assembled genes, obtainedafter recombination, to the nucleotide sequenceand concentration of the parent sequences andexperimental conditions. Speci"cally, given theprecise nucleotide sequence of the parent se-quences available for recombination, the objec-tive is to "nd the fraction of the fully assembledsequences whose nucleotide sequence matchesa prespeci"ed target (e.g. ATTGG). This targetcan be (i) sequence identity, (ii) percent sequencehomology or (iii) a desired number of crossovers.The work presented here focuses on matching thenucleotide sequence identity of a prespeci"ed tar-get. Moore et al. (1997) study a simpli"ed modelassuming that the lengths of the fragments to bereassembled are less than the distances betweenmutations. Later, Sun (1998) considers largerfragment lengths and addresses the case of single(Sun, 1998) and multiple (Sun, 1999) mutations.Also, Bogarad & Deem (1999) model molecularevolution with Monte Carlo simulations. Bybuilding on these contributions, this modelinge!ort addresses the general case of multiplemutations per DNA strand and arbitrary selec-tions for the fragment lengths.

In our analysis, the nucleotide sequence of onlycomplete DNA products of full length is ana-lysed. The fraction of the sequences achieving fulllength can be estimated based on the resultspresented in the previous section. Also, the par-ent sequences are assumed to have a high degreeof homology so that fragment annealing is pos-sible along the entire gene length. As in the frag-ment assembly model, a minimum overlap of<min

nucleotides is assumed to be necessary forannealing and subsequent assembly, and assem-

bly is assumed to proceed only from 5@ to [email protected], it is assumed that the assemblyprocess from a position i until the end B of thesequence is independent of assembly that hasoccurred before position i. In other words, theannealing of a fragment is independent of allprior fragment annealing that occurred in pre-vious shu%ing cycles. Therefore, if P

iis the prob-

ability of reproducing the portion of a targetsequence between positions i and its end B thenPiis independent of all P

jwhere j(i.

The correct assembly of a target sequence isachieved if and only if a cascade of four indepen-dent events occurs, as shown in Fig. 13. Each oneof these events contributes a probability term toPi. The "rst step is to choose a fragment of length

¸ to add to the sequence. Assuming randomfragmentation, a fragment of length ¸ is chosenwith probability Q0

Ldiscussed earlier. The second

step in the assembly process is the annealing ofthe fragment of length ¸ to the rest of the pre-viously assembled sequence. The overlap mustbe at least <

minnucleotides. Thus, the non-over-

lapping portion of the fragment adds at most¸!<

minnew nucleotides to the sequence. There-

fore, there are ¸!<min

possible ways for afragment to align itself during annealing withoverlaps< ranging from<

minto ¸!1. The prob-

ability of adding ¸!< new nucleotides with afragment of length ¸ is denoted as A

L~V,Land is

de"ned identically with the annealing probabilityA(X,>,<) described in the previous section:

AL~V,L

"<aNL~1+

V/Vmin

<a.

After summing up over all possible overlapvalues this contributes a factor of +L~1

V/VminA

L~V,Lto P

i.

The third step is to calculate the probabilitythat the extended sequences will contribute nu-cleotides that match the ones in the target se-quence. Starting from a nucleotide at positioni and assuming that a fragment of length ¸ hasannealed with an overlap of < nucleotides, theprobability of matching the target nucleotidesequence from i to position i#(¸!<)!1 isequal to the fraction of the parent sequences thatexactly match the target sequence from position

FIG. 13. Four necessary steps of the annealing process as described in Sequence Matching Model.


i to position i#(¸!<)!1. Let parameterDa,b

denote the number of parent sequences thatmatch the target sequence from positions a to b.Matching between positions i and i#(¸!<)!1 then occurs with probability D

i,i`L~V~1/K

where K is the number of parents available forrecombination. The fourth and "nal step is tocalculate the probability of reproducing theremainder of the target sequence after adding¸!< new nucleotides. Because the annealing ofadditional fragments is independent of prior ad-ditions, simple multiplication by P

i`L~Vsu$ces.

This establishes a function for Pithat must be

evaluated recursively. These four steps result inthe expression for P

ishown below, where B is the

sequence length in nucleotides, ¸1

and ¸2

are thesmallest and largest recombinatory fragments,<min

is the minimum annealing overlap, and K is

the number of parent sequences:

Pi"

G1, i'B,Di,i

K, i"B,

L2

+L/L1

Q0LC

L~1+

V/Vmin

AL~V,L

(Di,i`L~V~1

K)P

i`L~V], i(B.

The above recursive formula calculates the prob-ability P

iof obtaining as assembled sequence that

is identical with some target sequence S afternucleotide position i. Therefore, P

1is equal to the

probability of assembling a sequence identicalto the target. This target may be either a speci"cpattern or an entire gene.

In this analysis, the target sequence is assumed tobe the entire assembled sequence. If only a portion


of the assembled sequence is to be analyzed, theprobability of annealing for the "rst fragment ati"1 must be adjusted to include previous frag-ment additions (i(1). In Moore & Maranas(2000), a renewal probability analysis is per-formed to account for this.

The predictions of the sequence matchingmodel are consistent with experimental data.Stemmer (1994a) recombined two markers 75 bpapart from random fragments of size between 100and 200 bp and reported that only 11% of thereassembled fragments contained both muta-tions. Note that independent assembly of the twomutations would have predicted a 25% value.Assuming a required minimum overlap for an-nealing of <

min"15 and a"!1/2, this model

estimates this probability for the average frag-ment size of ¸"150 to be 12.4%, which is veryclose to the experimentally observed one.

Next the possibility of increasing the probabil-ity of containing both mutations in the recom-bined sequences by appropriately choosing thefragment length is examined. The estimatedprobability of assembling a two-mutation se-quence is plotted as a function of fragment lengthin Fig. 14. As shown in Fig. 14, this probability isa strong function of fragment length exhibitinga sharp maximum at around ¸"110 bp of21.4%. These results clearly demonstrate the im-portance of being able to predict this `rightafragment length.

Further comparisons with experimental results(Zhao & Arnold, 1997a) are shown in Tables

FIG. 14. Probability of recombining two markers 75 bpapart as a function of the fragment length ¸.

4 and 5. Zhao & Arnold (1997a) shu%e two 1.3kb sequences, one with no mutations (wild-type)and the other with multiple-point mutations. InTable 4, the results of modeling an 83 bp portionof this sequence are compared with the experi-mental results. The experimental method is usedto parameterize the model, so that P

cut"0.83%

(2 min DNase I digestion from Table 3),(¸

1,¸

2)"(30, 50) (fragments less than 50 bp),

<min

"15, a"!0.5 (standard annealing condi-tions). The modeling results match the trendsfound experimentally. The variations are mostlikely due to the small number of sequencedproducts reported. In addition, the modeling re-sults con"rm the experimentally observed tend-ency of the mutations at positions 35 and 47 to be&&linked''. The results shown in Table 5 moreclearly demonstrate this tendency by examiningonly the recombination of the closely spacedmutations.

Summary and Conclusions

In this paper, quantitative models for predic-ting the outcome of DNase I fragmentation, er-ror-prone PCR and DNA shu%ing experimentswere introduced. Speci"cally, the random frag-mentation model and the fragment assemblymodel provided the quantitative means of track-ing the size probability distribution of fragmentsin the reacting mixture during DNase I frag-mentation and DNA shu%ing respectively. Onthe other hand, the PCR model and the sequencematching model establish a formalism for estima-ting the probability of matching a prespeci"ednucleotide target. These models can be used incombination with optimization algorithms basedon mixed-integer linear technologies to &&homein'' on the optimum fragment length and parentset without resorting to exhaustive enumerationof all alternatives (Moore & Maranas, 2000).

The predictions of these models were testedagainst experimental data available in the openliterature. Unfortunately, such published data ondirected evolution experiments do not containsu$cient detail on the size and nucleotide orderof the recombined sequences to allow for a com-plete model validation and optimization. Cur-rently, research is being conducted to overcomethis limitation by designing directed evolution

TABLE 4DNA shu/ing calculations for ¸

1"30, ¸

2"50, P

cut"0.83%, and<

min"15

Parent sequences (2)

1 35 47 83]*****]**]****]************

Shu%ed sequence Calculated probability Reported frequencey(Zhao & Arnold, 1997a)

]*****]*]******] 8.2% 20%]*****]*]****** 8.2% 10%]*****]*******] 4.3% 0%]*****]>>>>>> 4.3% 0%]*******]*****] 4.3% 0%]******]****** 4.3% 0%]************] 8.2% 0%]************ 8.2% 0%******]*]*****] 8.2% 20%*****]*]***** 8.2% 0%*****]******] 4.3% 0%*****]******* 4.3% 10%******]*****] 4.3% 0%******]***** 4.3% 0%***********] 8.2% 20%************ 8.2% 20%

TABLE 5DNA shu/ing calculations for ¸

1"30, ¸

2"50, P

cut"0.83%, and

<min

"15

Parent sequences (2)1 13]>>>>>>>>>>******]****************

Shu%ed sequence Calculated probability Reported frequencey(Zhao & Arnold, 1997a)

]**********] 32.8% 50%]********** 17.2% 10%***********] 17.2% 0%*********** 32.8% 40%


experiments on a test gene to speci"cally providedata for our modeling e!ort. These experimentswill provide information on fragment sizes tovalidate and parameterize the proposed models.In addition, work is underway to apply themodeling framework presented to other recombi-nation protocols, particularly the new techniqueof incremental truncation (Ostermeier et al., 1999).The combination of theoretical, experimental

and analytical approaches will lead to the im-proved application of directed evolution methodsyielding higher success rates and lower costs.

Financial support by the PSU Innovative Biotech-nology Research Fund, NSF Career Award CTS-9701771 and computing hardware support by theIBM Shared University Research Program 1996, 1997and 1998 is gratefully acknowledged.


REFERENCES

ARNOLD, F. (1996). Directed evolution: creating biocatalystsfor the future. Chem. Eng. Sci. 51, 5091}5102.

ARNOLD, F. & MOORE, J. (1997). Optimizing industrialenzymes by directed evolution. Advan. Biochem. Eng. 58,1}14.

BOGARAD, L. & DEEM, M. (1999). A hierarchical approachto protein molecular evolution. Proc. Natl. Acad. Sci.;.S.A. 96, 2591}2595.

BORNSCHEUER, U., ALTENBUCHNER, J. & MEYER, H. (1998).Directed evolution of an esterase for the stereoselectiveresolution of a key intermediate in the synthesis of ep-othilones. Biotechnol. Bioeng. 58, 554}559.

CADWELL, R. & JOYCE, G. (1992). Randomization of genesby PCR mutagenesis. PCR Meth. Appl. 2, 28}33.

CHEN, K. & ARNOLD, F. (1993). Tuning the activity of anenzyme for unusual environments: sequential randommutagenesis of subtilisin E for catalysis in dimethylfor-mamide. Proc. Natl. Acad. Sci. ;.S.A. 90, 5618}5622.

CHRISTIANS, F., SCAPOZZA, L., CRAMERI, A., FOLKERS, G.& STEMMER, W. (1999). Directed evolution of thymidinekinase for AZT phosphorylation using DNA family shu%-ing. Nat. Biotechnol. 17, 259}264.

CRAMERI, A., DAWES, G., RODRIGUEZ JR., E., SILVER, S.& STEMMER, W. (1997). Molecular evolution of an ar-senate detoxi"cation pathway by DNA shu%ing. Nat. Bio-technol. 15, 436}438.

ECKERT, K. & KUNKEL, T. (1991). DNA polymerase "delityand the polymerase chain reaction. PCR Meth. Appl. 1,17}24.

GIVER, L., GERSHENSON, A., FRESKGARD, P. & ARNOLD, F.(1998). Directed evolution of a thermostable esterase. Proc.Natl. Acad. Sci. ;.S.A. 95, 12 809}12 813.

HANSSON, L., BOLTON-GROB, R., MASSOUD, T. & MANNER-

VIK, B. (1999). Evolution of di!erential substrate speci"ci-ties in Mu class glutathione transferases probed by DNAshu%ing. J. Mol. Biol. 287, 265}276.

HSU, J., DAS, S. & MOHAPATRA, S. (1997). Polymerase chainreaction engineering. Biotechnol. Bioeng. 55, 359}366.

KREYSZIG, E. (1993). Advanced Engineering Mathematics,7th Edn., p. 1164. New York: Wiley.

KUCHNER, O. & ARNOLD, F. (1997). Directed evolution ofenzyme catalysts. ¹rends Biotechnol. 15, 523}530.

KUMAMARU, T., SUENAGA, H., MITSUOKA, M., WATANABE,T. & FURUKAWA, K. (1998). Enhanced degradation ofpolychlorinated biphenyls by directed evolution of bi-phenyl dioxygenase. Nat. Biotechnol. 16, 663}666.

LEUNG, D., CHEN, E. & GOEDDEL, D. (1989). A method forrandom mutagenesis of a de"ned DNA segment usinga modi"ed polymerase chain reaction. ¹echnique 1, 11}15.

LIN, Z., THORSEN, T. & ARNOLD, F. (1999). Functionalexpression of horseradish peroxidase in E. coli by directedevolution. Biotechnol. Prog. 15, 467}471.

LIN-GOERKE, J., ROBBINS, D. & BURCZAK, J. (1997). PCR-based random mutagenesis using manganese and reduceddNTP concentration. Biotechniques 23, 409}412.

LING, L., KEOHAVONG, P., DAIS, C. & THILLY, W. (1991).Optimization of the polymerase chain reaction with regardto "delity: modi"ed T7, Taq, and Vent DNA polymerases.PCR Meth. Appl. 1, 63}69.

MARTINEAU, P., JONES, P. & WINTER, G. (1998). Expressionof an antibody fragment at high levels in the bacterialcytoplasm. J. Mol. Biol. 280, 117}127.

MATSUMURA, I. & ELLINGTON, A. (1999). In vitro evolutionof thermostable p53 variants. Protein Sci. 8, 731}740.

MOORE, J. & ARNOLD, F. (1996). Directed evolution ofa para-nitrobenzyl esterase for aqueous}organic solvents.Nat. Biotect. 14, 458}467.

MOORE, G. & MARANAS, C. (2000). Modeling and optimiza-tion of DNA recombination. Comput. Chem. Eng., (in press).

MOORE, J., JIN, H., KUCHNER, O. & ARNOLD, F. (1997).Strategies for the in vitro evolution of protein function:enzyme evolution by random recombination of improvedsequences. J. Mol. Biol. 272, 336}347.

OSTERMEIER, M., NIXON, A. & BENKOVIC, S. (1999). In-cremental truncation as a strategy in the engineering ofnovel biocatalysts. Bioorg. Med. Chem. 7, 2139}2144.

PATTEN, P., HOWARD, R. & STEMMER, W. (1997). Applica-tions of DNA shu%ing to pharmaceuticals and vaccines.Curr. Opin. Biotechnol. 8, 724}733.

PROBA, K., WORN, A., HONEGGER, A. & PLUCKTHUN, A.(1998). Antibody scFv fragments without disul"de bondsmade by molecular evolution. J. Mol. Biol. 275, 245}253.

REETZ, M., ZONTA, A., SCHIMOSSEK, K., LIEBETON, K.& JEGER, K. (1997). Creation of enantioselective bio-catalyses for organic chemistry by in vitro evolution.Angew. Chem. Int. Ed. Engl. 36, 2830}2835.

RYCHLIK, W., SPENCER, W. & RHOADS, R. (1990). Optimiza-tion of the annealing temperature for DNA ampli"cationin vitro. Nucl. Acids. Res. 18, 6409}6412.

SAKUMA, Y. & NISHIGAKI, K. (1994). Computer predictionof general PCR products based on dynamical solutionstructures of DNA. J. Biochem. 116, 736}741.

SCHMIDT}DANNERT, C. & ARNOLD, F. (1999). Directedevolution of industrial enzymes. ¹rends Biotechnol. 17,135}136.

SHAFIKHANI, S., SIEGEL, R., FERRARI, E. & SCHELLENBER-

GER, V. (1997). Generation of large libraries of randommutants in Bacillus subtilis by PCR-based plasmid multi-merization. Biotechniques 23, 304}310.

SHAO, Z., ZHAO, H., GIVER, L. & ARNOLD, F. (1998). Ran-dom-priming in vitro recombination: an e!ective tool fordirected evolution. Nucl. Acids Res. 26, 681}683.

STEMMER, W. (1994a). Rapid evolution of a protein in vitroby DNA shu%ing. Nature 370, 389}391.

STEMMER, W. (1994b). DNA shu%ing by random frag-mentation and reassembly: in vitro recombination for mo-lecular evolution. Proc. Natl. Acad. Sci. ;.S.A. 91,10 747}10 751.

SUN, F. (1998). Modeling DNA shu%ing. Proc. 2nd Ann.Inte. Conf. Comput. Mol. Biol., p. 251.

SUN, F. (1999). Modeling DNA shu%ing. J. Comput. Biol. 6,77}90.

TAGUCHI, S., OZAKI, A. & MOMOSE, H. (1998). Engineeringof a cold-adapted protease by sequential randommutagenesis and a screening system. Appl. Environ. Micro-biol. 64, 492}495.

VOLKOV, A. & ARNOLD, F. (1999). Methods for in vitroDNA recombination and chimeragenesis. Meth. Enzymol.,(in press).

WACKETT, L. (1998). Directed evolution of new enzymes andpathways for environmental biocatalysis. Ann. N> Acad.Sci. 864, 142}152.

WEISS, G. & HAESELER, A. (1995). Modeling the polymerasechain reaction. J. Comput. Biol. 2, 49}61.

WETMUR, J. & DAVIDSON, N. (1967). Kinetics of renaturiz-ation of DNA. J. Mol. Biol. 31, 349}370.


WU, D., UGOZZOLI, L., PAL, B., QIAN, J. & WALLACE, R.(1991). The e!ect of temperature and oligonucleotideprimer length on the speci"city and e$ciency of ampli"ca-tion by the polymerase chain reaction. DNA Cell Biol. 10,233}238.

ZHANG, J., DAWES, G. & STEMMER, W. (1997). Directedevolution of a fucosidase from a galactosidase by DNAshu%ing and screening. Proc. Natl. Acad. Sci. ;.S.A. 94,4504}4509.

ZHAO, H. & ARNOLD, F. (1997a). Optimization of DNAshu%ing for high "delity recombination. Nucl. Acids Res.25, 1307}1308.

ZHAO, H. & ARNOLD, F. (1997b). Functional and nonfunc-tional mutations distinguished by random recombinationof homologous genes. Proc. Natl. Acad. Sci. ;.S.A. 94,7997}8000.

ZHAO, H., GIVER, L., SHAO, Z., AFFHOLTER, J. & ARNOLD,F. (1998). Molecular evolution by staggered extension pro-cess (StEP) in vitro recombination. Nat. Biotechnol. 16,258}261.

APPENDIX A

Calculation of ZN,n

Let ZN,n

represent the number of strands thathave been through n extension steps after N PCRcycles. Information about the value of Z

N,ncan be

discerned for some values of N and n. Initially, asstated above, two single strands of DNA arepresent, so Z

0,0"2. These two strands are the

only two that are not the product of an extensionstep; therefore, Z

N,0"2 for all N. Also, after

N cycles no DNA strands will be the result ofmore extension steps than N (Z

N,n"0 for n'N).

After the N-th PCR cycle, a strand that is pro-duced after n extension steps is either one thatwas just produced in the N-th PCR cycle or onethat was already in the reaction mixture beforethe N-th PCR cycle began. In the "rst case, thisimplies that a sequence that has undergone(n!1) extension steps after (N!1) PCR cyclesserved as the template to produce the sequence inquestion. In the second case, the sequence inquestion has already undergone n extensions bythe N!1 PCR cycle. See Fig. 3 for an illustra-tion of these two cases. This implies thatZ

N,n"Z

N~1,n#Z

N~1,n~1. Based on this rela-

tion a proof by induction of ZN,n

"2(Nn) is con-

structed. First, this result is shown to be valid forn"0, 1 and 2.

For n"0,

ZN,0

"2"2AN

0 B .

For n"1,

ZN,1

"ZN~1,1

#ZN~1,0

"ZN~1,1

#2

"(ZN~2,1

#ZN~2,0

)#2

"(ZN~2,1

#2)#2"ZN~2,1

#2(2)

"ZN~3,1

#2(3)

"ZN~k,1

#2k, ∀0)k)N.

To resolve the recursion, set k"N:

ZN,1

"Z0,1

#2N"0#2N"2N"2AN

1 B .

For n"2,

ZN,2

"ZN~1,2

#ZN~1,1

"(ZN~2,2

#ZN~2,1

)#ZN~1,1

"Z1,1

#Z2,1

#2#ZN~1,1

"

N~1+k/1

Zk,1

"

N~1+k/1

2k

"2CN(N!1)

2 D"2AN

2 B.

After the postulated result is shown to be validfor n"0, 1, and 2, it is assumed that Z

N,n"2(N

n).

To complete the proof by induction, this assump-tion is utilized to prove that Z

N,n`1"2( N

n`1) :

ZN,n`1

"ZN~1,n`1

#ZN~1,n

"(ZN~2,n`1

#ZN~2,n

)#ZN~1,n

"(ZN~3,n`1

#ZN~3,n

)#ZN~2,n

#ZN~1,n


"Zn,n

#Zn`1,n

#2#ZN~1,n

"

N~1+k/n

Zk,n

"

N~1+k/n

2 Ak

nB.Let s"k!n, then

ZN,n`1

"2(N~n)~1

+s/0

An#s

n B .

This expression can equivalently be rewritten as(Kreyszig, 1993)

ZN,n`1

"2(N~n)~1

+s/0

An#s

n B"2 A(N!n)#n

n#1 B"2A

N

n#1B .

APPENDIX B

Approximation of Q0L

with the ExponentialDistribution

Q0L"P

cut(1!P

cut)L~1

"

Pcut

1!PcutCA1!

11/P

cutB~1@Pcut

D~PcutL

.

For small values of Pcut

we can write

Pcut

1!Pcut

+1, and A1!1

1/PcutB~1@Pcut

+exp(1).

Therefore, Q0L+P

cutexp (!P

cut¸) for small values

of Pcut

.

Date post:	06-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Modeling DNA Mutation and Recombination for Directed Evolution …€¦ · aspects of the DNA...

Documents