+ All Categories
Home > Documents > Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig...

Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig...

Date post: 05-Aug-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
14
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Numbers 3/4, 2000 Mary Ann Liebert, Inc. Pp. 395–408 Contig Selection in Physical Mapping STEFFEN HEBER, 1;2 JENS STOYE, 1 MARCUS FROHME, 2 JÖRG HOHEISEL, 2 and MARTIN VINGRON 1 ABSTRACT In physical mapping, one orders a set of genetic landmarks or a library of cloned fragments of DNA according to their position in the genome. Our approach to physical mapping di- vides the problem into smaller and easier subproblems by partitioning the probe set into independent parts (probe contigs). For this purpose we introduce a new distance function between probes, the averaged rank distance (ARD) derived from bootstrap resampling of the raw data. The ARD measures the pairwise distances of probes within a contig and smoothes the distances of probes across different contigs. It shows distinct jumps at contig borders. This makes it appropriate for contig selection by clustering. We have designed a physical mapping algorithm that makes use of these observations and seems to be particularly well suited to the delineation of reliable contigs. We evaluated our method on data sets from two physical mapping projects. On data from the recently sequenced bacterium Xylella fastid- iosa, the probe contig set produced by the new method was evaluated using the probe order derived from the sequence information. Our approach yielded a basically correct contig set. On this data we also compared our method to an approach which uses the number of sup- porting clones to determine contigs. Our map is much more accurate. In comparison to a physical map of Pasteurella haemolytica that was computed using simulated annealing, the newly computed map is considerably cleaner. The results of our method have already proven helpful for the design of experiments aimed at further improving the quality of a map. Key words: clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION T he goal of physical mapping is to order a set of genetic landmarks or a library of cloned fragments of DNA according to their position in the genome. Physical maps are powerful tools for localization and isolation of genes, studying the organization and evolution of genomes and as a preparatory step for ef cient sequencing. Even in the postgenome era, it is quite probable that genome-wide functional analyses will precede the sequencing of various organisms. For many such techniques, however, mapping information will still be an important requirement to see functions in their genomic perspective and also to make them accessible to function-directed sequence analysis. Different experimental techniques are used in physical mapping. Roughly, these are clone-probe hybridization mapping (Hoheisel et al. , 1993), 1 German Cancer Research Center (DKFZ), Theoretical Bioinformatics (H0300), Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany. 2 German Cancer Research Center (DKFZ), Functional Genome Analysis (H0800), Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany. 395
Transcript
Page 1: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 7, Numbers 3/4, 2000Mary Ann Liebert, Inc.Pp. 395–408

Contig Selection in Physical Mapping

STEFFEN HEBER,1;2 JENS STOYE,1 MARCUS FROHME,2 JÖRG HOHEISEL,2

and MARTIN VINGRON1

ABSTRACT

In physical mapping, one orders a set of genetic landmarks or a library of cloned fragmentsof DNA according to their position in the genome. Our approach to physical mapping di-vides the problem into smaller and easier subproblems by partitioning the probe set intoindependent parts (probe contigs). For this purpose we introduce a new distance functionbetween probes, the averaged rank distance (ARD) derived from bootstrap resampling of theraw data. The ARD measures the pairwise distances of probes within a contig and smoothesthe distances of probes across different contigs. It shows distinct jumps at contig borders.This makes it appropriate for contig selection by clustering. We have designed a physicalmapping algorithm that makes use of these observations and seems to be particularly wellsuited to the delineation of reliable contigs. We evaluated our method on data sets from twophysical mapping projects. On data from the recently sequenced bacterium Xylella fastid-iosa, the probe contig set produced by the new method was evaluated using the probe orderderived from the sequence information. Our approach yielded a basically correct contig set.On this data we also compared our method to an approach which uses the number of sup-porting clones to determine contigs. Our map is much more accurate. In comparison to aphysical map of Pasteurella haemolytica that was computed using simulated annealing, thenewly computed map is considerably cleaner. The results of our method have already provenhelpful for the design of experiments aimed at further improving the quality of a map.

Key words: clone-probe hybridization mapping, contig selection, bootstrap.

1. INTRODUCTION

The goal of physical mapping is to order a set of genetic landmarks or a library of cloned fragmentsof DNA according to their position in the genome. Physical maps are powerful tools for localization

and isolation of genes, studying the organization and evolution of genomes and as a preparatory stepfor ef� cient sequencing. Even in the postgenome era, it is quite probable that genome-wide functionalanalyses will precede the sequencing of various organisms. For many such techniques, however, mappinginformation will still be an important requirement to see functions in their genomic perspective and alsoto make them accessible to function-directed sequence analysis. Different experimental techniques areused in physical mapping. Roughly, these are clone-probe hybridization mapping (Hoheisel et al., 1993),

1German Cancer Research Center (DKFZ), Theoretical Bioinformatics (H0300), Im Neuenheimer Feld 280, D-69120Heidelberg, Germany.

2German Cancer Research Center (DKFZ), Functional Genome Analysis (H0800), Im Neuenheimer Feld 280,D-69120 Heidelberg, Germany.

395

Page 2: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

396 HEBER ET AL.

STS mapping (Hudson et al., 1995), restriction mapping (Coulson et al., 1995), radiation-hybrid mapping(Slonim et al., 1997), and optical mapping (Lin et al., 1999). Here we focus on a physical mapping strategybased on hybridization experiments (Hoheisel et al., 1993; Scholler et al., 1995; Hanke et al., 1998). Thisprocedure starts with a library of clones which correspond to subintervals of a larger contiguous piece ofDNA G, all subintervals having the same size. Experimentally this can approximately be achieved by sizeselection methods described in Hoheisel et al. (1996).

In a more formal setting, from this clone library CL we select a subset P » CL of probes. Each probepi 2 P is labeled and tested against the clone library. If a clone contains suf� cient sequence similarityto the probe sequence, the probe will hybridize to this clone and a positive hybridization signal can bedetected. The result of these experiments is a binary clone/probe hybridization matrix A 5 .aij / where

aij :5

(1 if probe pj hybridizes to clone ci ;

0 otherwise.

The physical mapping problem is to � nd the order of the probes in P that corresponds to their real positionin G. A subsequent problem would then be to extend this order to the whole clone library. Here, we do notdeal with the latter question, though. In the error-free case, the physical mapping problem can be translatedinto the following optimization problem (Greenberg and Istrail, 1995): Given a hybridization matrix, � nda permutation of the columns (probes) such that the reordered matrix has the consecutive ones property,i.e., every row has at most one block of consecutive ones.

Unfortunately, physical mapping by hybridization experiments is highly in� uenced by errors and am-biguities: there are high rates of false positive and false negative hybridization signals and inconsistenthybridization signals caused by repetitive sequences, chimeric clones, or clones containing deletions. Ad-ditionally, there is variation in library coverage and in clone size. Note that even in the error-free caseambiguities may occur due to multiple solutions to the consecutive ones problem.

In the absence of errors, all admissible probe orders can be found and characterized ef� ciently using thePQ-tree data structure de� ned in Booth and Lueker (1976). However, in the presence of noise there is nogeneralization of the PQ-tree approach and the problem becomes ill de� ned. Our approach to this problemcan be described as follows: we partition the probe set into independent parts (probe contigs). Based onthese probe contigs, we clean the hybridization data. Then the probes are ordered inside the probe contigs.Finally the data is reinvestigated and additional experiments are suggested in order to improve and extendthe map. This procedure can be iterated several times. In the rest of the paper we will focus only on thepartitioning of the probe set into probe contigs, the essential step in the procedure.

Our mapping strategy is based on clustering of probes under a particular distance function. This distanceis based on the evaluation of rank differences of probe orders as derived from multiple bootstrap replicatesof the original hybridization data. We demonstrate certain properties of this distance function on idealizeddata that we believe make it particularly appropriate for use in conjunction with a clustering algorithm.The result of the clustering is a partitioning of probes into contigs. We also present methods to order theprobes within the contigs.

There are several computational approaches which could be adapted for our physical mapping setting.Most of them globally optimize a certain objective function to construct a preliminary order for all mark-ers/clones and offer then the possibility of interaction to improve this order. In the context of STS-contentmapping, Alizadeh et al. (1995a,b) present both a detailed formal analysis and several computational ap-proaches for � nding a good marker order. This work contains an approach which relies on maximizingthe posterior probability of a marker order, an approach which relies on solving the Hamming distancetravelling salesman problem (TSP) and on algorithms for obtaining a good initial probe order and for datacleaning. The authors also discuss and evaluate several combinations of these methods. Cuticchia et al.(1992) use simulated annealing to order a clone set according to a binary clone � ngerprint, implementedin the program ODS. Wang et al. (1993) use a random cost algorithm to order a clone set according toobjective functions based on the Hamming distance of binary clone � ngerprints. Mott et al. (1993) describethe programs PROBEORDER, BARR, and COSTIG which use simulated annealing and tree-search tech-niques to compute a map based on a maximum-likelihood distance measure between neighboring probes.SEGMAP (Green and Green, 1991) is a powerful interactive graphical tool for analyzing STS-content datawhich computes an optimal marker order by exhaustively rearranging some supplied suboptimal orders. Inthe special settings of unique end-probes and nonoverlapping probes, Christof et al. (1997) and Christofand Kececioglu (1999) apply a branch-and-cut approach to determine a probe order.

Page 3: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 397

Computational approaches which primarily divide the data into different contigs before computing amarker order are the mapping software CONTIGMAKER of the WI/MIT group (Hudson et al., 1995) andthe program CONTIG EXPLORER described in Nadkarni et al. (1996). In contrast to our approach, theyrely on the number of clones which share a certain probe pair for their contig de� nition.

Bootstrap resampling was introduced in Efron (1979) as a computer-based method for assigning measuresof accuracy to statistical estimates. In physical mapping, Wang et al. (1994) and Liu (1998) used thistechnique to determine the reliability of a clone/marker order. A clustering strategy similar to ours wasused in Mayraz and Shamir (1999) in the context of oligonucleotide � ngerprinting. An introduction to rankcorrelation methods can be found in Kendall (1970).

The following section contains the basic de� nitions and algorithms. It starts by summarizing the initialsteps of our procedure where we draw on established methods � rst for computing one physical map andthen for bootstrapping. Next, the averaged rank distance on probes will be de� ned. Properties of thisdistance follow. The clustering algorithm presented afterwards uses this distance. In the Results sectionwe apply our method to maps of Xylella fastidiosa and Pasteurella haemolytica. An assessment of theapproach and some directions for future development are given in the Discussion section.

2. ALGORITHMS

Our strategy is the following. First we repeatedly apply a standard map construction algorithm based onsimulated annealing to bootstrap resamplings of the hybridization data. The resulting bootstrap replicatesform the basis for our probe distance function, the averaged rank distance. This distance is then usedfor constructing contigs by a modi� ed clustering method. Finally, the probes within a contig need to beordered.

2.1. Basic algorithm for map construction

We focus on ordering the probe set P . To compute the order of probes in P we use a vector-TSP(Cuticchia et al., 1992; Alizadeh et al., 1995a,b) formulation based on the Hamming distance between thecolumns of the clone/probe hybridization matrix A. The probe set P is extended by a dummy probe p0 toyield eP :5 P [ fp0g and likewise the hybridization matrix A is extended by a dummy column consistingonly of 0’s to give eA. We construct a complete weighted graph G 5 .eP ; E; c/ where weight c..pi ; pj //

is de� ned as the Hamming distance of column i and j in eA. Now the optimization problem consists of� nding in G a Hamiltonian cycle of minimal weight. Such a minimal Hamiltonian cycle corresponds to aprobe order which minimizes the number of blocks of consecutive ones in the hybridization matrix withreordered probes. This order is supposed to approximate the correct solution (Greenberg and Istrail, 1995;Xiong et al., 1996). For the minimization we use the simulated annealing algorithm of Press et al. (1992).

2.2. Bootstrap resampling

In order to simulate independent replications of the physical mapping experiment in silico we resamplethe data set using a bootstrap strategy (Efron, 1979) which is similar to the approach of Wang et al.(1994); however, with the roles of clones and probes interchanged. We create new hybridization datamatrices by resampling jCLj times with replacement from the rows of A. This corresponds to repeatingthe hybridization experiments using the same set of probes P but creating new clone libraries by resamplingfrom the original clone library CL.

In order to determine how often the procedure had to be reproduced, we tested the variance in independentexperiment repetitions using different numbers of bootstrap replicates. With more than 200 resamplings,the results are well reproducible.

2.3. Averaged rank distance

While “contig” usually refers to an ordered set of overlapping clones representing a contiguous stretchof DNA, we here introduce the notion of a probe contig.

Let P 5 fp1; : : : ; png denote the set of given probes and let 5 be a family of permutations of P . 5

may, for example, be the result of bootstrapping the physical mapping data. Then C 5 fpi1 ; : : : ; pimg » P

is a probe contig if it is a maximal set of probes occurring as a “� xed block” in all permutations of 5. This

Page 4: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

398 HEBER ET AL.

means the probes occur continuously in a � xed order¡!C 5 .pi1 ; : : : ; pim/ or its reverse

¬¡C 5 .pim ; : : : ; pi1 /

in each permutation of 5 and there is no superset of C with this property.As an example, consider a set of probes P 5 fp1; ; p5g and a family of permutations 5 5

f.p1; p2; p5; p4; p3/; .p2; p1; p3; p4; p5/g. This yields two probe contigs, C1 5 fp1; p2g and C2 5fp3; p4; p5g.

In contrast to the idealized de� nition of probe contigs, when investigating bootstrap replicates of physicalmapping experiments, one typically � nds sets of probes where the interior order and integrity are onlyapproximately maintained. This is due partly to the particular data selection in the bootstrap replicates,partly to suboptimal optimization in map construction (simulated annealing), and partly to ambiguity in theraw data. Therefore, in order to determine the probe contigs of a physical map by investigating bootstrapreplicates, we use a distance function between probes that tries to correct for this fuzziness.

Let rk¼ .pi/ denote the position (rank) of probe pi in permutation ¼ . Given a family of probe permu-tations 5, the averaged rank distance (ARD) between two probes pi and pj is de� ned as

ARD5.pi ; pj / :51

j5jX

¼ 25

j rk¼ .pi/ ¡ rk¼ .pj /j:

We omit the subscript 5 when there is no ambiguity. This distance averages the rank distances of probesin the bootstrap replicates. The idea is that, in the different bootstrap replicates, the probes which belongto the same contig should occur close to each other with a high reliability even if their correct order isnot exactly de� ned. In contrast, the position and orientation of different contigs should be random andtherefore the distances of probes belonging to different contigs should be signi� cantly higher and show ahigher variability. In the following, we show some properties of the ARD.

Theorem 1. The averaged rank distance is a metric.

Proof. The rank distance d¼ .p; q/ :5 j rk¼ .p/ ¡ rk¼ .q/j of two elements p; q 2 P in a permutation¼ 2 5 is a metric. Therefore the average of these values over all permutations is a metric as well.

Theorem 2. Within a probe contig¡!C 5 .p1; p2; : : : ; pm/ the ARD distance between pi and pj is

ji ¡ j j (see Figure 1).

Proof. By the de� nition of probe contig this property holds for each permutation ¼ 2 5, and hence italso holds for the average.

Our intention is to analyze the permutations resulting from bootstrapping a physical mapping experiment.In those permutations, we observed that while the contig structure is generally maintained, there seems tobe no preference as to the order in which contigs occur. Likewise, there is no obvious preference as to theorientation of the individual contigs. To model this behavior, we de� ne for a given set of contigs the spacee5, which consists of all possible probe permutations compatible with the contig set. More precisely, foreach possible contig order and each contig occurring in its two orientations, e5 contains the implied probepermutation.

:::

p7 6 5 4 3 2 1 0p6 5 4 3 2 1 0 1p5 4 3 2 1 0 1 2p4 3 2 1 0 1 2 3p3 2 1 0 1 2 3 4p2 1 0 1 2 3 4 5p1 0 1 2 3 4 5 6

p1 p2 p3 p4 p5 p6 p7

FIG. 1. The ARD distance matrix within a probe contig.

Page 5: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 399

Theorem 3. Let C1; C2 be two probe contigs, C1 65 C2. Then for all p1; p2 2 C1; q1; q2 2 C2,

ARDe5.p1; q1/ 5 ARDe5.p2; q2/ 5 const:

Proof. Let¡!C1 5 .p1; : : : ; pk/ and

¡!C2 5 .q1; : : : ; ql/. We will show that

ARDe5

.pi ; qj / 5 ARDe5

.pi ¡ 1; qj / for 1 < i µ k: (1)

First, let C1 <¼ C2 if and only if rk¼ .p/ < rk¼ .q/ for all p 2 C1 and q 2 C2. (Note that this is avalid de� nition by our de� nition of a probe contig.) Then, following immediately from the de� nition ofthe ARD and the property that all permutations of probe contigs occur equally often in e5, we have

ARDe5

.pi; qj / 512

Á

ARDf¼ 2e5:C1<¼ C2 g

.pi ; qj / 1 ARDf¼ 2e5:C2<¼ C1g

.pi ; qj /

!

:

Similarly, using the de� nition of the ARD and the property of e5 that in f¼ 2 e5 : C1 <¼ C2g bothorientations of C1 occur equally often, we have

ARDf¼ 2e5:C1<¼ C2 g

.pi ; qj / 51

2

Á

ARDf¼ 2e5:C1<¼ C2^ ¡!

C1 g.pi ; qj / 1 ARD

f¼ 2e5:C1<¼ C2^¬¡C1 g

.pi ; qj /

!

:

Using

ARDf¼ 2e5:C1<¼ C2^ ¡!

C1 g.pi ; qj / 5 ARD

f¼ 2e5:C1<¼ C2^ ¡!C1 g

.pi ¡ 1; qj / ¡ 1;

and the symmetric equality for the second term, we get

ARDf¼ 2e5:C1<¼ C2 g

.pi; qj / 51

2

Á

ARDf¼ 2e5:C1<¼ C2^ ¡!

C1 g.pi ¡ 1; qj / ¡ 1 1 ARD

f¼ 2e5:C1<¼ C2^¬¡C1 g

.pi ¡ 1; qj / 1 1

!

5 ARDf¼ 2e5:C1<¼ C2 g

.pi ¡ 1; qj /:

Similarly we obtain ARDf¼ 2e5:C2<¼ C1g.pi ¡ 1; qj / 5 ARDf¼ 2e5:C2<¼ C1g.pi ; qj /, and Equation (1) fol-lows. From this one can easily derive the proposition.

Theorem 4. Based on Theorem 3 one may speak of ARDe5.C1; C2/. There holds

ARDe5.C1; C2/ 52jP j 1 jC1j 1 jC2j

6(2)

¶jP j 1 1

3: (3)

Proof. Suppose C1 and C2 are the only probe contigs. It is easy to verify that

ARDe5.C1; C2/ 5jC1j 1 jC2j

2:

Now suppose we have k > 2 probe contigs C1; C2; : : : ; Ck . By the properties of e5, each of the probecontigs Ci 2 C3; : : : ; Ck has probability 1

3 to occur between the probe contigs C1 and C2. Therefore itscontribution to ARDe5.C1; C2/ is 1

3 jCi j. Summing up, we get

kX

i5 3

13

jCi j 513

.jP j ¡ jC1j ¡ jC2j/:

Page 6: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

400 HEBER ET AL.

Additionally we have the contribution of C1 and C2 as calculated above. Together Equality (2) follows.Inequality (3) is obvious.

Note that we can also generalize Theorems 3 and 4 to the case where the order of probes within theprobe contigs is not constant, if we assume that this order is independent of the (internal) order, position,or orientation of other probe contigs. It suf� ces to show that all averages of the rank differences of pi

and qj over all permutations where pi and qj are at � xed positions in C1 and C2, are constant. But thisfollows with exactly the same proof as above.

The motivation for de� ning the ARD distance was the relation between the bootstrap results and thecontig permutations. ARD distance matrices based on probe orders derived from bootstrap replications ofreal data will be shown in Figures 3, 6, and 7. Theorem 4 predicts strong ‘jumps’ of the distance values atcontig borders, which we indeed observe on the real data as well. Moreover, by Theorem 3, the distancesbetween probes from two contigs should be constant, yielding a “chess board pattern.” This feature, too,can be recognized on the real data. Thus, the idealized properties derived for ARD on e5 seem to describebootstrap data quite well.

2.4. The contig construction algorithm

Encouraged by the results presented above, we proceed to utilize the ARD for clustering on probes inorder to de� ne contigs as clusters. The algorithm is similar to the map construction algorithm described inMayraz and Shamir (1999). It is a modi� cation of a greedy clustering algorithm, where a special contigdistance function is combined with a merge criterion that decides which growing contigs may be merged,based on an intercontig distance.

In order to prepare for the algorithm, we � rst de� ne the contig distance function and the merge criterion.

Contig distance function. Given two ordered probe contigs,¡!C1 5 .p1; ; pk/ and

¡!C2 5 .q1; ; ql/,

with pi ; qj 2 P and a family of probe permutations 5, we consider all four possible concatenationseC 5 f¡!C1

¡!C2;

¡!C1

¬¡C2;

¬¡C1

¡!C2;

¬¡C1

¬¡C2g and compute

d.C1; C2/ :5 minC2eC

8<

:1

jC1jjC2jX

p2C1;q2C2

.ARD5.p; q/ ¡ j rkC .p/ ¡ rkC .q/j/2

9=

; : (4)

This value de� nes the contig distance of C1 and C2.The contig distance measures the mean square deviation of the ARD values from an ideal ARD distance

matrix corresponding to the putative linear order C. A similar distance was discussed in Weeks and Lange(1987) in the context of linkage analysis.

The merge criterion. In order to prevent merging different probe contigs, we test if their measuredARD values could be better explained by a merged contig pair or by two unmerged probe contigs. Inanalogy to the contig distance d , we de� ne for two probe contigs C1 and C2 the intercontig distance

d¤.C1; C2/ :51

jC1jjC2jX

p2C1;q2C2

.ARD5.p; q/ ¡ 2jP j 1 jC1j 1 jC2j6

/2: (5)

This function measures the mean square deviation of the ARD values from the ARD values for twoindependent probe contigs C1 and C2 as predicted by Theorem 4. For each putative pair of probe contigsC1 and C2 to be merged, we compare this value to the contig distance d.C1; C2/ and allow merging onlyif d.C1; C2/ < d¤.C1; C2/.

The algorithm. We now describe the algorithm that, from a set of probes, constructs a set of probecontigs. It consists of three steps:

1. Initialize the contig set such that each single probe corresponds to a contig, Ci :5 fpi g.Initialize distance matrices D[i; j ] :5 d.Ci ; Cj / and D¤[i; j ] :5 d¤.Ci ; Cj / for all .i; j/.

Page 7: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 401

2. Repeat while further merges are possible:a. Search the contig distance matrix for the smallest distance D[i0; j0].b. If the merge criterion is ful� lled, i.e., if d.i0; j0/ < d¤.i0; j0/

i. merge contigs Ci0 and Cj0 ;ii. update the distance matrices.otherwisei. set the contig distance to in� nity: D[i0; j0] :5 1.

3. Output the contig set.

Whenever two probe contigs are merged in Step 2.a.i., the corresponding orientation C 2 eC that yieldedthe minimum in (4) is used.

Updating the distance matrices in Step 2.a.ii. is straightforward. One � rst removes the rows and columnsof C1 and C2 in D and D¤, and then one inserts a new row and a new column for the merged contig C

where the distances, as in the initialization, are computed using Equations (4) and (5), respectively.

2.5. Probe ordering within a contig

We have already obtained a linear order of the probes within a contig, which results from the orientationof the contigs at the merging step in the contig construction algorithm. However, the computation of theorder was not the primary goal of the clustering algorithm, and hence more sophisticated re-orderingsmight yield better results. We present two alternative possibilities:

1. We assign each clone to a single contig using a maximum likelihood approach similar to the algorithmfor � tting clones to a probe order described in Mott et al. (1993) and erase its hybridization signalsin other probe contigs. Now the order of probes within a contig can be recomputed by any physicalmapping algorithm (for example the basic algorithm for map construction described in the Algorithmsection), using only the hybridization data of the clone set which was assigned to this contig.

2. We can also form a “consensus” of the bootstrap maps. We � rst delete in each bootstrap map theprobes which do not belong to the investigated contig. Then, for each of these maps, we determinethe orientation which best � ts the probe order obtained by the contig construction algorithm. Usingthis orientation, we rank all probes. If we now order the probes corresponding to the sum of theiralloted ranks in the different bootstrap maps, it can be shown (Kendall, 1970) that this order has thehighest averaged Spearman rank correlation to all bootstrap replicates and can therefore be used as a“consensus order.”

2.6. Analysis and implementation of the algorithms

Assume k permutations (the bootstrap replicates) of the n probes are given. Then a straightforwardalgorithm that computes the n.n ¡ 1/=2 ARD values between all probes runs in total time O.n2k/ anduses O.n2/ space. Using these precomputed values, it is easy to compute for a given pair .C1; C2/ the twovalues d.C1; C2/ and d¤.C1; C2/ in time O.jC1jjC2j/. In particular, the complete initialization of tablesD[i; j ] and D¤[i; j ] in Step 1 of the contig construction algorithm takes O.n2k/ time.

It can easily be seen that, using a priority queue storing of the distance table D, the greedy clusteringof n elements can be computed in time O.n2 logn 1 nt/ where t is the time required to compute alldistances of a newly created (merged) cluster C to the remaining clusters. In our case, t is O.jCjn/, whichis bounded by O.n2/. Moreover, in our modi� ed greedy clustering algorithm, before merging we haveto test if the merge criterion is ful� lled. Each such test can easily be done in constant time. Hence, thecomplete clustering (Step 2 of the contig construction algorithm) takes O.n3/ time in the worst case.

The algorithms for map construction and bootstrappingwere written in C++ in the LEDA 3.8 environment(Melhorn and Näher, 1999). For solving the vector-TSP, we adapted the simulated annealing routine of Presset al. (1992). Visualizations of the distance and variance matrices were done in MATLAB, visualizationsof the clone/probe hybridization matrices were done using the program package Programs for AnalysingHybridisation Data, version 2 by R. Mott and A. Grigoriev and described in Mott et al. (1993).

The complete computation for the Pasteurella haemolytica data set (255 probes and 1025 clones),including the 200 bootstrap resamplings, took about 135 minutes on a SUN Ultra Enterprise 450 with 400MHz. Note that, using the bootstrap approach, our method obviously was not designed to run as fast aspossible, but rather to yield results of the highest possible quality.

Page 8: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

402 HEBER ET AL.

3. RESULTS

3.1. Validation of the ARD and the clustering

In order to validate our algorithm, we tested it on the hybridization data of Xylella fastidiosa. This dataset was created by Frohme et al. (unpublished) for the Xylella fastidiosa Genome Project.

During the development of our algorithm the sequence was unknown. While � nishing, however, thesequence became available (Simpson et al., unpublished) such that we are now able to obtain the exactposition of 181 probes in the genome. A visualization of the hybridization data matrix using this “correct”probe order (corresponding to the sequence position) is shown in Figure 2.

We used the hybridization data to create 1000 resampled hybridization matrices, and then we computedthe corresponding probe orders and their ARD values. Figure 3 shows a visualization of the ARD values(left) and the variances of the ARD values (right) using the “correct” probe order. Apart from a fewoutliers which are persistently misplaced by our map construction algorithm, the ARD values show thestructure predicted by Theorems 2, 3, and 4. On the main diagonal, one � nds blocks of small values whichcorrespond to probe contigs (Theorem 2). These blocks show distinct “jumps” at the borders (Theorem 4).Moreover, by Theorem 3 the distances between probes from two contigs should be constant yielding a“chess board pattern.” This feature, too, can be recognized on the real data. Additionally, the variancesof the ARD values (Figure 3, right) also con� rm our prediction that ARD values within a probe contigshould show a small variance compared to the variances between probes of different contigs. Our contigconstruction algorithm applied to this data set yielded twenty probe contigs including three singletons (seeTable 1).

The selected contigs correspond to the blocks on the main diagonal of the ARD distance matrix and thecorresponding variances. We found six incorrectly placed probes in the contig set: probes 14, 30, 37, 70,126, 159. A re-examination of these probes on the sequence level yielded that probes 14 and 37 overlapwith large repeats. Clearly, these probes were placed at a wrong occurrence of these repeats in the genome.Probe 70 also overlaps with a repeated sequence but the other occurrence does not match the position ofthis probe as well. The remaining probes 30, 126 and 159 have a misleading hybridization pattern (strong

7c

9

1

11

c7

3

1e

4

5

5f5

7

11

a4

9

9h

12

1

1

11

d3

1

3

7b

4

15

11

a8

1

7

10

c5

1

9

5g

6

21

2f3

2

3

4h

8

25

4f7

2

7

7e

6

29

5d

1

31

1b

1

33

9a

12

3

5

4b

4

37

10

b1

0

39

5e

12

4

1

2d

12

4

3

5d

12

4

5

8a

5

47

7h

3

49

1a

10

5

1

1b

5

53

6h

4

55

7g

8

57

7b

7

59

1b

4

61

2e

9

63

7e

11

6

5

5f1

1

67

2e

1

69

11

f8

71

6g

11

b

73

10

d7

7

5

7a

9

77

10

c8

7

9

8a

9

81

5g

2

83

9g

3

85

3e

6

87

1d

3

89

7a

4

91

1a

11

9

3

11

d9

9

5

7a

10

9

7

2b

10

9

9

7a

1

10

1

5f1

1

03

4d

5

10

5

7c

4

10

7

11

g1

2

10

9

6b

9

11

1

7b

10

1

13

1g

8

11

5

7d

6

11

7

7a

7

11

9

8c

8

12

1

9e

9

12

3

7e

2

12

5

7c

2

12

7

3a

1

12

9

1h

4

13

1

3e

8

13

3

8g

6

13

5

7a

12

1

37

7b

8

13

9

9d

11

1

41

2d

3

14

3

1f2

1

45

6c

10

1

47

7e

1

14

9

1a

2

15

1

2d

8

15

3

3c

12

1

55

7a

3

15

7

1h

9

15

9

11

a1

0

16

1

7a

6

16

3

7b

1

16

5

1g

7

16

7

9g

5

16

9

5h

4

17

1

9e

10

1

73

11

h5

1

75

10

c6

1

77

2c

10

1

79

3d

10

1

81

2

11

g9

4

11

a6

6

10

h5

8

11

a2

1

0

2b

4

1

2

6d

10

1

4

7c

7

1

6

11

a9

1

8

11

b6

2

0

11

a7

2

2

6d

6

2

4

8a

4

2

6

7a

2

2

8

2c

7

3

0

5d

6

3

2

7a

11

3

4

1e

5

3

6

2a

11

3

8

5b

8

4

0

3c

7

4

2

9f1

0

4

4

2d

9

4

6

2e

12

4

8

2d

1

5

0

8h

7

5

2

4h

9

5

4

7e

10

5

6

7c

1

5

8

1g

11

6

0

2a

6

6

2

10

g1

1

6

4

7b

11

6

6

2d

5

6

8

4g

10

7

0

3h

11

7

2

8h

2

7

4

11

d8

7

6

7c

10

7

8

6h

11

8

0

1b

6

8

2

10

g6

8

4

11

c2

8

6

2g

12

8

8

1a

3

9

0

10

f3

9

2

5a

11

9

4

8a

8

9

6

9b

4

9

8

9b

2

1

00

4

d2

1

02

6

a8

1

04

5

b9

1

06

7

b2

1

08

8

c1

1

10

1

1a

1

1

12

6

c6

1

14

7

b3

1

16

3

d3

1

18

7

h4

1

20

5

c1

1

1

22

1

0a

5

1

24

9

b1

2

1

26

8

h6

1

28

2

g4

1

30

1

h5

1

32

2

h1

1

34

1

c9

1

36

8

d2

1

38

2

f9

1

40

8

g1

2

1

42

1

1g

6

1

44

1

e8

1

46

2

f10

1

48

7

b9

1

50

2

d2

1

52

2

e2

1

54

7

c1

2

1

56

1

a1

1

58

4

a9

1

60

7

h1

1

62

7

b5

1

64

7

b1

2

1

66

1

1a

5

1

68

4

a1

1

1

70

7

g2

1

72

5

c1

2

1

74

1

0c

1

1

76

9

c1

2

1

78

8

b3

1

80

1

0g

10

FIG. 2. A physical map of Xylella fastidiosa produced by procedures as described in Hoheisel et al. (1993) usingthe “correct” probe order.

Page 9: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 403

0

10

20

30

40

50

60

70

20 40 60 80 100 120 140 160 180

20

40

60

80

100

120

140

160

180

probe

prob

e

0

500

1000

1500

2000

20 40 60 80 100 120 140 160 180

20

40

60

80

100

120

140

160

180

probe

prob

e

FIG. 3. ARD distance matrix of the Xylella fastidiosa data set using the “correct” probe order based on 1000bootstrap replicates (left). Variance of these ARD values (right).

signal in another region of the genome) which cannot be explained on the sequence level and which webelieve to result from a mix-up of clones.

Apart from these wrongly placed probes, the probe contigs are essentially correct, i.e., no contiguousstretches of probes of different positions in the genome are merged together in the same probe contig. Itseems likely that repeat sequences may lead to increased ARD values by causing ambiguity in the probeorder. In our case, this has not prevented the wrong placement of some single probes, but it prevented thealgorithm from merging large probe stretches which do not belong together. It remains an interesting open

Table 1. Our Contig Construction Algorithm Applied tothe Xylella fastidiosa Data Set Yielded 20 Probe Contigs

(Left Column). Based on the Correct Order We AssignedEach Probe Contig a Position Corresponding to the

Position of the Majority of its Probes (Center Column).Probes Inconsistent with This Position Were Counted asWrongly Assigned and Are Listed in the Right Column(“ ¡ ” Corresponds to Missing and “1 ” Corresponds to

Wrongly Included Probes)

Contig Cosition Wrongly assigned probes

1 0–72 8–13 1 703 15–36 ¡ 304 385 39–556 56–617 62–71 1 14 1 126 ¡ 708 729 73–81 1 159

10 82–84 1 3011 85–8812 89–9613 97–11014 111–125 1 3715 127–13316 134–14017 14118 142–15719 158–173 ¡ 15920 174–180

Page 10: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

404 HEBER ET AL.

0 20 40 60 80 100 120 140 160 1800

20

40

60

80

100

120

140

160

180

order in contigs

corr

ect o

rder corr=.974

FIG. 4. Correlation of the probe order found by our clustering algorithm and the “correct” order. The contig bordersare marked on the x-axis.

question to explore this behaviour in greater detail and to test the algorithm’s performance on eukaryoticDNA whose repeats are much more complex than those of prokaryotes.

To demonstrate the quality of our clustering, we arranged the probe contigs in the correct order withoutchanging the probe order inside the probe contigs (see Figure 4). Although our contig construction algorithmwas mainly intended to compute a probe partition, it produced a remarkably good probe order.

Comparison with -linked contigs. Hudson et al. (1995) (in the context of STS mapping) de� nedcontigs based on the number of supporting clones. We adapted this method in the following way. Twoprobes are k-linked if they hybridize to at least k clones simultaneously.We assembled the Xylella fastidiosaprobe set into k-linked contigs and evaluated the contig set. The result is shown in Table 2.

For values of k which produced a reasonable number of contigs, this approach always merges consecutivestretches of probes and results in a higher number of wrongly assigned probes compared to the results ofour algorithm. We are well aware that the k-linkage approach is not designed to be a stand-alone methodand that the resulting k-linked contigs could be improved by additional cleaning steps. Nevertheless, thisdemonstrates that the probe contig set, at least for our data sets, is a reasonable alternative to this approach.

3.2. Application to the Pasteurella haemolytica data set

In order to demonstrate the robustness of our clustering method we applied it to a noisy data set ofPasteurella haemolytica which is very dif� cult to process (Hanke et al., unpublished). A conventionalapproach using simulated annealing to optimize the above-described vector-TSP formulation producedonly an unsatisfactory result (Figure 5, left). A visualization of the ARD distance matrix (Figure 6, left)ordered with respect to this solution immediately highlights large regions which seem to be incorrectlyordered. A closer look with a higher magni� cation (Figure 7) also reveals local disorder.

Our cluster algorithm determined 39 contigs (Figure 6, right) which appear more homogeneous thanthe result derived by simulated annealing. We arranged these contigs (for presentation) in an order whichminimizes the contig distance function (Equation 4). A visualization of the clone/probe hybridization matrixcorresponding to this order is shown in Figure 5 (right). The improvements over the physical map basedon simulated annealing (Figure 5, left) are obvious.

Table 2. The Number of k-Linked Contigs, Merged Contigs, and Wrongly Assigned Probes asCompared to the Correct Probe Order of the Xylella fastidiosa Data for Different Values of k

k 1 2 3 4 5 6 7 8 9 10 15

Number of contigs 2 4 7 13 16 22 29 36 40 45 73Merged contigs — 2 2 5 5 7 5 3 3 3 2Wrongly assigned probes — 85 84 69 54 44 33 21 21 20 13

Page 11: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 405

3m

24

aq

3

1

1h

19

q3

3

1d

10

q2

5

1e

11

aq

2

7

1n

1a

q2

9

1j

18

q3

11

3h

20

aq

2

1

3

3m

21

q1

15

3g

24

q1

17

3f

22

q1

19

1p

1q

2

2

1

1p

5a

q3

23

2j

21

q1

25

1n

6q

3

2

7

1h

20

aq

3

2

9

1j

10

q3

31

1f

10

aq

2

3

3

1n

7a

q3

35

1d

6q

2

3

7

3g

1a

q2

39

1e

13

aq

2

4

1

3c

24

aq

2

4

3

1j

11

aq

2

4

5

2k

12

aq

3

4

7

2g

9a

q3

49

1i

11

aq

3

5

1

1k

12

aq

3

5

3

1j

12

q2

55

1a

4a

q3

57

1i

2a

q3

59

1b

18

q3

61

1b

19

aq

2

6

3

1a

9a

q3

65

1g

21

q3

67

2l

19

q2

69

2j

14

q1

71

1j

12

aq

2

7

3

2k

16

q1

75

1m

8q

3

7

7

1f

4a

q3

79

2c

21

aq

2

8

1

2h

5a

q3

83

1o

22

q1

85

1g

18

q2

87

1j

7q

2

8

9

1n

16

q2

91

2j

11

q1

93

3d

9q

2

9

5

1j

21

q3

97

1h

15

q3

99

1c

24

q3

10

1

3i1

1q

3

1

03

3g

15

q1

10

5

1l7

q2

10

7

2e

14

aq

2

1

09

3h

7a

q3

11

1

2n

20

aq

2

1

13

1i2

3q

2

1

15

1i2

2q

3

1

17

1a

2q

2

1

19

1b

10

q2

12

1

2a

21

aq

2

1

23

1m

14

q2

12

5

3m

8a

q2

12

7

3m

8q

1

1

29

2a

24

aq

2

1

31

3g

13

q2

13

3

1g

9q

2

1

35

1l1

7a

q3

13

7

2j1

8a

q3

13

9

2k

10

aq

3

1

41

2d

16

q3

14

3

1e

23

aq

2

1

45

1m

10

aq

3

1

47

1m

5a

q3

14

9

1m

13

aq

3

1

51

3o

22

q1

15

3

2k

24

q3

15

5

1j8

q3

15

7

1m

21

aq

3

1

59

2k

22

aq

3

1

61

1l1

9a

q3

16

3

1a

22

q2

16

5

1l1

0q

1

1

67

3l1

5a

q2

16

9

1l2

4q

3

1

71

3h

21

aq

2

1

73

2h

24

q1

17

5

1g

5q

2

1

77

2a

17

q3

17

9

1i1

3a

q2

18

1

1n

20

q2

18

3

1h

4q

3

1

85

1a

24

aq

3

1

87

3p

10

q2

18

9

1n

5a

q3

19

1

1o

21

aq

3

1

93

1p

4q

3

1

95

2k

4q

3

1

97

1g

19

q2

19

9

1n

3a

q3

20

1

1c

19

q2

20

3

1b

8q

2

2

05

2c

17

aq

3

2

07

1e

9a

q2

20

9

2e

11

q3

21

1

1g

22

q2

21

3

1l1

2q

3

2

15

2c

24

aq

2

2

17

1c

8q

2

2

19

1b

13

q3

22

1

1n

8q

3

2

23

1d

7q

2

2

25

1m

2q

2

2

27

3d

19

aq

2

2

29

1l1

6q

2

2

31

1i9

aq

2

2

33

3f1

2a

q3

23

5

2d

22

q3

23

7

1a

5a

q2

23

9

1j5

q3

24

1

3g

22

q2

24

3

1d

24

aq

2

2

45

1i1

2q

1

2

47

1j1

4q

1

2

49

1o

20

q1

25

1

1i1

3q

2

2

53

1c

17

q1

25

5

2

1

g1

6q

3

4

1

j19

aq

3

6

1

i14

aq

2

8

1

i21

q2

10

1

h2

1a

q3

12

1

m1

8q

3

14

1

o1

q2

16

3

g1

8q

1

18

3

f9

q2

20

3

e1

5q

3

22

1

o6

aq

2

24

1

m2

3a

q3

26

2

j2

1q

3

28

1

p9

q3

30

3

n2

3a

q2

32

2

f1

1a

q3

34

1

p2

q2

36

1

c3

q2

38

1

a1

9a

q3

40

1

l2

1q

2

42

1

i4

q3

44

1

m1

6q

3

46

1

e1

6q

2

48

1

e1

7a

q2

50

1

k1

1a

q3

52

1

j4

aq

2

54

1

l9

aq

3

56

1

l1

1a

q3

58

1

h2

3a

q3

60

1

e5

q3

62

1

e1

8q

2

64

1

c6

a2

66

1

b1

2a

q2

68

1

g2

0q

3

70

2

l1

9q

1

72

1

i1

7a

q2

74

1

c2

0q

1

76

2

k1

6q

2

78

1

n4

aq

3

80

2

j1

1a

q3

82

3

d1

aq

2

84

1

o1

8a

q2

86

1

b2

4a

q1

88

1

e2

4q

2

90

1

j2

2q

3

92

1

c1

2q

2

94

1

a8

q2

96

1

h1

1a

q3

98

2

a2

3a

q3

1

00

1

a3

q2

1

02

2

m2

4a

q2

1

04

3

e2

4q

3

1

06

1

j24

q2

1

08

2

e1

7q

2

1

10

1

h2

q2

1

12

3

e4

q2

1

14

1

h2

aq

2

1

16

1

j17

q2

1

18

1

c7

q2

1

20

1

a2

1a

q2

1

22

2

l15

q1

1

24

1

h6

q1

1

26

1

e2

2a

q2

1

28

1

c1

4q

1

1

30

3

n1

7q

2

1

32

3

d1

8a

q3

1

34

1

g1

5a

q3

1

36

2

a5

q3

1

38

2

k1

3a

q2

1

40

2

n2

4a

q3

1

42

1

e2

3q

2

1

44

2

j9a

q2

1

46

1

j9q

3

1

48

1

e1

1q

3

1

50

1

p2

4q

2

1

52

1

d5

aq

2

1

54

2

k2

4a

q3

1

56

3

f8a

q2

1

58

1

j13

aq

3

1

60

3

c1

8q

2

1

62

1

h1

8q

3

1

64

2

k2

2q

1

1

66

1

h1

4a

q2

1

68

1

g1

7q

2

1

70

2

b7

q3

1

72

3

h2

1a

q3

1

74

1

b1

4a

q3

1

76

1

g1

9a

q2

1

78

2

j18

q3

1

80

1

m4

aq

3

1

82

1

m2

2a

q3

1

84

1

g2

3q

3

1

86

1

c5

aq

3

1

88

3

k4

aq

2

1

90

1

l18

q3

1

92

1

k1

q3

1

94

1

o1

5a

q3

1

96

2

d1

8q

3

1

98

1

l14

q2

2

00

1

l15

q2

2

02

1

o1

0q

2

2

04

1

a8

q1

2

06

3

o2

1q

2

2

08

1

i24

q3

2

10

1

m7

q3

2

12

2

a1

4q

3

2

14

2

j1q

2

2

16

1

h2

2q

3

2

18

1

c1

0q

3

2

20

1

i8q

3

2

22

1

c2

3a

q3

2

24

1

l23

aq

3

2

26

1

b5

aq

2

2

28

2

j17

q2

2

30

1

e1

0a

q2

2

32

1

i18

aq

2

2

34

1

n2

aq

3

2

36

3

f17

q3

2

38

1

a1

1q

3

2

40

1

c4

aq

2

2

42

1

e1

4a

q3

2

44

1

j3q

3

2

46

1

i13

aq

1

2

48

1

j3q

1

2

50

1

h1

3a

q1

2

52

1

j13

aq

1

2

54

1

p1

3q

3

1d

5a

q2

1

2k

24

aq

3

3

3f8

aq

2

5

1e

22

aq

2

7

3m

8a

q2

9

3m

8q

1

1

1

1h

6q

1

1

3

2l

15

q1

15

3d

18

aq

3

1

7

1g

9q

2

1

9

2a

5q

3

2

1

2k

13

aq

2

2

3

1j

9q

3

2

5

2j

18

q3

27

1j

13

aq

3

2

9

1m

21

aq

3

3

1

1f

4a

q3

33

1h

15

q3

35

2j

11

aq

3

3

7

1n

16

q2

39

1j

7q

2

4

1

1g

18

q2

43

1h

11

aq

3

4

5

2j

11

q1

47

1j

22

q3

49

1c

12

q2

51

2h

5a

q3

53

1h

18

q3

55

1l

19

aq

3

5

7

1l

10

q1

59

1l

24

q3

61

3h

21

aq

3

6

3

1g

19

aq

2

6

5

2b

7q

3

6

7

3l

15

aq

2

6

9

2n

24

aq

3

7

1

2j

9a

q2

73

1e

23

q2

75

2k

10

aq

3

7

7

1e

14

aq

3

7

9

1i

13

aq

1

8

1

1i

12

q1

83

1j

14

q1

85

1o

20

q1

87

1i

13

q2

89

1c

17

q1

91

2l

19

q2

93

2j

14

q1

95

1j

12

aq

2

9

7

2k

16

q1

99

2j2

1q

1

1

01

1n

6q

3

1

03

1m

10

aq

3

1

05

1g

16

q3

10

7

1h

19

q3

10

9

1i2

1q

2

1

11

1e

11

aq

2

1

13

1j1

8q

3

1

15

1h

21

aq

3

1

17

3m

21

q1

11

9

3g

24

q1

12

1

3f2

2q

1

1

23

1p

1q

2

1

25

1p

5a

q3

12

7

1m

23

aq

3

1

29

2a

17

q3

13

1

3e

24

q3

13

3

3g

15

q1

13

5

3h

7a

q3

13

7

2e

14

aq

2

1

39

1j2

4q

2

1

41

2n

20

aq

2

1

43

1h

2a

q2

14

5

1j1

7q

2

1

47

1a

2q

2

1

49

2m

24

aq

2

1

51

1g

21

q3

15

3

1a

3q

2

1

55

1e

5q

3

1

57

1e

18

q2

15

9

1c

6a

2

1

61

1h

23

aq

3

1

63

1l1

1a

q3

16

5

1j1

2q

2

1

67

1i1

1a

q3

16

9

1k

12

aq

3

1

71

1k

11

aq

3

1

73

1e

17

aq

2

1

75

1n

7a

q3

17

7

1l1

6q

2

1

79

1i9

aq

2

1

81

1d

6q

2

1

83

3d

19

aq

2

1

85

1m

2q

2

1

87

1d

7q

2

1

89

1e

13

aq

2

1

91

1a

19

aq

3

1

93

1f1

0a

q2

19

5

1m

16

q3

19

7

1i4

q3

19

9

1n

8q

3

2

01

1a

11

q3

20

3

1c

4a

q2

20

5

3p

10

q2

20

7

2c

24

aq

2

2

09

1b

13

q3

21

1

1a

24

aq

3

2

13

1c

8q

2

2

15

1g

23

q3

21

7

1l1

8q

3

2

19

2d

22

q3

22

1

1k

1q

3

2

23

1n

20

q2

22

5

1o

15

aq

3

2

27

1m

22

aq

3

2

29

1l1

2q

3

2

31

1g

22

q2

23

3

1h

20

aq

3

2

35

1j1

0q

3

2

37

1o

10

q2

23

9

1b

8q

2

2

41

1c

19

q2

24

3

2e

11

q3

24

5

1e

9a

q2

24

7

2c

17

aq

3

2

49

1l1

5q

2

2

51

1g

19

q2

25

3

1m

5a

q3

25

5

2

3

o2

2q

1

4

2

k2

4q

3

6

1

m1

4q

2

8

1

c1

4q

1

10

3

n1

7q

2

12

2

a2

4a

q2

14

2

a2

1a

q2

16

1

b1

0q

2

18

3

g1

3q

2

20

1

g1

5a

q3

22

1

l1

7a

q3

24

3

m2

4a

q3

26

1

i1

3a

q2

28

1

j8

q3

30

3

c1

8q

2

32

1

n4

aq

3

34

2

a2

3a

q3

36

1

j2

1q

3

38

2

c2

1a

q2

40

1

o2

2q

1

42

1

e2

4q

2

44

1

b2

4a

q1

46

3

d9

q2

48

1

a8

q2

50

1

o1

8a

q2

52

3

d1

aq

2

54

2

k2

2a

q3

56

2

k2

2q

1

58

1

a2

2q

2

60

1

h1

4a

q2

62

3

h2

1a

q2

64

1

b1

4a

q3

66

1

g5

q2

68

2

h2

4q

1

70

1

g1

7q

2

72

2

j1

8a

q3

74

2

d1

6q

3

76

1

e2

3a

q2

78

3

g2

2q

2

80

1

j3

q3

82

1

d2

4a

q2

84

1

j3

q1

86

1

h1

3a

q1

88

1

j1

3a

q1

90

1

p1

3q

3

92

1

g2

0q

3

94

2

l1

9q

1

96

1

i1

7a

q2

98

1

c2

0q

1

1

00

2

k1

6q

2

1

02

2

j21

q3

1

04

1

m8

q3

1

06

1

e1

1q

3

1

08

1

j19

aq

3

1

10

1

n1

aq

2

1

12

1

d1

0q

2

1

14

1

i14

aq

2

1

16

1

m1

8q

3

1

18

3

h2

0a

q2

1

20

3

g1

8q

1

1

22

3

f9q

2

1

24

3

e1

5q

3

1

26

1

o6

aq

2

1

28

1

o1

q2

1

30

1

m4

aq

3

1

32

1

m1

3a

q3

1

34

3

i11

q3

1

36

3

e4

q2

1

38

1

h2

q2

1

40

2

e1

7q

2

1

42

1

l7q

2

1

44

1

i23

q2

1

46

1

i22

q3

1

48

1

c7

q2

1

50

1

p2

4q

2

1

52

1

c2

4q

3

1

54

1

b1

2a

q2

1

56

1

a9

aq

3

1

58

1

b1

8q

3

1

60

1

b1

9a

q2

1

62

1

i2a

q3

1

64

1

a4

aq

3

1

66

1

a2

1a

q2

1

68

1

j4a

q2

1

70

1

l9a

q3

1

72

2

g9

aq

3

1

74

3

f12

aq

3

1

76

2

k1

2a

q3

1

78

1

c3

q2

1

80

1

i18

aq

2

1

82

1

n2

aq

3

1

84

1

e1

0a

q2

1

86

2

j17

q2

1

88

1

b5

aq

2

1

90

1

l23

aq

3

1

92

3

g1

aq

2

1

94

1

p2

q2

1

96

1

e1

6q

2

1

98

1

j11

aq

2

2

00

3

c2

4a

q2

2

02

1

c2

3a

q3

2

04

1

a5

aq

2

2

06

1

j5q

3

2

08

3

k4

aq

2

2

10

1

c1

0q

3

2

12

1

c5

aq

3

2

14

1

i8q

3

2

16

1

h4

q3

2

18

1

n5

aq

3

2

20

1

l21

q2

2

22

3

f17

q3

2

24

1

o2

1a

q3

2

26

1

p4

q3

2

28

2

d1

8q

3

2

30

1

h2

2q

3

2

32

2

j1q

2

2

34

3

n2

3a

q2

2

36

1

p9

q3

2

38

2

f11

aq

3

2

40

3

o2

1q

2

2

42

1

a8

q1

2

44

1

m7

q3

2

46

2

a1

4q

3

2

48

1

i24

q3

2

50

1

n3

aq

3

2

52

1

l14

q2

2

54

2

k4

q3

FIG. 5. Clone/probe hybridization matrix of Pasteurella haemolytica based on the best output of 200 simulatedannealing runs (left). Map based on our cluster construction algorithm (right).

0

20

40

60

80

100

120

50 100 150 200 250

50

100

150

200

250

probe

prob

e

0

20

40

60

80

100

120

50 100 150 200 250

50

100

150

200

250

probe

prob

e

FIG. 6. ARD distance matrix of Pasteurellahaemolyticaordered according to the simulated annealing result describedin the text. The region p180–p240 shows putative disorder (left). The reordered ARD distance matrix is shown on theright.

p129 3.86 4.25 4.33 6.77 6.10 2.80 0.00p128 5.22 5.07 4.50 6.39 5.19 0.00 2.80p127 7.02 6.91 5.76 2.03 0.00 5.19 6.10p126 6.79 6.45 5.75 0.00 2.03 6.39 6.77p125 2.80 2.34 0.00 5.75 5.76 4.50 4.33p124 1.20 0.00 2.34 6.45 6.91 5.07 4.25p123 0.00 1.20 2.80 6.79 7.02 5.22 3.86

p123 p124 p125 p126 p127 p128 p129

p127 7.02 6.91 5.76 6.10 5.19 2.03 0.00p126 6.79 6.45 5.75 6.77 6.39 0.00 2.03p128 5.22 5.07 4.50 2.80 0.00 6.39 5.19p129 3.86 4.25 4.33 0.00 2.80 6.77 6.10p125 2.80 2.34 0.00 4.33 4.50 5.75 5.76p124 1.20 0.00 2.34 4.25 5.07 6.45 6.91p123 0.00 1.20 2.80 3.86 5.22 6.79 7.02

p123 p124 p125 p129 p128 p126 p127

FIG. 7. Enlargement of the ARD distance matrix of Pasteurella haemolytica ordered according to the simulatedannealing result (left). We suppose that the probe order is locally incorrect. A con� guration which � ts better to anARD distance matrix within a probe contig could be achieved if probes 129 and 128 (in this order) would be placedbetween probes 125 and 126 (right).

Page 12: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

406 HEBER ET AL.

The results of our cluster algorithm have already been used in the Pasteurella haemolytica physicalmapping project. At probe contig borders, additional probes for contig extension and gap closure wereselected and used for additional hybridization experiments. Additionally, the computed probe order wasused to select a clone set for ordering a plasmid library.

4. DISCUSSION

Clustering the set of probes into independent contigs and subsequently ordering these contigs is a naturalapproach to physical mapping. It divides the optimization problem into smaller and, it is hoped, easiersubproblems that can be dealt with independently. At the same time, though, the danger is introduced ofhaving errors in the contig selection which then propagate. In this work we presented a method for contigselection that apparently performs very well on real data.

The source of the robustness of the resulting contig de� nitions probably is twofold. First, bootstrappingis the in silico equivalent of repeating an experiment. For each resampled data set, we compute a physicalmap using a standard algorithm. Particularities of any one solution are lost and thus the sensitivity tooutliers or peculiarities of the data is reduced.

Second, in order to combine the results of these computations we de� ne a distance function betweenprobes which averages the rank differences of probe pairs in these bootstrap maps. This approach can beinterpreted as a generalization of the bootstrap procedure for physical mapping (Efron, 1979; Wang et al.,1994; Liu, 1998) which not only takes into account the consecutive occurrence of two probes, but alsouses the information of more distant connections. This leads to a robust and reliable distance function withinteresting and useful properties. Averaged rank distance is largely independent of factors like coveragedepth because it accounts only for distances in rank of probes. Within contigs, the averaged rank distancebehaves much like other distance measures. Between contigs, however, individual distances between probesare less important because all probes in one contig tend to have roughly the same distance to any probe ina particular other contig. This between-contig distance tends to be much larger than the distances betweenneighboring probes in the same contig. In this respect, ARD on idealized data resembles an ultrametricin that all distances between elements of two clusters are equal. Hence, such a distance should be moreeasily approximated by a tree and allow for good clustering results.

The results shown are very encouraging. In addition, the distance matrices can also be used to visualizethe reliability of a given probe ordering and to highlight dubious regions (see Figures 6, left, and 7).This has been shown very helpful to derive hypotheses about possible orderings and experiments whichincrease the quality of the map. Similar drawings for the bootstrap values are less meaningful because theyincorporate only next neighbor connections.

Several lines of future work can be anticipated. The problem of contig construction is particularlychallenging in physical mapping using STS-content data. For example, large STS mapping data sets werecollected by the CEPH/Généthon and WI/MIT teams, but an assembly into comprehensive contig maps wasimpossible (Harley et al., 1999). We plan to adapt our method to STS-content data and make it applicableto this kind of data. On the theoretical side, we are working on a probabilistic model that allows one toformulate the partitioning of probes into contigs as an optimization problem. Another interesting projectcould be to investigate the in� uence of other perturbation strategies, like subsampling, oversampling, ordata perturbation, on our method.

ACKNOWLEDGMENTS

We are very grateful to the Organization for Nucleotide Sequencing and Analysis (ONSA), especially toJoão Carlos Setubal and João Meidauis, for providing us with the sequence data of the Xylella fastidiosaclone library. We would also like to thank Richard Desper for many helpful discussions. Jens Hankeprovided us with the data set and physical map of Pasteurella haemolytica.

REFERENCES

Alizadeh, F., Karp, R., Newberg, L., and Weisser, D. 1995a. Physical mapping of chromosomes: A combinatorialproblem in molecular biology. Algorithmica 13, 52–76.

Page 13: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

CONTIG SELECTION IN PHYSICAL MAPPING 407

Alizadeh, F., Karp, R., Weisser, D., and Zweig, G. 1995b. Physical mapping of chromosomes using unique probes. J.Comp. Biol. 2, 159–184.

Booth, K., and Lueker, G. 1976. Testing for the consecutive ones property, interval graphs and graph planarity usingPQ-tree algorithms. J. Comput. Syst. Sci. 13, 333–379.

Christof, T., Jünger, M., Kececioglu, J., Mutzel, P., and Reinelt, G. 1997. A branch-and-cut approach to physicalmapping of chromosomes by unique end-probes. J. Comp. Biol. 4, 433–447.

Christof, T., and Kececioglu, J. 1999. Computing physical maps of chromosomes with nonoverlapping probes bybranch-and-cut, 115–123. In Istrail, S., Pevzner, P., and Watermann, M., eds., Proceedings of the Third AnnualInternational Conference on Computational Molecular Biology (RECOMB 99), ACM, New York.

Coulson, A., Huynh, C., Kozono, Y., and Shownkeen, R. 1995. The physical map of the Caenorhabditis elegansgenome. Methods Cell Biol. 48, 533–550.

Cuticchia, A., Arnold, J., and Timberlake, W. 1992. The use of simulated annealing in chromosome reconstructionexperiments based on binary scoring. Genetics 132, 591–601.

Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Ann. Stat. 7, 1–26.Green, E., and Green, P. 1991. Sequence-tagged site (STS) content mapping of human chromosomes: theoretical

considerations and early experiences. PCR Methods Appl. 1, 77–90.Greenberg, D., and Istrail, S. 1995. The chimeric mapping problem: Algorithmic strategies and performance evaluation

on synthetic genomic data. J. Comp. Biol. 2, 219–274.Hanke, J., Frohme, M., Laurent, J.-P., Swindle, J., and Hoheisel, J. 1998. Hybridization mapping of Trypanosoma

cruzi chromosome III and IV. Electrophoresis 19, 482–485.Harley, E., Bonner, A., and Goodman, N. 1999. Revealing hidden interval graph structure in STS-content data. Bioin-

formatics 15, 278–285.Hoheisel, J., Maier, E., Mott, R., and Lehrach, H. 1996. Integrated genome mapping by hybridization, 319–346. In

Birren, B., and Lai, E., eds., Analysis of Non-Mammalian Genomes—A Practical Guide. Academic Press, SanDiego.

Hoheisel, J., Maier, E., Mott, R., McCarthy, L., Grigoriev, A., Schalkwyk, L., Nizetic, D., Francis, F., and Lehrach,H. 1993. High resolution cosmid and P1 maps spanning the 14 Mb genome of the � ssion yeast S. pombe. Cell 73,109–120.

Hudson, T., Stein, L., Gerety, S., Ma, J., Castle, A., Silva, J., Slonim, D., Baptista, R., Kruglyak, L., Xu, S., Hu,X., Colbert, A., Rosenberg, C., Reeve-Daly, M., Rozen, S., Hui, L., Wu, X., Vestergaard, C., Wilson, K., Bae, J.,Maitra, S., Ganiatsas, S., Evans, C., DeAngelis, M., Ingalls, K., Nahf, R., Horton Jr., L., Anderson, M., Collymore,A., Ye, W., Kouyoumjian, V., Zemsteva, I., Tam, J., Devine, R., Courtney, D., Renaud, M., Nguyen, H., O’Connor,T., Fizames, C., Fauré, S., Gyapay, G., Dib, C., Morissette, J., Orlin, J., Birren, B., Goodman, N., Weissenbach,J., Hawkins, T., Foote, S., Page, D., and Lander, E. 1995. An STS-based map of the human genome. Science 270,1945–1954.

Kendall, M. 1970. Rank Correlation Methods. Grif� n, London.Lin, J., Qi, R., Aston, C., Jing, J., Anantharaman, T., Mishra, B., White, O., Daly, M., Minton, K., Venter, C., and

Schwartz, D. 1999. Whole-genome shotgun optical mapping of Deinococcus radiodurans. Science 285, 1558–1562.Liu, B. 1998. Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press LLC, Boca Raton, FL.Mayraz, G., and Shamir, R. 1999. Construction of physical maps from olignucleotide � ngerprints data, 268–277. In

Istrail, S., Pevzner, P., and Watermann, M., eds., Proceedings of the Third Annual International Conference onComputational Molecular Biology (RECOMB 99), ACM, New York.

Melhorn, K., and Näher, S. 1999. LEDA: A Platform for Combinatorial and Geometric Computing. Cambridge Uni-versity Press, Cambridge.

Mott, R., Grigoriev, A., Maier, E., Hoheisel, J., and Lehrach, H. 1993. Algorithms and software tools for orderingclone libraries: Application to the mapping of the genome of Schizosaccharomyces pombe. Nucl. Acids Res. 21,1965–1974.

Nadkarni, P., Banks, A., Montgomery, K., LeBlanc-Stracewski, J., Miller, P., and Krauter, K. 1996. CONTIG EX-PLORER: Interactive marker-content map assembly. Genomics 31, 301–310.

Press, W., Teukolsky, W., Vetterling, W., and Flannery, B. 1992. Numerical Recipes in C. Cambridge University Press,New York.

Scholler, P., Karger, A., Meier-Ewert, S., Lehrach, H., Delius, H., and Hoheisel, J. 1995. Fine-mapping of shotguntemplate-libraries:An ef� cient strategy for the systematic sequencing of genomic DNA. Nucl. Acids Res. 23, 3842–3849.

Slonim, D., Kruglyak, L., Stein, L., and Lander, E. 1997. Building human genome maps with radiation hybrids. J.Comp. Biol. 4, 487–504.

Wang, Y., Prade, R., Grif� th, J., Timberlake, W., and Arnold, J. 1993. A fast random cost algorithm for physicalmapping. Proc. Natl. Acad. Sci. USA 91, 11094–11098.

Wang, Y., Prade, R., Grif� th, J., Timberlake, W., and Arnold, J. 1994. ODS_BOOTSTRAP: Assessing the statisticalreliability of physical maps by bootstrap resampling. CABIOS 10, 625–634.

Page 14: Contig Selection in Physical Mapping€¦ · Keywords:clone-probe hybridization mapping, contig selection, bootstrap. 1. INTRODUCTION The goal of physical mappingis to order a set

408 HEBER ET AL.

Weeks, D., and Lange, K. 1987. Preliminary ranking procedures for multilocus ordering. Genomics 1, 236–242.Xiong, M., Chen, R., Prade, R., Wang, J., Grif� th, W., Timberlake, W., and Arnold, J. 1996. On the consistency of a

physical mapping method to reconstruct a chromosome in vitro. Genetics 142, 267–284.

Address correspondence to:Steffen Heber

German Cancer Research Center (DKFZ)Theoretical Bioinformatics (H0300)

Im Neuenheimer Feld 280D-69120 Heidelberg, Germany

E-mail: [email protected]


Recommended