+ All Categories
Home > Documents > Gene organisation, sequence variation and isochore structure at the centromeric boundary of the...

Gene organisation, sequence variation and isochore structure at the centromeric boundary of the...

Date post: 15-Oct-2016
Category:
Upload: richard-stephens
View: 212 times
Download: 0 times
Share this document with a friend
11
Gene Organisation, Sequence Variation and Isochore Structure at the Centromeric Boundary of the Human MHC Richard Stephens 1 , Roger Horton 2 , Sean Humphray 2 , Lee Rowen 3 , John Trowsdale 1 and Stephan Beck 2 * 1 Immunology, Department of Pathology, University of Cambridge, Tennis Court Road Cambridge CB2 1QP, UK 2 The Sanger Centre, Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SA, UK 3 Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195-7730, USA We have mapped and sequenced the region immediately centromeric of the human major histocompatibility complex (MHC). A cluster of 13 genes/pseudogenes was identified in a 175 kb PAC linking the TAPASIN locus with the class II region. It includes two novel human genes (BING4 and SACM2L) and a thus far unnoticed human leucocyte antigen (HLA) class II pseudogene, termed HLA-DPA3. Analysis of the G C content revealed an isochore boundary which, together with the previously reported telomeric boundary, defines the MHC class II region as one of the first completely sequenced isochores in the human genome. Compari- son of the sequence with limited sequence from other cell lines shows that the high sequence variation found within the classical class II region extends beyond the identified isochore boundary leading us to propose the concept of an ‘‘extended MHC’’. By comparative analysis, we have precisely identified the mouse/human synteny breakpoint at the centro- meric end of the extended MHC class II region between the genes HSET and PHF1. # 1999 Academic Press Keywords: chromosome banding; extended MHC; GC content; isochore; single nucleotide polymorphism (SNP); synteny breakpoint *Corresponding author Introduction The human MHC, the HLA system, comprises about 4 Mb on chromosome 6p21.3. MHC regions are organized from centromere to telomere as class II, class III and class I. We have recently sequenced a gene-rich region positioned centromeric of the human class II region (Herberg et al., 1998a). This region contains TAPASIN, a gene required for antigen presentation by MHC class I molecules (Herberg et al., 1998b; Ortmann et al., 1997), DAXX, which encodes an effector of Fas that stimulates apoptosis through the Jun kinase (JNK) pathway (Chang et al., 1998); RGL2, which encodes an effec- tor of Ras and several other novel genes (Herberg et al., 1998a; Peterson et al., 1996). The presence of the TAPASIN gene raises the possibility that the immune-related cluster of genes known as the MHC extends further centromeric than previously thought. In support of this view a new MHC locus Cim2 that influences class I presentation has appar- ently been mapped between K and Pb in mouse (Simmons et al., 1997b). This region has a con- served syntenic relationship in human, mouse and rat (Hanson & Trowsdale, 1991; Walter et al., 1996). The human genome is composed of giant mosaic structures or isochores (Bernardi, 1993). These are long stretches of DNA which vary in their G C composition from less than 38 % G C to more than 55 % G C. These isochores have been grouped into five classes, the A T-rich L1 and L2 (approximately 40 % G C) and the increasingly G C-rich H1 (45 %), H2 (50 %) and H3 (53 %) classes. The gene density found within each iso- chore type has been found to correlate with the G C content, being greater in those isochores with a higher G C content (Craig & Bickmore, 1993). Isochores are also present within the MHC. The isochore boundary separating the class II and III regions has been characterised and found to contain sequences similar to the pseudoautosomal E-mail address of the corresponding author: [email protected] Abbreviations used: HLA, human leucocyte antigens; HERV, human endogenous retrovirus; MHC, major histocompatibility complex; ORF, open reading frame, indel, insertion/deletion. Article No. jmbi.1999.3004 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 291, 789–799 0022-2836/99/340789–11 $30.00/0 # 1999 Academic Press
Transcript

Article No. jmbi.1999.3004 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 291, 789±799

Gene Organisation, Sequence Variation and IsochoreStructure at the Centromeric Boundary of theHuman MHC

Richard Stephens1, Roger Horton2, Sean Humphray2, Lee Rowen3,John Trowsdale1 and Stephan Beck2*

1Immunology, Department ofPathology, University ofCambridge, Tennis Court RoadCambridge CB2 1QP, UK2The Sanger Centre, WellcomeTrust Genome CampusHinxton, CambridgeCB10 1SA, UK3Department of MolecularBiotechnology, University ofWashington, Seattle, WA98195-7730, USA

E-mail address of the [email protected]

Abbreviations used: HLA, humanHERV, human endogenous retrovirhistocompatibility complex; ORF, oindel, insertion/deletion.

0022-2836/99/340789±11 $30.00/0

We have mapped and sequenced the region immediately centromeric ofthe human major histocompatibility complex (MHC). A cluster of 13genes/pseudogenes was identi®ed in a 175 kb PAC linking the TAPASINlocus with the class II region. It includes two novel human genes (BING4and SACM2L) and a thus far unnoticed human leucocyte antigen (HLA)class II pseudogene, termed HLA-DPA3. Analysis of the G � C contentrevealed an isochore boundary which, together with the previouslyreported telomeric boundary, de®nes the MHC class II region as one ofthe ®rst completely sequenced isochores in the human genome. Compari-son of the sequence with limited sequence from other cell lines showsthat the high sequence variation found within the classical class II regionextends beyond the identi®ed isochore boundary leading us to proposethe concept of an ``extended MHC''. By comparative analysis, we haveprecisely identi®ed the mouse/human synteny breakpoint at the centro-meric end of the extended MHC class II region between the genes HSETand PHF1.

# 1999 Academic Press

Keywords: chromosome banding; extended MHC; G�C content; isochore;single nucleotide polymorphism (SNP); synteny breakpoint

*Corresponding author

Introduction

The human MHC, the HLA system, comprisesabout 4 Mb on chromosome 6p21.3. MHC regionsare organized from centromere to telomere as classII, class III and class I. We have recently sequenceda gene-rich region positioned centromeric of thehuman class II region (Herberg et al., 1998a). Thisregion contains TAPASIN, a gene required forantigen presentation by MHC class I molecules(Herberg et al., 1998b; Ortmann et al., 1997), DAXX,which encodes an effector of Fas that stimulatesapoptosis through the Jun kinase (JNK) pathway(Chang et al., 1998); RGL2, which encodes an effec-tor of Ras and several other novel genes (Herberget al., 1998a; Peterson et al., 1996). The presence ofthe TAPASIN gene raises the possibility that the

ing author:

leucocyte antigens;us; MHC, majorpen reading frame,

immune-related cluster of genes known as theMHC extends further centromeric than previouslythought. In support of this view a new MHC locusCim2 that in¯uences class I presentation has appar-ently been mapped between K and Pb in mouse(Simmons et al., 1997b). This region has a con-served syntenic relationship in human, mouse andrat (Hanson & Trowsdale, 1991; Walter et al., 1996).

The human genome is composed of giant mosaicstructures or isochores (Bernardi, 1993). These arelong stretches of DNA which vary in their G � Ccomposition from less than 38 % G � C to morethan 55 % G � C. These isochores have beengrouped into ®ve classes, the A � T-rich L1 and L2(approximately 40 % G � C) and the increasinglyG � C-rich H1 (45 %), H2 (50 %) and H3 (53 %)classes. The gene density found within each iso-chore type has been found to correlate with theG � C content, being greater in those isochoreswith a higher G � C content (Craig & Bickmore,1993). Isochores are also present within the MHC.The isochore boundary separating the class II andIII regions has been characterised and found tocontain sequences similar to the pseudoautosomal

# 1999 Academic Press

Figure 1. Isochore structure ofthe MHC class II region. The meanG � C% is plotted per 1 kb for theentire region. An averaged G � C%smoothed per 100 kb interval foreach 1 kb is also shown with pointsplotted at the midpoint of the inter-val starting at 50 kb. The upper barrepresents an idealised genomicbanding pattern for 6p21.3 illustrat-ing the possible correlation of6p21.32 with the low G � C iso-chore of the classical class II region(for more detailed discussion, seethe main text). The approximatepositions of a number of landmarkgenes are indicated. The non-redundant consensus sequence wasconstructed from overlappingsequence data (see Materials andMethods). Two small gaps (//) arestill present in the sequence. Thesizes of both these gaps areestimated to be about 10 kb. Theclassical class II region forms a

single, L2 isochore of 40.2 % average G � C content. The transition regions at both boundaries are clearly visiblewhere the average G � C content rapidly increases to around 50 %.

790 Extended MHC

boundaries of the human sex chromosomes(Fukagawa et al., 1996, 1995). Quantitative analysisof the class II and class III isochores con®rmed thecorrelation with chromosome band giemsa stainingbut no clear correlation could be establishedbetween isochores and gene/repeat density (Becket al., 1996). A further correlation was foundbetween these isochores and DNA replicationtiming with precise switching at the isochoreboundary from late replication in the class II regionto early replication in the class III region (Tenzenet al., 1997). This is a possible indication of isochoreinvolvement in the elusive replicon structure of thehuman genome.

Here we present the sequence of PAC clone1033B10, which links the TAPASIN locus and theclassical MHC class II region. The resulting region,termed extended class II region, has been analysedin detail for gene and G � C content and has beencompared for synteny with the correspondingmouse sequence.

Results and Discussion

PAC 1033B10 was isolated from the RPCI1library (Ioannou et al., 1994) using a number ofPCR ampli®ed probes (Materials and Methods).The 175 kb clone was sequenced in its entirety aspart of the Sanger Centre's effort to complete thesequence of human chromosome 6 and was foundto be moderately gene rich, containing two pocketsof nine genes and ®ve pseudogenes in total, inter-spersed by a region of high density Alu repeats(see Figure 3, below).

Isochore structure

Figure 1 shows the G � C content of the genomicDNA sequence encompassing the class II region.The non-redundant sequence was constructed bysplicing together overlapping sequence data repre-senting more than 1 Mb of DNA from clonesderived from several different cell lines (seeMaterials and Methods). The two small gaps stillpresent in the sequence are currently being closed.Analysis of the base composition shows where theisochore boundary regions are located, disclosingsharp G � C transitions at both the centromericand telomeric boundaries. As previously reported,the class II/III boundary contains sequences highlyhomologous with the pseudoautosomal boundaryof the short arm of the human sex chromosomesPAB1X and PAB1Y (Fukagawa et al., 1995). Theseare the interface boundaries between sex-speci®cand pseudoautosomal regions. No similar featureshave been found at the centromeric isochoreboundary of the MHC.

In summary, three distinctly different regions ofG � C content are identi®able in Figure 1, corre-sponding to an L-family isochore for the classicalclass II region, an H3 isochore for the telomericclass III region and an H2 isochore for the regioncentromeric of the classical class II region, termedextended class II region. The region between thetwo boundaries corresponding to the classical classII region correlates with the presence of a giemsadark band, possibly 6p21.32 (Spring et al., 1985).However, mapping by metaphase FISH suggeststhat the entire MHC is located within chromosomeband 6p21.31 (Senger et al., 1993) but, in accord-

Figure 2. Dot matrix comparison of the centromeric MHC class II boundary regions of mouse (y-axis) and human(x-axis). The clones used to construct the respective sequences are indicated by ®lled bars. For the purpose of thisplot, the sequences of cICK0721Q and cICB2046 were joined, the small gap is currently being closed. Genes and theirtranscriptional directions are shown on the left and upper axes. Regions of similarity are identi®ed as a concentrationof dots forming diagonal lines. The region of homology extends from COL11A2 to HSET, except for the the H2-Klocus, which is mouse-speci®c. The same syntenic order is maintained for all the orthologous genes. Orthologousgenes are indicated by black arrows and non-orthologous genes by white arrows. The region centromeric of HSETshows no further similarity on the DNA level or gene content, indicating the presence of a synteny breakpoint. Theregion (including the HSET gene) leading up to the breakpoint has been enlarged to show more detail.

Extended MHC 791

ance with its small size, it is quite possible that theclassical class II region is located on an unrecordeddark band within 6p21.31. More ®ne mapping isrequired to clarify this.

Gene organisation

The region immediately centromeric to theclassical class II region contains a number of genesincluding RING1, RING2, KE4, RXRB andCOL11A2, which are conserved and syntenic inhuman, rat and mouse. In contrast to human, asecond class I gene-containing region, H2-K inmouse and and RT1-A in rat, is found centromericto the class II region between the SACM2L and theRING1 genes (Hanson & Trowsdale, 1991; Walteret al., 1996). This organisation is assumed to be dueto a class I gene translocation from the telomeric

part of the MHC during evolution of these rodentspecies. The MHC displays conservation betweenmouse on chromosome 17 and man on chromo-some 6 over its entire length of around 4Mb (Laiet al., 1994; Yoshino et al., 1997). The synteny break-points at both ends of the MHC have been mappedand shown to be located in man between D6S131Eand RFP at the telomeric boundary (Yoshino et al.,1997) and the gene HSET and PHF1 at the centro-meric boundary (this study). By comparison of themurine and human sequences, we have been ableto show the precise location of the centromericbreakpoint. With the exception of the mouse-speci®c H2-K locus and the human-speci®c BING3locus, Figure 2 shows the entire region up to thebreakpoint to be highly conserved and syntenicover 15 genes: COL11A2, RXRB, KE4, RING2,RING1, SAC2M2L, RPS18, B3GALT4, BING4, KE2,

Figure 3. Detailed feature map of PAC clone 1033B10. The diagram is orientated from centromere (left) to telomere(right). Exons of genes and pseudogenes are shown on the upper-most line with arrows indicating the direction oftranscription. All features are shown to scale with repeats drawn as arrows indicating orientation. CpG islands andrepeats (separated into Alu and non-Alu) are displayed on different lines for clarity.

Table 1. Breakdown of the variation found between thetwo clones 1033B10 (AL031228) and D84401 over a 43kb region

VariationNumber

(frequency/kb)

Indel: Deletion 42 (0.99)Insertion 308 (7.26)Subtotal 350 (8.25)

Substitution: Transversion 14 (0.33)Transition 22 (0.52)

Subtotal 36 (0.85)

Overall total 386 (9.10)

The level of overall variation in this region is high (9.10/kb)compared with that of the genome average (1-2/kb). This highlevel of variation is mainly due to the number of insertions/deletions (7.26/kb).

792 Extended MHC

RGL2, TAPASIN, BING1, DAXX and HSET. Thesynteny stops immediately centromeric of HSETwith no futher similarity in DNA sequence or geneorder.

BING4

PAC clone 1033B10 overlaps with and is telo-meric to a fully sequenced cosmid F0811 (Z97184)by 5281 bp (Figure 2). The BING4 gene spans boththese clones and consists of 15 exons. A pseudo-gene is present within intron 11 at the centromericend of PAC 1033B10 and has been tentativelynamed BING5 (Figure 3) (Herberg et al., 1998a).The Genscan prediction for BING5 shows a singleexon gene transcribed in the opposite direction tothat of BING4. An EST (accession AA382427) with>99 % homology has been found corresponding toBING5 but it contains a stop codon at the 30 end ofthe predicted gene. BING4 also contains a tandemrepeat within intron 12. The repeat is some 70 bplong and persists throughout the whole intron. Therepeat is not conserved in the correspondingmouse intron. There are numerous nuclear localis-ation signals situated at both the proximal anddistal ends of the protein. Analysis of the proteinsequence by PSORT (Nakai & Horton, 1999) andthe use of the k-nearest neighbour (k-NN) algor-ithm (91.3 %) indicates that the protein is mostlikely to be localised in the nucleus (Nakai &Kanehisa, 1992) . There are a number of ortholo-gues to hypothetical proteins in the databasesincluding those for Mus musculus, Saccharomycescerevisiae, Caenorhabditis elegans and Manduca sexta.

Analysis of the predicted amino acid sequence ofBING4 and its orthologues identi®ed a number ofconserved regions, as shown in Figure 4. The classI aminoacyl-tRNA synthetase motif (residues 305-318) is conserved in all known orthologues overthat region (PROSITE, Bairoch, 1993). A largeregion of homology also exists between residues455-530 with a nuclear localization signalembedded within it.

B3GALT4

The human gene for UDP-galactose, beta-N-acetyl-glucosamine beta-1,3-galactosyltransferase-T4 (B3GALT4), is one member of a family of four

Figure 4. Comparison of human HsBING4 to orthologues in M. musculus (AF100956), C. elegans (f28d1.1), S. cerevi-siae (YER082C) and M. sexta (MNG10). Alignments were made using the PILEUP and BOXSHADE programs. Identi-cal amino acid residues are shown on black and conservative substitutions on grey blackground. BING4 is apredicted nuclear protein with conserved nuclear localisation signals throughout the gene (positions 12-18, 21-24, 44-50, 76-82 and 490-496). Residues 327 to 337 of BING4 match the adenylate binding site of class I aminoacyl-tRNAsynthetase, the amino acid motif being conserved in all orthologues examined.

Extended MHC 793

794 Extended MHC

B3-galactosyltransferases, three of which have beenmapped to chromosome positions 1q31, 3q25 and6p21.3 (MHC). Each gene is a single exon and thefour homologues have different expressionpatterns (Amado et al., 1998).

Interestingly, the ancestral MHC appears to haveduplicated into four regions. The ®rst duplicationmay have occured just before the time of vertebrateemergence (6p23.1 and 9q33-q34). Futher dupli-cations arose later on 19p13.1-p13.3 and 1q21-q25.A growing list of homologous genes can be foundwithin these four duplicated regions includingRING3, COL11A2, RXRB, C4 and NOTCH4(Sugaya et al., 1994; Endo et al., 1997; Kasahara,1999). Some of the B3-galactosyltransferase familymembers loosely ®t into this ancestral pattern andmay be part of the duplicated complex.

SACM2L

The gene SACM2L was recently identi®ed andmapped in humans, rat and mouse (Walter &GuÈ nther, 1998). In the rat the gene is ubiquitouslyexpressed. The human gene is made up of 20exons, which span 21.6 kb, and contains a largeintron 17 of 11.5 kb. The gene contains a putativecoiled-coil domain (amino acid residues 122-149)with three putative transmembrane regions(Figure 5). A feature of human SACM2L predictedby PSORT is the four di-leucine motifs at positions473, 533, 591 and 603, which may constitute recep-tor sorting motifs.

COL11A2

Collagen11A2 (COL11A2) is a structural com-ponent of the cartilage extracellular matrix whichplays an important role in skeletal morphogenesis(Burgeson & Hollister, 1979). The gene is separatedby only 1.1 kb from RXRB and is in the sametranscriptional orientation. Linkage analysis withthe marker D6S276 implicates COL11A2 as the cau-sative gene for Stickler syndrome in approximately50 % of cases (Brunner et al., 1994). COL11A2 ismade up of 66 exons and undergoes alternativesplicing (Lui et al., 1996b; Tsumaki & Kimura,1995). A number of deletion experiments haveshown that some promoter regions are importantfor spatial expression in the mouse gene (Tsumakiet al., 1996, 1998). We have examined the gene forhomologous sequences in mouse and man usingdot matrix analysis and have found high levels ofhomology within introns 1, 6 and 8 (data notshown). There is also a high degree of homology in

Figure 5. cDNA sequence of the human SACM2Lgene. The stop codon is indicated by an asterisk, theputative coiled-coil domain is underlined, putativemembrane-spanning regions are indicated by a dottedline and the predicted di-leucine repeat motifs (pre-dicted by PSORT) are doubly underlined.

Figure 6. Sequence variationwithin the extended MHC class IIregion. The frequency (calculatedper 1000 bases) and the positions ofinsertions/deletions and substi-tutions are shown between PAC1033B10 (EMBL AL031228) andcosmid 519 (EMBL D84401). Thepositions of the RING1, RING2 andHKE4 genes are marked. Novariation is found around theHKE4 locus.

Extended MHC 795

the 1.1 kb between RXRB and COL11A2, whichhas been reported. The homologous sequences inthe COL11A2 introns 1 and 6 appear to have dis-crete exonic structure, suggesting the possibility offurther alternative exons within the collagen geneor an alternative, separately transcribed mRNA.The sequence from COL11A2 intron 1 has no data-base matches. The homologous sequence in intron6 has been shown to correspond to an expressedsequence, KE5 (not shown in Figures 2 and 3), aseparately transcribed gene contained withinCOL11A2 (Lui et al., 1996a). Intron 8 has weakermouse-human homology but contains a 100 %match with an IMAGE clone (AA502575) derivedfrom liposarcoma tissue. The clone contains theCOL11A2 intronic sequence only and has a poly(A)tract in the same transcriptional orientation asRXRB and COL11A2. It has been noted that sometranscripts from the adjacent RXRB gene extendinto COL11A2 when undergoing RT-PCR(Vandenberg et al., 1996) and it is possible that thisEST could represent extended run-through tran-scripts from RXRB. This may indicate coordinateexpression of the two genes.

HLA-DPA3 and DPB2

A previously unknown pseudogene, which wehave termed HLA-DPA3, has been found just cen-tromeric to the pseudogene HLA-DPB2. Phyloge-netic analysis using the neighbour joining methodclassi®es DPA3 as most closely related to DPA2(data not shown). The DPB2 pseudogene is large incomparison to other class II genes or pseudogenes(Radley et al., 1994) owing to the insertion ofHERV fragments. This is in keeping with otherpseudogenes which tend to accumulate repeatsdue to the absence of functional selection.

Sequence variation

A 43 kb section of PAC 1033B10 has beenmapped and sequenced previously using a cosmidsubcloned from a YAC clone (Kikuti et al., 1997).Contained within the subcloned cosmid clone arethree genes RING1, RING2 and HKE4. These are alltranscribed in a telomeric to centromeric direction.No known function has been attributed to eitherRING2 or HKE4 while RING1 is known to associatewith the polycomb group of proteins and to act asa transcriptional repressor (Satijn et al., 1997).

Comparison of the cosmid sequence D84401(Kikuti et al., 1997) with that of PAC 1033B10reveals higher (9.1 variations/kb) than averagesequence variation of 1-2 variations per kb (Table 1and Figure 6). This is perhaps not surprising as thesequence variation in the adjacent classical class IIregion is also much higher (up to 48 variations/kb)than the genome average, although the dramati-cally different (over 100-fold) ratio of substitutionsversus indels is puzzling (Horton et al., 1998). Thecontinuation of high sequence variation from theclassical class II region further centromericsuggests that this region may be part of an``extended MHC''.

When analysing sequence data from the sub-cloned YAC (D84401), the RING1 (exons 4 and 5)and RING2 (exons 1, 2, 6 and 8) ORFs are seen tobe disrupted by indels (data not shown). It has notbeen possible to assess whether these indels aredue to true variation or sequencing errors. Thegene HKE4, in contrast, displays no sequence vari-ation over the 4 kb genomic region (Figure 6).

Extended MHC

Detailed genomic analyses of the MHC and its¯anking regions have revealed new features whichjustify reassessment of the classical division of theMHC into the class I, class II and class III regions.By comparison with mouse, we have precisely

796 Extended MHC

determined a synteny breakpoint (between HSETand PHF1) de®ning the centromeric end of thehuman MHC. We propose to term the �300 kbregion between HSET and HLA-DP (start of theclassical class II region) as ``extended class IIregion''. We support the distinction betweenextended and classical class II regions with func-tional and physical evidence as follows.

The extended class II region contains the TAPA-SIN locus, which is part of the pathway forendogenous peptide presentation together withLMP, TAP and CLASS I genes. The TAPASIN pro-tein facilitates peptide loading by forming a bridgebetween the transporter associated with antigenpresentation (TAP) and the class I molecule(Ortmann et al., 1997). Further evidence linking theextended class II region with MHC-associatedimmune function comes from studies in rodentswhere the orthologous region between RING1 andCOL11A2 has been found to exert a broad effect onpeptide loading of both transgenic HLA-B27 andmouse class I alleles. It has been termed the Cim2locus in both mouse and rat (Simmons et al.,1997a,b).

On the physical side, the most compellingevidence which also re¯ects the very differentgenomic environment in terms G � C content andrepeat composition between the two regions is thepresence of an isochore boundary. According tothis boundary, the region de®ned as extended classII region falls onto a high G � C H2-isochore,while the classical class II region falls onto a lowG � C L2-isochore. Such isochore boundaries havebeen shown to separate early and late replicatingchromosomal segments (Tenzen et al., 1997). Onthe other hand, the ®nding of higher than averagesequence variation in both the extended and theclassical class II regions is physical evidence thatthe selection in both regions may somehow beconnected.

Interestingly, the situation at the telomeric endof the MHC is quite similar, giving rise to an``extended class I region''. The synteny to mouseclearly continues beyond HLA-F at the end of theclassical class I region (Yoshino et al., 1997) and theregion has been found to be in extreme linkage dis-equilibrium with the MHC (Malfroy et al., 1997). Inaddition, several MHC-similar genes (HFE andseveral butyrophilins) have been found in theextended class I region (Ruddy et al., 1997;Tazi-Ahnini et al., 1997).

Materials and Methods

Isolation of PAC clones

PCR primers and conditions. The standard PCR con-ditions used were 94 �C for four minutes, then 30 cyclesat 94 �C for one minute, 55 �C for one minute and 72 �Cfor one minute, followed by a ®nal extension step at72 �C for ®ve minutes. PCR was carried out in 50 mMTris-HCl (pH 9.0), 1.5 mM MgCl2, 0.1 % (v/v) TritonX-100, 0.2 mM dNTPs, 1 mM concentrations of eachprimer and 0.5 unit of Taq DNA polymerase in a 50 ml

reaction volume. Primer pairs were used to amplifytotal genomic DNA, the products were used as hybridis-ation probes for the identi®cation of gridded PAC clones(library RPCI-1): B2046l agagatattcagctactggg, B2046raagcatcattctgaattacg. RXRBl gtttgccaagctgctgctacg,RXRBr catctccatgaggaaggtgtc. HSETl agaacttgcgtgcttgtgtcc, HSETr tccgttcttcctgcaattcc. RING1l gtgccctatctgcctggacatgctg, RING1r ctcaatgctggagctcaatg. COL11A2lacctggaatcccacaacatc, COL11A2r ctaacgggtaacaggctcca.

DNA sequencing and analysis

PAC 1033B10 was isolated from the RPCI-1 PAClibrary (HSF7 cell line: Ioannou et al., 1994). The entireclone was randomly subcloned into M13mp18 andpUC18 (Bankier et al., 1987). Recombinant clones (80 %M13 s, 20 % pUCs) were picked, ampli®ed and puri®edin 96 well microtitre plates (Beck & Alderton, 1993;Mardis, 1994). The DNA sequence was determined usingthe enzymatic dideoxy chain termination sequencingchemistry (Sanger et al., 1977) and automated ABI 373/377 DNA sequencers (Applied Biosystems, Foster City,USA). The generated reads were quality clipped,screened for cloning and sequencing vectors, assembledinto contigs and analysed as described (The SangerCentre and GSC Washington University, 1998). The®nished sequence was analysed using the Sanger Centreanalysis strategy (http://www.sanger.ac.uk/Teams/Informatics/Humana/). Dot matrix comparisons werecarried out using the DOTTER program (Sonnhammer &Durbin, 1995), which is available at http://www.san-ger.ac.uk/software/Dotter. Motifs on the amino acidlevel were identi®ed using the PSORT (Nakai & Horton,1999) program and the PROSITE (Bairoch, 1993) data-base. The sequence reported here has been deposited inthe EMBL database under the accession numberAL031228.

Variation analysis

The sequence variation between PAC 1033B10 (cellline HSF7; EMBL accession AL031228) and YAC clone42 derived cosmid 519 (B cell line CGM1; EMBLaccession number D84401; Kikuti et al., 1997) wasdetermined using the methods previously described(Horton et al., 1998). The systematic analysis of allthree types of variation, substitutions (S), insertions (I)and deletions (D) was carried out with the Cross -Match program (P. Green, personal communication).The program compares two sequences in FASTA for-mat and generates an output ®le listing any differ-ences. The output ®le was further sorted andannotated with various PERL scripts (R.H., unpub-lished but available on request). During the analysis1033B10 was regarded as the query sequence andD84401 as the subject sequence. Individual variationevents were de®ned as follows: (1) substitutions as asingle base variation in one sequence with respect tothe other; (2) insertion as a single event in which oneor more bases were inserted in the query sequencewith respect to the subject sequence; (3) deletion as asingle event in which one or more bases were deletedin the query sequence with respect to the subjectsequence.

Extended MHC 797

Phylogenetic analysis

The phylogenetic analysis of DPA3 was carried out bytwo different methods using the CLUSTALW(Thompson et al., 1997) and PHYLO WIN (Galtier et al.,1996) packages. Due to HLA-DPA3 being a gene frag-ment, the sequence alignment was only 43 amino acidresidues long. Based on distance estimates derived fromthe Dayhoff PAM substitution matrix, the neighbour-joining (Saitou & Nei, 1987) and maximum parsimony(Fitch, 1971) methods were used for tree construction.Both methods produced essentially identical trees con-®rmed by 100 bootstrap replicates (data not shown).

G�C content analysis

A non-redundant, semi-contiguous consensussequence across the entire class II region was constructedby splicing together several clones available in thechromosome 6 database (6ace) at the Sanger Centre. Theclone names, accession numbers and cell lines givenbelow are in order from centromere to telomere:cICK0721Q (AL021366, RPETO1); cICB2046 (Z97183,RPETO1); cICF0811 (Z97184, RPETO1); dJ1033B10(AL031228, HSF7); LC11 (K03014, lung carcinoma tis-sue); A1 (Z95437, RPETO1); O19AO14 (Z99705, Clontech6550-1); O14 (Z84497, MANN); O27 (Z96104, MANN);DV19 (Z84490, RPETO1); F1121 (Z80899, RPETO1);E1448 (Z80898, RPETO1); dJ93N13 (Z84489, HSF7);dJ265J14 (Z84477, HSF7); dJ172K2 (Z84814, HSF7);1077I5 (AL034394, HSF7); C47 (AF044083, ICE5); A5(U89335, MANN); W5A (U89336, MANN); W6A(U89337, MANN). The consensus sequence can beobtained from http://www.sanger.ac.uk/HGP/Chr6/MHC.shtml. The positions of two small gaps (//) in theconsensus are indicated in Figure 1.

Acknowledgements

Human DNA sequencing at the Sanger Centre isfunded by the Wellcome Trust. We thank Denise Sheerfor comments and all members of the chromosome 6project group (http://www.sanger.ac.uk/HGP/Chr6/).J.T. is supported by a Wellcome Programme grant, R.S.is funded by the Leukaemia Research Fund and L.R. issupported by the National Institutes of Health (USA).

References

Amado, M., Almeida, R., Carneiro, F., Levery, S. B.,Holmes, E. H., Nomoto, M., Hollingsworth, M. A.,Hassan, H., Schwientek, T., Nielsen, P. A., Bennett,E. P. & Clausen, H. (1998). A family of humanbeta3-galactosyltransferases. Characterization offour members of a UDP-galactose:beta-N-acetyl-glucosamine/beta-nacetyl-galactosamine beta-1,3-galactosyltransferase family. J. Biol. Chem. 273,12770-12778.

Bairoch, A. (1993). The PROSITE dictionary of sites andpatterns in proteins, its current status. Nucl. AcidsRes. 21, 3097-3103.

Bankier, A. T., Weston, K. M. & Barrell, B. G. (1987).Random cloning and sequencing by the M13/dideoxynucleotide chain termination method.Methods Enzymol. 155, 51-93.

Beck, S. & Alderton, R. P. (1993). A strategy for theampli®cation, puri®cation, and selection of M13templates for large-scale DNA sequencing. Anal.Biochem. 212, 498-505.

Beck, S., Abdulla, S., Alderton, R. P., Glynne, R. J., Gut,I. G., Hosking, L. K., Jackson, A., Kelly, A., Newell,W. R., Sanseau, P., Radley, E., Thorpe, K. L. &Trowsdale, J. (1996). Evolutionary dynamics of non-coding sequences within the class II region of thehuman MHC. J. Mol. Biol. 255, 1-13.

Bernardi, G. (1993). The isochore organization of thehuman genome and its evolutionary historyÐareview. Gene, 135, 57-66.

Brunner, H. G., van Beersum, S. E., Warman, M. L.,Olsen, B. R., Ropers, H. H. & Mariman, E. C.(1994). A Stickler syndrome gene is linked tochromosome 6 near the COL11A2 gene. Hum. Mol.Genet. 3, 1561-1564.

Burgeson, R. E. & Hollister, D. W. (1979). Collagen het-erogeneity in human cartilage: identi®cation of sev-eral new collagen chains. Biochem. Biophys. Res.Commun. 87, 1124-1131.

Chang, H. Y., Nishitoh, H., Yang, X., Ichijo, H. &Baltimore, D. (1998). Activation of apoptosis signal-regulating kinase 1 (ASK1) by the adapter proteinDaxx. Science, 281, 1860-1863.

Craig, J. M. & Bickmore, W. A. (1993). ChromosomebandsЯavours to savour. BioEssays, 15, 349-354.

Endo, T., Imanishi, T., Gojobori, T. & Inoko, H. (1997).Evolutionary signi®cance of intra-genome dupli-cations on human chromosomes. Gene, 205, 19-27.

Fitch, W. M. (1971). Toward de®ning the course of evol-ution: minimum change for a speci®c tree topology.Syst. Zool. 20, 406-416.

Fukagawa, T., Sugaya, K., Matsumoto, K., Okumura, K.,Ando, A., Inoko, H. & Ikemura, T. (1995). A bound-ary of long-range G � C% mosaic domains in thehuman MHC locus: pseudoautosomal boundary-like sequence exists near the boundary. Genomics,25, 184-191.

Fukagawa, T., Nakamura, Y., Okumura, K., Nogami,M., Ando, A., Inoko, H., Saitou, N. & Ikemura, T.(1996). Human pseudoautosomal boundary-likesequences: expression and involvement in evol-utionary formation of the present-day pseudoauto-somal boundary of human sex chromosomes. Hum.Mol. Genet. 5, 23-32.

Galtier, N., Gouy, M. & Gautier, C. (1996). SEAVIEWand PHYLO WIN: two graphic tools for sequencealignment and molecular phologeny. Comput. Appl.Biosci. 12, 543-548.

Hanson, I. M. & Trowsdale, J. (1991). Colinearity ofnovel genes in the class II regions of the MHC inmouse and human. Immunogenet. 34, 5-11.

Herberg, J. A., Beck, S. & Trowsdale, J. (1998a). TAPA-SIN, DAXX, RGL2, HKE2 and four new genes(BING 1, 3 to 5) form a dense cluster at the centro-meric end of the MHC. J. Mol. Biol. 277, 839-857.

Herberg, J. A., Sgouros, J., Jones, T., Copeman, J.,Humphray, S. J., Sheer, D., Cresswell, P., Beck, S. &Trowsdale, J. (1998b). Genomic analysis of theTapasin gene, located close to the TAP loci in theMHC. Eur. J. Immunol. 28, 459-467.

Horton, R., Niblett, D., Milne, S., Palmer, S., Tubby, B.,Trowsdale, J. & Beck, S. (1998). Large-scalesequence comparisons reveal unusually high levelsof variation in the HLA-DQB1 locus in the class IIregion of the human MHC. J. Mol. Biol. 282, 71-97.

798 Extended MHC

Ioannou, P. A., Amemiya, C. T., Garnes, J., Kroisel,P. M., Shizuya, H., Chen, C., Batzer, M. A. & deJong, P. J. (1994). A new bacteriophage P1-derivedvector for the propagation of large human DNAfragments. Nature Genet. 6, 84-89.

Kasahara, M. (1999). The chromosomal duplicationmodel of the major histocompatibility complex.Immunol. Rev. 167, 17-32.

Kikuti, Y. Y., Tamiya, G., Ando, A., Chen, L., Kimura,M., Ferreira, E., Tsuji, K., Trowsdale, J. & Inoko, H.(1997). Physical mapping 220 kb centromeric of thehuman MHC and DNA sequence analysis of the43-kb segment including the RING1, HKE6, andHKE4 genes. Genomics, 42, 422-435.

Lai, F., Stubbs, L., Lehrach, H., Huang, Y., Yeom, Y. &Artzt, K. (1994). Genomic organization andexpressed sequences of the mouse extended H-2 Kregion. Genomics, 23, 338-343.

Lui, V. C., Ng, L. J., Sat, E. W. & Cheah, K. S. (1996a).The human alpha 2(XI) collagen gene (COL11A2):completion of coding information, identi®cation ofthe promoter sequence, and precise localizationwithin the major histocompatibility complex revealoverlap with the KE5 gene. Genomics, 32, 401-412.

Lui, V. C. H., Ng, L. J., Sat, E. W. Y., Nicholls, J. &Cheah, K. S. E. (1996b). Extensive alternative spli-cing within the amino-propeptide coding domain ofalpha2(XI) procollagen mRNAs. Expression of tran-scripts encoding truncated pro-alpha chains. J. Biol.Chem. 271, 16945-16951.

Malfroy, L., Roth, M. P., Carrington, M., Borot, N., Volz,A., Ziegler, A. & Coppin, H. (1997). Heterogeneityin rates of recombination in the 6 Mb regiontelomeric to the human major histocompatibilitycomplex. Genomics, 43, 226-231.

Mardis, E. R. (1994). High-throughput detergent extrac-tion of M13 subclones for ¯uorescent DNA sequen-cing. Nucl. Acids Res. 22, 2173-2175.

Nakai, K. & Horton, P. (1999). PSORT: a program fordetecting sorting signals in proteins and predictingtheir subcellular localization. Trends Biochem. Sci. 24,34-36.

Nakai, K. & Kanehisa, M. (1992). A knowledge base forpredicting protein localization sites in eukaryoticcells. Genomics, 14, 897-911.

Ortmann, B., Copeman, J., Lehner, P. J., Sadasivan, B.,Herberg, J. A., Grandea, A. G., Riddell, S. R.,Tampe, R., Spies, T., Trowsdale, J. & Cresswell, P.(1997). A critical role for tapasin in the assemblyand function of multimeric MHC class I-TAP com-plexes. Science, 277, 1306-1309.

Peterson, S. N., Trabalzini, L., Brtva, T. R., Fischer, T.,Altschuler, D. L., Martelli, P., Lapetina, E. G., Der,C. J. & White, G. C., II (1996). Identi®cation of anovel RalGDS-related protein as a candidate effec-tor for Ras and Rap1. J. Biol. Chem. 271, 29903-29908.

Radley, E., Alderton, R. P., Kelly, A., Trowsdale, J. &Beck, S. (1994). Genomic organization of HLA-DMAand HLA-DMB. Comparison of the gene organiz-ation of all six class II families in the human majorhistocompatibility complex. J. Biol. Chem. 269,18834-18838.

Ruddy, D. A., Kronmal, G. S., Lee, V. K., Mintier, G. A.,Quintana, L., Domingo, R., Jr, Meyer, N. C., Irrinki,A., McClelland, E. E., Fullan, A., Mapa, F. A.,Moore, T., Thomas, W., Loeb, D. B. & Harmon, C.,et al. (1997). A 1.1-Mb transcript map of the heredi-tary hemochromatosis locus. Genome Res. 7, 441-456.

Saittou, M. & Nei, M. (1987). The neighbor-joiningmethod: a new method for reconstructing phyloge-netic trees. J. Mol. Evol. 4, 406-425.

Sanger, F., Nicklen, S. & Coulson, A. R. (1977). DNAsequencing with chain-terminating inhibitors. Proc.Natl Acad. Sci. USA, 74, 5463-5467.

Satijn, D. P., Gunster, M. J., van der Vlag, J., Hamer,K. M., Schul, W., Alkema, M. J., Saurin, A. J.,Freemont, P. S., van Driel, R. & Otte, A. P. (1997).RING1 is associated with the polycomb group pro-tein complex and acts as a transcriptional repressor.Mol. Cell Biol. 17, 4105-4113.

Senger, G., Ragoussis, J., Trowsdale, J. & Sheer, D.(1993). Fine mapping of the human MHC class IIregion within chromosome band 6p21 and evalu-ation of probe ordering using interphase ¯uor-escence in situ hybridization. Cytogenet. Cell Genet.64, 49-53.

Simmons, W. A., Roopenian, D. C., Summer®eld, S. G.,Jones, R. C. & Galocha, B. (1997a). A new MHClocus that in¯uences class I peptide presentation.Immunity, 7, 641-651.

Simmons, W. A., Roopenian, D. C., Summer®eld, S. G.,Jones, R. C., Galocha, B., Christianson, G. J., Maika,S. D., Zhou, M., Gaskell, S. J., Bordoli, R. S., Ploegh,H. L., Slaughter, C. A., Lindahl, K. F., Hammer,R. E. & Taurog, J. D. (1997b). A new MHC locusthat in¯uences class I peptide presentation. Immu-nity, 7, 641-651.

Sonnhammer, E. L. & Durbin, R. (1995). A dot-matrixprogram with dynamic threshold control suited forgenomic DNA and protein sequence analysis. Gene,167, Gc1-10.

Spring, B., Fonatsch, C., MuÈ ller, C., Pawelec, G., KoÈmpf,J., Wernet, P. & Ziegler, A. (1985). Re®nement ofHLA gene mapping with induced B-cell linemutants. Immunogenet. 21, 277-291.

Sugaya, K., Fukagawa, T., Matsumoto, K., Mita, K.,Takahashi, E., Ando, A., Inoko, H. & Ikemura, T.(1994). Three genes in the human MHC class IIIregion near the junction with the class II: gene forreceptor of advanced glycosylation end products,PBX2 homeobox genes and a Notch homolog,human counterpart of mouse mammary tumorgene int-3. Genomics, 23, 408-419.

Tazi-Ahnini, R., Henry, J., Offer, C., Bouissou-Bouchouata, C., Mather, I. H. & Pontarotti, P.(1997). Cloning localisation and structure of newmembers of the butyrophilin gene family in thejuxta-telomeric region of the major histocompatibil-ity complex. Immunogenet. 47, 55-63.

Tenzen, T., Yamagata, T., Fukagawa, T., Sugaya, K.,Ando, A., Inoko, H., Gojobori, T., Fujiyama, A.,Okumura, K. & Ikemura, T. (1997). Precise switch-ing of DNA replication timing in the GC contenttransition area in the human major histocompatibil-ity complex. Mol. Cell Biol. 17, 4043-4050.

The Sanger Centre GSC Washington University (1998).Toward a complete human genome sequence.Genome Res. 8, 1097-1108.

Thompson, J. D., Gibson, T. J., Plewniak, F.,Jeanmougin, F. & Higgins, D. H. (1997). TheCLUSTAL X windows interface: ¯exible strategiesfor multiple alignment aided by quality analysistools. Nucl. Acids Res. 25, 4876-4882.

Tsumaki, N. & Kimura, T. (1995). Differential expressionof an acidic domain in the amino-terminal propep-tide of mouse pro-alpha 2(XI) collagen by complexalternative splicing. J. Biol. Chem. 270, 2372-2378.

Extended MHC 799

Tsumaki, N., Kimura, T., Matsui, Y., Nakata, K. & Ochi,T. (1996). Separable cis-regulatory elements thatcontribute to tissue- and site-speci®c alpha 2(XI)collagen gene expression in the embryonic mousecartilage. J. Cell Biol. 134, 1573-1582.

Tsumaki, N., Kimura, T., Tanaka, K., Kimura, J. H.,Ochi, T. & Yamada, Y. (1998). Modular arrange-ment of cartilage- and neural tissue-speci®ccis-elements in the mouse alpha2(XI) collagen pro-moter. J. Biol. Chem. 273, 22861-22864.

Vandenberg, P., Vuoristo, M. M., Ala, Kokko L. &Prockop, D. J. (1996). The mouse col11a2 gene. Sometranscripts from the adjacent rxr-beta gene extendinto the col11a2 gene. Matrix Biol. 15, 359-367.

Walter, L. & GuÈ nther, E. (1998). Identi®cation of a novelhighly conserved gene in the centromeric part ofthe MHC. Genomics, 52, 298-304.

Walter, L., Fischer, K. & GuÈ nther, E. (1996). Physicalmapping of the Ring1, Ring2, Ke6, Ke4, Rxrb,Col11a2, and RT1. Hb genes in the rat major histo-compatibility complex. Immunogenet. 44, 218-221.

Yoshino, M., Xiao, H., Jones, E. P., Kumanovics, A.,Amadou, C. & Fischer, Lindahl K. (1997). Genomicevolution of the distal Mhc class I region on mouseChr 17. Hereditas, 127, 141-148.

Note added in proof

The two small gaps indicated in Figure 1have now been closed and the completesequence of the entire MHC is now available fromhttp://www.sanger.ac.uk/HGP/Chr6/.

Edited by J. Karn

(Received 16 April 1999; received in revised form 7 July 1999; accepted 7 July 1999)


Recommended