An integrated map of genetic variation from 1,092

transcript

An integrated map of genetic variation from 1,092 human

genomes

The 1000 Genomes Project Consortiumhttp://www.1000genomes.org

Nature 491, 56–65 (01 November 2012)

Primary goal • to create a complete and detailed

catalogue of human genetic variations, which in turn can be used for association studies relating genetic variation to disease.

Primary goal • to discover >95 % of the variants (e.g.

SNPs, CNVs, indels) with minor allele frequencies as low as 1% across the genome and 0.1-0.5% in gene regions

• to estimate the population frequencies, haplotype backgrounds and linkage disequilibrium patterns of variant alleles

Secondary goals• support of better SNP and probe selection for

genotyping platforms in future studies• improvement of the human reference sequence.• the completed database will be a useful tool for

studying regions under selection, variation in multiple populations and understanding the underlying processes of mutation and recombination.

Project design• to sequence each sample to about 4X coverage; at

this depth sequencing cannot provide the complete genotype of each sample, but should allow the detection of most variants with frequencies as low as 1%.

• Combining the data from 2500 samples should allow highly accurate estimation (imputation) of the variants and genotypes for each sample that were not seen directly by the light sequencing.

Project design / Stages• The 1000 genomes full project has been

divided into phases to represent the dispersed nature of the sample collection.

Project design / Stages / PilotThree pilot studies provided data to inform the design of the full-scale project:• Pilot 1: low coverage pilot (2-4X, WGS of 180 samples)• Pilot 2: high coverage pilot (20-60X, WGS of 2 mother-

father-adult child trios)• Pilot 3: the exon targeted pilot (50X, 1000 gene

regions in 900 samples)The pilot was completed in 2009.

Project design / Stages / Phase 1 Phase 1 represents low coverage and exome data analysis available for the first 1092 samples.

Project design / Stages / Phase 1 Phase 1 represents low coverage and exome data analysis available for the first 1092 samples.DONE!Results published in Nature 491, 56–65 (01 November 2012)

Но это ещё не всё!

Project design / Stages / Phase 2• Phase 2 represents an expanded set of samples,

around 1700 in number (the sequence data has been finalized).

• This data is being used for method development to both improve on existing methods from phase 1 and also develop new methods to handle features like multi allelic variant sites and true integration of complex variation and structural variants.

Project design / Stages / Phase 3• Phase 3 represents 2500 samples including

new African samples and samples from South Asia. The new methods developed in phase 2 will be applied to this data set an a final catalogue of variation will be released.

Amounts of Data• Full genomic sequence of 1,700 individuals is

now available (200TB of genomic data).

Amounts of Data• > 2 human genomes every 24 hours• 60-fold more sequence data than what has

been published in DNA databases over the past 25 years.

Samples• 14 populations• 4 Ancestry-based groups

Samples / Ancestry-based groups• Europe (IBS (Iberian Populations in Spain), GBR (British from

England and Scotland ), CEU (Utah residents with ancestry from northern and western Europe), FIN (Finnish in Finland), TSI (Toscani in Italia));

• East Asia (JPT (Japanese in Tokyo, Japan), CHB (Han Chinese in Beijing, China), CHS (Han Chinese South));

• Africa (ASW (African Ancestry in SW USA), YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya));

• Americas (MXL (Mexican Ancestry in Los Angeles, CA, USA), PUR (Puerto Ricans in Puerto Rico), CLM (Colombians in Medellin, Colombia)).

Data• combination of low-coverage (2–6x) whole-

genome sequence data, targeted deep (50–100x) exome sequence data and dense SNP genotype data.

• the approach was augmented with statistical methods for selecting higher quality variant calls from candidates obtained using multiple algorithms, and to integrate SNP, indel and larger structural variants within a single framework

• A key goal of the 1000 Genomes Project was to identify more than 95% of SNPs at 1% frequency in a broad set of populations.

• Our current resource includes ~50%, 98% and 99.7% of the SNPs with frequencies of ~0.1%, 1.0% and 5.0%, respectively, in ~2,500 UK sampled genomes.

Genetic variation• 3.60 million single nucleotide polymorphisms (SNPs),

of which 24,000 were in GENCODE (coding) regions• 344,000 small indels (440 coding) which gives a ratio of

1:10 with SNPs in human genomes, and demonstrates the strong selection against indels in coding regions.

• 717 large deletions (the most confident category of SVs that we currently can detect), of which 39 overlapped GENCODE regions.

• Most common variants (94% of variants with frequency>=5%) were known before the current phase of the project and had their haplotype structure mapped through earlier projects.

• Only 62% of variants in the range 0.5–5% and• 13% of variants with frequencies of <0.5% had

been described previously.

• Variants present at 10% and above across the entire sample are almost all found in all of the populations studied.

• By contrast, 17% of low-frequency variants in the range 0.5–5% were observed in a single ancestry group, and 53% of rare variants at 0.5% were observed in a single population.

• The derived allele frequency distribution shows substantial divergence between populations below a frequency of 40%, such that individuals from populations with substantial African ancestry carry up to three times as many low-frequency variants (0.5–5%) as those of European or East Asian origin, reflecting ancestral bottlenecks in non-African populations.

• However, individuals from all populations show an enrichment of rare variants (<0.5% frequency), reflecting recent explosive increases in population size and the effects of geographic differentiation.

• Variants present twice across the entire sample (referred to as f2 variants), typically the most recent of informative mutations, are found within the same population in 53% of cases

• However, between-population sharing identifies recent historical connections.

• At the most highly conserved coding sites, 85% of non-synonymous variants and more than 90% of stop-gain and splice-disrupting variants are below 0.5% in frequency, compared with 65% of synonymous variants.

• Individuals typically carry more than 2500 nonsynonymous variants at conserved positions, of which 20-40 are likely to be damaging (2-5 of which are rare), 150 loss-of-function variants (splice site variants, stop gains, frameshift indels) of which 10-20 are rare

• 130–400 non-synonymous variants per individual, 10–20 LOF variants, 2–5 damaging mutations, and 1–2 variants identified previously from cancer genome sequencing

Bonus Track

• The non-synonymous to synonymous ratio among rare (<0.5%) variants is typically in the range 1–2, and among common variants in the range 0.5–1.5, suggesting that 25–50% of rare non-synonymous variants are deleterious.

• However, the segregating rare load among gene groups in KEGG pathways varies substantially.

An integrated map of genetic variation from 1,092

Education