Human Sequencing
Stefano LiseBioinformatics & Statistical Genetics (BSG) Core
The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford
Email: [email protected]
Outline
● Human genetic variation in health and disease– How do we identify pathogenic mutations amongst
many genomic variants?
● The WGS500 project– Whole-genome sequencing of 500 genomes of
clinical significance
Human Genome● The (haploid) reference human genome is about 3 x
109 bases– Human genome is diploid => ~ 2 x 3 x 3 109 bases– The exome is ~ 30-60 Mb (1-2% of the genome)
● Some more numbers (from GENCODE, Nov 2012)– 20,387 protein-coding genes
● 81,626 protein-coding transcripts– 13,220 long non-coding RNA genes– 9,173 small non-coding RNA genes– 13,419 pseudogenes
Sequence Variants● Single nucleotide variants (SNV)
● Small insertions/deletions (INDEL)
● Structural variants– Large insertions/deletions– Inversions– Copy number variants– Translocations– ….
Human Genome Variation● The 1000 Genomes Project (www.1000genomes.org)
provides a catalogue of all (most) types of human genetic variation– Population-scale genome sequencing
● Phase 1 (October 2012)– High-throughput sequencing of 1092 human genomes– Identified up to 98% of all SNPs with a frequency > 1% in
the population● 1,500 additional genomes in the next (final) phase
Human Genome Variation(1000 Genomes Project, Nature 491, 56-65, 2012 )
LOF=loss-of-function variant (stop-gain, frameshift indel, essential splice site)Conserved sites = sites with GERP conservation score > 2
Allele Frequency
< 0.5 % 0.5 - 5 % > 5% Total
All variants 30 -150 K 120 – 680 K
3.6 – 3.9 M 3.7 – 4.7 M
Synonymous 139 - 640 480 - 2470 12 – 13 K 13 – 16 K
Non-synonymous(at conserved sites)
220 – 800(130 - 400)
540 -2400(240 - 910)
10 – 11 K(2.3 – 2.7 K)
11 – 14 K(2.7 – 4 K)
LOF 10 - 20 20 - 55 85 - 105 115 - 180
HGMD-DM(at conserved sites)
4 – 8(2.5 -5)
10 – 33(4.8 - 17)
28 – 43(11 - 18)
40 – 85(18 - 40)
Rare and Common Diseases
adapted from TA Manolio et al. Nature 461, 747-753 (2009)
Only 1 or 2 causal variants
Sequencing Strategies● Targeted sequencing
– E.g. screening of known genes associated with cardiomyopathies or ataxia
– Applications in clinical diagnostic● Whole exome sequencing
– Protein coding regions● Whole genome sequencing
– Can detect all types of information relevant to pathology in a single go
– Still costly, but decreasing rapidly
Identifying causal variants: Assumptions and Filters
● After variant calling, filter out low quality (confidence) calls
● Variant is unique in patients or at least very rare in the general population, e.g. < 1%– Use of in-house databases too
● Variant has complete penetrance: every carrier will have the phenotype
● In general these steps will not identify the pathogenic variant uniquely but will restrict the list of candidates. Further analysis required
Ideal Scenario
● Variant is common amongst all affected and absent in all unaffected
● Variant is in a gene with known function and disrupts the protein
Variant Prioritization● Focus first on protein-coding regions (exome)
– Nonsense and missense mutations– Frame-shift indels– Essential splice sites disruptions
● Easier to interpret the consequences of the variant– E.g. mutation affects catalytic residues in an enzyme
● Targeted exome sequencing has been very successful in disease gene discovery
● Cautionary note: on average each “normal, healthy” individual carries– 10-20 rare LOF variants– 2-5 rare, disease-associated variants
Non-coding variants● Many functional elements lie outside protein-coding regions
(ENCODE)● Variants can disrupt
– Regulatory elements, e.g. transcription factor binding sites– Splicing regulatory elements (branch sites, intronic splicing
enhancers/inhibitors, …)– ncRNA transcripts– …
● Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions– At least as many as in protein-coding genes
Disease models● Diseases can be
– Mendelian● Dominant, recessive or X-linked
– Sporadic ● De novo mutation
– Cancer● Driver mutations
● Analysis strategy needs to be adjusted to each disease category
Autosomal Dominant Disease
● Familiar, inherited disorder● Search for heterozygous variants
– Present in affected individuals, absent in non-affected ones● Linkage analysis can substantially narrow the genomic search space
– E.g. SNP array all family members and sequence one or two affected members
Recessive Disease
● Suspected consanguinity● Search for homozygous variants
– Heterozygous in parents● Homozygosity mapping by SNP
arrays can substantially reduce the number of variants for follow-up
● No indication of consanguinity● Search for compound
heterozygous variants– Affected individual carries two
separate variants in the same gene
– Each parent carries one of the two variants
Sporadic Genetic Disease
● Dominant disorder, parents are unaffected● Search for de novo mutations
– Present in child and not in parents ● Expect 50-100 de novo mutations in “normal, healthy” individual
– Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488, 471–475, 2012)
● Sometimes difficult to distinguish from a recessive disease
Cancer
● Matched normal to tumour samples● Search for somatic variants
– Present in tumour(s), absent in normal sample● Identify driver mutations● More on this tomorrow, JB Cazier’s lecture
Predicting Phenotypic Consequences
● Methods based on comparative genomics
● Evolution as a measure of deleteriousness– Variants at conserved positions more likely to be deleterious
● Several conservation scores– phyloP - single-site score (http://compgen.bscb.cornell.edu/phast/)– GERP - single-site score
(http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html)– phastCons – region-based score
(http://compgen.bscb.cornell.edu/phast/)– …
Conservation ScoresBenign vs Pathogenic Variants
Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497;
Protein Sequence Variants● Most established methods. They exploit
– Amino acid properties, e.g. charge, size, …– Structural information, e.g. local secondary structure, surface/core
amino acid, …– Evolutionary information, e.g. pattern of observed substitutions – Database information, e.g. known binding site
● Several methods available– SIFT (http://sift.bii.a-star.edu.sg/)– Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/)– …
PolyPhen-2(http://genetics.bwh.harvard.edu/pph2/)
● Prediction based on sequence, phylogenetic and structural information characterizing the substitution– 8 sequence-based properties– 3 structure-based properties
● The 11 properties (features) used as input of a probabilistic classifier– Trained to differentiate benign from pathogenic variants
Non-coding variants● A substantial fraction of disease causing mutations are not
exonic– Probably under-represented in databases
● Regulatory variants can have a large effect● More difficult to discover
– Non-coding positions less conserved than coding positions● ENCODE has provided a detailed map of regulatory regions
– Search for variants that disrupt a consensus sequence motif within a known binding site
Gene Prioritization Methods● Methods focus on genes rather than on variants
– Identify the genes most likely to cause a given disease in a list of candidates
● Methods combine heterogeneous pieces of information– Shared biological pathways with other disease genes– Orthologues genes involved in similar diseases in model
organisms– Localization in affected tissue– …
Follow up● Definite proof of pathogenicity requires
– Validation in independent patient cohort● But many diseases are genetically heterogeneous and
caused by extremely rare variants– In vitro functional experiments
● Evaluate molecular consequences, e.g. disruption of expression or protein folding
– In vivo experiments in model organisms● Is the human phenotype reproduced in, e.g., a knock-out
mouse?
Bioinformatics Challenges● How reliably can we read and annotate an individual’s genome?● How well can we interpret genetic variation in the context of a clinical
presentation?● Community experiment to objectively assess computational methods
– Critical Assessment of Genome Interpretation (CAGI 2012)● Distinguish between exomes of Crohn’s disease patients and healthy individuals● PGP genomes: predict clinical phenotypes from genome data, and match individuals to their
health records● Whole genomes of a family affected by primary congenital glaucoma: discover the genetic
basis of the disease● ...
– Critical Assessment of Massive Data Analysis (CAMDA 2013)● Reliable variant calling ● …
The WGS500 Project● Collaboration involving the WTCHG, Oxford BRC, Oxford University
Hospitals and Illumina● Sequence 500 genomes of clinical significance
– Mendelian diseases– Immunological disorders– Cancers
● Target coverage: 25x (50x for cancer)● Diverse set of experimental designs
– Familial: Linkage information– De novo: trios– Cancer: Tumour-normal, metastases, multiple-mets, ..
● Substantial follow-up (screening and functional) to establish candidacy
Overview of processing
Oxford Genomics Illumina
Read alignment (Stampy)
Individual/group variant
calls (Platypus)
Individual genotypes
Annotated genotypes
400 genomes 100 genomes
Read alignment and calls (Eland/ Casava)
Large-scale CNV scan
Homozygosity scan
Union file
QC
• Frequency (1000G, EVS)• Conservation• Coding consequence (x2)• Predicted effect (x3)• Pathogenicity (HGMD)• Regulatory annotation
Reference-compressed
Archive
Web server
Case StudyPI: Dr A Nemeth
• 3 affected individuals from a highly consanguineous family– Childhood developmental ataxia – Cognitive impairment
Targeted Sequencing• Targeted sequencing on V3 using a panel of > 100 known ataxia
genes – Found an homozygous stop codon in SPTBN2– Mutation present as homozygous in all 3 affected individuals and as
heterozygous in parents of V3, by Sanger sequencing• Mutations in SPTBN2 cause spinocerebellar ataxia type 5 (SCA5)
– Sometimes referred to as “Lincoln ataxia”– Autosomal dominant, slowly progressing, adult onset
• Is the cognitive impairment due to the mutation in SPTBN2?– Could be caused by mutations in a second gene (homozygous or
compound heterozygous)• Investigated this possibility using a combination of SNP array and
whole genome sequencing
Homozygosity Mapping
• SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs)• Identified regions of homozygosity (ROH) shared by V1, V2 and V3
and not present in either IV3 or IV4– Homozygosity mapping with PLINK– Found 23 regions totalling 28.7 Mb– Largest segments on chromosome 11
Whole Genome Sequencing• Searched for rare, homozygous variants in shared ROH
– Present in 1000 Genomes with an allele frequency < 1%– Not observed in other WGS500 samples
• Found 68 candidate variants
• Based on evolutionary conservation and available information in databases (eg HGMD) the only likely pathogenic variant is the stop codon in SPTBN2
• Excluded also a compound heterozygous model (data not shown)
Functional class Number of variants
Exonic• Stop gain • Synonymous
2(1)(1)
ncRNA 1
UTR 3
Intronic 40
Upstream 1
Intergenic 21
SPTBN2 variant
• The position is actually not well conserved– E.g. G->A in gorilla, baboon and mouse– GERP = -6.71– PhyloP = -1.28
• TGT and TGC encode for cysteine• TGA is a stop codon
SPTBN2 knock-out mouse• Investigated a mouse knock-out of SPTBN2 (Mandy
Jackson Lab, Edinburgh) – Ataxia (previously reported)– Morphological abnormalities in neurons from prefrontal
cortex, an area believed to be important in human for cognitive tasks
– Deficits in object recognition tasks• The mouse model supports the hypothesis that both
ataxia and cognitive impairment are caused by the recessive mutation in SPTBN2
WGS500 overview of findings (as of Dec 2012)
• Project about 75% complete, with 292 samples (195 case studies) over 38 projects with initial analysis
• 75/195 cases there is at least one candidate viewed by the PI and analysts as a strong candidate for causing (strongly contributing to) the phenotype– 45/82 in Mendelian– 19/61 in Immune– 11/52 in Cancer
• Papers in press/submitted to date on– Ataxia, CMS, CLL, Multiple adenomas