Human Sequencing

Human Sequencing

Stefano LiseBioinformatics & Statistical Genetics (BSG) Core

The Wellcome Trust Centre for Human Genetics (WTCHG), Oxford

Email: [email protected]

Outline

● Human genetic variation in health and disease– How do we identify pathogenic mutations amongst

many genomic variants?

● The WGS500 project– Whole-genome sequencing of 500 genomes of

clinical significance

Human Genome● The (haploid) reference human genome is about 3 x

109 bases– Human genome is diploid => ~ 2 x 3 x 3 109 bases– The exome is ~ 30-60 Mb (1-2% of the genome)

● Some more numbers (from GENCODE, Nov 2012)– 20,387 protein-coding genes

● 81,626 protein-coding transcripts– 13,220 long non-coding RNA genes– 9,173 small non-coding RNA genes– 13,419 pseudogenes

Sequence Variants● Single nucleotide variants (SNV)

● Small insertions/deletions (INDEL)

● Structural variants– Large insertions/deletions– Inversions– Copy number variants– Translocations– ….

Functional Consequences

From Ensembl

Human Genome Variation● The 1000 Genomes Project (www.1000genomes.org)

provides a catalogue of all (most) types of human genetic variation– Population-scale genome sequencing

● Phase 1 (October 2012)– High-throughput sequencing of 1092 human genomes– Identified up to 98% of all SNPs with a frequency > 1% in

the population● 1,500 additional genomes in the next (final) phase

Human Genome Variation(1000 Genomes Project, Nature 491, 56-65, 2012 )

LOF=loss-of-function variant (stop-gain, frameshift indel, essential splice site)Conserved sites = sites with GERP conservation score > 2

Allele Frequency

< 0.5 % 0.5 - 5 % > 5% Total

All variants 30 -150 K 120 – 680 K

3.6 – 3.9 M 3.7 – 4.7 M

Synonymous 139 - 640 480 - 2470 12 – 13 K 13 – 16 K

Non-synonymous(at conserved sites)

220 – 800(130 - 400)

540 -2400(240 - 910)

10 – 11 K(2.3 – 2.7 K)

11 – 14 K(2.7 – 4 K)

LOF 10 - 20 20 - 55 85 - 105 115 - 180

HGMD-DM(at conserved sites)

4 – 8(2.5 -5)

10 – 33(4.8 - 17)

28 – 43(11 - 18)

40 – 85(18 - 40)

Rare and Common Diseases

adapted from TA Manolio et al. Nature 461, 747-753 (2009)

Only 1 or 2 causal variants

Sequencing Strategies● Targeted sequencing

– E.g. screening of known genes associated with cardiomyopathies or ataxia

– Applications in clinical diagnostic● Whole exome sequencing

– Protein coding regions● Whole genome sequencing

– Can detect all types of information relevant to pathology in a single go

– Still costly, but decreasing rapidly

Identifying causal variants: Assumptions and Filters

● After variant calling, filter out low quality (confidence) calls

● Variant is unique in patients or at least very rare in the general population, e.g. < 1%– Use of in-house databases too

● Variant has complete penetrance: every carrier will have the phenotype

● In general these steps will not identify the pathogenic variant uniquely but will restrict the list of candidates. Further analysis required

Ideal Scenario

● Variant is common amongst all affected and absent in all unaffected

● Variant is in a gene with known function and disrupts the protein

(Almost) Ideal Scenario

Variant Prioritization● Focus first on protein-coding regions (exome)

– Nonsense and missense mutations– Frame-shift indels– Essential splice sites disruptions

● Easier to interpret the consequences of the variant– E.g. mutation affects catalytic residues in an enzyme

● Targeted exome sequencing has been very successful in disease gene discovery

● Cautionary note: on average each “normal, healthy” individual carries– 10-20 rare LOF variants– 2-5 rare, disease-associated variants

Non-coding variants● Many functional elements lie outside protein-coding regions

(ENCODE)● Variants can disrupt

– Regulatory elements, e.g. transcription factor binding sites– Splicing regulatory elements (branch sites, intronic splicing

enhancers/inhibitors, …)– ncRNA transcripts– …

● Many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions– At least as many as in protein-coding genes

Disease models● Diseases can be

– Mendelian● Dominant, recessive or X-linked

– Sporadic ● De novo mutation

– Cancer● Driver mutations

● Analysis strategy needs to be adjusted to each disease category

Autosomal Dominant Disease

● Familiar, inherited disorder● Search for heterozygous variants

– Present in affected individuals, absent in non-affected ones● Linkage analysis can substantially narrow the genomic search space

– E.g. SNP array all family members and sequence one or two affected members

Recessive Disease

● Suspected consanguinity● Search for homozygous variants

– Heterozygous in parents● Homozygosity mapping by SNP

arrays can substantially reduce the number of variants for follow-up

● No indication of consanguinity● Search for compound

heterozygous variants– Affected individual carries two

separate variants in the same gene

– Each parent carries one of the two variants

Sporadic Genetic Disease

● Dominant disorder, parents are unaffected● Search for de novo mutations

– Present in child and not in parents ● Expect 50-100 de novo mutations in “normal, healthy” individual

– Father’s age effect, 2 extra mutations per year (Kong et al, Nature 488, 471–475, 2012)

● Sometimes difficult to distinguish from a recessive disease

Cancer

● Matched normal to tumour samples● Search for somatic variants

– Present in tumour(s), absent in normal sample● Identify driver mutations● More on this tomorrow, JB Cazier’s lecture

Predicting Phenotypic Consequences

● Methods based on comparative genomics

● Evolution as a measure of deleteriousness– Variants at conserved positions more likely to be deleterious

● Several conservation scores– phyloP - single-site score (http://compgen.bscb.cornell.edu/phast/)– GERP - single-site score

(http://mendel.stanford.edu/sidowlab/downloads/gerp/index.html)– phastCons – region-based score

(http://compgen.bscb.cornell.edu/phast/)– …

Conservation ScoresBenign vs Pathogenic Variants

Gilissen et al, European Journal of Human Genetics (2012) 20, 490–497;

b-haemoglobin locus

From GM Cooper & J Shendure, Nature Reviews Genetics 12, 628-640 (2011)

Protein Sequence Variants● Most established methods. They exploit

– Amino acid properties, e.g. charge, size, …– Structural information, e.g. local secondary structure, surface/core

amino acid, …– Evolutionary information, e.g. pattern of observed substitutions – Database information, e.g. known binding site

● Several methods available– SIFT (http://sift.bii.a-star.edu.sg/)– Polyphen-2 (http://genetics.bwh.harvard.edu/pph2/)– …

PolyPhen-2(http://genetics.bwh.harvard.edu/pph2/)

● Prediction based on sequence, phylogenetic and structural information characterizing the substitution– 8 sequence-based properties– 3 structure-based properties

● The 11 properties (features) used as input of a probabilistic classifier– Trained to differentiate benign from pathogenic variants

Non-coding variants● A substantial fraction of disease causing mutations are not

exonic– Probably under-represented in databases

● Regulatory variants can have a large effect● More difficult to discover

– Non-coding positions less conserved than coding positions● ENCODE has provided a detailed map of regulatory regions

– Search for variants that disrupt a consensus sequence motif within a known binding site

Gene Prioritization Methods● Methods focus on genes rather than on variants

– Identify the genes most likely to cause a given disease in a list of candidates

● Methods combine heterogeneous pieces of information– Shared biological pathways with other disease genes– Orthologues genes involved in similar diseases in model

organisms– Localization in affected tissue– …

Follow up● Definite proof of pathogenicity requires

– Validation in independent patient cohort● But many diseases are genetically heterogeneous and

caused by extremely rare variants– In vitro functional experiments

● Evaluate molecular consequences, e.g. disruption of expression or protein folding

– In vivo experiments in model organisms● Is the human phenotype reproduced in, e.g., a knock-out

mouse?

Bioinformatics Challenges● How reliably can we read and annotate an individual’s genome?● How well can we interpret genetic variation in the context of a clinical

presentation?● Community experiment to objectively assess computational methods

– Critical Assessment of Genome Interpretation (CAGI 2012)● Distinguish between exomes of Crohn’s disease patients and healthy individuals● PGP genomes: predict clinical phenotypes from genome data, and match individuals to their

health records● Whole genomes of a family affected by primary congenital glaucoma: discover the genetic

basis of the disease● ...

– Critical Assessment of Massive Data Analysis (CAMDA 2013)● Reliable variant calling ● …

The WGS500 Project● Collaboration involving the WTCHG, Oxford BRC, Oxford University

Hospitals and Illumina● Sequence 500 genomes of clinical significance

– Mendelian diseases– Immunological disorders– Cancers

● Target coverage: 25x (50x for cancer)● Diverse set of experimental designs

– Familial: Linkage information– De novo: trios– Cancer: Tumour-normal, metastases, multiple-mets, ..

● Substantial follow-up (screening and functional) to establish candidacy

Overview of processing

Oxford Genomics Illumina

Read alignment (Stampy)

Individual/group variant

calls (Platypus)

Individual genotypes

Annotated genotypes

400 genomes 100 genomes

Read alignment and calls (Eland/ Casava)

Large-scale CNV scan

Homozygosity scan

Union file

QC

• Frequency (1000G, EVS)• Conservation• Coding consequence (x2)• Predicted effect (x3)• Pathogenicity (HGMD)• Regulatory annotation

Reference-compressed

Archive

Web server

Case StudyPI: Dr A Nemeth

• 3 affected individuals from a highly consanguineous family– Childhood developmental ataxia – Cognitive impairment

Targeted Sequencing• Targeted sequencing on V3 using a panel of > 100 known ataxia

genes – Found an homozygous stop codon in SPTBN2– Mutation present as homozygous in all 3 affected individuals and as

heterozygous in parents of V3, by Sanger sequencing• Mutations in SPTBN2 cause spinocerebellar ataxia type 5 (SCA5)

– Sometimes referred to as “Lincoln ataxia”– Autosomal dominant, slowly progressing, adult onset

• Is the cognitive impairment due to the mutation in SPTBN2?– Could be caused by mutations in a second gene (homozygous or

compound heterozygous)• Investigated this possibility using a combination of SNP array and

whole genome sequencing

Homozygosity Mapping

• SNP array genotyped V1, V2, V3, IV3 and IV4 (~300K SNPs)• Identified regions of homozygosity (ROH) shared by V1, V2 and V3

and not present in either IV3 or IV4– Homozygosity mapping with PLINK– Found 23 regions totalling 28.7 Mb– Largest segments on chromosome 11

Chromosome 11

Whole Genome Sequencing• Searched for rare, homozygous variants in shared ROH

– Present in 1000 Genomes with an allele frequency < 1%– Not observed in other WGS500 samples

• Found 68 candidate variants

• Based on evolutionary conservation and available information in databases (eg HGMD) the only likely pathogenic variant is the stop codon in SPTBN2

• Excluded also a compound heterozygous model (data not shown)

Functional class Number of variants

Exonic• Stop gain • Synonymous

2(1)(1)

ncRNA 1

UTR 3

Intronic 40

Upstream 1

Intergenic 21

SPTBN2 variant

• The position is actually not well conserved– E.g. G->A in gorilla, baboon and mouse– GERP = -6.71– PhyloP = -1.28

• TGT and TGC encode for cysteine• TGA is a stop codon

SPTBN2 knock-out mouse• Investigated a mouse knock-out of SPTBN2 (Mandy

Jackson Lab, Edinburgh) – Ataxia (previously reported)– Morphological abnormalities in neurons from prefrontal

cortex, an area believed to be important in human for cognitive tasks

– Deficits in object recognition tasks• The mouse model supports the hypothesis that both

ataxia and cognitive impairment are caused by the recessive mutation in SPTBN2

WGS500 overview of findings (as of Dec 2012)

• Project about 75% complete, with 292 samples (195 case studies) over 38 projects with initial analysis

• 75/195 cases there is at least one candidate viewed by the PI and analysts as a strong candidate for causing (strongly contributing to) the phenotype– 45/82 in Mendelian– 19/61 in Immune– 11/52 in Cancer

• Papers in press/submitted to date on– Ataxia, CMS, CLL, Multiple adenomas

Date post:	24-Feb-2016
Category:	Documents
Upload:	uta
View:	42 times
Download:	1 times

Human Sequencing

Documents