+ All Categories
Home > Documents > Rockefeller pop gen 2017 - Baylor College of Medicine ·...

Rockefeller pop gen 2017 - Baylor College of Medicine ·...

Date post: 07-Jun-2018
Category:
Upload: truongdat
View: 221 times
Download: 0 times
Share this document with a friend
8
1 Intro to population genetics Shamil Sunyaev Broad Institute of M.I.T. and Harvard Forces responsible for genetic change Mutation μ Selection s N e Drift Population structure F ST Mutations Mutation rate in humans and flies ~10 2 per nt changes genome 2.5x10 -8 (Nachman & Crowell) 1.8x10 -8 (Kondrashov) Other events: indels (10 -9 ) repeat extensions/contractions (10 -5 ) large events (?) NGS estimates ~1.2X10 -8 per nt changes genome Mutation rate is variable along the genome Regional variation of mutation rate Context dependence of mutation rate Replication fidelity DNA damage DNA repair CpG deamination Genetic drift
Transcript
Page 1: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

1

Intro  to  population  genetics

Shamil  Sunyaev

Broad  Institute  of  M.I.T.  and  Harvard

Forces  responsible  for  genetic  change

Mutation µ

Selection s

NeDrift

Population structure FST

Mutations

Mutation  rate  in  humans  and  flies

~102 per nt changes genome

2.5x10-8 (Nachman & Crowell) 1.8x10-8 (Kondrashov)

Other events: indels (10-9)

repeat extensions/contractions (10-5)

large events (?)

NGS estimates ~1.2X10-8 per nt changes genome

Mutation  rate  is  variable  along  the  genome

Regional variation of mutation rate

Context dependence of mutation rate

Replication fidelity DNA damage DNA repair CpG deamination Genetic  drift

Page 2: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

2

Drift  is  a  random  change  of  allele  frequencies Drift  depends  on  population  size

Demographic  history

Selection

12

NeutralDeleterious Advantageous

New  mutation

Functional

Nonfunctional

Selection  indicates  functional  mutations,  whether  or  not  the  tested  trait  is  under  selection

Selective  effect  of  mutation

Most  functional  mutations  are  deleterious

Page 3: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

3

Methods  of  mathematical  population  genetics

Dynamic  of  allelic  substitution

time0

1

Mathematically,  allele  frequency  change  in  a  population  follows  a  one-­dimensional  random  walk

Diffusion  approximation

Random  walk  that  does  not  jump  long  distances  can  be  approximated  by  a  diffusion  process

∂φ x, p,t( )∂t

= −∂Mφ x, p,t( )

∂x+12∂2Vφ x, p,t( )

∂x 2

Coalescent  theoryInstead  of  modeling  a  population,  we  can  model  our  sample

Time  goes  backwards  !

t

Natural  selection  in  protein  coding  regions

Page 4: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

4

Effect  of  new  missense  mutations Computer  simulations

time

∂φ x, p,t( )∂t

= −∂Mφ x, p,t( )

∂x+12∂2Vφ x, p,t( )

∂x 2Demographic  history

Natural  selection

0

0.05

0.1

0.15

0.2

0.25

0.3

• Can  we  find  additional  evidence  in  sequence  data?

• Is  there  any  information  beyond  frequency?  Can  we  tell  alleles  under  selection  from  neutral  alleles  if  they  are  of  the  same  frequency?  

Page 5: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

5

25

Maruyama  effect  (1974):  at  any  frequency advantageous  ,  or  deleterious alleles  are  younger  than  neutral  alleles

−150 −100 −50 0

050

100150

200250

300

time (generations)

allele c

ount

Frequency  x

Frequency  0%Time

At  a  given  frequency  deleterious  and  advantageous  alleles  are  younger  than  

neutral

Longer  trajectory:  6  jumps

Shorter  trajectory:  4  jumps

Frequency  0%

Frequency  x

Time

Intuition:  shorter  trajectories  require  fewer  lucky  jumps

time

allelefrequency

Neutrals:  equal  time  at  each  frequencySelecteds:  faster  through  higher  frequencies

Idea:  low  accumulation  of  mutations  at  linked  sites  indicates  selection

Diffusion  theory:  deleterious  alleles  pass  fast  through  higher  frequencies

10

!

!

!

!

!

−25 −20 −15 −10 −5 0

0.0

0.1

0.2

0.3

0.4

0.5

selection coefficient 2Ns

mea

n ag

e (2

N g

ener

atio

ns)

!

!

!

!

!

!

!

!

!

!

!

!

!

Population frequency7%5%3%

!"

#$%&'(")"

0 5 10 15 20

0.00

00.

005

0.01

00.

015

0.02

0

Intermediate allele frequency (%)

mea

n so

jour

n tim

e (2

N g

ener

atio

ns)

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

! ! ! ! !

!

!!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!! ! ! ! ! ! ! ! !

Selection coefficient (2Ns)0 (neutral)−2 (weakly deleterious)−10 (deleterious)

3%

*"

−0.20 −0.15 −0.10 −0.05 0.00

05

1015

time (generations before present, in 2N units)

popu

latio

n fre

quen

cy (%

)

Alleleneutraldeleterious

+"

""

""

Figure 1. Simulation and theoretical results for allelic age and sojourn times. a. Exampletrajectories for a neutral and deleterious allele with current population frequencies 3% (indicated by anarrow). The shaded areas indicate sojourn times at frequencies above 5%. b. Mean ages for neutral anddeleterious alleles at a given population frequency (lines show theoretical predictions, dots showsimulation results with standard error bars). The graph shows that deleterious alleles at a givenfrequency are younger than neutral alleles, and that the e↵ect is greater for more strongly selectedalleles. c. Mean sojourn times for neutral and deleterious alleles. Vertical line denotes the currentpopulation frequency of the variant (3%). Mean sojourn times have been computed in bins of 1%. Lineconnects theoretical predictions for each frequency bin. Dots show simulation results. The graphillustrates that deleterious alleles spend much less time than neutral alleles at higher populationfrequencies in the past even if they have the same current frequency.

Neighborhood  clock  (fuzzy  clock)

29

Variant''Closest'rarer'linked'variant'

Closest'variant'beyond''recombina4on'event'

)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�

3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

MAC=3

NC statistic

Prop

ortio

n N

C <

= x

●● missense ancestral

missense derivedprobably damaging derivedsynonymous derived

)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�

Neighborhood  clock  is  consistent  with  Maruyama-­effect  expectations

Data:  pilot  Genome  of  Netherlands  dataset

15

MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5

3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6

4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5

4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5

6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826

2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10

2-6 benign 4464 5.30⇥ 10�06

2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13

Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).

15

MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5

3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6

4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5

4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5

6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826

2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10

2-6 benign 4464 5.30⇥ 10�06

2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13

Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).

Page 6: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

6

Can  signals  of  selection  guide  prioritization?

Genes  of  interest  should  be  highly  selectively  constrained

Can  we  estimate  fitness  loss  directly?

Several  methods  to  estimate  gene-­based  selection  constrained  exist  (pLI,  RVIS)

ExAC dataset  combines  exomes of  >60,000  individuals

Selection  inference  using  frequencies  of  individual  SNPs

Change  in  allele  frequency =

Mutation Selection Drift= ++

Of  the  order  of  10-­8

Demographic  history

Population  structure

Focusing  on  rare  deleterious  PTVs

PTV  – protein  truncating  variant  (a.k.a.  nonsense)

Combine  all  PTVs  per  gene  – we  assume  that  they  have  identical  effects

Consider  each  gene  as  a  bi-­allelic  locus  –PTV  /  no  PTV

Selection  inference  using  combine  frequency  of  PTVs

Change  in  allele  frequency =

Mutation Selection Drift= ++

Combined  frequency  of  rare  deleterious  PTVs  is  expected  to  be  Poisson  distributed  with  l=U/hs

Simulations The  model

PTV  counts  in  each  gene  are  Poisson  distributed  but  we  lacksufficient  data  to  estimate  selection  coefficients  

We  can  treat  selection  coefficients  as  random  variables  with  a  distribution  to  be  estimated  

Page 7: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

7

Distribution  of  selection  coefficients

10-4 0.001 0.010 0.100 1

0

10

20

30

40

50

60

Heterozygous selection coefficient, shet

P(shet|α,β)

Estimates  for  each  gene

combine the results in a mixture distribution with equal weights. The mean mutation rates in the

three terciles are F^ = 4.6 ∙ 10y~, F= = 1.1 ∙ 10y�, and Fz = 2.6 ∙ 10y�. We estimate (α^, β^) =

(0.057±0.010,0.0052±0.0003), (α=, β=) = (0.046±0.005,0.0087±0.0004), and (αz, βz) =

(0.074±0.005,0.0160±0.0005), with error margins denoting two s.d. from 100 bootstrapping

replicates of the set of ~5,333 genes in each tercile. This error estimate is intended to quantify

the effect of the sampling noise in the data set on the parameter inference while local mutation

rate estimates are assumed fixed. The resulting fitted distributions of counts are shown in

Supplementary Figure 9 together with the corresponding p N , while Figure 1 shows the

inferred V !het; %, ' = IG !het; %^, '^ + IG !het; %=, '= + IG !het; %z, 'z /3. The choice for the

functional form of V !het is motivated by the shape of the empirical distribution of the naïve

estimator W/N (given by a simple inversion of Eq. 3). We also compared the log-likelihood of the

fit to p(N) obtained with this model to that obtained from two other two-parameter distributions,

!het~Gamma and !het~InvGamma, and chose the model with the highest likelihood, which is

!het~IG.

Inference of !het on individual genes From the inferred distributions V !het; %A, 'A in each tercile t of the mutation rate U, we construct

a per-gene estimator of !het for genes in the tercile using the posterior probability given N, which

mitigates the stochasticity of the observed PTV count:

V !"#$,6|N6; W6 =Ü _á|Sàâä,á;gá Ü Sàâä,á;fã,dã

Ü _á|S;gá Ü S;fã,dã dS , (7)

where the denominator is given by Eq. 5. Supplementary Table 1 provides the mean values

derived from these posterior probabilities for each gene. Predicted mode of inheritance in clinical exome cases

We trained a Naïve Bayes classifier to predict the mode of inheritance in a set of solved clinical

exome sequencing cases from Baylor College of Medicine (N=283 cases)22

and UCLA23

(N=176

cases). Using data from UCLA as the training dataset, we are able to cross-predict the mode of

inheritance in separately ascertained Baylor cases with classification accuracy of 88.0%,

sensitivity of 86.1%, specificity of 90.2%, and an AUC of 0.931. Genes that were related to

diagnosis in both clinics (overlapping genes) were removed from the larger Baylor set

(Supplementary Figure 2).

Using a logistic regression based on the full set of cases from Baylor and UCLA, we generated

predictions for all 15,998 genes where there is a !het value (Supplementary Table 4). Mouse knockout comparative analysis

We reviewed mouse knockout enrichments from two datasets: the full set of mouse knockouts

from a neutrally-ascertained mouse knockout screen (N=2,179 genes) generated by the

International Mouse Phenotyping Consortium25

. Genes were classified as ‘Viable’, ‘Sub-Viable’,

or ‘Lethal’ based on the results for the assay. PubMed gene score and enrichment analysis

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

The  estimated  distribution  over  selection  coefficients  can  be  now  used  as  a  prior,  and  per  gene  estimates  from  posteriors  

AD  and  AR  Mendelian genes

Figure 2: Separation of disease genes and clinical cases by mode of inheritance. [a] The distribution of genes associated with exclusively autosomal dominant (AD, N=867) disorders versus autosomal recessive (AR, N=1,482) disorders as annotated by the Clinical Genomics Database (CGD). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] Overall, AD genes have significantly higher !"#$ values than AR genes [Mann-Whitney p-value 3.14x10-64]. [c] Similarly, in solved Mendelian clinical exome sequencing cases (Baylor)22, !"#$ values can help discriminate between AR and AD disease genes, as annotated by clinical geneticists. [d] A !"#$ value of 0.04 can be used as a simple classification threshold for AD genes with a PPV of 96%. [e] This finding is replicated in a separately ascertained sample from UCLA. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. In a set of 504 clinical exome cases that resulted in a Mendelian diagnosis22, we find a similar enrichment of cases by MOI and selection value (Figure 2[c]). We find that 90.4% of novel, dominant variants are associated with heterozygous fitness loss greater than 0.04 (Figure 2[d]). Among disease variants, a cutoff of !"#$ > 0.04 provides a 96% positive predictive value for discriminating between AD and AR modes of inheritance.

ADDiseaseGenes

ARDiseaseGenes

0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] s_het distributions

AD Disease Genes AR Disease Genes

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 >= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

2%

4%

6%

8%

10%

12%

Frac

tion

of g

enes

in e

ach

s_he

t bin

(10^

-x)

[a] Mode of Inheritance [Clinical Genomic Database]

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.0010

20

40

60

80

100

Num

ber o

f obs

erve

d ge

nes

0%

20%

40%

60%

80%

100%

Frac

tion

of g

enes

by

Mod

e of

Inhe

ritan

ce

102

382730

34

9

7 6

[c] Mode of Inheritance in Molecular Diagnoses [Baylor]

s_het bins

s_het <0.04

s_het >0.04

19.57%

96.04%

80.43%

[d] Baylor

s_het bins

s_het <0.04

s_het >0.04

21.18%

96.70%

78.82%

[e] UCLA

Mode of InheritanceAD

AR

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

Age  of  onset,  penetrance  and  severity

To test the generalizable utility of !"#$ values in prioritizing candidate genes in Mendelian sequencing studies, we compared the overall prevalence of genes with !"#$ > 0.04 to the corresponding fraction in an independently ascertained dataset of new dominant Mendelian diagnoses (Figure 2[e])23. This analysis suggests that restricting to genes with !"#$ > 0.04 would provide a three-fold reduction of candidate variants, given the overall distribution of !"#$ values. Thus, initial effort in clinical cases can be focused on just a few genes for functional validation, familial segregation studies, and patient matching. We summarize the classification accuracy for all possible thresholds (AUC 0.9312) and probabilities for the mode of inheritance in each gene, generated using the full set of clinical sequencing cases (Supplementary Figure 2 and Supplementary Table 2). Beyond mode of inheritance, we find that !"#$ can help predict phenotypic severity, age of onset, penetrance, and the fraction of de novo variants in a set of high-confidence haploinsufficient disease genes (Figure 3). In broader sets of known disease genes, !"#$ estimates significantly correlate with the number of references in OMIM MorbidMap and the number of HGMD disease “DM” variants (Supplementary Figure 3).

Figure 3: Enrichments of !"#$ in known haploinsufficient disease genes of high confidence (ClinGen Project). In (N=127) autosomal genes, we annotate the !"#$ scores of genes associated with each disease category and classification. Higher !"#$ values are associated with increased phenotypic severity (Mann-Whitney p-value 4.87x10-

3), earlier age of onset (p=1.46 x10-2), high or unspecified penetrance (p=1.79 x10-2), and a larger fraction of de novo variants (p=8x10-5). Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. Gene-specific fitness loss values allow us to plot the distribution of selective effects for different disorders. This provides information about the breadth and severity of selection associated with various disorder groups using both well-established genes (Figure 4[a]) and new findings from Mendelian exome cases (Figure 4[b]). Overall, genes involved in neurologic phenotypes and congenital heart disease appear to be under more intense selection when compared with other disorder groups, tolerated knockouts in a consanguineous cohort, or in all genes (Figure 4[c,d])24. Interestingly, genes recessive for these disorders appear to have only partially

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

Concordance  with  mouse  knockout  dataviability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).

Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

20%

40%

60%

80%

100%

Per

cent

age

of g

enes

in e

ach

bin,

by

phen

otyp

e

105

215

308

118

283

130144102

55

48

19

57

17

36

71

11

11

7

7

1

[a] Orthologous mouse knockouts by phenotypePhenotype

Lethal Subviable Viable0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] Distribution of s_het values

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

25%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

458

100

394

292

451

43

2

[c] Cell-Essential by KBM7 CRISPR Assays_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

175

299

263

236

70

242

[d] Cell-Essential by Yeast Gene Trap Assay

PhenotypeLethal

Subviable

Viable

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

Concordance  with  cell  essentiality  screens

viability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).

Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

20%

40%

60%

80%

100%

Per

cent

age

of g

enes

in e

ach

bin,

by

phen

otyp

e

105

215

308

118

283

130144102

55

48

19

57

17

36

71

11

11

7

7

1

[a] Orthologous mouse knockouts by phenotypePhenotype

Lethal Subviable Viable0.0001

0.0002

0.0005

0.001

0.002

0.005

0.01

0.02

0.05

0.1

0.2

0.5

1

s_he

t

[b] Distribution of s_het values

s_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

25%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

458

100

394

292

451

43

2

[c] Cell-Essential by KBM7 CRISPR Assays_het bin

>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%

5%

10%

15%

20%

Per

cent

age

of g

enes

cla

ssifi

ed a

s es

sent

ial

175

299

263

236

70

242

[d] Cell-Essential by Yeast Gene Trap Assay

PhenotypeLethal

Subviable

Viable

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;

Page 8: Rockefeller pop gen 2017 - Baylor College of Medicine · •Can’we’find’additional’evidence’in’sequence’data? ... 2 coding-synon 2813 4.97 baseline ... Rockefeller_pop_gen_2017

8

Black  hole  in  knowledgeSupplementary Figure 7: Most published and least published genes from top êëíì decile

Most published and least published genes from top êëíì decile. The proportion of annotations related to genes

with the fewest and most publications in Entrez Gene. From the set of genes under the strongest selection (top 10%

of !"#$ values), we create two sets of 250 genes. The first set of genes has the fewest publications associated with

each gene, as defined by our PubMed gene score (Methods), and the second set has the greatest number of

associated publications. Between the two groups, we compare the !"#$ values, number of protein-protein interactions,

viability of orthologous mouse knockouts (IMPC), and cell essentiality assays (KBM-7 CRISPR score and Gene Trap

Score). These results suggest that the genes in the least published set are similar to those in the most published set,

and are also potentially important developmental genes.

Non-ViableSanger Mice

KBM7 HumanCell Line

Protein-Protein

Interactions s_het ValueYeast GeneTrap Score

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

Few

est P

ublic

atio

ns

Mos

t Pub

licat

ions

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Perc

enta

ge o

f Gen

es in

Eac

h G

roup

Black Hole Figure

Measure NamesFewest Publications

Most Publications

Fewest Publications and Most Publications for each F1. Color shows details aboutFewest Publications and Most Publications. The view is filtered on F1, which keepsKBM7 Human Cell Line, Yeast Gene Trap Score, Protein-Protein Interactions, Non-Viable Sanger Mice and s_het Value.

peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;


Recommended