1
Intro to population genetics
Shamil Sunyaev
Broad Institute of M.I.T. and Harvard
Forces responsible for genetic change
Mutation µ
Selection s
NeDrift
Population structure FST
Mutations
Mutation rate in humans and flies
~102 per nt changes genome
2.5x10-8 (Nachman & Crowell) 1.8x10-8 (Kondrashov)
Other events: indels (10-9)
repeat extensions/contractions (10-5)
large events (?)
NGS estimates ~1.2X10-8 per nt changes genome
Mutation rate is variable along the genome
Regional variation of mutation rate
Context dependence of mutation rate
Replication fidelity DNA damage DNA repair CpG deamination Genetic drift
2
Drift is a random change of allele frequencies Drift depends on population size
Demographic history
Selection
12
NeutralDeleterious Advantageous
New mutation
Functional
Nonfunctional
Selection indicates functional mutations, whether or not the tested trait is under selection
Selective effect of mutation
Most functional mutations are deleterious
3
Methods of mathematical population genetics
Dynamic of allelic substitution
time0
1
Mathematically, allele frequency change in a population follows a one-dimensional random walk
Diffusion approximation
Random walk that does not jump long distances can be approximated by a diffusion process
€
∂φ x, p,t( )∂t
= −∂Mφ x, p,t( )
∂x+12∂2Vφ x, p,t( )
∂x 2
Coalescent theoryInstead of modeling a population, we can model our sample
Time goes backwards !
t
Natural selection in protein coding regions
4
Effect of new missense mutations Computer simulations
time
€
∂φ x, p,t( )∂t
= −∂Mφ x, p,t( )
∂x+12∂2Vφ x, p,t( )
∂x 2Demographic history
Natural selection
0
0.05
0.1
0.15
0.2
0.25
0.3
• Can we find additional evidence in sequence data?
• Is there any information beyond frequency? Can we tell alleles under selection from neutral alleles if they are of the same frequency?
5
25
Maruyama effect (1974): at any frequency advantageous , or deleterious alleles are younger than neutral alleles
−150 −100 −50 0
050
100150
200250
300
time (generations)
allele c
ount
Frequency x
Frequency 0%Time
At a given frequency deleterious and advantageous alleles are younger than
neutral
Longer trajectory: 6 jumps
Shorter trajectory: 4 jumps
Frequency 0%
Frequency x
Time
Intuition: shorter trajectories require fewer lucky jumps
time
allelefrequency
Neutrals: equal time at each frequencySelecteds: faster through higher frequencies
Idea: low accumulation of mutations at linked sites indicates selection
Diffusion theory: deleterious alleles pass fast through higher frequencies
10
!
!
!
!
!
−25 −20 −15 −10 −5 0
0.0
0.1
0.2
0.3
0.4
0.5
selection coefficient 2Ns
mea
n ag
e (2
N g
ener
atio
ns)
!
!
!
!
!
!
!
!
!
!
!
!
!
Population frequency7%5%3%
!"
#$%&'(")"
0 5 10 15 20
0.00
00.
005
0.01
00.
015
0.02
0
Intermediate allele frequency (%)
mea
n so
jour
n tim
e (2
N g
ener
atio
ns)
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
! ! ! ! !
!
!!
!
!
!
!
!
!
!
!
!
!
!!
!!
!!
!!
!
!
!
!
!
!
!
!
!
!
!
!! ! ! ! ! ! ! ! !
Selection coefficient (2Ns)0 (neutral)−2 (weakly deleterious)−10 (deleterious)
3%
*"
−0.20 −0.15 −0.10 −0.05 0.00
05
1015
time (generations before present, in 2N units)
popu
latio
n fre
quen
cy (%
)
Alleleneutraldeleterious
+"
""
""
Figure 1. Simulation and theoretical results for allelic age and sojourn times. a. Exampletrajectories for a neutral and deleterious allele with current population frequencies 3% (indicated by anarrow). The shaded areas indicate sojourn times at frequencies above 5%. b. Mean ages for neutral anddeleterious alleles at a given population frequency (lines show theoretical predictions, dots showsimulation results with standard error bars). The graph shows that deleterious alleles at a givenfrequency are younger than neutral alleles, and that the e↵ect is greater for more strongly selectedalleles. c. Mean sojourn times for neutral and deleterious alleles. Vertical line denotes the currentpopulation frequency of the variant (3%). Mean sojourn times have been computed in bins of 1%. Lineconnects theoretical predictions for each frequency bin. Dots show simulation results. The graphillustrates that deleterious alleles spend much less time than neutral alleles at higher populationfrequencies in the past even if they have the same current frequency.
Neighborhood clock (fuzzy clock)
29
Variant''Closest'rarer'linked'variant'
Closest'variant'beyond''recombina4on'event'
)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�
3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
MAC=3
NC statistic
Prop
ortio
n N
C <
= x
●
●
●● missense ancestral
missense derivedprobably damaging derivedsynonymous derived
)LJXUH&OLFN�KHUH�WR�GRZQORDG�)LJXUH��)LJXUHB��SGI�
Neighborhood clock is consistent with Maruyama-effect expectations
Data: pilot Genome of Netherlands dataset
15
MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5
3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6
4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5
4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5
6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826
2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10
2-6 benign 4464 5.30⇥ 10�06
2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13
Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).
15
MAC Variants N meanNC Effect size 95% CI P2 coding-synon 2813 4.97 baseline2 missense 3957 5.02 0.089 (0.0387, 0.138) 0.00122 benign 1772 5.02 0.088 (0.0361, 0.136) 0.00832 possibly damaging 708 4.99 0.040 (�0.013, 0.091) 0.1412 probably damaging 1136 5.05 0.142 (0.0914, 0.188) 0.00033 coding-synon 1708 4.68 baseline3 missense 2277 4.75 0.134 (0.0726, 0.197) 2.17⇥ 10�5
3 benign 1035 4.74 0.118 (0.0521, 0.183) 0.002133 possibly damaging 368 4.75 0.137 (0.0714, 0.202) 0.01493 probably damaging 650 4.79 0.211 (0.146, 0.275) 1.58⇥ 10�6
4 coding-synon 1216 4.46 baseline4 missense 1496 4.56 0.16 (0.088, 0.238) 2.68⇥ 10�5
4 benign 695 4.54 0.127 (0.050, 0.207) 0.008174 possibly damaging 254 4.59 0.217 (0.144, 0.287) 0.0005124 probably damaging 376 4.59 0.212 (0.140, 0.284) 0.0001245 coding-synon 935 4.37 baseline5 missense 1102 4.42 0.0966 (0.010, 0.188) 0.009345 benign 530 4.42 0.0922 (0.005, 0.176) 0.04545 possibly damaging 181 4.4 0.0596 (�0.028, 0.158) 0.3125 probably damaging 277 4.52 0.266 (0.185, 0.353) 2.73⇥ 10�5
6 coding-synon 814 4.24 baseline6 missense 896 4.28 0.082 (�0.015, 0.171) 0.05626 benign 432 4.26 0.047 (�0.044, 0.136) 0.2916 possibly damaging 145 4.29 0.101 (0.012, 0.187) 0.1836 probably damaging 215 4.37 0.243 (0.149, 0.338) 0.000826
2-6 coding-synon 7486 baseline2-6 missense 9728 1.79⇥ 10�10
2-6 benign 4464 5.30⇥ 10�06
2-6 possibly damaging 1656 0.0012-6 probably damaging 2654 3.25⇥ 10�13
Table 1. Discrimination of derived missense alleles by the NC statistic. Missense alleles aresub-classified info categories based on PolyPhen-2 predictions. Effect sizes were calculated as standarddeviations from the mean of the NC statistic for synonymous variants at the same minor allele count(MAC). Within each MAC class, P-values were calculated by 1-sided Mann-Whitney test. CombinedP-values for MAC 2-6 were computed by meta-analysis (Methods).
6
Can signals of selection guide prioritization?
Genes of interest should be highly selectively constrained
Can we estimate fitness loss directly?
Several methods to estimate gene-based selection constrained exist (pLI, RVIS)
ExAC dataset combines exomes of >60,000 individuals
Selection inference using frequencies of individual SNPs
Change in allele frequency =
Mutation Selection Drift= ++
Of the order of 10-8
Demographic history
Population structure
Focusing on rare deleterious PTVs
PTV – protein truncating variant (a.k.a. nonsense)
Combine all PTVs per gene – we assume that they have identical effects
Consider each gene as a bi-allelic locus –PTV / no PTV
Selection inference using combine frequency of PTVs
Change in allele frequency =
Mutation Selection Drift= ++
Combined frequency of rare deleterious PTVs is expected to be Poisson distributed with l=U/hs
Simulations The model
PTV counts in each gene are Poisson distributed but we lacksufficient data to estimate selection coefficients
We can treat selection coefficients as random variables with a distribution to be estimated
7
Distribution of selection coefficients
10-4 0.001 0.010 0.100 1
0
10
20
30
40
50
60
Heterozygous selection coefficient, shet
P(shet|α,β)
Estimates for each gene
combine the results in a mixture distribution with equal weights. The mean mutation rates in the
three terciles are F^ = 4.6 ∙ 10y~, F= = 1.1 ∙ 10y�, and Fz = 2.6 ∙ 10y�. We estimate (α^, β^) =
(0.057±0.010,0.0052±0.0003), (α=, β=) = (0.046±0.005,0.0087±0.0004), and (αz, βz) =
(0.074±0.005,0.0160±0.0005), with error margins denoting two s.d. from 100 bootstrapping
replicates of the set of ~5,333 genes in each tercile. This error estimate is intended to quantify
the effect of the sampling noise in the data set on the parameter inference while local mutation
rate estimates are assumed fixed. The resulting fitted distributions of counts are shown in
Supplementary Figure 9 together with the corresponding p N , while Figure 1 shows the
inferred V !het; %, ' = IG !het; %^, '^ + IG !het; %=, '= + IG !het; %z, 'z /3. The choice for the
functional form of V !het is motivated by the shape of the empirical distribution of the naïve
estimator W/N (given by a simple inversion of Eq. 3). We also compared the log-likelihood of the
fit to p(N) obtained with this model to that obtained from two other two-parameter distributions,
!het~Gamma and !het~InvGamma, and chose the model with the highest likelihood, which is
!het~IG.
Inference of !het on individual genes From the inferred distributions V !het; %A, 'A in each tercile t of the mutation rate U, we construct
a per-gene estimator of !het for genes in the tercile using the posterior probability given N, which
mitigates the stochasticity of the observed PTV count:
V !"#$,6|N6; W6 =Ü _á|Sàâä,á;gá Ü Sàâä,á;fã,dã
Ü _á|S;gá Ü S;fã,dã dS , (7)
where the denominator is given by Eq. 5. Supplementary Table 1 provides the mean values
derived from these posterior probabilities for each gene. Predicted mode of inheritance in clinical exome cases
We trained a Naïve Bayes classifier to predict the mode of inheritance in a set of solved clinical
exome sequencing cases from Baylor College of Medicine (N=283 cases)22
and UCLA23
(N=176
cases). Using data from UCLA as the training dataset, we are able to cross-predict the mode of
inheritance in separately ascertained Baylor cases with classification accuracy of 88.0%,
sensitivity of 86.1%, specificity of 90.2%, and an AUC of 0.931. Genes that were related to
diagnosis in both clinics (overlapping genes) were removed from the larger Baylor set
(Supplementary Figure 2).
Using a logistic regression based on the full set of cases from Baylor and UCLA, we generated
predictions for all 15,998 genes where there is a !het value (Supplementary Table 4). Mouse knockout comparative analysis
We reviewed mouse knockout enrichments from two datasets: the full set of mouse knockouts
from a neutrally-ascertained mouse knockout screen (N=2,179 genes) generated by the
International Mouse Phenotyping Consortium25
. Genes were classified as ‘Viable’, ‘Sub-Viable’,
or ‘Lethal’ based on the results for the assay. PubMed gene score and enrichment analysis
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;
The estimated distribution over selection coefficients can be now used as a prior, and per gene estimates from posteriors
AD and AR Mendelian genes
Figure 2: Separation of disease genes and clinical cases by mode of inheritance. [a] The distribution of genes associated with exclusively autosomal dominant (AD, N=867) disorders versus autosomal recessive (AR, N=1,482) disorders as annotated by the Clinical Genomics Database (CGD). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] Overall, AD genes have significantly higher !"#$ values than AR genes [Mann-Whitney p-value 3.14x10-64]. [c] Similarly, in solved Mendelian clinical exome sequencing cases (Baylor)22, !"#$ values can help discriminate between AR and AD disease genes, as annotated by clinical geneticists. [d] A !"#$ value of 0.04 can be used as a simple classification threshold for AD genes with a PPV of 96%. [e] This finding is replicated in a separately ascertained sample from UCLA. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. In a set of 504 clinical exome cases that resulted in a Mendelian diagnosis22, we find a similar enrichment of cases by MOI and selection value (Figure 2[c]). We find that 90.4% of novel, dominant variants are associated with heterozygous fitness loss greater than 0.04 (Figure 2[d]). Among disease variants, a cutoff of !"#$ > 0.04 provides a 96% positive predictive value for discriminating between AD and AR modes of inheritance.
ADDiseaseGenes
ARDiseaseGenes
0.0001
0.0002
0.0005
0.001
0.002
0.005
0.01
0.02
0.05
0.1
0.2
0.5
1
s_he
t
[b] s_het distributions
AD Disease Genes AR Disease Genes
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.0003 >= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
2%
4%
6%
8%
10%
12%
Frac
tion
of g
enes
in e
ach
s_he
t bin
(10^
-x)
[a] Mode of Inheritance [Clinical Genomic Database]
s_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.0010
20
40
60
80
100
Num
ber o
f obs
erve
d ge
nes
0%
20%
40%
60%
80%
100%
Frac
tion
of g
enes
by
Mod
e of
Inhe
ritan
ce
102
382730
34
9
7 6
[c] Mode of Inheritance in Molecular Diagnoses [Baylor]
s_het bins
s_het <0.04
s_het >0.04
19.57%
96.04%
80.43%
[d] Baylor
s_het bins
s_het <0.04
s_het >0.04
21.18%
96.70%
78.82%
[e] UCLA
Mode of InheritanceAD
AR
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;
Age of onset, penetrance and severity
To test the generalizable utility of !"#$ values in prioritizing candidate genes in Mendelian sequencing studies, we compared the overall prevalence of genes with !"#$ > 0.04 to the corresponding fraction in an independently ascertained dataset of new dominant Mendelian diagnoses (Figure 2[e])23. This analysis suggests that restricting to genes with !"#$ > 0.04 would provide a three-fold reduction of candidate variants, given the overall distribution of !"#$ values. Thus, initial effort in clinical cases can be focused on just a few genes for functional validation, familial segregation studies, and patient matching. We summarize the classification accuracy for all possible thresholds (AUC 0.9312) and probabilities for the mode of inheritance in each gene, generated using the full set of clinical sequencing cases (Supplementary Figure 2 and Supplementary Table 2). Beyond mode of inheritance, we find that !"#$ can help predict phenotypic severity, age of onset, penetrance, and the fraction of de novo variants in a set of high-confidence haploinsufficient disease genes (Figure 3). In broader sets of known disease genes, !"#$ estimates significantly correlate with the number of references in OMIM MorbidMap and the number of HGMD disease “DM” variants (Supplementary Figure 3).
Figure 3: Enrichments of !"#$ in known haploinsufficient disease genes of high confidence (ClinGen Project). In (N=127) autosomal genes, we annotate the !"#$ scores of genes associated with each disease category and classification. Higher !"#$ values are associated with increased phenotypic severity (Mann-Whitney p-value 4.87x10-
3), earlier age of onset (p=1.46 x10-2), high or unspecified penetrance (p=1.79 x10-2), and a larger fraction of de novo variants (p=8x10-5). Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range. Gene-specific fitness loss values allow us to plot the distribution of selective effects for different disorders. This provides information about the breadth and severity of selection associated with various disorder groups using both well-established genes (Figure 4[a]) and new findings from Mendelian exome cases (Figure 4[b]). Overall, genes involved in neurologic phenotypes and congenital heart disease appear to be under more intense selection when compared with other disorder groups, tolerated knockouts in a consanguineous cohort, or in all genes (Figure 4[c,d])24. Interestingly, genes recessive for these disorders appear to have only partially
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;
Concordance with mouse knockout dataviability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).
Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.
s_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
20%
40%
60%
80%
100%
Per
cent
age
of g
enes
in e
ach
bin,
by
phen
otyp
e
105
215
308
118
283
130144102
55
48
19
57
17
36
71
11
11
7
7
1
[a] Orthologous mouse knockouts by phenotypePhenotype
Lethal Subviable Viable0.0001
0.0002
0.0005
0.001
0.002
0.005
0.01
0.02
0.05
0.1
0.2
0.5
1
s_he
t
[b] Distribution of s_het values
s_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
5%
10%
15%
20%
25%
Per
cent
age
of g
enes
cla
ssifi
ed a
s es
sent
ial
458
100
394
292
451
43
2
[c] Cell-Essential by KBM7 CRISPR Assays_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
5%
10%
15%
20%
Per
cent
age
of g
enes
cla
ssifi
ed a
s es
sent
ial
175
299
263
236
70
242
[d] Cell-Essential by Yeast Gene Trap Assay
PhenotypeLethal
Subviable
Viable
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;
Concordance with cell essentiality screens
viability, while those with the lowest !"#$ estimates are depleted for embryonic lethality [Mann-Whitney p=2.95x10-28] (Figure 5[a,b]).
Figure 5: High-throughput screens of gene essentiality in mice and cell assays. [a] Proportion of orthologous mouse knockout genes by phenotype, from a neutrally-ascertained set of genes generated by the International Mouse Phenotyping Consortium (IMCP). Logarithmic bins are ordered from greatest to smallest !"#$ values. [b] ICMP mice are separated into viable (N=1,057), sub-viable (N=211) and lethal knockouts (N=477), and lethal knockouts have significantly higher !"#$ values than viable [Mann-Whitney p-value 2.95x10-28]. [c] Cell-essential genes as reported by Wang et al. from genome-wide KBM-7 tumor cell CRISPR assay (N=1,740) have significantly higher !"#$ values [p-value 5.13x10-16] and [d] as do genes that were characterized as essential in a gene trap assay (N= 1,081) [p-value = 4.90x10-18]. In the CRISPR assay, all genes with adjusted p-values < 0.05 and negative assay scores are included, and genes with gene trap scores < 0.4 or lower are included. Box plots range from 25th-75th percentile values and whiskers include 1.5 times the interquartile range.
s_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
20%
40%
60%
80%
100%
Per
cent
age
of g
enes
in e
ach
bin,
by
phen
otyp
e
105
215
308
118
283
130144102
55
48
19
57
17
36
71
11
11
7
7
1
[a] Orthologous mouse knockouts by phenotypePhenotype
Lethal Subviable Viable0.0001
0.0002
0.0005
0.001
0.002
0.005
0.01
0.02
0.05
0.1
0.2
0.5
1
s_he
t
[b] Distribution of s_het values
s_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
5%
10%
15%
20%
25%
Per
cent
age
of g
enes
cla
ssifi
ed a
s es
sent
ial
458
100
394
292
451
43
2
[c] Cell-Essential by KBM7 CRISPR Assays_het bin
>= 0.3 0.1 0.03 0.01 0.003 0.001 0.00030%
5%
10%
15%
20%
Per
cent
age
of g
enes
cla
ssifi
ed a
s es
sent
ial
175
299
263
236
70
242
[d] Cell-Essential by Yeast Gene Trap Assay
PhenotypeLethal
Subviable
Viable
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;
8
Black hole in knowledgeSupplementary Figure 7: Most published and least published genes from top êëíì decile
Most published and least published genes from top êëíì decile. The proportion of annotations related to genes
with the fewest and most publications in Entrez Gene. From the set of genes under the strongest selection (top 10%
of !"#$ values), we create two sets of 250 genes. The first set of genes has the fewest publications associated with
each gene, as defined by our PubMed gene score (Methods), and the second set has the greatest number of
associated publications. Between the two groups, we compare the !"#$ values, number of protein-protein interactions,
viability of orthologous mouse knockouts (IMPC), and cell essentiality assays (KBM-7 CRISPR score and Gene Trap
Score). These results suggest that the genes in the least published set are similar to those in the most published set,
and are also potentially important developmental genes.
Non-ViableSanger Mice
KBM7 HumanCell Line
Protein-Protein
Interactions s_het ValueYeast GeneTrap Score
Few
est P
ublic
atio
ns
Mos
t Pub
licat
ions
Few
est P
ublic
atio
ns
Mos
t Pub
licat
ions
Few
est P
ublic
atio
ns
Mos
t Pub
licat
ions
Few
est P
ublic
atio
ns
Mos
t Pub
licat
ions
Few
est P
ublic
atio
ns
Mos
t Pub
licat
ions
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Perc
enta
ge o
f Gen
es in
Eac
h G
roup
Black Hole Figure
Measure NamesFewest Publications
Most Publications
Fewest Publications and Most Publications for each F1. Color shows details aboutFewest Publications and Most Publications. The view is filtered on F1, which keepsKBM7 Human Cell Line, Yeast Gene Trap Score, Protein-Protein Interactions, Non-Viable Sanger Mice and s_het Value.
peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/075523doi: bioRxiv preprint first posted online Sep. 16, 2016;