Identifying dominant gene candidates with GEMINI
Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist
Aaron Quinlan University of Utah
!!!!!quinlanlab.org
1
https://gist.github.com/arq5x/9e1928638397ba45da2e#file-autosomal-dominant-sh
Autosomal dominant pedigrees
Jessica Chong2
Create a PED file (done for you)
Jessica Chong3
Create a GEMINI database from a VCF
Notes: 1. The VCF has been normalized and decomposed with VT 2. The VCF has been annotated with VEP.
$ curl https://s3.amazonaws.com/gemini-‐tutorials/trio.trim.vep.vcf.gz > trio.trim.vep.vcf.gz
$ curl https://s3.amazonaws.com/gemini-‐tutorials/dominant.ped > dominant.ped
$ gemini load -‐-‐cores 4 \
-‐v trio.trim.vep.vcf.gz \
-‐t VEP \
-‐-‐skip-‐gene-‐tables \
-‐p dominant.ped \ ! trio.trim.vep.dominant.db
Note: copy and paste the full command from the Github Gist to avoid errors
http://gemini.readthedocs.org/en/latest/content/preprocessing.html#step-1-split-left-align-and-trim-variants
4
Running the autosomal_dominant tool
$ gemini autosomal_dominant --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ trio.trim.vep.dominant.db \ | head \ | column -t
Note: copy and paste the full command from the Github Gist
chrom start end ref alt gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count chr15 59508889 59508890 T C AC092756.1 mature_miRNA -‐0.78 8480 family1 1805,1847,4805 T/C,T/T,T/C 1805,4805 1 chr15 84868493 84868494 G C AC103965.1 downstream -‐0.37 9360 family1 1805,1847,4805 G/C,G/G,G/C 1805,4805 1 chr15 75966884 75966885 G A AC105020.1 upstream -‐0.74 8906 family1 1805,1847,4805 G/A,G/G,G/A 1805,4805 1 chr15 84945184 84945185 C T AC136704.1 upstream 0.01 9370 family1 1805,1847,4805 C/T,C/C,C/T 1805,4805 1 chr15 84945511 84945512 C T AC136704.1 upstream 0.74 9371 family1 1805,1847,4805 C/T,C/C,C/T 1805,4805 1 chr15 84949950 84949951 C G AC136704.1 mature_miRNA -‐0.43 9378 family1 1805,1847,4805 C/G,C/C,C/G 1805,4805 1 chr15 89398552 89398553 C A ACAN non_syn_coding 2.57 9519 family1 1805,1847,4805 C/A,C/C,C/A 1805,4805 1 chr15 100649247 100649248 G A ADAMTS17 synonymous_coding 1.51 9856 family1 1805,1847,4805 G/A,G/G,G/A 1805,4805 1 chr15 100692844 100692845 A G ADAMTS17 non_syn_coding -‐1.16 9858 family1 1805,1847,4805 A/G,A/A,A/G 1805,4805 1
5
Time to start filtering
$ gemini autosomal_dominant \ --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ trio.trim.vep.dominant.db \ | wc -l
Note: copy and paste the full command from the Github Gist
1515 candidates
6
Again, use the -‐-‐filter option
$ gemini autosomal_dominant \ --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ --filter "filter is NULL" \ trio.trim.vep.dominant.db \ | wc -l
Note: copy and paste the full command from the Github Gist
1288 candidates
Exclude variants that failed GATK filters
7
Further restrict variants with functional consequence
$ gemini autosomal_dominant \ --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ --filter "filter is NULL and impact_severity != 'LOW'" \ trio.trim.vep.dominant.db \ | wc -l
Note: copy and paste the full command from the Github Gist
347 candidates
8
$ gemini autosomal_dominant \ --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ --filter "filter is NULL and impact_severity != 'LOW' and (aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL)" trio.trim.vep.dominant.db \ wc -l
Use ESP and ExAC to focus on rare variants
Note: copy and paste the full command from the Github Gist
40 candidates
9
$ gemini autosomal_dominant \ --columns "chrom, start, end, ref, alt, gene, impact, cadd_raw" \ --filter "filter is NULL and impact_severity == 'HIGH' and (aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL)" trio.trim.vep.dominant.db \ wc -l
Let’s be more strict with the functional consequence
Note: copy and paste the full command from the Github Gist
chrom start end ref alt gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count chr17 28778816 28778817 T C CPD stop_loss 1.09 11860 family1 1805,1847,4805 T/C,T/T,T/C 1805,4805 1 chr2 21236250 21236251 G A APOB stop_gain 7.82 492 family1 1805,1847,4805 G/A,G/G,G/A 1805,4805 1 chr22 21363743 21363744 T C TUBA3FP splice_acceptor 2.46 15272 family1 1805,1847,4805 T/C,T/T,T/C 1805,4805 1
10
Are any of these variants known to underlie a clinical phenotype?
chrom start end ref alt gene impact cadd_raw variant_id family_id family_members family_genotypes samples family_count chr17 28778816 28778817 T C CPD stop_loss 1.09 11860 family1 1805,1847,4805 T/C,T/T,T/C 1805,4805 1 chr2 21236250 21236251 G A APOB stop_gain 7.82 492 family1 1805,1847,4805 G/A,G/G,G/A 1805,4805 1 chr22 21363743 21363744 T C TUBA3FP splice_acceptor 2.46 15272 family1 1805,1847,4805 T/C,T/T,T/C 1805,4805 1
How could you extend the query from the previous slide to answer this? Hint: use the gemini documentation
11
Custom analyses: Use the query module to
identify autosomal dominant variants
12
Using the -‐-‐gt-‐filter
We need to use the query module to enforce an autosomal dominant inheritance pattern.
13
Using the -‐-‐gt-‐filter
Note: copy and paste the full command from the Github Gist
$ gemini query \ -q "SELECT chrom, start, end, ref, alt, gene, impact, (gts).(*) \ FROM variants" \ --header \ --gt-filter "gt_types.4805 == HET \ and gt_types.1805 == HET \ and gt_types.1847 == HOM_REF" \ trio.trim.vep.dominant.db \ | head \ | column -t
chrom start end ref alt gene impact gts.1805 gts.1847 gts.4805 chr2 229933 229934 A G SH3YL1 intron A/G A/A A/G chr2 277249 277250 G A ACP1 downstream G/A G/G G/A chr2 675830 675831 G T TMEM18 intron G/T G/G G/T chr2 905441 905442 A T AC113607.2 upstream A/T A/A A/T chr2 1636903 1636904 C T PXDN UTR_3_prime C/T C/C C/T chr2 1642789 1642790 T C PXDN intron T/C T/T T/C chr2 3653841 3653842 C A COLEC11 intron C/A C/C C/A chr2 3660868 3660869 G A COLEC11 intron G/A G/G G/A chr2 3661001 3661002 G A COLEC11 intron G/A G/G G/A
14
What about large pedigrees or multiple families? --gt-‐filter “wildcards”
15
Using the --gt-‐filter “wildcards”
Note: copy and paste the full command from the Github Gist
$ gemini query \ -q "SELECT chrom, start, end, ref, alt, gene, impact, (gts).(*) \ FROM variants" \ --header \ --gt-filter "(gt_types).(phenotype==2).(==HET).(all) \ and (gt_types).(phenotype==1).(==HOM_REF).(all)” \ trio.trim.vep.dominant.db \ | head \ | column -t
Affected individuals must be HETUnffected individuals must be HOM_REF
chrom start end ref alt gene impact gts.1805 gts.1847 gts.4805 chr2 229933 229934 A G SH3YL1 intron A/G A/A A/G chr2 277249 277250 G A ACP1 downstream G/A G/G G/A chr2 675830 675831 G T TMEM18 intron G/T G/G G/T chr2 905441 905442 A T AC113607.2 upstream A/T A/A A/T chr2 1636903 1636904 C T PXDN UTR_3_prime C/T C/C C/T chr2 1642789 1642790 T C PXDN intron T/C T/T T/C chr2 3653841 3653842 C A COLEC11 intron C/A C/C C/A chr2 3660868 3660869 G A COLEC11 intron G/A G/G G/A chr2 3661001 3661002 G A COLEC11 intron G/A G/G G/A16
Can apply multiple --gt-‐filter “wildcards”
Note: copy and paste the full command from the Github Gist
$ gemini query \ -q "SELECT chrom, start, end, ref, alt, gene, impact, \ (gts).(*), (gt_depths).(*) \ FROM variants" \ --header \ --gt-filter "(gt_types).(phenotype==2).(==HET).(all) \ and (gt_types).(phenotype==1).(==HOM_REF).(all) \ and (gt_depths).(*).(>=20).(all)" \ trio.trim.vep.dominant.db \ | head \ | column -t
Affected individuals must be HETUnaffected individuals must be HOM_REF
chrom start end ref alt gene impact gts.1805 gts.1847 gts.4805 gt_depths.1805 gt_depths.1847 gt_depths.4805 chr2 229933 229934 A G SH3YL1 intron A/G A/A A/G 161 225 231 chr2 277249 277250 G A ACP1 downstream G/A G/G G/A 183 237 234 chr2 905441 905442 A T AC113607.2 upstream A/T A/A A/T 90 142 246 chr2 1636903 1636904 C T PXDN UTR_3_prime C/T C/C C/T 54 87 174 chr2 1642789 1642790 T C PXDN intron T/C T/T T/C 74 120 179 chr2 3653841 3653842 C A COLEC11 intron C/A C/C C/A 123 223 199 chr2 3660868 3660869 G A COLEC11 intron G/A G/G G/A 76 96 250 chr2 3661001 3661002 G A COLEC11 intron G/A G/G G/A 70 86 182 chr2 6879884 6879885 T A LINC00487 intron T/A T/T T/A 83 110 106
Everyone must have sequence depth >=20
17
We have the inheritance model, now apply the annotation filters
Note: copy and paste the full command from the Github Gist
$ gemini query \ -q "SELECT chrom, start, end, ref, alt, gene, impact, \ (gts).(*), (gt_depths).(*) \ FROM variants \ WHERE filter is NULL and impact_severity == 'HIGH' and (aaf_esp_ea <= 0.01 or aaf_esp_ea is NULL) and (aaf_exac_all <= 0.01 or aaf_exac_all is NULL)" \ --header \ --gt-filter "(gt_types).(phenotype==2).(==HET).(all) \ and (gt_types).(phenotype==1).(==HOM_REF).(all) \ and (gt_depths).(*).(>=20).(all)" \ trio.trim.vep.dominant.db \ | column -t
chrom start end ref alt gene impact gts.1805 gts.1847 gts.4805 gt_depths.1805 gt_depths.1847 gt_depths.4805 chr2 21236250 21236251 G A APOB stop_gain G/A G/G G/A 46 72 112 chr17 28778816 28778817 T C CPD stop_loss T/C T/T T/C 144 171 207 chr22 21363743 21363744 T C TUBA3FP splice_acceptor T/C T/T T/C 36 56 241
Note that (pleasingly) these are the same three candidates as detected with the autosomal_dominant tool.
18
Load the following files into IGV (Load from URL) and inspect your candidates
BAM alignment files: !https://s3.amazonaws.com/gemini-‐tutorials/1805.workshop.bam https://s3.amazonaws.com/gemini-‐tutorials/1847.workshop.bam https://s3.amazonaws.com/gemini-‐tutorials/4805.workshop.bam
VCF variant file: !https://s3.amazonaws.com/gemini-‐tutorials/trio.trim.vep.vcf.gz
!
19