Date post: | 01-Jul-2015 |
Category: |
Documents |
Upload: | skills-matter |
View: | 419 times |
Download: | 0 times |
MongoDB and academiaJan Aerts, PhD
Wellcome Trust Sanger InstituteHinxton, UK
[email protected]@jandot
Disclaimer 1
Disclaimer 2
Acknowledgments
MongoDB community
Caren Brockington
10gen
transcriptomics
genomics
proteomics
*omics
transcriptomics
genomics
proteomics
*omics
instantiationomics
metabolomics
spliceomics
interactomics
metallomics
lipidomics
orfeomics
phenomicshistomics
Academia != industry
heterogeneous systems
transitory
little optimization
slow adoption of new technology
(don't break anything that works)
data management = afterthought
money
Who are the players?
large genome/data centers
genome hackers(lone bioinformaticians)
bench-based scientists
Drawings by Morag Ann Lewis
genome hackers (lone bioinformaticians)
bench-based scientists
heavy investment in infrastructure/pipelines
data exchange => standards!
large genome/data centers
genome hackers (lone bioinformaticians)
bench-based scientists
little investment in infrastructure
little time/effort for optimization
one-off
getting it donecreating legacy
need IT support for heavier work
large genome/data centers
often self-taught
large genome/data centers
genome hackers (lone bioinformaticians)
bench-based scientistsuse whatever everyone else is using
"normalization?"
The data landscape
1. Flat text filesLOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's
yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL Yeast 10 (11), 1503-1509 (1994)PUBMED 7871890FEATURES Location/Qualifiers gene 687..3158 /gene="AXL2" gene complement(3300..4037) /gene="REV7"ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) ...
1. Flat text filesLOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's
yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.REFERENCE 1 (bases 1 to 5028)AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiaeJOURNAL Yeast 10 (11), 1503-1509 (1994)PUBMED 7871890FEATURES Location/Qualifiers gene 687..3158 /gene="AXL2" gene complement(3300..4037) /gene="REV7"ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctc...//LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) ...
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
1. Flat text files##format=PCFv1##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA000011 967433 . G A 151.43 0 AB=0.42;AC=1 GT:DP:GQ 1/0:11:99.001 970323 . G A 492.61 0 AB=0.41;AC=1;AF=0.50 GT:DP:GQ 1/0:28:99.001 970950 . A G 1287.90 0 AB=0.55;AC=1;AF=0.50 GT:DP:GQ 0/1:108:99.001 972804 . T C 210.56 0 AB=0.53;AC=1;AF=0.50 GT:DP:GQ 1/0:13:99.001 972857 . T C 846.18 0 AB=0.53;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:58:99.001 974165 . T C 810.47 0 AB=0.38;AC=1;AF=0.50;AN=2 GT:DP:GQ 1/0:6:67.051 977063 . C T 1110.31 0 AB=0.50;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:67:99.001 1006892 . C G 62.39 SF AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:2:6.021 1148494 . A G 5237.88 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:160:99.001 1149380 . T C 165.10 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:6:18.051 1212553 . C T 426.61 0 AB=0.26;AC=1;AF=0.50;AN=2 GT:DP:GQ 0/1:18:99.001 1235867 . A G 1158.08 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:30:90.281 1237357 . T C 142.01 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:5:15.041 1239050 . G A 13952.03 0 AC=2;AF=1.00;AN=2 GT:DP:GQ 1/1:340:99.0020 14370 . G A 29 0 NS=58;DP=258;AF=0.786 GT:GQ:DP:HQ 0|0:48:1:51,5120 13330 . T A 3 q10 NS=55;DP=202;AF=0.024 GT:GQ:DP:HQ 0|0:49:3:58,5020 1110696 . A G,T 67 0 AF=0.421,0.579;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,2720 10237 . T . 47 0 NS=57;DP=257;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60...
perl
java
python
ruby
“tab-delimited” is king
2. Binary compressed flat filesOne experiment
=> One datafile as text: 40-70Gb=> Compressed to 11-20Gb
Toolkits to access data (and generate tab-delimited)
Cjava
3. MySQL and Oracle
Curated dataMeta-dataRaw data: BLOBs
Sequencing:>6 TB/week and growing…
Departmental project:40 individuals x 42mio datapoints/individual=> joins?
Denormalized copy
4. AceDB - A Caenorhabditis elegans database
object-orientedAuthor "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789" Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993
Challenges in *omics-
Where can MongoDB play a role?
explosion of data
every researcher must be able to handle data
low stepping stone for bench-based scientists big data
Takeoff within research community?widespread?
Cannot manage all data in-house <= data exchange!=> focus more on file formats than on technology
smaller scaleImplement MongoDB for
* local storage and queyring (load file from standard file format into custom DB)
* encourage non-informaticians to use MongoDB