2012 hpcuserforum talk

1.Why HPCs hate biologists(and what were doing about it)C. Titus BrownAssistant ProfessorCSE, MMG, BEACON Michigan State University September [email protected]

2. Outline The basic problem(s) resequencing and assembly Data processing & data flow Compression approaches Some thoughts for the future 3. Shotgun genomics Collect samples; Extract DNA; Feed into sequencer; Computationally analyze.Wikipedia: Environmental shotgun sequencing.p 4. Analogy: shredding booksIt was the best of times, it was the wor, it was the worst of times, it was theisdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishnessbut for lots and lots of fragments! 5. Sequencers also produceerrors It was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness 6. Three basic problemsResequencing, counting, and assembly. 7. 1. Resequencing analysisWe know a reference genome, and want to findvariants (blue) in a background of errors (red) 8. 2. CountingWe have a reference genome (or gene set) andwant to know how much we have. Think gene expression/microarrays. 9. 3. Assembly We dont have a genome or any reference, and wewant to construct one.(This is how all new genomes are sequenced.) 10. Noisy observations information It was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness 11. The scale of the problem is stunning. I estimate a worldwide capacity for DNAsequencing of 15 petabases/yr (its probablylarger). Individual labs can generate ~100 Gbp in ~1week for $10k. This sequencing is at a boutique level: Sequencing formats are semi-standard. Basic analysis approaches are ~80% cookbook. Every biological prep, problem, and analysis is different. Traditionally, biologists receive no training incomputation and our computational infrastructure isoptimizing for high performance computing, not 12. Three types of data scientists.(Bob Grossman, U. Chicago, at XLDB)1. Your data gathering rate is slower than Moores Law.2. Your data gathering rate matches Moores Law.3. Your data gathering rate exceeds Moores Law. 13. http://www.genome.gov/sequencingcosts/ 14. Three types of data scientists.1. Your data gathering rate is slower than MooresLaw. => Be lazy, all will work out.2. Your data gathering rate matches Moores Law. => You need to write good software, but all willwork out.3. Your data gathering rate exceeds Moores Law.=> You need serious help. 15. A few use cases1. Real-time pathogen analysis2. Cancer genome analysis => diagnosis & treatment regimen3. Evolution of drug resistance in HIV4. Gene expression analysis in agricultural animals5. Examining microbial community change in response to agriculture and/or global climate change6. Gene discovery & genome sequencing in non- model organisms 16. Three basic problemsResequencing, counting, and assembly. 17. ~2 GB 2 TB of single-chassis RAM "Information" "Information" Raw data"Information"Analysis "Information" (~10-100 GB)~1 GB "Information"Database &integrationA few interesting computational challenges: - How do we store raw data for (1) provenance and (2)reanalysis? - What kind of infrastructure (hardware/software) is theright approach? - Can we reduce the cost of analysis? 18. A few use cases1. Real-time pathogen analysis2. Cancer genome analysis => diagnosis & treatment regimen3. Evolution of drug resistance in HIV4. Gene expression analysis in agricultural animals5. Examining microbial community change in response to agriculture and/or global climate change6. Gene discovery & genome sequencing in non- model organisms 19. What kind of approaches? Hardware solutions may not be appropriate Everyone in biology/biomedical informatics has orwill have these data sets; need commoditysolutions (=> ~cloud?) but current commodity hardware is optimized forprocessing power, while memory and I/O isexpensive. Are there algorithmic approaches with which we can apply leverage? 20. "Information" "Information" Raw data Compression"Information"Analysis "Information" (~10-100 GB) (~2 GB)~1 GB "Information"Database &integrationA software & algorithms approach: can we developlossy compression approaches that1. Reduce data size & remove errors => efficient processing?2. Retain all information? (think JPEG)If so, then we canShort answer the compressed data for store only is: yes, we can.later reanalysis. 21. My research driven by myproblems Est ~50 Tbp to comprehensively sample themicrobial composition of a gram of soil. Currently we have approximately 2 Tbp spreadacross 9 soil samples, for one project; 1 Tbpacross 10 samples for another. Need 3 TB RAM on single chassis to doassembly of 300 Gbp. estimate 500 TB RAM for 50 Tbp of sequence.As it turns out, if we can solve that problem, we can solve the rest 22. My lab at MSU:Theoretical => applied solutions.Theoretical advances Practically useful & usableDemonstratedin data structures and implementations, at scale.effectiveness on real data.algorithms 23. 1. CountMin SketchTo add element: increment associated counter at all hash localesTo get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/ 24. 2. Online, streaming, lossy(NOVEL)compression. Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results. Core algorithm is single pass, low memory. 25. Streaming Twitter analysis. 26. (NOVEL)3. Compressible de Bruijn graphs1%5% 10% 15% Jason Pell & Arend Hintze 27. Concluding thoughts Our approaches provide significant andsubstantial practical and theoretical leverage tosome really challenging problems. They provide a path to the future: Many-core compatible; distributable? Decreased memory footprint => cloud computing can be used for many analyses. They are in use, ~dozens of labs using digitalnormalization. although were still in the process of publishingthem. 28. There is nothing up my sleeves.Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/interests.html Preprints: on arXiv, q-bio:diginorm arxivand I welcome e-mails, [email protected] 29. Thank you for the invitation!AcknowledgementsLab members involvedCollaborators Adina Howe (w/Tiedje) Jim Tiedje, MSU Jason Pell Arend Hintze Billie Swalla, UW Rosangela Canino- Janet Jansson, LBNLKoning Qingpeng Zhang Susannah Tringe, JGI Elijah Lowe Likit Preeyanon Funding Jiarong Guo Tim Brom USDA NIFA; NSF IOS; Kanchan Pavangadkar BEACON. Eric McDonald

Date post:	06-May-2015
Category:	Technology
Upload:	ctitusbrown
View:	2,100 times
Download:	0 times

2012 hpcuserforum talk

Technology