Date post: | 16-Apr-2017 |
Category: |
Science |
Upload: | lex-nederbragt |
View: | 3,197 times |
Download: | 1 times |
Coding & Best Practice in ProgrammingWhy it matters so much in the NGS era
Lex Nederbragt Norwegian Sequencing Centre and
Centre for Evolutionary and Ecological [email protected]
@lexnederbragt
OK
Who am I
@lexnederbragt flxlexblog.wordpress.com
How I became a bioinformatician
2007: a grant
GS FLX from Roche/454 Genome Analyzer from Solexa/Illumina
?
Let’s try them out!
Specimen
• Planktothrix rubescens NIVA CYA 98
• Cyanobacteria
• (blue-green algae)
Planktothrix
Half a million readsAverage length 260 nt
10 million reads33 nucleotides each
Perl
Planktothrix
Newbler SHARCGS
Assembly
Half a million readsAverage length 260 nt
10 million reads33 nucleotides each
Atlantic cod genome project
850 million bases (Mbp )‘Wild-caught’
GS FLX from Roche/454
Atlantic cod genome project phase 1
Cod genome project phase 2
From Wikimedia commons, user Sagar Joshi
In summary
From flickr, user lesterpubliclibrary
Challenges in the next-generation sequencing era
High-throughput sequencing
Phase 1: more is better
Phase 2: smaller is better
Phase 3: single-molecule
Phase 4: nanopores
Democratization of sequencing
MinION
512 nanopores 150mb/hour
Up to 6 hours$900
Sequencing cost
Thanks to Matt Clark (TGAC), modified from http://bit.ly/1iiajcS
454 &polony Solexa
&SOLiD
HiSeq HiSeq X Ten
GAII
End of the gold rush?
More more more
Data Software
Mathias Bigge, Ricordisamoa, others (wikimedia commons)
TCTCCTAACAACCCCCcACACACACACACTGGTACTGATGCCATTCTGCTTTACACCTATACACATCATATACATtATACACACACACACACACACACAACACTCTCCTAACCCACACACACTGGTACAGATGCCAGTCTGCTTAACACCTACGCACGTATTATACACACACACACACAACGCTCTCCTAACCCACACACACACCAGTCTGCTTTAAACCTACACACATATTATACAAACGAGTTGGTGACGTAAGGTTGATAAGGGATATTGGTAAGGGTTAAGGGTAGGGTTGGTGTTAGGGGCAAGGGTTAGGGTTAGTGTAAGGGGTAAGGGTTAGTGTAaGGAGTAAGGGTTAGTGTAAGGGGTTAGTGTTATTGTAAGGGGCTAGTGTTAGTGTTAGTGTTCAGGGTTAGTGTTAGGGGTAGGGTTAATgTTTAGGGTAATGTTTAGGGTTAGGGGTATGGGTTAGTGCTAGGGGTCAGGGTTAGTGTTAGGGTTAGACAACCCACCTGAGAGAACCAGTGCGATGCCGCCGCAGGCGTTGGGCGAGGACATGGAGGTGCCGTTCATCAGCTGGGTCCCCCGGAGGGTCCAGTTGGGGACGGAGGCGATGGCTCCCCCCGGAGCGCTGATGCTGACCCCCAGGGCGCCGTCGATGCTGGGTCCCCGAGACGACCAGGTGTACTGGTTGGCCGGGAGCTTCTCCCTCAGGGAGTACTCCGCCACCATCATGTCGGGGGTCACGTAGGCCCCAACCCCTGGGGACAGACGGAGCGCGTTACACACCTCAACCCCTTACCCTCGGAGCCTACATAACCCAACCCTCTGGAGACGGCAATGCTTGCATAGTCAGAAATAGaGCTGACCGATTCATCAAATTCAAACGTCATCGCTATATAATAGCGGGgTTTGATTTGCCATTTGCAAATTGCAAAGGCTGCAATgtttttttttttt
Software
Constant stream of new software
http://wwwdev.ebi.ac.uk/fg/hts_mappers
88 short-read mappers
Software
Constant stream of new software
http://neidetcher.com/ubuntu_package_dependency.html
InstallationJudging quality
Wikimedia commons, user Thebestofall007
Do we need to be worried?
Do we need to be worried?
Self-taught bioinformaticians
TCTCCTAACAACCCCCcACACACACACACTGGTACTGATGCCATTCTGCTTTACACCTATACACATCATATACATtATACACACACACACACACACACAACACTCTCCTAACCCACACACACTGGTACAGATGCCAGTCTGCTTAACACCTACGCACGTATTATACACACACACACACAACGCTCTCCTAACCCACACACACACCAGTCTGCTTTAAACCTACACACATATTATACAAACGAGTTGGTGACGTAAGGTTGATAAGGGATATTGGTAAGGGTTAAGGGTAGGGTTGGTGTTAGGGGCAAGGGTTAGGGTTAGTGTAAGGGGTAAGGGTTAGTGTAaGGAGTAAGGGTTAGTGTAAGGGGTTAGTGTTATTGTAAGGGGCTAGTGTTAGTGTTAGTGTTCAGGGTTAGTGTTAGGGGTAGGGTTAATgTTTAGGGTAATGTTTAGGGTTAGGGGTATGGGTTAGTGCTAGGGGTCAGGGTTAGTGTTAGGGTTAGACAACCCACCTGAGAGAACCAGTGCGATGCCGCCGCAGGCGTTGGGCGAGGACATGGAGGTGCCGTTCATCAGCTGGGTCCCCCGGAGGGTCCAGTTGGGGACGGAGGCGATGGCTCCCCCCGGAGCGCTGATGCTGACCCCCAGGGCGCCGTCGATGCTGGGTCCCCGAGACGACCAGGTGTACTGGTTGGCCGGGAGCTTCTCCCTCAGGGAGTACTCCGCCACCATCATGTCGGGGGTCACGTAGGCCCCAACCCCTGGGGACAGACGGAGCGCGTTACACACCTCAACCCCTTACCCTCGGAGCCTACATAACCCAACCCTCTGGAGACGGCAATGCTTGCATAGTCAGAAATAGaGCTGACCGATTCATCAAATTCAAACGTCATCGCTATATAATAGCGGGgTTTGATTTGCCATTTGCAAATTGCAAAGGCTGCAATgtttttttttttt
lot’s of data
lot’s of software
recipe for disaster?
Correctness of results
http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html
Reproducibility
doi:10.1038/sj.embor.7401143
A reproducibility crisis?
Reproducibility and reusability
http://upload.wikimedia.org/wikipedia/commons/4/48/Recycle.jpg
What it boils down to
TRUST
My (given) title
Coding & Best Practice in ProgrammingWhy it matters so much in the NGS era
Why it matters so much in science
Next-generation sequencing specific?
Diagnostic sequencing
Wikimedia commons, user Bill Branson
Diagnostic sequencing
Diagnostic sequencing
Solutions
Solutions
Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg Wikimedia commons
Best practices
10.1371/journal.pbio.1001745
Best practices
Automate repetitive tasks
Wikimedia commons, user Pzucchel
Best practices
Coding styles, variable naming etc
def test_seq:
def sequence_is_DNA:
Best practices
Use version control
https://www.atlassian.com/git/workflows
Best practices
From my own work:
$ cd scripts$ lsblat_parse4.pl old_versions snps_flanks_2_fastq.pl
$ ls old_versions/blat_parse2.pl blat_parse_attemp1.pl blat_parse.pl.bak blat_parse.plblat_parse3_backup.plblat_parse3.pl
Best practices
test, test, test
def test_zero:assert run_the_function(0) == 0
Assert x > 0, ”cannot handle negative numbers"
Best practices
Document well
Best practices
Collaborate
http://howdoitradestocks.com/wp-content/uploads/2011/12/share-ideas1.jpg
khmer, a 'case study'
khmer
Crusoe et al. doi: 10.6084/m9.figshare.979190MichaelCrusoe
TitusBrown
khmer
https://github.com/ged-lab/2013-paper-ssspe
khmer
Integrated code coverage analysis
The “GitHub Flow” model of code review
Semantic versioning
Continuousintegration Integration and
acceptance testing
Beyond best coding practices
Benchmarks
http://assemblathon.org/
Benchmarks
http://www.genome.org/cgi/doi/10.1101/gr.131383.111
Benchmarks
http://www.genomeinabottle.org/
~8300 10ug vials of DNA for NA12878
(Assembly) validation
(Assembly) validation
Assembly
doi:10.1186/1471-2105-15-126
Reproducibility ‘platforms’
usegalaxy.org
taverna.org.uk/
pythonhosted.org/Sumatra/
Action points
Action points
Attend a software Carpentry Boot Camp
http://software-carpentry.org/
Action points
Look for signs of best practice
Action points
Look for signs of best practice
during peer review
nature.com
Action points
Benchmarking/validation
Action points
Develop (under)graduate curriculum
My goal today
Flickr: http://farm4.staticflickr.com/3319/3265787219_bfbc654b5e_o.jpg