Date post: | 13-Apr-2017 |
Category: |
Science |
Upload: | annika-eriksson |
View: | 343 times |
Download: | 2 times |
Reproducibility:10 Simple Rules
And more!
Sandve, Geir Kjetil, et al. "Ten simple rules for reproducible computational research." PLoS computational biology 9.10 (2013): e1003285.
Rule 1: For Every Result, Keep Track of How It Was Produced
http://xkcd.com/
Rule 2: Avoid Manual Data Manipulation Steps
• “Stop clicking, start typing” – Matt Frost, Charlottesville, VA
• Use scripts for even small changes• Split commonly used code off into
functions/classes, and put these into libraries
Rule 3: Archive the Exact Versions of All External Programs Used
Level 0
Note names and versions of all packages
Level 1
Use package management system (packrat,
anaconda/conda)
Boss Level
Save image of entire system
Rule 4: Version Control All Custom Scripts
http://www.slideshare.net/sjcockell/reproducibility-the-myths-and-truths-of-pipeline-bioinformatics
• Also, version control workflows (what are good workflow management systems, guys?)
• Use the commit space to write something useful to your future self (“pwew pwew pwew” is not useful)
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats
• “Explicit is better than implicit” – Tim Peters, The Zen of Python
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds
• This goes for all parameters that may change• Separate code from configuration, e.g. use
config files (another gift to your future self!)
Rule 7: Always Store Raw Data behind Plots
• (and the plot generating code, too)• Make raw data read only• Separate folders for raw and pre-processed
data
https://inspguilfoyle.wordpress.com/2014/02/19/straight-lines/
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
Rule 9: Connect Textual Statements to Underlying Results
Rule 10: Provide Public Access to Scripts, Runs, and Results
• GitHub• Synapse• Open Science Framework• ReadTheDocs• RunMyCode• ???
Documentation Is it clear where to begin? (e.g., can someone picking a project up see where to
start running it) can you determine which file(s) was/were used as input in a process that
produced a derived file? Who do I cite? (code, data, etc.) Is there documentation about every result? Have you noted the exact version of every external application used in the
process? For analyses that include randomness, have you noted the underlying random
seed(s)? Have you specified the license under which you're distributing your content,
data, and code? Have you noted the license(s) for others peoples' content, data, and code used
in your analysis?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
Organization Which is the most recent data file/code? Which folders can I safely delete? Do you keep older files/code or delete them? Can you find a file for a particular replicate of your research project? Have you stored the raw data behind each plot? Is your analysis
output done hierarchically? (allowing others to find more detailed output underneath a summary)
Do you run backups on all files associated with your analysis? How many times has a particular file been generated in the past? Why was the same file generated multiple times? Where did a file that I didn't generate come from?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
AutomationAre there lots of manual data manipulation steps are there?Are all custom scripts under version control? Is your writing (content) under version control?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
PublicationHave you archived the exact version of every external application
used in your process(es)?Did you include a reproducibility statement or declaration at the
end of your paper(s)?Are textual statements connected/linked to the supporting results
or data?Did you archived preprints of resulting papers in a public
repository?Did you release the underlying code at the time of publishing a
paper?Are you providing public access to your scripts, runs, and results?
http://ropensci.github.io/reproducibility-guide/sections/checklist/
Best Practices for Scientific Computing
Write programs for people, not computers.Let the computer do the work.Make incremental changes.DRY: Don’t repeat yourself (or others).Plan for mistakes. (“Defensive Programming”)Use pair programming.
Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.
Wilson, Greg, et al. "Best practices for scientific computing." PLoS biology 12.1 (2014): e1001745.
Document design and purpose, not mechanics.
Suggested Training Topics• version control and use of online repositories• modern programming practice including unit testing and regression testing• maintaining “notebooks” or “research compendia”• recording the provenance of final results relative to code and/or data• numerical / floating point reproducibility and nondeterminism• reproducibility on parallel systems• dealing with large datasets• dealing with complicated software stacks and use of virtual machines• documentation and literate programming• IP and licensing issues, proper citation and attribution
http://icerm.brown.edu/tw12-5-rcem/
Resources
• http://projecttemplate.net/ - Project automation (R)• http://www.nature.com/news/2010/101013/full/467
753a.html - Publish your computer code: it is good enough
• http://www.carlboettiger.info/ - Open lab notebook• http://wiki.stodden.net/ICERM_Reproducibility_in_C
omputational_and_Experimental_Mathematics:_Readings_and_References
• http://rrcns.readthedocs.org/ - Best practices tutorial• http://www.bioinformaticszen.com/