Date post: | 24-Jan-2018 |
Category: |
Software |
Upload: | sri-ambati |
View: | 369 times |
Download: | 0 times |
Confidential |August 28, 2014Investor Deck |
Who am I?
Enterprise data science platform for analytically sophisticated organizations
Previously built analytical software at a big hedge fund
BA, MS in computer science
Confidential |August 28, 2014Investor Deck |
My goals for this talk
Convey why reproducibility and collaboration are important
Share insights, tips, principles, technologies to help implement best practices
Confidential |August 28, 2014Investor Deck |
Motivation
Individual produc.vity Less wasted time tracking, reproducing past work Less wasted time on environment setup
Team efficiency Work compounds; don’t re-invent the wheel More feedback, faster iteration Faster onboarding of new team members
More insights Shared context and discussion facilitates idea generation
Methodology/regulatory Some disciplines / industries have auditing requirements
Confidential |August 28, 2014Investor Deck |
Challenges in a data science context
• Analytical work is much more than just source code
• Data, results, parameters all important to tracking progress and sharing
• Generating results requires running code — can’t just store files
• Running code requires hardware, and so;ware/packages Setting these up can be a pain Software/packages can differ between people and over time
• Source control (e.g., git) too complex for many data scientists
• Hard to mandate behavior top down — have to incentivize it bottom up
Technical
Organiza.onal
Confidential |August 28, 2014Investor Deck |
Ten Simple Rules for Reproducible Computational Research
1. For Every Result, Keep Track of How It Was Produced
2. Avoid Manual Data Manipulation Steps
3. Archive the Exact Versions of All External Programs Used
4. Version Control All Custom Scripts
5. Record All Intermediate Results, When Possible in Standardized Formats
6. For Analyses That Include Randomness, Note Underlying Random Seeds
7. Always Store Raw Data behind Plots
8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
9. Connect Textual Statements to Underlying Results
10. Provide Public Access to Scripts, Runs, and Results
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285
Confidential |August 28, 2014Investor Deck |
Strategy
Give individuals something they want.
Package it in a solu.on that facilitates best prac.ces.
Confidential |August 28, 2014Investor Deck |
Solution: a central hub to run, track, and share work
Easy Access to Scalable Compute
• Long-running scripts or interactive work • Run multiple experiments in parallel • Elastic resources via cloud infrastructure
Turnkey Deployment & Opera.onaliza.on
• Package analyses into self-service web UIs • Execute models through REST APIs • Schedule automated recurring tasks
Version Control & Reproducibility
• Automatic tracking of code, data, and results • Supports concurrent development
Collabora.on
• Share code, data, and results • Discuss, comment on results and other work • Search and security
Incentivize centralization:
Capitalize on centralization:
Confidential |August 28, 2014Investor Deck |
Containers (Docker)
• Standardized, centrally managed “environments” with software and
configuration already set up
• Any language/software, even commercial / proprietary (e.g., Matlab)
• Including interactive tools, e.g., Jupyter, RStudio
• Able to change environments per project, per run — possibly without
going through IT
• Containers can be tracked and stored
• Can change environment without affecting others
Flexibility for users
Reproducibility and comparison
Confidential |August 28, 2014Investor Deck |
Useful tools and tips
1. For Every Result, Keep Track of How It Was Produced
2. Avoid Manual Data Manipulation Steps
3. Archive the Exact Versions of All External Programs Used
4. Version Control All Custom Scripts
5. Record All Intermediate Results, When Possible in Standardized Formats
6. For Analyses That Include Randomness, Note Underlying Random Seeds
7. Always Store Raw Data behind Plots
8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing
Detail to Be Inspected
9. Connect Textual Statements to Underlying Results
10. Provide Public Access to Scripts, Runs, and Results
Automatic
Automatic
Docker
Pickle, Rda/Rds
“stats” json
Discussion / commentsNotebooks / knitr
Pickle, Rda/Rds