+ All Categories
Home > Documents > Open Methodology and Reproducibility in Computational...

Open Methodology and Reproducibility in Computational...

Date post: 29-Apr-2019
Category:
Upload: truongphuc
View: 218 times
Download: 0 times
Share this document with a friend
21
The Changing Concept of a Scientific Fact Survey of the Machine Learning Community Responses and Open Questions Open Methodology and Reproducibility in Computational Science Victoria Stodden Department of Statistics Columbia University Numerical Cosmology 2012 Centre of Theoretical Cosmology DAMTP, University of Cambridge, UK July 18, 2012 1 / 21
Transcript

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Open Methodology and Reproducibility inComputational Science

Victoria StoddenDepartment of Statistics

Columbia University

Numerical Cosmology 2012Centre of Theoretical Cosmology

DAMTP, University of Cambridge, UKJuly 18, 2012

1 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

The Changing Concept of a Scientific FactThe Scientific RecordComputational ScienceExamplesThe Credibility Crisis

Survey of the Machine Learning Community

Responses and Open Questions

2 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

The Concept of a Scientific Fact

In Opus Tertium (1267) Roger Bacon distin-guishes experimental science by:

1. verification of conclusions by directexperiment,

2. discovery of truths unreachable by otherapproaches,

3. investigation of the secrets of nature,opening us to a knowledge of past andfuture.

I described a repeating cycle of observation, hypothesis,experimentation, and the need for independent verification,

I recorded his experiments (e.g. the nature and cause of therainbow) in enough detail to permit reproducibility by others.

3 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

Inductive Scientific Reasoning

In Novum Organum (1620) Francis Bacon proposes:

1. the gathering of facts, by observation orexperimentation,

2. verification of general principles.

“There are and can be only two ways ofsearching into and discovering truth. Theone flies from the senses and particulars tothe most general axioms, and from theseprinciples, the truth of which it takes forsettled and immoveable. ... The otherderives axioms from the senses and par-ticulars, rising by a gradual and unbrokenascent, so that it arrives at the most gen-eral axioms last of all. This is the trueway, but as yet untried.”

4 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

The Scientific Record

I The Royal Society of London founded1660 (the “Invisible College”),

I members discussed Francis Bacon’s“new science” from 1645,

I Society correspondence reviewed bythe first Secretary, Henry Oldenburg,

I Oldenburg became the founder, editor,author, and publisher of PhilosophicalTransactions, launched in 1665.

5 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

Scientific Research is Changing

Scientific computation emerging as central to the scientificmethod:

I Simulation of the complete evolution of a physical system,systematically changing parameters,

I Data driven, machine-generated hypotheses.

Conjecture: Today’s academic scientist probably has more incommon with a large corporation’s information technology managerthan with a philosophy or English professor at the same university.

6 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

I. Examples of Pervasiveness of Computational Methods

I For example, in statistics:

JASA June Computational Articles Code Publicly Available

1996 9 of 20 0%2006 33 of 35 9%2009 32 of 32 16%2011 29 of 29 21%

I Social network data and the quantitative revolution in socialscience (Lazier et al. 2009);

I Computation reaches into traditionally nonquantitative fields:e.g. Wordhoard project at Northwestern examining worddistributions by Shakespearian play.

7 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

2. Dynamic modeling of macromolecules: SaliLab UCSF

8 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

3. Mathematical “proof” by simulation and grid search

Phil. Tran

s. R. Soc. A | vol. 367 n

o. 1906 pp

. 4235–4470 | 13 Nov 2009

Statistical challen

ges o

f hig

h-d

imen

sion

al data

Founded in 1660, the Royal Society is the independent scientific academy of the UK, dedicated to promotingexcellence in science

Registered Charity No 207043

IntroductionStatistical challenges of high-dimensional data 4237I. M. Johnstone & D. M. Titterington

ArticlesSelective inference in complex research 4255Y. Benjamini, R. Heller & D. Yekutieli

Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing 4273D. Donoho & J. Tanner

On landmark selection and sampling in high-dimensional data analysis 4295M.-A. Belabbas & P. J. Wolfe

An overview of recent developments in genomics and associated statistical methods 4313P. J. Bickel, J. B. Brown, H. Huang & Q. Li

Cherry-picking for complex data: robust structure discovery 4339D. L. Banks, L. House & K. Killourhy

Statistical inference for exploratory data analysis and model diagnostics 4361A. Buja, D. Cook, H. Hofmann, M. Lawrence, E.-K. Lee, D. F. Swayne & H. Wickham

Sufficient dimension reduction and prediction in regression 4385K. P. Adragni & R. D. Cook

Identifying graph clusters using variational inference and links to covariance parametrization 4407D. Barber

Classification of sparse high-dimensional vectors 4427Yu. I. Ingster, C. Pouet & A. B. Tsybakov

Feature selection by higher criticism thresholding achieves the optimal phase diagram 4449D. Donoho & J. Jin

13 November 2009

volume 367 · number 1906 · pages 4235–4470

rsta.royalsocietypublishing.orgPublished in Great Britain by the Royal Society, 6–9 Carlton House Terrace, London SW1Y 5AG

Statistical challenges of high-dimensional dataPapers of a Theme Issue compiled and edited by D. L. Banks, P. J. Bickel, Iain M. Johnstone and D. Michael Titterington

13 November 2009

Statistical challenges of high-dimensional dataPapers of a Theme Issue compiled and edited by D. L. Banks, P. J. Bickel, Iain M. Johnstone and D. Michael Titterington

In this issue

The world’s longest running science journal

ISSN 1364-503X

volume 367

number 1906

pages 4235–4470

RSTA_367_1906_cover.qxd 09/25/09 07:27 PM Page 1

9 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

Evidence of a problem..

Relaxed practices regarding the communication of computationaldetails is creating a credibility crisis in computational science.

I Re-establish reproducibility, via code and data sharing

10 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

The Last Update to the Scientific Method: 1665

I The “Invisible College” included RobertBoyle, the “father of chemistry,”

I Boyle introduced standards for scientificcommunication: enough informationmust be included to allow others toindependently reproduce the finding.

I delineates science, concept ofreproducibility permits verification andknowledge transfer,

I knowledge in method not in the findingitself.

11 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

Controlling Error is Central to Scientific Progress

“The scientific method’s central motiva-tion is the ubiquity of error - the aware-ness that mistakes and self-delusion cancreep in absolutely anywhere and thatthe scientist’s effort is primarily expendedin recognizing and rooting out error.”David Donoho et al. (2009)

12 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

The Third Branch of the Scientific Method

I Branch 1: Deductive/Theory: e.g. mathematics; logic,

I Branch 2: Inductive/Empirical: e.g. the machinery ofhypothesis testing; statistical analysis of controlledexperiments,

I Branch 3? Large scale extrapolation and prediction, usingsimulation and other data-intensive methods.

13 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open QuestionsExamplesThe Credibility Crisis

Toward a Resolution of the Credibility Crisis

I Typical scientific communication doesn’t include sufficientdetail for reproducibility ie. the code and data that generatedthe findings.

I Most published computational scientific results today are nearimpossible to replicate.

Thesis: Computational science cannot be elevated to a thirdbranch of the scientific method until it generates routinelyverifiable knowledge. (Donoho, Stodden, et al. 2009)

Sharing of underlying code and data is a necessary part of thissolution, enabling Reproducible Research.

14 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Survey of Machine Learning Community (Stodden 2010)

Question: Why isn’t reproducibility practiced more widely?Answer builds on literature of free revealing and open innovation inindustry, and the sociology of science.

I Sample: American academics registered at the MachineLearning conference NIPS.

I Respondents: 134 responses from 593 requests (∼23%).

15 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Top Reasons Not to Share

Code Data

77% Time to document and clean up 54%52% Dealing with questions from users 34%44% Not receiving attribution 42%40% Possibility of patents -34% Legal barriers (ie. copyright) 41%

- Time to verify release with admin 38%30% Potential loss of future publications 35%30% Competitors may get an advantage 33%20% Web/Disk space limitations 29%

16 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Top Reasons to Share

Code Data

91% Encourage scientific advancement 81%90% Encourage sharing in others 79%86% Be a good community member 79%82% Set a standard for the field 76%85% Improve the caliber of research 74%81% Get others to work on the problem 79%85% Increase in publicity 73%78% Opportunity for feedback 71%71% Finding collaborators 71%

17 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Grassroots Efforts in Many Fields, Policies

Independent efforts by researchers:

I AMP 2011 “Reproducible Research: Tools and Strategies for Scientific Computing”

I AMP / ICIAM 2011 “Community Forum on Reproducible Research Policies”

I SIAM Geosciences 2011 “Reproducible and Open Source Software in the Geosciences”

I ENAR International Biometric Society 2011: Panel on Reproducible Research

I AAAS 2011: “The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer”

I SIAM CSE 2011: “Verifiable, Reproducible Computational Science”

I Yale 2009: Roundtable on Data and Code Sharing in the Computational Sciences

I ACM SIGMOD conferences

I ...

Policy changes:

I NSF/OCI report on Grand Challenge Communities (Dec 2010)

I NSF report “Changing the Conduct of Science in the Information Age” (Aug 2011)

I IOM “Review of Omics-based Tests for Predicting Patient Outcomes in Clinical Trials”

I NIH, NSF multiple requests for input on data policies

I Journal policy movement toward code and data requirements (ie. Science Feb 2011)

I ...

18 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

A Solution: Web-based Executable DisseminationPlatforms

Effort I have been involved in: RunMyCode.org

19 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

RunMyCode: Companion Websites

20 / 21

The Changing Concept of a Scientific FactSurvey of the Machine Learning Community

Responses and Open Questions

Open Questions

I Code complexity: Massive codes, installation, softwaresupport, parallel and multicore implementations,

I Streaming data, massive data access

I Tools for ease of implementation ie. data provenance andworkflow, (“progress depends on artificial aids becoming sofamiliar they are regarded as natural” I.J. Good, 1958),

I Taleb Effect - scientific discoveries as (misused) black boxes,

I Nefarious uses / public misinterpretation,

I Black boxes and opacity in software - testing and design,

I Lock-in: calcification of ideas in software?

I Independent replication discouraged?

I Policy maker engagement: finding support for our norms.

21 / 21


Recommended