+ All Categories
Home > Documents > Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of...

Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of...

Date post: 27-May-2020
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
4
Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities Juliana Freire New York University [email protected] Philippe Bonnet IT University of Copenhagen [email protected] Dennis Shasha New York University [email protected] ABSTRACT Computational reproducibility efforts in many communities will soon give rise to validated software and data repositories of high quality. A scientist in a field may want to query the components of such repositories to build new software workflows, perhaps after adding the scientist’s own algorithms. This paper explores research challenges necessary to achieving this goal. 1. INTRODUCTION A hallmark of the scientific method is that experiments should be described in enough detail that they can be repeated and per- haps generalized. The idea in natural science is that if a scientist claims an experimental result, then another scientist should be able to check it. Similarly, in a computational environment, it should be possible to repeat a computational experiment as the authors have run it or to change the experiment to see how robust the au- thors’ conclusions are to changes in parameters or data (a concept called workability). Our goal of reproducibility thus encompasses both repeatability and workability. As computational experiments become ubiquitous in many scientific disciplines, there has been great interest in the publication of reproducible papers as well as of infrastructure that supports them [14, 6, 15, 19, 26, 3, 11, 17]. Unlike traditional papers which aim to describe ideas and results using text only, reproducible papers include data, the specification of computational processes and code used to derive the results. Mo- tivated in part by cases of academic dishonesty as well as honest mistakes [9, 23, 24], some institutions have started to adopt repro- ducibility guidelines. For example, the ETH Zurich research ethics guidelines [8] require that all steps from input data to final figures need to be archived and made available upon request. Conferences, publishers and funding agencies are also encouraging authors to include reproducible experiments in their papers [17, 10]. In many ways, this is an extension of the very healthy demo-or-die philoso- phy that the database community follows for systems papers. Science will greatly benefit as different communities start to fol- low these guidelines. Although in computer science, the publica- tion of reproducible results and data sets is still in its infancy, in other fields there is already an established culture for doing so, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 12 Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00. see e.g., [1, 11, 13]. When collections of computational experi- ments (along with their source code, raw data, and workflows) are documented, reproducible, and available in community-accessible repositories, new software can be built upon this verified base. This enables scientific advances that combine previous tools as well as ideas. For example, members of the community can search for related experiments (e.g., “find experiments similar to mine”) and better understand tools that have been created and how they are used. Furthermore, such repositories enable the community to eval- uate the impact of a contribution not only through the citations to a paper, but also through the use of the proposed software and data components. The database community has taken the lead in encouraging re- producibility for computational experiments [17, 16]. We can also take the lead in showing how to advance science by providing new techniques and tools for exploring the information in these reposi- tories. Before discussing challenges of exploring repositories, let’s take a look at the data: reproducible papers themselves. 2. REPRODUCIBLE EXPERIMENTS, PAPERS, AND REPOSITORIES In reproducible papers, the results reported, including data, plots and visualizations are linked to the experiments and inputs. Having access to these, reviewers and readers can examine the results, then repeat or modify an execution. A number of ongoing efforts pro- vide infrastructure and guidelines to make the production of such papers easier for authors [14, 10, 27, 15, 19]. Madagascar [15] is an open-source system for multi-dimensional data analysis that provides support for reproducible computational experiments. Au- thors describe their experiments in SCons, a rule-based language analogous to make. A reproducible publication can then be cre- ated by including the rules in a LaTeX document. Koop et al. [14] describe a provenance-based infrastructure that uses the VisTrails system [29] to support the life-cycle of publications: their creation, review and re-use. As scientists explore a given problem, VisTrails systematically captures the provenance of the exploration, includ- ing the workflows created and versions of source code and libraries used. The infrastructure also includes methods to link results to their provenance, reproduce results, explore parameter spaces, in- teract with results through a Web-based interface, and upgrade the specification of computational experiments to work in different en- vironments and with newer versions of software. Documents (in- cluding LaTeX, PowerPoint, Word, wiki and HTML pages) can be created that link to provenance information that allows the results to be reproduced. 1 This year, the SIGMOD Repeatability effort has included extensive software infrastructure and guidelines to fa- 1 Videos demonstrating this infrastructure in action are available at http://www.vistrails.org/index.php/ExecutablePapers. 1494
Transcript
Page 1: Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities Juliana Freire New York University

Exploring the Coming Repositories of ReproducibleExperiments: Challenges and Opportunities

Juliana FreireNew York University

[email protected]

Philippe BonnetIT University of Copenhagen

[email protected]

Dennis ShashaNew York University

[email protected]

ABSTRACTComputational reproducibility efforts in many communities willsoon give rise to validated software and data repositories of highquality. A scientist in a field may want to query the componentsof such repositories to build new software workflows, perhaps afteradding the scientist’s own algorithms. This paper explores researchchallenges necessary to achieving this goal.

1. INTRODUCTIONA hallmark of the scientific method is that experiments should

be described in enough detail that they can be repeated and per-haps generalized. The idea in natural science is that if a scientistclaims an experimental result, then another scientist should be ableto check it. Similarly, in a computational environment, it shouldbe possible to repeat a computational experiment as the authorshave run it or to change the experiment to see how robust the au-thors’ conclusions are to changes in parameters or data (a conceptcalled workability). Our goal of reproducibility thus encompassesboth repeatability and workability. As computational experimentsbecome ubiquitous in many scientific disciplines, there has beengreat interest in the publication of reproducible papers as well asof infrastructure that supports them [14, 6, 15, 19, 26, 3, 11, 17].Unlike traditional papers which aim to describe ideas and resultsusing text only, reproducible papers include data, the specificationof computational processes and code used to derive the results. Mo-tivated in part by cases of academic dishonesty as well as honestmistakes [9, 23, 24], some institutions have started to adopt repro-ducibility guidelines. For example, the ETH Zurich research ethicsguidelines [8] require that all steps from input data to final figuresneed to be archived and made available upon request. Conferences,publishers and funding agencies are also encouraging authors toinclude reproducible experiments in their papers [17, 10]. In manyways, this is an extension of the very healthy demo-or-die philoso-phy that the database community follows for systems papers.

Science will greatly benefit as different communities start to fol-low these guidelines. Although in computer science, the publica-tion of reproducible results and data sets is still in its infancy, inother fields there is already an established culture for doing so,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 37th International Conference on Very Large Data Bases,August 29th - September 3rd 2011, Seattle, Washington.Proceedings of the VLDB Endowment, Vol. 4, No. 12Copyright 2011 VLDB Endowment 2150-8097/11/08... $ 10.00.

see e.g., [1, 11, 13]. When collections of computational experi-ments (along with their source code, raw data, and workflows) aredocumented, reproducible, and available in community-accessiblerepositories, new software can be built upon this verified base. Thisenables scientific advances that combine previous tools as well asideas. For example, members of the community can search forrelated experiments (e.g., “find experiments similar to mine”) andbetter understand tools that have been created and how they areused. Furthermore, such repositories enable the community to eval-uate the impact of a contribution not only through the citations toa paper, but also through the use of the proposed software and datacomponents.

The database community has taken the lead in encouraging re-producibility for computational experiments [17, 16]. We can alsotake the lead in showing how to advance science by providing newtechniques and tools for exploring the information in these reposi-tories. Before discussing challenges of exploring repositories, let’stake a look at the data: reproducible papers themselves.

2. REPRODUCIBLE EXPERIMENTS,PAPERS, AND REPOSITORIES

In reproducible papers, the results reported, including data, plotsand visualizations are linked to the experiments and inputs. Havingaccess to these, reviewers and readers can examine the results, thenrepeat or modify an execution. A number of ongoing efforts pro-vide infrastructure and guidelines to make the production of suchpapers easier for authors [14, 10, 27, 15, 19]. Madagascar [15]is an open-source system for multi-dimensional data analysis thatprovides support for reproducible computational experiments. Au-thors describe their experiments in SCons, a rule-based languageanalogous to make. A reproducible publication can then be cre-ated by including the rules in a LaTeX document. Koop et al. [14]describe a provenance-based infrastructure that uses the VisTrailssystem [29] to support the life-cycle of publications: their creation,review and re-use. As scientists explore a given problem, VisTrailssystematically captures the provenance of the exploration, includ-ing the workflows created and versions of source code and librariesused. The infrastructure also includes methods to link results totheir provenance, reproduce results, explore parameter spaces, in-teract with results through a Web-based interface, and upgrade thespecification of computational experiments to work in different en-vironments and with newer versions of software. Documents (in-cluding LaTeX, PowerPoint, Word, wiki and HTML pages) can becreated that link to provenance information that allows the resultsto be reproduced.1 This year, the SIGMOD Repeatability efforthas included extensive software infrastructure and guidelines to fa-1Videos demonstrating this infrastructure in action are available athttp://www.vistrails.org/index.php/ExecutablePapers.

1494

Page 2: Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities Juliana Freire New York University

The ALPS project release 2.0:

Open source software for strongly correlated

systems

B. Bauer1 L. D. Carr2 H.G. Evertz3 A. Feiguin4 J. Freire5

S. Fuchs6 L. Gamper1 J. Gukelberger1 E. Gull7 S. Guertler8

A. Hehn1 R. Igarashi9,10 S.V. Isakov1 D. Koop5 P.N. Ma1

P. Mates1,5 H. Matsuo11 O. Parcollet12 G. Paw�lowski13

J.D. Picon14 L. Pollet1,15 E. Santos5 V.W. Scarola16

U. Schollwock17 C. Silva5 B. Surer1 S. Todo10,11 S. Trebst18

M. Troyer1‡ M. L. Wall2 P. Werner1 S. Wessel19,20

1Theoretische Physik, ETH Zurich, 8093 Zurich, Switzerland2Department of Physics, Colorado School of Mines, Golden, CO 80401, USA3Institut fur Theoretische Physik, Technische Universitat Graz, A-8010 Graz, Austria4Department of Physics and Astronomy, University of Wyoming, Laramie, Wyoming

82071, USA5Scientific Computing and Imaging Institute, University of Utah, Salt Lake City,

Utah 84112, USA6Institut fur Theoretische Physik, Georg-August-Universitat Gottingen, Gottingen,

Germany7Columbia University, New York, NY 10027, USA8Bethe Center for Theoretical Physics, Universitat Bonn, Nussallee 12, 53115 Bonn,

Germany9Center for Computational Science & e-Systems, Japan Atomic Energy Agency,

110-0015 Tokyo, Japan10Core Research for Evolutional Science and Technology, Japan Science and

Technology Agency, 332-0012 Kawaguchi, Japan11Department of Applied Physics, University of Tokyo, 113-8656 Tokyo, Japan12Institut de Physique Theorique, CEA/DSM/IPhT-CNRS/URA 2306, CEA-Saclay,

F-91191 Gif-sur-Yvette, France13Faculty of Physics, A. Mickiewicz University, Umultowska 85, 61-614 Poznan,

Poland14Institute of Theoretical Physics, EPF Lausanne, CH-1015 Lausanne, Switzerland15Physics Department, Harvard University, Cambridge 02138, Massachusetts, USA16Department of Physics, Virginia Tech, Blacksburg, Virginia 24061, USA17Department for Physics, Arnold Sommerfeld Center for Theoretical Physics and

Center for NanoScience, University of Munich, 80333 Munich, Germany18Microsoft Research, Station Q, University of California, Santa Barbara, CA 93106,

USA19Institute for Solid State Theory, RWTH Aachen University, 52056 Aachen,

Germany

‡ Corresponding author: [email protected]

arX

iv:1

101.

2646

v4 [

cond

-mat

.str-

el]

23 M

ay 2

011

!"#!"#$%&'&(%)*+(,-$%&'()#!*$+,-.#

.#"/0#1#

!*$+#,-.#

/%0&120134#

2+3"'"+%4#

+3/51%62"#7'85108#

5'6'#

The ALPS project release 2.0: Open source software for strongly correlated systems 15

Figure 3. In this example we show a data collapse of the Binder Cumulant in the

classical Ising model. The data has been produced by remotely run simulations and

the critical exponent has been obtained with the help of the VisTrails parameter

exploration functionality.

1 cat > parm << EOFLATTICE=” chain l a t t i c e ”MODEL=” sp in ”l o c a l S =1/2L=60

6 J=1THERMALIZATION=5000SWEEPS=50000ALGORITHM=” loop ”{T=0.05;}

11 {T=0.1;}{T=0.2;}{T=0.3;}{T=0.4;}{T=0.5;}

16 {T=0.6;}{T=0.7;}{T=0.75;}{T=0.8;}{T=0.9;}

21 {T=1.0;}{T=1.25;}{T=1.5;}{T=1.75;}{T=2.0;}

26 EOF

parameter2xml parmloop −−auto−eva luate −−write−xml parm . in . xml

Figure 4. A shell script to perform an ALPS simulation to calculate the uniform

susceptibility of a Heisenberg spin chain. Evaluation options are limited to viewing

the output files. Any further evaluation requires the use of Python, VisTrails, or a

program written by the user.

sensitivity of the data collapse to the correlation length critical exponent.

9. Tutorials and Examples

Main contributors: B. Bauer, A. Feiguin, J. Gukelberger, E. Gull, U. Schollwock,

B. Surer, S. Todo, S. Trebst, M. Troyer, M.L. Wall and S. Wessel

The ALPS web page [38], which is a community-maintained wiki system and the

central resource for code developments, also offers extensive resources to ALPS users.

In particular, the web pages feature an extensive set of tutorials, which for each ALPS

application explain the use of the application codes and evaluation tools in the context

of a pedagogically chosen physics problem in great detail. These application tutorials

are further complemented by a growing set of tutorials on individual code development

Figure 1: An executable paper describing the ALPS 2.0 release. Figure 3 in this paper (center bottom) shows a plot which has a deepcaption consisting of the workflow used to derive the plot, the underlying libraries invoked by the workflow, and the input data. Thisprovenance information allows the plot to be reproduced. In the PDF version of the paper, this figure is active, and when clicked, theworkflow is loaded into a workflow management system and executed on the reader’s machine.

cilitate the creation of reproducible software. Authors have usedthose facilities to archive their software, configuration files, andworkflows. For the sake of concreteness, we give examples of thereproducible experiments along with the software.

2.1 Examples of Executable PapersAnatomy of a Reproducible Tuning Experiment To show how aparticular experimental graph was obtained, the experiment pack-age includes the data used, the software that ran on that data, con-figuration parameters, and a workflow to run the software compo-nents in a particular order, perhaps with branching. It also includesa description of the hardware and operating system platform re-quired (or a virtual machine that embodies those). Besides directlyrepeating the experiment, a “reader” of the paper can change thedata, the configuration parameters, or the workflow. The data out-put of each step of the workflow can be examined or potentially beused as input to another software component (see [28] for details).That will be important in what we see below.Anatomy of a Reproducible WikiQuery Experiment This ex-periment includes a series of workflows that were used to derivethe experimental results reported in [22]. To run these workflows,readers may either copy and run experiments locally or run the ex-periment on the authors’ machines and have the results shippedback.ALPS 2.0 and Physics Simulations The ALPS project (Algorithmsand Libraries for Physics Simulations) is an open-source initiativefor the simulation of large quantum many body systems [2], whichhas been used in about two hundred research projects over thepast six years. One of its core goals has been to simplify archivallongevity and repeatability of simulations by standardizing inputand result file formats. The paper describing the ALP2.0 [4], shown

in Figure 1, is an example of a reproducible paper. It reports re-sults from large-scale simulations that are time-consuming and runon high-performance hardware. The experiments are thus split intotwo parts: simulations and a set of analysis workflows. The simula-tion results are stored in (and made available from) an archival site,and the analysis workflows access the archival site and perform asequence of analyses. The figures in the paper are active: clickingon a figure activates a “deep caption” which retrieves the workflowassociate with the figure and executes the calculation leading to thefigure on the user’s machine. This paper makes use of the VisTrailspublication infrastructure [14], which enables the linkage of resultsin a paper to their provenance.

2.2 Experiment and Workflow RepositoriesWith the growing awareness of the importance of reproducibil-

ity and sharing, several repositories have been created which caterto different aspects of this problem. nanoHUB [21] offers simula-tion tools which users can access from their web browsers in or-der to simulate nanotechnology devices. The creators of nanoHubclaim that papers which make the simulation tools made availablethrough their site enjoy a greater number of citations. Sites likecrowdLabs [7, 18] and myExperiment [20] support the sharing ofworkflows which describe computational experiments, data anal-yses and visualizations. crowdlabs also supports a Web-based in-terface for executing workflows and displaying their results on aWeb browser. PubZone (http://www.pubzone.org) is a newresource for the scientific community that provides a discussion fo-rum and Wiki for publications. The idea for PubZone emerged aspart of the initiative to ensure the reproducibility of experimentsreported in SIGMOD papers. The results of such reproducibilityexperiments will be published in PubZone.

1495

Page 3: Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities Juliana Freire New York University

Understanding Variability to Anticipate Change

River-to-shelf forecast based on database DB22 (f22)Pre-generated forecast products for 2011-04-01

Region: Plume Product: Salinity (PSU) Surface

in visual detail is invaluable to evaluate the quality of model simu-lations, to gain insight on physical processes such as the formationof estuarine turbidity maxima, and in conveying this information tobiologists and ecologists.

Figure 7: 3D detail of the salinity intrusion during flood tide, show-ing the split of ocean water entering the estuary between the Nav-igation Channel (at right) and the North Channel (at left). Oceanwater is represented in blue, fresh water in green.

Salinity intrusion is only one of the “faces” of the water ex-change across the mouth of the estuary. A complementary “face”is the creation of a dynamic plume of freshwater in the continentalshelf, which near-shore manifestation is captured in Figure 8 (andcorresponding animation1).

Figure 8: 3D detail of the freshwater plume during the ebb tide,showing a sharp density front (right side of the image).

Of particular interest in Figure 8 is a clearly visible density front,which identification is enhanced by the use of the color red in a pre-selected salinity range. Density fronts trap nutrients and plankton,becoming natural attractors for fish. Their dynamic nature posesdifficulties to fisheries researchers, interested in sampling the dis-tinct environments in each side of the front. Gradients of primaryphysical variables, such as salinity, are often useful in enhancingfront identification (e.g., Figure 9).The dynamic environment of the Columbia River provides a

number of different trapping mechanisms, in addition to fronts. Ofparticular importance are eddies (Figure 10), which form at vari-ous locations and times during the tidal cycle, and are also evident

cussed in the main text are available in the proceedings DVD and from theproject webpage.

Figure 9: Maximum gradients of salinity reveal potential locationsof ecologically significant fronts in the Columbia River plume.

in residual circulation fields (obtained by averaging instantaneousfields).Although circulation models add enormous insight to our under-

standing of the complex dynamics of the Columbia River, modelsare just a representation of the reality. Characterizing errors anduncertainties in this representation requires multiple approaches, in-cluding 3D visualization. An example is shown in Figure 11 (andcorresponding animation1), where observed and simulated trajec-tories of a drifter released in the estuary are compared. The 3Danimation makes it intuitively clear that observations and simula-tions remain close until the real and virtual drifters follow differentchannels in the upstream end of their progression into the estuary.

5 CONCLUSIONS

In this paper, we describe our initial efforts in closing the gap be-tween the simulation capabilities of CORIE and its visualizationtools. Although CORIE has sophisticated 3D modeling and simu-lation components, its visualization tools are mostly based on 2Dtools that generate canned animations, and do not allow for direct“interactive visualization”. Here, we describe several new tools forlooking at CORIE data that generate 4D (i.e., time-varying 3D) vi-sualizations and allow for interactive exploration of the data.There are important directions we intend to pursue in future work

to both address limitations as well as extend the existing tools:Deliver real-time frame rates: even though our visualization

tools are efficient, they are not able to render the full resolutiondata at real-time frame rates.Ubiquitous visualization platform: our tools are not machine-

scalable, i.e., it is not possible to adjust their performance to theplatform being used by a given user. Adaptability is especially im-portant in the context of EOFS, since users might be out on the field,without access to high-end visualization machines.Specification of visualization products: currently it is hard to as-

semble and manage complex visualization pipelines. We need to gothrough many steps, run several programs on a variety of machinesto generate data products to be visualized. This process can be very

!"#$%&#'()(*+%$#*#%%,-./%$#*#0#&"%

1(&2'#+%3.'4/"%-")$"-()5%.,%%&#'()(*+%()%*6"%%7.'4/0(#%!(3"-%

7.)3"-*%$#*#%

!"#$%&'#&(#)*%+,+'-#+,#'./#0&%123+*#4+5/6##

7&%12/#6/,8/6+,9#&(#'./#5+)+3%/#.12*,#./*8# 7&%12/#6/,8/6+,9#&(#)*%+,+'-#+,#'./#0&%123+*#4+5/6#

Figure 2: A repository where scientists can publish their workflows and source code opens up new opportunities for knowledge shar-ing and re-use. While environmental scientists at the STC CMOP at OHSU use 2D plots to visualize the results of their simulations(left), at the SCI Institute, visualization experts are developing state-of-the-art volume rendering techniques (center). By combiningtheir work, it is possible to generate more detailed (and insightful) 3D renderings of the CMOP simulation results (right).

3. VISIONMany researchers have tools, software, and data they are willing

to share in a widely available repository. To take best advantage ofsuch contributions, we would like to propose a vision of “repositoryexploration” to help researchers re-use software as well as buildnew software components from existing ones. Here are some ofthe opportunities opened up by having an exploration platform andchallenges involved in building such a system.1. How do I find tools/sofware/data that are helpful to me orrelated to my work? Consider the following example. Find an ex-periment that uses MySQL as a back-end to store salinity informa-tion about the Columbia River and that performs volume renderingusing an algorithm developed at the SCI Institute. This query spansthe meta-data of the system configuration (for MySQL), the algo-rithm author (SCI Institute) and the data type (salinity informationabout a specific river). The repository querier may get very luckyand find an exact match, but will often find only a partial match.For example, the repository may have one entry that has a MySQLback-end and salinity information about a different river withoutvolume rendering, and another that does volume rendering. Thischallenge entails the construction of an intuitive query interfacethat allows users to explore the data in the repository. Althoughthere has been work on querying workflows [25, 5], experimentsmay or may not be specified as workflows. Even when they are,they can be specified at different levels of granularity. This createsnew challenges for querying. In particular, techniques are neededthat infer workflows from lower-level specifications (e.g., scripts)as well as that support exploratory, vague queries. Furthermore,

besides workflows, these repositories will contain other kinds ofinformation, including source code, input and derived data. Thisrequires a flexible query system that is able to cross the boundariesbetween structured and unstructured data perhaps in a similar wayto querying on dataspace systems [12].2. Given several relevant entries from the repository, can theybe combined into a coherent whole? In the case of our runningexample, can we pipe the salinity information from one repositoryitem into the volume rendering software of another? A workflowrepresentation of each published computational experiment will beuseful for this challenge, because a workflow exposes the specifi-cation of the computation tasks, including intermediate steps, in astructured fashion. As Figure 2 illustrates, one can imagine takingtwo workflows and creating a third one perhaps with extra stepsfor data conversion. To support this, techniques need to be devel-oped to automatically find correspondences between computationalmodules (e.g., module A performs a function that is similar to mod-ule B), as well as determine their compatibility (e.g., can module Abe connected to module B?).3. What is a good query language for finding repository itemsand assembling several repository items together? Ideally, thequery system would provide ways to query the different compo-nents of an experiment, including: meta-data about the data used;the structure as well as parameters in workflows, software headers,and system configuration parameters. Assembling different reposi-tory items together entails finding sub-parts of workflows that linktogether perhaps at some cost. A query processing system that in-corporates such a cost measure to find the “best” answer to a query

1496

Page 4: Exploring the Coming Repositories of Reproducible ...Exploring the Coming Repositories of Reproducible Experiments: Challenges and Opportunities Juliana Freire New York University

would be most useful. Thus, the query language, if successful,would answer the first two challenges.4. Support “standing queries”. Once a consumer of the reposi-tory has identified a need, he or she can pose a query. If unsatisfied,the consumer can declare the query to be “standing” which meansthat new entries to the repository will match against the query tosee whether they are more helpful.5. What is the “executable impact” of a given paper/result?Given an executable paper A, go through other papers that use thecomponents of A (directly or indirectly) and count them. To bemost effective, this will tie into a visualization that shows a vir-tual collaboration graph to help answer questions like who uses (re-uses) what; what are the most influential tools and results. Somemechanisms to support such an “executable impact” measure in-clude: (i) the ability to capture the meta-data of a publication asso-ciated with an executable component, so the user of that componentcan cite the publication in an “executable bibliography”; and (ii)the ability to discover similarities of components in order to tracecopyright rights.

4. USERS OF REPRODUCIBLE EXPERI-MENT REPOSITORY EXPLORATION

Scientists would be our first target users. A base of validatedworkflow-described software will allow a kind of “workflow mashup”which, if combined with a capable query language, may enable thecreation of a new targeted tool in days rather than years. But what’sgood for scientists will also help inventors and, through them, ven-ture capitalists, as new products will be able to come online usingthe most advanced technology available. Repository Explorationwill be a tool for the nimble.Acknowledgments. This work has been partially funded by theNational Science Foundation under grants IIS-1050422, IIS-0905385,IIS-0746500, CNS-0751152, N2010 IOB-0519985, N2010 DBI-0519984, IIS-0414763, DBI-0445666, DBI-0421604, DBI-0421604,and MCB-0209754.

5. REFERENCES[1] The SAO/NASA Astrophysics Data System.

http://adsabs.harvard.edu.[2] The ALPS project. http://alps.comp-phys.org/.[3] ICIAM Workshop on Reproducible Research: Tools and

Strategies for Scientific Computing.http://www.mitacs.ca/events/index.php?option=com_content&view=article&id=214&Itemid=230&lang=en, 2011.

[4] B. Bauer et. al. The ALPS project release 2.0: Open sourcesoftware for strongly correlated systems, Jan. 2011.Accepted for publication in Journal of Statistical Mechanics:Theory and Experiment. Paper available athttp://arxiv.org/pdf/1101.2646 and workflowsat http://arxiv.org/abs/1101.2646.

[5] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Queryingbusiness processes. In VLDB, pages 343–354, 2006.

[6] Beyond the PDF Workshop. https://sites.google.com/site/beyondthepdf, 2011.

[7] CrowdLabs. http://www.crowdlabs.org.[8] Guidelines for Research Integrity and Good Scientific

Practice at ETH Zurich. http://www.vpf.ethz.ch/services/researchethics/Broschure.

[9] ETH Zurich’s head of research resigns. http://www.ethlife.ethz.ch/archive_articles/090921_Peter_Chen_Ruecktritt_MM/index_EN.

[10] The executable paper grand challenge, 2011.http://www.executablepapers.com.

[11] S. Fomel and J. Claerbout. Guest editors’ introduction:Reproducible research. Computing in Science Engineering,11(1):5 –7, 2009.

[12] M. J. Franklin, A. Y. Halevy, and D. Maier. From databasesto dataspaces: a new abstraction for informationmanagement. SIGMOD Record, 34(4):27–33, 2005.

[13] Genbank.http://www.ncbi.nlm.nih.gov/genbank.

[14] D. Koop, E. Santos, P. Mates, H. Vo, P. Bonnet, B. Bauer,B. Surer, M. Troyer, D. Williams, J. Tohline, J. Freire, andC. Silva. A provenance-based infrastructure to support thelife cycle of executable papers. In Proceedings of theInternational Conference on Computational Science, pages648–657, 2011.

[15] Madagascar. http://www.reproducibility.org/wiki/Main_Page.

[16] S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou,M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis,K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Re,V. Sans, P. Senellart, T. Wu, and D. Shasha. Repeatability &workability evaluation of SIGMOD 2009. SIGMOD Record,38(3):40–43, 2009.

[17] I. Manolescu, L. Afanasiev, A. Arion, J. Dittrich,S. Manegold, N. Polyzotis, K. Schnaitter, P. Senellart,S. Zoupanos, and D. Shasha. The repeatability experiment ofSIGMOD 2008. SIGMOD Record, 37(1):39–45, 2008.

[18] P. Mates, E. Santos, J. Freire, and C. Silva. Crowdlabs:Social analysis and visualization for the sciences. InProceedings of SSDBM, 2011. To appear.

[19] J. Mesirov. Accessible reproducible research. Science,327(5964):415–416, 2010.

[20] myExperiment. http://www.myexperiment.org.[21] nanoHub. http://nanohub.org.[22] H. Nguyen, T. Nguyen, H. Nguyen, and J. Freire. Querying

Wikipedia Documents and Relationships. In Proceedings ofWebDB, 2010.

[23] Nobel laureate retracts two papers unrelated to her prize.http://www.nytimes.com/2010/09/24/science/24retraction.html?_r=1&emc=eta1,September 2010.

[24] It’s science, but not necessarily right.http://www.nytimes.com/2011/06/26/opinion/sunday/26ideas.html?_r=2, June 2011.

[25] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and C. T.Silva. Querying and creating visualizations by analogy. IEEETransactions on Visualization and Computer Graphics,13(6):1560–1567, 2007.

[26] SIAM Mini-symposium on Verifiable, ReproducibleResearch and Computational Science.http://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=11845.

[27] Repeatability section of the ACM SIGMOD 2011.http://www.sigmod2011.org/calls_papers_sigmod_research_repeatability.shtml.

[28] Computational repeatability: Tuning case study. http://effdas.itu.dk/repeatability/tuning.html.

[29] VisTrails. http://www.vistrails.org.

1497


Recommended