+ All Categories
Home > Documents > Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata - using version control, GitHub ...

Date post: 05-Feb-2017
Category:
Upload: dinhbao
View: 254 times
Download: 5 times
Share this document with a friend
42
Reproducible Research with Stata Reproducible Research with Stata using version control, GitHub, and MarkDoc E. F. Haghish Nov. 17th, 2016
Transcript
Page 1: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Reproducible Research with Statausing version control, GitHub, and MarkDoc

E. F. Haghish

Nov. 17th, 2016

Page 2: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Reproducible Analysis

OverviewDefinition

Figure 1: Reproducible Analysis

Page 3: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

to do so, we will need:the same version of software, data, and code, (and the sameOS, depending on the software)and a literate programming software

Figure 2: version control and literate programming also imply coding theanalysis

Page 4: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

the analysis should be reproduced with identical softwarewe should be able to access the software without requesting itfrom the author.

the data, code, and software should be accessible publiclyall versions of the software used for running the analysis shouldbe accessible.

archiving older versions becomes crucial. For example,Statistical Software Component (SSC) does not archivedi�erent versions of a package, in contrast to CRANfor developing computational programs, version controlbecomes much more important for fixing bugs and cooperatingon the software

Page 5: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Concerns about package archiving

While the idea and importance of archiving versions is clear, someusers may have concerns such as:

1 having access to di�erent versions of a software might causeconfusion for users, making them install old software

2 that can cause confusion for users from where they shouldinstall their software?

3 some would argue that we simply don’t need to make archivesof older software because there is no use in that

4 software update fixes bugs. what is the point of using previousversions if we knew they are buggy?

5 what is the point of reproducing the same results, using thesame software version, when we know they are bugged?

Page 6: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

GitHub for Stata community

GitHub is a general platform that is used for variety of purposes:1 sharing data2 sharing code3 developing and collaborating software4 hosting software for R, Stata, . . .5 archiving software versions6 documenting software, using GitHub WiKi7 reading code within browser

Page 7: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Learning GitHub

Using GitHub has a learning curveUsing the GitHub desktop can considerable eliminate thelearning curve.GitHub has a desktop GUI for Windows and Mac. Linux usershave several third-party software options

I recommend SmartGit for Linux users

When using GitHub, you still write and update your code inyour computer. Once you have made a change, you canregister your commit on your machine (via the App orcommand-line), and when you are through, you can push it tothe repository on GitHub website. Therefore the workflow forprogramming does not change much.

Page 8: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 3: a screenshot of the github package on my local drive, whereprogramming takes place

Page 9: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 4: once you’re done with coding, commit the changes and pushthem to GitHub

Page 10: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 5: viewing the history of changes

Page 11: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

The github package

It’s similar to the ssc command in Stata. But it is used forsearching, installing, and uninstalling Stata packages fromGitHub.The package can be installed from GitHub using:

. net install github, from("https://raw.githubusercontent.com/haghish/github/master/")

such a command is usually required for installing any Statapackage on GitHub. But github command makes life easier inmany ways

Page 12: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Examples

let’s search for a package named markdoc on GitHubusing github search command followed by the keywordthis searches first for all repositories named markdoc that haveStata as their language and are installable packages (have thepkg and toc files in the repository)the output shows a description of the package, along with itsdependencies which will be installed automatically

. github search markdoc

--------------------------------------------------------------------------------repository Author Install Description

--------------------------------------------------------------------------------MarkDoc haghish Install A literate programming package for Stata

3937k which develops dynamic documents, slides,and help files in various formatshomepage: http://haghish.com/markdocHits:49 Stars:5 Lang:Stata (Depend)

--------------------------------------------------------------------------------

Page 13: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

The github command allows you to specify the dependenciesof the package and install them automatically after thepackage.the dependencies are simply a file named dependency.do thatincludes the code for installing a particular version of thepackage or alternatively, the latest version of it. But itallows the user to define a particular version of thedependencies, to ensure the package works as expected by theauthor and recent development of the dependency packages donot yield unexpected resultsYou can install the package with a mouse click or, type thegithub install followed by username/repository names:

. github install haghish/markdoc

Page 14: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

executing the command shows that markdoc installs weaver

package and weaver package installs another package calledstatax which is its own dependencyhaving the option to install dependencies, allows the authors tobreak their packages into pieces, which allows others to rely onthe smaller pieces in their programs. Having the option version,makes it safe to use a particular version of the package.that also means more citations

Page 15: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

The versions are in fact GitHub releases, which are so easy to make

Figure 6: Viewing the software releases on GitHub

Page 16: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

clicking on the releases button will open a page where all theprevious releases are listed, the fixed bugs are explained, and youcan download the old as well as the newest source code

Figure 7: Creating a new release

Page 17: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 8: publishing the new release

Page 18: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Accessing releases via Stata

Once a new release is made on GitHub or the package masteris updated, the new version becomes available for all usersinstantly.You can view all of the available versions using the github

query command followed by the username/repository

Page 19: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

. github query haghish/markdoc

----------------------------------------Version Release Date Install

----------------------------------------3.8.8 2016-11-16 Install3.8.7 2016-11-10 Install3.8.6 2016-11-10 Install3.8.5 2016-10-16 Install3.8.4 2016-10-13 Install3.8.3 2016-10-03 Install3.8.2 2016-10-01 Install3.8.1 2016-09-29 Install3.8.0 2016-09-24 Install3.7.9 2016-09-20 Install3.7.8 2016-09-19 Install3.7.7 2016-09-18 Install3.7.6 2016-09-13 Install3.7.5 2016-09-08 Install3.7.4 2016-09-07 Install3.7.3 2016-09-06 Install3.7.2 2016-09-05 Install3.7.0 2016-08-23 Install3.6.9 2016-08-16 Install3.6.7 2016-02-27 Install

----------------------------------------

Page 20: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Clicking on the install text would install any of the previousversionsAlternatively we can use the version(tag) option to installany version. The tag is the version that we specify for eachrelease. For example, version 3.8.7 of MarkDoc (old version)can be installed as follows:

. github install haghish/markdoc, version(3.8.7)

the same procedure can be used in the dependency.do file toinstall a particular version of a package

Page 21: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Other github subcommands

you can uninstall a package, which only requires the repositoryname

. github uninstall markdoc

you can check whether a repository is installable? This willconfirms that the packagename.pkg and the stata.toc filesexist in the repository. The github search command alsocarries out this process and only shows the install text if thepackage is installable

. github check haghish/markdocstata.toc file was foundpkg file was foundhaghish/markdoc is installable

Page 22: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

You can view the Stata packages that are popular and youhave plenty of options to search di�erent repositories:try:

. github hot

. github hot, n(30)

. github hot, all

. github hot, all language(Python)

the data is available on GitHub:https://raw.githubusercontent.com/haghish/github/master/data/archive.dtayou can build a fresh archive of Stata repositories on GitHubanytime. and it takes about 10 minutes to be executed. Thecommand will create a dataset with the given name.

. github list stata, language(all) in(all) all save(archive) append

Page 23: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Literate Programming

Reproducible documentationIdea

Figure 9: Literate Programming Process

Page 24: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

The main idea is to make the code more readable andwell-written by preparing it for others to read and comprehendit. The document is only a byproduct.Literate programming must not be reduced to generatingdynamic document!It is meant to:

1 make reading and comprehending source code and data analysiscode easier by including the documentation

2 make the analysis and documentation reproducible3 make writing documentation easier

Page 25: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

The documentation is written inside the code file to makethem more understandable, therefore the readability of thecode is central to literate programming paradigmthe markup language used for documentation should be assimple as possible, to avoid unnecessary complications in thesource codethe markup language should not impose learning curveusing HTML and LaTeX for documentation is only popularamong nerds

Page 26: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Workflow

The same workflow is used in statistics for data analysisThe documentation is specified with an especial notation

Therefore, the source code is not directly sourceableThe most popular programs only include the Weave process.

Page 27: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Workflow

This workflow can be improved in a variety of ways:Interactive literate programmingReal-time weavingSupporting di�erent markup languagesSupporting all documentation features required by statisticiansProducing documents in various formats, using the same sourceThinking about procedures to improve the readability of thecodeKeep the source code sourceable

Page 28: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Making literate programming convenient

Real-time document update in di�erent formatsDocx, html, latex, PDF, slides, etc.

A simple GUI interface to facilitate working with the packagesSupport for LaTeX mathematical notations in all documentformatsAutomated process for capturing, saving, and including figuresin the documentCreating automatic layouts for markup languages (Improvingreadability)Ensuring that the code files remain sourceable

Page 29: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

MarkDoc package

MarkDoc is a general purpose literate programming packagefor Statait supports Markdown, LaTeX, and HTMLit provides a holistic approach to reproducible documentationof produce various formats from the same source

Microsoft O�ce DocxOpenO�ce ODTLaTeXHTMLBeamer slidesWeb-based HTML slidesepubStata help files (sthlp)Stata package vignette in all formats mentioned above

Page 30: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 10: MarkDoc includes several engines for processing do, smcl,ado, and mata files

Page 31: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

MarkDoc’s features

MarkDoc supports interactive workflowusing the smcl log as the source code allows you to view thedocument in any of the formats after making a change, withoutre-executing the whole codeusing the do file as the source requires re-executing the wholesource code for generating the documentusing ado and mata code, which are used for programming,only extract the documentation for generating Stata help filesand package vignettes.

Script files written for MarkDoc will remain sourceable in Stata,i.e. the analysis can be executed as usual in Stata because thedocumentation are written as a special comment format, onlymeaningful to MarkDoc

Page 32: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

MarkDoc’s features

has a very simple method for capturing and including graphs inthe dynamic documentsupports writing dynamic textsupports creating dynamic tables convenientlyit defined markers for keeping the output document simple,while preserving all of the analysis code and results

1 //ON and //OFF for activating and deactivating the results of acode chunk

2 /**/ for hiding a command3 /***/ for hiding output4 //IMPORT filename for importing documentation from

external files, which helps to keep the do-files clean

Page 33: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Installation

MarkDoc is hosted on SSC and can be installed by:. github install haghish/markdoc

Since Feb 2016, all of the releases of MarkDoc are accessiblevia GitHub:

. github query haghish/markdoc

an analysis that is used by older versions of MarkDoc will bereproducible, because you can install older versions of MarkDocto generate the documentation

Page 34: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Documentation

The documentation is written as comment, between /*** and***/ comment signs. For example, your do file could look like:

. stata command

/***Markdown heading 1===============

Markdown heading 2------------------

documentation text.***/

. stata command

Page 35: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Markdown syntax is explained on GitHub Wiki, which holdsthe manual of the package: https://github.com/haghish/MarkDoc/wiki/Markdown-tutorialexamples of MarkDoc package can be found on GitHub as well:https://github.com/haghish/MarkDoc/tree/master/Examples

Page 36: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Learning MarkDoc

MarkDoc comes with a graphical user interface which makeslearning the package much easier.Typing db markdoc opens the dialog boxThe dialog box has 3 independent tabs, each work with aparticular engine, and has its own options.

Page 37: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Figure 11: MarkDoc GUI

Page 38: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Applications in research

Reproducible research software can improve analysis transparency by:Documenting the process of data analysisAllowing the whole analysis to be reproduced step by stepEmbedding the interpretationAutomatizing reporting the results and eliminating untraceablehuman errorsCombining the results, figures, and all of the interpretations ina publication-ready document that ideally should be availablein various document formats (Docx, LaTeX, PDF, OpenO�ceODT).

Page 39: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Applications in education

Reproducible research tools can be used by students to:Create notes for themselves within statistical packagesDocument their codeRead and comprehend code that is written by others in thesame fashionPractice data analysis in a more disciplined way

Page 40: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Applications in education

Reproducible research tools are very useful tools for teachingstatistics at any level for teachers, students, and programmers

Teachers can use them for creating educational materials usingthe same “source files”. The source files can be used toproduce:Presentation slidesHandoutsHTML documents for websites or blog postseBooksProduce documents with di�erent levels of documentatione.g. slides and detailed handouts can be produced from thesame source by specifying how and to what extent thedocumentations should be includedThe documentation can be reused in other formats, whichencourages practicing literate programming

Page 41: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

Applications in statistical programming

Programmers can get benefit from this paradigm fordocumenting their own code

Making it more comprehensible for othersEncouraging others to read their codeUsing the documentation to produce various documentationformatsHelp files, package vignettes, etc.Reading

Page 42: Reproducible Research with Stata - using version control, GitHub ...

Reproducible Research with Stata

If you found this talk interesting, you can expect a book thattouches on the topics of this presentation in much more details andexamples: https://leanpub.com/reproducible/

Figure 12: Reproducible Research With Stata


Recommended