SMARTSkills Workshop
Reproducible Analyses and Data Management
Coilın MintoMarine and Freshwater Research Centre
Galway-Mayo Institute of Technology
October 24th, 2013
Outline
1 Reproducible Analyses
2 Data management
3 Example database project
4 Summary
Outline
1 Reproducible Analyses
2 Data management
3 Example database project
4 Summary
Research is reproducible if it can be reproduced by others
Baggerly and Berry (2011)
Image source: http://www.therooms.ca
Emigration to Newfoundland
Emigration to Newfoundland
20
40
60
80
-0.0
5
0.0
0.0
5
0.1
0
1720
1723
1724
1725
1726
1727
1730
1732
17331
734
1735
1736
1739
1741
1742
1748
1749
1750
1751
1752
1753
1754
1755
1757
1758
1759
1760 1
762
1763
1764
1768
1776
1778
1779
1781
1782
1786
1787
1788
1791
1793
1794
1797
1798
1802
1803
1804
1805
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821 1
822
1823
1824
18261827
Ca
tch
pe
r M
an
(q
uin
tals
/ma
n)
Trinity B
ay
Population Growth Rate (1/year)
100
200
300
400
500
600
-0.0
5
0.0
0.0
5
0.1
0
1720
1723
1724
1725
1726
1730
1732
1733
1734
1735
1736
1739
1741
1742
1748
1749
1750
1751
1752
1753
1754
1755
17571758
1759
1760
1762
17631764
1768
1776
1778
1779
1781 1
782
1786
1787
1788
1791
1793
1794
1797
1798
1802
1803
18041
805
1807
18081
809
1810
1811
18121813
1814
1815
1816
1817
1818 1819 1820
1821
1822
1823
1824
1826
1827
Ca
tch
pe
r B
oa
t (q
uin
tals
/bo
at)
Myers (2001)
Emigration to Newfoundland
Analysis was:
• Conducted in 2000
• Run on a Sun server with documented (READMEs) folderscontaining:
• Data• Text• Analysis code (S-Plus)
• Archived
Emigration to Newfoundland
Year 2009:Contacted by a Norwegian researcher wishing to re-run theanalysis but the sole author (RAM) had very unfortunatelypassed away in 2007
In many cases, this would signal the end of the line and we goback to collating the data over-again or forget about it.But in this case, three steps:
$ ssh server
$ cd relevant_folder
$ make
recovered the complete analysis, figure and table preparationand dynamicaly linked to a fresh write-up
Getting the structure right
analyses
project name
doc
figures ms tables
data R
functions scripts
Fastidious data management is paramount for reproducibility
Image source: moods of norway
Outline
1 Reproducible Analyses
2 Data management
3 Example database project
4 Summary
What’s data?
Ultimately, a stored array of electrical charges but I like to thinkof data as the map and mode of transport that gets you fromthe start of a research project or program to the final product
Image source: http://www.deviantart.com
What’s data?
It’s not just a spreadsheet!
Data encompasses
• Metadata on what the work was about (who, what,where, when and why?)
• Records Measurements, dates, treatments, etc.
• Code Data extraction and analysis
• Results (value-added collections of records) Figures,tables, calculations
• Reporting Documents, mark-up
Losing our way
In science we often lose our map and mode of transport via:
• Damage to files or storage deviceError: cannot open ...
• Purported storage device ageing or becoming redundant“That was three laptops ago”
• Software changeHouse of punch-cards
• Personnel change“They left with the laptop”
• Bounce to the next project
Why do some scientists treat data poorly?
Among other reasons:
• Incentive potentially lacking in highly competitivepublishing arena
• Focus on the publication as self-contained product of thebusiness
• Data husbandry viewed as diminishing returns
• Shoulders of giants mis-interpreted
• Illusion of ownership
Why these reasons don’t cut the mustard now
Among other reasons:
• Large collaborative initiatives consisting of manysub-projects necessitate data management
• Journal publishing ethics changing and valuing datahusbandry, e.g.,
• Debes PV, Fraser DJ, McBride MC, Hutchings JA (2013) Multigenerational
hybridisation and its consequences for maternal effects in Atlantic salmon.
Heredity 111: 238-247. doi:10.1038/hdy.2013.43
• Debes PV, McBride MC, Fraser DJ, Hutchings JA (2013) Data from:
Multigenerational hybridisation and its consequences for maternal effects in
Atlantic salmon. Dryad Digital Repository. doi:10.5061/dryad.9cs2v
• Granting bodies requesting data management planning
Data management
ONCE COLLECTED AND ELECTRONICALLY ENTEREDDON’T TOUCH THE DATA!
Tempting as it might be to fire up a spreadsheet and startcreating worksheets and pasting specially, this will only lead todata woesTo avoid wondering whetherdata new.xls
ordata updated.xls
is the relevant copy, leave the data in the data folder orrepository alone
Data management: Spreadsheet Tales
“In the process of copying, the scribes made (deliberately orotherwise) changes, which were themselves copied.”
Barbrook et al. (1998). Nature (394) p.839.
Data management: solution
All data manipulations should be done programmatically
• Read raw data in analytical software
• Subset, remove, adjust via code
• Leaves a reproducible trail and
• Leaves the original (hard-won) data intact
• Pipe results dynamically into your document (e.g., Sweave,knitr)
A contention
My over-arching contention with the status quo is that anindividual’s laptop or PC is not an acceptable researchenvironment, as it:
• Risks complete data loss
• Fosters the “Chaucer” effect (more later)
• Is anti-collaborative
• Is license hungry and therefore costly
• Is less powerful, slower
A back-to-the-future solution
Need to return to a common research environment - the server
Image source: http://my.opera.com
A back-to-the-future solution
In as much as we have the focal point of the wet-lab to processspecimen samples, we should have a central place for datastorage and processing, as it:
• Keeps single copies of data centrally
• Has a longer life than the project
• Has a longer life than the researchers (??)
• Gives everyone equal access to high-performancearchitecture (no need a new laptop, just use laptop for )
• Managed centrally
A back-to-the-future solution
Many institutes have servers but rarely used as a commonresearch environment outside of the physical sciences
But the coming of age of high-performance computing nownecessitates that we make the move back
Example: data-poor stock status
• FAO and Conservation International project to globallyassess status of “data-poor” stocks
• Two research teams from 8 different countries
• Had to work in a central environment -hexagon.bccs.uib.no in Bergen, Norway.
Example: data-poor stock status
• 576 (scenarios) x 10 (iterations) x 4 (methods) = 23,040stock assessments
• For agreed convergence level (MCMC,SIR) requires 19.5CPU years on single processor
• Completed work in walltime of 7.5 days on Hexagon cluster
Outline
1 Reproducible Analyses
2 Data management
3 Example database project
4 Summary
Original database
Ransom Myers’ Stock Recruitment Database
• Approximately 640 stocks.
• Used in many publications on fish population dynamics,e.g.
• Relationship between recruitment and spawning stock size• Density dependence• Depensation (Allee effects)• Productivity rates across taxa• Patterns of depletion and recovery
• Housed in flat text files
• Archived (not updated anymore) version available from:http://www.mscs.dal.ca/∼myers/welcome.html
Why an updated database?
• Many stocks 15 years outof date
• New data often at:• Low population levels• Reduced fishing
intensities
● ●
●
●
● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1985 1990 1995 2000 2005
050
010
0015
0020
0025
00
Year
Bio
mas
s (1
000
tonn
es)
● COD2J3KL
• Interest in:• Effects of exploitation on trends in abundance across taxa
from many ecosystems• Efficacy of harvest policies• Recovery trajectories post fishing mortality reductions
• Relational database to support reproducible analyses
Geographic coverage
1−45−910−1920−2930+
Temporal coverage: orca plots
1850 1900 1950 2000
050
100
150
200
250
300
data$span
Fre
quen
cy
010
2030
40
10 30 50 70 90 110Span (years)
Fre
quen
cy
A
1850 1900 1950 2000
data$span
Fre
quen
cy
010
2030
40
10 30 50 70 90 110
B
1850 1900 1950 2000
data$span
Fre
quen
cy
010
2030
40
10 30 50 70 90 110
C
Year
Ass
essm
ent c
ount
Taxonomic coverage
Pseudocyttus maculatusAmmodytes marinusArripis truttaCentropristis striataEpinephelus morioEpinephelus niveatus
Mycteroperca microlepisChrysophrys auratusPagrus pagrusStenotomus chrysops
Cynoscion regalis
Micropogonias undulatus
Dissostichus eleginoides
Dissostichus mawsoni
Kajikia audax
Katsuwonus pelamis
Scomber japonicus
Scomber scombrus
Scomberomorus cavalla
Scomberomorus maculatus
Thunnus alalunga
Thunnus albaca
res
Thunnus macc
oyii
Thunnus obesu
s
Thunn
us th
ynnu
s
Loph
olatilu
s cha
mae
leont
iceps
Lutja
nus a
nalis
Lutja
nus c
ampe
chan
us
Ocy
urus
chr
ysur
us
Rhom
bopl
ites
auro
rube
ns
Mor
one
saxa
tilis
Nem
adac
tylu
s m
acro
pter
us
Pepr
ilus
triac
anth
us
Pom
atom
us s
alta
trix
Pseu
doca
ranx
den
tex
Serio
la d
umer
ili
Trac
huru
s ca
pens
is
Trac
huru
s m
urph
yi
Rex
ea s
olan
dri
Ser
iole
lla b
ram
a
Ser
iole
lla p
unct
ata
Sill
ago
flind
ersi
Taut
oga
oniti
s
Xip
hias
gla
dius
Ano
plop
oma
fimbr
ia
Hex
agra
mm
os d
ecag
ram
mus
Oph
iodo
n el
onga
tus
Ple
urog
ram
mus
mon
opte
rygi
us
Neo
plat
ycep
halu
s ric
hard
soni
Pla
tyce
phal
us c
onat
us
Red
fish
spec
ies
Sco
rpae
na g
utta
taS
ebas
tes
aleu
tianu
s
Sebastes alutus
Sebastes borealis
Sebastes carnatusS
ebastes crameri
Sebastes entom
elas
Sebastes fasciatus
Sebastes flavidus
Sebastes goodei
Sebastes jordani
Sebastes levis
Sebastes m
elanops
Sebastes m
elanostomus
Sebastes m
ystinus
Sebastes norvegicus
Sebastes paucispinis
Sebastes pinniger
Sebastes polyspinis
Sebastes ruberrim
us
Sebastes variabilis
Sebastolobus alascanus
Sebastolobus altivelis
Scorpaenichthys m
armoratus
Balistes capriscus
Brevoortia patronus
Brevoortia tyrannus
Clupea harengus
Clupea pallasii
Clupeonella engrauliformis
Sardina pilchardus
Sardinops sagax
Sprattus sprattus
Engraulis anchoita
Engraulis encrasicolus
Engraulis ringens
Brosme brosme
Gadus macrocephalusGadus morhua
Melanogrammus aeglefinus
Merlangius merlangus
Micromesistius australis
Micromesistius poutassou
Pollachius virens
Theragra chalcogramma
Triso
pterus esm
arkii
Urophycis
tenuis
Macruro
nus m
agellanicu
s
Macru
ronu
s nov
aeze
landia
e
Mer
lucciu
s aus
tralis
Mer
lucciu
s bilin
earis
Mer
lucc
ius
cape
nsis
Mer
lucc
ius
hubb
si
Mer
lucc
ius
mer
lucc
ius
Mer
lucc
ius
para
doxu
s
Mer
lucc
ius
prod
uctu
s
Cen
trobe
ryx
gerra
rdi
Hop
lost
ethu
s at
lant
icus
Eop
setta
jord
ani
Gly
ptoc
epha
lus
cyno
glos
sus
Gly
ptoc
epha
lus
zach
irus
Hip
pogl
osso
ides
ela
ssod
on
Hip
pogl
osso
ides
pla
tess
oide
s
Hip
pogl
ossu
s hi
ppog
loss
us
Hip
pogl
ossu
s st
enol
epis
Lepi
dops
etta
bili
neat
a
Lepi
dops
etta
pol
yxys
traLi
man
da a
sper
a
Lim
anda
ferr
ugin
eaM
icro
stom
us p
acifi
cus
Par
ophr
ys v
etul
usP
latic
hthy
s st
ella
tus
Ple
uron
ecte
s pl
ates
sa
Pleuronectes quadrituberculatus
Pseudopleuronectes am
ericanusR
einhardtius hippoglossoidesR
einhardtius stomias
Lepidorhombus boscii
Lepidorhombus w
hiffiagonis
Scophthalm
us aquosus
Paralichthys dentatus
Solea vulgaris
Genypterus blacodes
Genypterus capensis
Lophius americanus
Mallotus villosus
Carcharhinus acronotus
Carcharhinus isodon
Carcharhinus limbatus
Carcharhinus plumbeus
Rhizoprionodon terraenovae
Sphyrna tiburo
Isurus oxyrinchus
Raja rhina
Squalus acanthias
Arctica islandica
Spisula solidissima
Placopecten magellanicus
Haliotis iris
Haliotis midae
Illex illecebrosus
Chionoecetes opilio
Homarus americanus
Jasus edwardsii
Jasus lalandii
Palinurus gilchristi
Lithodes aequispinus
Paralithodes camtschaticus
Pandalus borealisPenaeus esculentusPseudocarcinus gigas●
Arthropoda
MolluscaChondrichtyes
Perciformes
Scorpaeniformes
Clupeiformes
Pleuronectiformes
Gadiformes
Animalia
356 assessments
Order N %
Gadiformes 71 20Perciformes 66 19Pleuronectiformes 57 16Scorpaeniformes 45 13Clupeiformes 36 10Invertebrates 42 12
Used in 27 publications since inception in 2009http://depts.washington.edu/ramlegac/
Outline
1 Reproducible Analyses
2 Data management
3 Example database project
4 Summary
Summary
• Reproducibility is a central component of science
• To-date our general approach to data has been poorbordering on careless
• Scale of the problems and collaborations now necessitatechange for the better
• A laptop/desktop is not a research environment
• Data management increasingly recognized
• Putting in the spade work of data management can reapgood rewards
Aknowledgements
Paulha McGrane and John Boyd and organizing committee
Julia BaumDeirdre BrophyOlaf JensenRay HilbornRick OfficerDaniel RicardConservation InternationalUniversity of WashingtonFAOMarine and FreshwaterResearch Centre, GMIT,GalwayDalhousie University, NovaScotia
Baggerly, KA and Berry, DA (2011). Reproducible Research. AmstatNews January 2011.
Myers, RA (2001). Testing ecological models: the influence of catchrates on settlement of fishermen in Newfoundland, 1710-1833.Research in Maritime History, 21, 13-29.