Date post: | 14-Jul-2015 |
Category: |
Science |
Upload: | gigascience-bgi-hong-kong |
View: | 908 times |
Download: | 0 times |
ScSc
0000-0001-6444-1436
@SCEdmunds
NEW MODEL
Open data
publishing
Scott Edmunds
Balti Bioinformatics
The problems with publishing
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than 350-year old style “dead tree” publication
• Traditional publishing policies and practices a hindrance (licensing & access, embargoes, Ingelfinger, closed doors, anti-granularity & forking)
The consequences: growing replication gap
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Consequences: increasing number of retractions>15X increase in last decade
At current % > by 2045 as many papers published as retracted
1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html 2. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
STAP paper demonstrates problems:
Nature Editorial, 2nd July 2014:
“We have concluded that we and the referees could not have detected the problems that fatally undermined the papers. The referees’ rigorous reports quite rightly took on trust what was presented in the papers.”
http://www.nature.com/news/stap-retracted-1.15488
STAP paper demonstrates problems:
…to publish protocols BEFORE analysis…better access to supporting data…more transparent & accountable review
…to publish replication studies
Need:
• Review• Data• Software• Models• Pipelines• Re-use…
= Credit
}
Credit where credit is overdue:“One option would be to provide researchers who release data to public repositories with a means of accreditation.”“An ability to search the literature for all online papers that used a particular data set would enable appropriate attribution for those who share. “Nature Biotechnology 27, 579 (2009)
New incentives/credit
Not just carrots…
“The data discovery index (DDI) enabled through bioCADDIE is to do for data what PubMed (and PubMed Central) did for the literature.”
Things we need to reward
Methods
Answer
Metadata
softwareAnalysis
(Pipelines)
Workflows/Environments
Idea
Study
Rewarding the
DOI, etc.Publication
Publication
Publication
Data
Open peer review1. Transparency
The only drawback?
End reviewer 3 Downfall parody videos, now!
1. TransparencyOpen peer review
Publons + AcademicKarma = credit for reviewers efforts
http://publons.com/
1. Transparency/open peer review
http://academickarma.org/
1. Transparency
Reward pre-prints
http://tmblr.co/ZzXdssfOMJfy
arXiv + blogged reviews = real-time open-review
1. Transparency
arXiv + blogged reviews = real-time open-review
1. Transparency
2. DataReward Open Data
IRRI GALAXYRice 3K project: 3,000 rice genomes, 13.4TB public data
2. (Big) Data
2. DataReward Intermediate Data
Nanopore MinION E. Coli genome released via GigaDB 10-Sep-2014
Curated & converted to ISA-tab, & worked with EBI to get raw data there
Data Note submitted & preprint version out 26th September
Peer reviewed & published 20th October
2. DataReward Faster Data Release
http://www.gigasciencejournal.com/content/3/1/22
Real time sequencing era needs real time publication!
• Used as test data for “minoTour”: real time data analysis tools for minION data
• Nanopore data already used in (CC0 GitHub based) teaching materials
• Next stop…Erratums, Updates & more (see later)
1. mioTour http://minotour.nottingham.ac.uk/2. https://github.com/lexnederbragt/INF-BIOx121_fall2014_de_novo_assembly
2. DataReward Faster Data Release
OMERO: providing access to imaging data
Already used by JCB.
View, filter, measure raw images with direct links from journal article.
See all image data, not just cherry picked examples.
Download and reprocess.
2. DataReward Imaging Data
The alternative...
...look but don't touch
2. DataReward Imaging Data
3. Software
https://www.change.org/p/everyone-in-the-research-community-we-must-accept-that-software-is-fundamental-to-research-or-we-will-lose-our-ability-to-make-groundbreaking-discoveries
galaxy.cbiit.cuhk.edu.hk
4. WorkflowsReward Sharing of Workflows
Visualisations & DOIs for workflows
http://www.gigasciencejournal.com/series/Galaxy 26
• Can facilitate reproducibility, reuse & sharing with tools like: Knitr, Sweave, iPython Notebook
5. Open DocumentsReward Open/Dynamic Workbooks
E.g.
E.g.
5. Virtual Machines
?http://ivory.idyll.org/blog/vms-considered-harmful.html
http://dx.doi.org/10.5524/100106http://www.gigasciencejournal.com/content/3/1/23
5. Virtual Machines
Taking a microscope to the publication process
33
How reproducible can we get?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>33,000 accesses& 270 citations
Open-Code
7 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>36,000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
34
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
The SOAPdenovo2 Case studySubject to and test with 3 models:
DataData
Method/Experimental protocolMethod/Experimental protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-TAB/ISA2OWLWfdesc/ISA-
TAB/ISA2OWL
Models to describe each resource type
See: http://biorxiv.org/content/early/2014/12/08/011973
1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.
Lessons Learned• Most published research findings are false. Or at
least have errors
• Is possible to push button(s) & recreate a result from a paper
• Reproducibility is COSTLY. How much are you willing to spend?
• Much easier to do this before rather than after publication
The cost of staying with the status quo?
• Ioannidis estimate that 85% of research resources are wasted.
• Each retraction estimated to cost $400,000.
Make your data, software &
other ROs open (CC0, OSI)
Get credit for your reviewing
Publish your research objects
(with us!)
In Summary
www.gigasciencejournal.com
@gigasciencefacebook.com/GigaScience
Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)
Thanks to:
@gigasciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/
Peter LiChris HunterJesse Si ZheRob DavidsonNicole NogoyLaurie GoodmanAmye Kenall (BMC)
Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Lancaster)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.orggalaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
45