Sandra Gesing
Center for Research Compu6ng [email protected]
12 February 2016
Usability, Reusability and Reproducibility of Bioinforma6c
Applica6ons
University of Notre Dame
Sandra Gesing 2
hHp://chartsbin.com/view/1124 hHp://chartsbin.com/view/1124
• In the middle of nowhere of northern Indiana (1.5 h from here) • 4 undergraduate colleges • ~35 research ins6tutes and centers • ~12,000 students
Center for Research Compu6ng
Sandra Gesing 3
• SoSware development and profiling • Cyberinfrastructure/science gateway development • Geographical Informa6on Systems • Visualiza6on Support • Computa6onal Scien6st support • Collabora6ve research/ grant development • System administra6on/ design and acquisi6on • ~40 researchers, research programmers, HPC specialists
CRC and OIT building hHp://crc.nd.edu
Center for Research Compu6ng
Sandra Gesing 4
• Computa6onal resources: 25,000 cores+ • Storage resources: 3 PB • Visualiza6on systems • Systems for virtual hos6ng • Prototype architectures e.g., Docker, OpenStack • Access and interface to • XSEDE • Open Science Grid • Blue Waters
CRC HPC Center (old Union Sta6on)
Bioinforma6cs
Sandra Gesing 5
• Genomics • Proteomics • Metabolomics • Immunomics • System biology • Molecular simula6ons • Docking • Epidemiology • …
Black Swallowtail – larvae and buHerfly
The Genomics Boom
Sandra Gesing 6
February 16, 2001 biotech company Celera
February 15, 2001 The Human Genome Project
Big Data
Sandra Gesing 8
• Explosion in the quan6ty, variety and complexity of data • Ques6ons can be answered impossible to even ask about 10 years ago • Costs far reduced (e.g., Human Genome project, 15 years, ~$2 billion; today ~3 days, $1000)
State of the Art
Sandra Gesing 10
Data and compute-‐ intensive problems
High-‐speed networks
Users generally not IT specialists Tools and workflow
engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Challenge for Developers
Sandra Gesing 11
Data and compute-‐ intensive problems
High-‐speed networks Tools and workflow engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Users generally not IT specialists
Need for intui6ve and self-‐explanatory user interfaces!
Challenge for Developers
Sandra Gesing 12
Data and compute-‐ intensive problems
High-‐speed networks Tools and workflow engines
Web-‐based agile frameworks Distributed data and
compu6ng infrastructures
Users generally not IT specialists
Usability
Sandra Gesing 14
“ASer all, usability really just means that making sure that something works well: that a person … can use the thing -‐ whether it's a Web site, a fighter jet, or a revolving door -‐ for its intended purpose without gerng hopelessly frustrated.” (Steve Krug in “Don't make me think!: A Common Sense Approach to Web Usability”, 2005)
Reusability
Sandra Gesing 15
“The key to produc6vity is reusability. The easiest way to produce code is obviously to have it already!" (John R. Bourne in “Object-‐oriented Engineering: Building Engineering Systems Using Smalltalk-‐80”, 1992)
Reproducibility
Sandra Gesing 16
“The closeness of agreement between independent results obtained with the same method on iden6cal test material but under different condi6ons (different operators, different apparatus, different laboratories and/or aSer different intervals of 6me)…” (IUPAC (Interna6onal Union of Pure and Applied Chemistry iupac.org) GoldBook)
Reproducibility
Sandra Gesing 17
“The closeness of agreement between independent results obtained with the same method on iden6cal test material but under different condi6ons (different operators, different apparatus, different laboratories and/or aSer different intervals of 6me)…” (IUPAC (Interna6onal Union of Pure and Applied Chemistry iupac.org) GoldBook)
Science Gateways
Sandra Gesing Science Gateways 18
“A Science Gateway is a community-‐developed set of tools, applica6ons, and data that is integrated via a portal or a suite of applica6ons, usually in a graphical user interface, that is further customized to meet the needs of a specific community.” TeraGrid/XSEDE
Science Gateways
Sandra Gesing Science Gateways 20
It’s a Science Gateway
It’s a Research Portal
It’s a Collaboratory
It’s a Cyberinfrastructure
It’s e-‐Science eResearch
It’s a Virtual Lab
Science Gateway Technologies
Sandra Gesing 24
• Agile web frameworks (AngularJS, Seman6c UI) • Content management systems (Drupal) • Libraries for implementa6on (Django) • Science gateway frameworks (Galaxy, WS-‐PGRADE, Catania Science Gateway Framework, HubZero) • Sta6c layout • Layout extendable • Workflow-‐enabled
• APIs for implementa6on (Apache Airavata, Agave, Vine Toolkit)
Development of Science Gateways
Sandra Gesing 26
Crucial Topics • Close collabora6on with user communi6es • Knowledge about available technical solu6ons
Sounds easy but… • Requirements of user communi6es oSen not so
clear • Technologies some6mes s6ll under development
for certain building blocks è Slow uptake of solu6ons è Larger effort for crea6ng science gateways
New Science Gateways -‐ Checklist
Sandra Gesing 27
Organiza6onal Aspects
Technical Aspects
Domain-‐Specific Aspects
Developers Domain Experts
New Science Gateways -‐ Checklist
Sandra Gesing Science Gateways 28
Domain-‐specific aspects: • Goal, target area and target users • Visions/demands on the layout • Priori6es of features and op6ons, e.g., a list
from must-‐have to great-‐to-‐have op6ons • Integra6on of exis6ng applica6ons or
development of applica6ons • Technologies of the applica6ons • Visualiza6on • Security demands • Workflows
New Science Gateways -‐ Checklist
Sandra Gesing Science Gateways 29
Organiza6onal aspects: • Time constraints for the development,
agreement on a (maybe even rough) project plan with milestones
• Agreement on alpha-‐ or beta-‐tester • Regular mee6ngs
New Science Gateways -‐ Checklist
Sandra Gesing Science Gateways 30
Technical aspects: • Experience with exis6ng frameworks and
programming languages • Available infrastructure including security
infrastructure and resources • Available support of suitable technologies • Scalability of suitable technologies • Effort for extending exis6ng technologies
compared to novel developments • Synergy effects with other science gateway
projects
Science Gateways
Sandra Gesing Science Gateways 31
A new era… • Novel developments of web-‐based agile
frameworks • Infrastructure providers report that science
gateways are more used than commandlines
hHp://www.iplantcollabora6ve.org
Science Gateways
Sandra Gesing Science Gateways 32
A new era… • Novel developments of web-‐based agile
frameworks • Infrastructure providers report that science
gateways are more used than commandlines But also always new challenges… • Novel infrastructures • Novel data sources such as the next Next-‐Gen
Sequencing
è Support of developers necessary
Science Gateway Ins6tute
Sandra Gesing Science Gateways 33
2012 NSF SoSware Ins6tute conceptualiza6on award 2015 NSF SoSware Ins6tute implementa6on proposal ($15M) Services • Incubator • Developer support team • Gateway framework directory • Workforce development
hHp://sciencegateways.org
Science Gateway Survey 2014
Sandra Gesing Science Gateways 34
• 29,000-‐person survey • 4957 responses from across domains
Bioinforma6c Infrastructure Survey
Sandra Gesing 36
• Nick Loman (Birmingham, UK) • Thomas Connor (Cardiff, UK) • October 2015 • 272 answers
hHps://drive.google.com/drive/folders/0B7KZv1TRi06fLUJCU1BYM3JScjg
Bioinforma6c Infrastructure Survey
Sandra Gesing 38
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
Where do bioinforma6cians do most of their work
Bioinforma6c Infrastructure Survey
Sandra Gesing 39
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
0.00%$ 10.00%$20.00%$30.00%$40.00%$50.00%$60.00%$70.00%$80.00%$90.00%$
Best$for$job$
Good$documenta>on$
Word$of$mouth$recommenda>on$
Used$in$similar$analysis$
Quickest$
Already$installed$on$server$
Other$
Graphical$interface$
Where do bioinforma6cians do most of their work
Why do bioinforma6cians use the soSware they use
Bioinforma6c Infrastructure Survey
Sandra Gesing 40
0" 20" 40" 60" 80" 100" 120"
Cloud"
Ins0tu0on2wide"resource"
Local"resource"
Personal"computer"
0.00%$ 10.00%$20.00%$30.00%$40.00%$50.00%$60.00%$70.00%$80.00%$90.00%$
Best$for$job$
Good$documenta>on$
Word$of$mouth$recommenda>on$
Used$in$similar$analysis$
Quickest$
Already$installed$on$server$
Other$
Graphical$interface$
Where do bioinforma6cians do most of their work
Why do bioinforma6cians use the soSware they use
Bioinforma6c Infrastructure Survey
Sandra Gesing 41
Ques6ons around frustra6on and limita6ons of using • Bioinforma6c soSware • Bioinforma6c resources • HPC and Cloud infrastructures and about challenges to train students in bioinforma6cs Answers oSen address • Hurdles to use bioinforma6c resources because of commandline access or not available soSware • Quality of documenta6on of soSware • Need for parsers and converters for diverse data formats • Long wai6ng 6me for support or even lack of support
Challenges
Sandra Gesing 42
A world-‐wide research compu6ng infrastructure • Transparent service selec6on • e.g., Docker could be part of the solu6on
• Access to data irrespec6ve of loca6on • Op6ons to share data efficiently • Appropriate privacy and security measures • Op6mized usage of resources • e.g., op6mized usage of cloud compu6ng and their business models
Challenges
Sandra Gesing 45
Integra6on of data sources and instruments • Different data formats • Different interfaces • Different hardwares and technologies … from small ones to the big ones…
Challenges
Sandra Gesing 46
SoSware searchability, reproducibility and reusability • Science gateways step in the right direc6on but … much more work necessary on searchibility… Not only finding any data for a research area but finding the right data • Metadata approaches • Dic6onaries • More involvement of
librarians
Challenges
Sandra Gesing 47
SoSware searchability, reproducibility and reusability • Science gateways step in the right direc6on but … much more work necessary on reproducibility and reusability… • studies in medicine and pharmacology: 11% or 6% of the
analysed research was reproducible • myExperiment: only 20% of workflows reusable because
of dependencies on hardware, local or distributed data, soSware versions
Challenges
Sandra Gesing 48
SoSware searchability, reproducibility and reusability • Science gateways and workflow systems step in the
right direc6on but … much more work necessary on reproducibility and reusability… • Containeriza6on approaches • Migra6on approaches • Combina6on of both
Projects -‐ OSF
Sandra Gesing Science Gateways 50
• Big Data • Reproducibility
Open Access to Data and Projects could solve parts of the problems…
Projects -‐ WSSSPE
Sandra Gesing Science Gateways 51
Need of founda6onal building blocks and a reward system for soSware engineering! hHps://github.com/wssspe
Early adopters
Publicity
Wider adop3on
Funding ends
Scien3sts disillusioned
New project
prototype
Projects – B3 Book
Sandra Gesing Science Gateways 52
Biology, Bioinforma6cs and Big Data
arXiv:1511.02689 [cs.DC]
EU COST Ac6on cHiPSet (IC1406)
Sandra Gesing Science Gateways 53
cHiPSet – High Performance Modeling and Simula6on for Big Data Applica6ons • April 2015 – April 2019 • 15 countries -‐ 12 COST, 3 non-‐COST (US, China, Australia) • 37 reseach organiza6ons/companies (31 COST, 6 non-‐
COST)
hHp://www.cost.eu/COST_Ac6ons/ict/Ac6ons/IC1406
cHiPSet -‐ Collabora6ons
Sandra Gesing Science Gateways 55
Projects declared interest for collabora6on • NESUS (Network for Sustainable Ultrascale
Compu6ng) hHp://www.nesus.eu/
• KEYSTONE (Seman6c keyword-‐based search on structured data sources) hHp://www.keystone-‐cost.eu/
• AAPELE (Algorithms, Architectures and Pla�orms for Enhanced Living Environment) hHp://aapele.eu/
And maybe YOU?
Informa6on on Science Gateways
Sandra Gesing Science Gateways 56
• Science Gateway Workshops Europe: IWSG -‐ hHp://iwsg.info USA: GCE -‐ hHp://sciencegateways.org Australasia: IWSG-‐A -‐ hHp://iwsg.info
• Science Gateway Ins6tute hHp://sciencegateways.org
• IEEE Technical Area on Science Gateways hHp://ieeesciencegateways.org • XSEDE Science Gateways
hHps://www.xsede.org/gateways-‐overview • CRC Science Gateways
hHps://crc.nd.edu/index.php/research/gateways