+ All Categories
Home > Documents > Distributed tools deployment and management for multiple galaxy instances in globus genomics

Distributed tools deployment and management for multiple galaxy instances in globus genomics

Date post: 02-Dec-2023
Category:
Upload: sistemaucem
View: 0 times
Download: 0 times
Share this document with a friend
6
Distributed Tools Deployment and Management for Multiple Galaxy Instances in Globus Genomics Dinanath Sulakhe Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] Nilesh Kavthekar Department of Computer Engineering University of Pennsylvania Philadelphia, PA 19104 USA [email protected] Ravi Madduri Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] Alex Rodriguez Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] Amol Parikh School of Computer Science and Engineering Vellore Institute of Technology Vellore, 380015, India [email protected] Paul Dave Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] Nick Prozorovsky Department of Computer Engineering University of Illinois at Champaign- Urbana Champaign, IL 61820 USA [email protected] Lukasz Lacinski Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] Ian Foster Computation Institute Argonne National Laboratory and University of Chicago Chicago, IL 60637 USA [email protected] ABSTRACT Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a- service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics. Categories and Subject Descriptors H.m [Information Systems]; C.1.4 [Computer Systems Organization]: Parallel Architectures - Distributed architectures; J.3 [Computer Applications]: Life and Medical Sciences - Biology and genetics General Terms Design, Management, Performance, Economics, Reliability, Human Factors, Standardization. Keywords Globus Genomics, Globus Online, Galaxy, Galaxy Tool Shed, data transfer, data management, grid, cloud, next-generation sequencing, translational medicine. 1. INTRODUCTION The availability of next-generation sequencing (NGS) and third- generation sequencing methodologies [1] has drastically reduced the cost [2] of sequencing exomes and whole genomes to just a few thousand dollars. Affordable sequencing capabilities allow a growing number of biomedical research groups to have patient DNA and RNA sequenced with the goal of gaining a better understanding of the molecular mechanisms involved in the disease phenotypes of interest. This has lead to the research groups generating and handling enormous amount of data [3]. However, these groups often lack advanced computational infrastructure to manage and analyze this type of data. The analysis of NGS data involves running many third-party tools, which continue to grow in number and capability to address the evolving needs of NGS analysis. Most of these tools are computationally intensive and require considerable computational expertise for their installation, execution, and optimization. The Galaxy workflow system [4] helps to meet many of these challenges by providing an intuitive web-based platform for wrapping analytical tools in a form that permits easy invocation, creating complex computational pipelines that combine multiple such tools, and running applications and pipelines in a cluster or cloud environment. Galaxy has been widely adopted in recent years, especially as a growing number of NGS analysis tools have (c) 2013 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. WORKS’13, November 17–21, 2013, Denver, CO, USA. Copyright 2013 ACM 978-1-4503-2502-8/13/11…$15.00. http://dx.doi.org/10.1145/2534248.2534259
Transcript

Distributed Tools Deployment and Management for Multiple Galaxy Instances in Globus Genomics

Dinanath Sulakhe Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

Nilesh Kavthekar Department of Computer

Engineering University of Pennsylvania

Philadelphia, PA 19104 USA [email protected]

Ravi Madduri Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

Alex Rodriguez Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

Amol Parikh School of Computer Science and

Engineering Vellore Institute of Technology

Vellore, 380015, India [email protected]

Paul Dave Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

Nick Prozorovsky Department of Computer Engineering

University of Illinois at Champaign-Urbana

Champaign, IL 61820 USA [email protected]

Lukasz Lacinski Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

Ian Foster Computation Institute

Argonne National Laboratory and University of Chicago

Chicago, IL 60637 USA [email protected]

ABSTRACT Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-a-service (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics.

Categories and Subject Descriptors H.m [Information Systems]; C.1.4 [Computer Systems Organization]: Parallel Architectures - Distributed architectures; J.3 [Computer Applications]: Life and Medical Sciences - Biology and genetics

General Terms Design, Management, Performance, Economics, Reliability, Human Factors, Standardization.

Keywords Globus Genomics, Globus Online, Galaxy, Galaxy Tool Shed, data transfer, data management, grid, cloud, next-generation sequencing, translational medicine.

1. INTRODUCTION The availability of next-generation sequencing (NGS) and third-generation sequencing methodologies [1] has drastically reduced the cost [2] of sequencing exomes and whole genomes to just a few thousand dollars. Affordable sequencing capabilities allow a growing number of biomedical research groups to have patient DNA and RNA sequenced with the goal of gaining a better understanding of the molecular mechanisms involved in the disease phenotypes of interest. This has lead to the research groups generating and handling enormous amount of data [3]. However, these groups often lack advanced computational infrastructure to manage and analyze this type of data. The analysis of NGS data involves running many third-party tools, which continue to grow in number and capability to address the evolving needs of NGS analysis. Most of these tools are computationally intensive and require considerable computational expertise for their installation, execution, and optimization.

The Galaxy workflow system [4] helps to meet many of these challenges by providing an intuitive web-based platform for wrapping analytical tools in a form that permits easy invocation, creating complex computational pipelines that combine multiple such tools, and running applications and pipelines in a cluster or cloud environment. Galaxy has been widely adopted in recent years, especially as a growing number of NGS analysis tools have

(c) 2013 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. WORKS’13, November 17–21, 2013, Denver, CO, USA. Copyright 2013 ACM 978-1-4503-2502-8/13/11…$15.00. http://dx.doi.org/10.1145/2534248.2534259

been made available in the Galaxy workflow environment. The Galaxy Tool Shed [5] allows users to share analytical tools and wrappers with other users, and to import tools and wrappers into their own local Galaxy instance for evaluation and use. While a standalone alone Galaxy can be useful to individual researchers, it introduces two major challenges; (1) the amount of local compute resources available to research groups are insufficient to analyze the large amount of sequence data they are generating and require a Galaxy instance that can use scalable computational resources, and (2) administering a Galaxy instance and managing hundreds of complex genomic tools within Galaxy can be extremely cumbersome and requires computational expertise beyond most of the groups. Hence, there is a need for a well-managed and hosted Galaxy service that addresses these needs. Large number of individual research groups can greatly benefit from well-managed infrastructures that provide for transparent, scalable and reproducible computational analysis.

Many commercial solutions such as DNANexus [6], Seven Bridges Genomics [7], Spiral Genetics [8], Maverix Biomics [9] and others have emerged in the recent times that provide cloud-based services for the analysis NGS data. Globus Genomics [10] developed by our group at the Computation Institute is one of a kind to provide Galaxy as a service on the cloud. Globus Genomics uses commodity cloud compute resources such as Amazon Web services (AWS) [11] and provides a hosted service integrating Galaxy with Globus Online [12]. In this context, Galaxy provides a flexible workflow definition environment whereas Globus Online provides services such as high-performance, reliable and secure file transfer as well as user/group management. Globus Genomics, as a hosted service, provides virtually unlimited compute resources and a powerful platform for various types of users. However, in delivering this type of solution, one significant challenge needs to be addressed – management and deployment of newly wrapped tools as well as different versions of existing tools. Given the breadth of available tools (Galaxy instances launched within Globus Genomics currently host more than 500 tools including the tools imported from the public Galaxy Tool Shed and tools wrapped by our group), as well as the addition of many more tools to the Galaxy toolshed everyday, a tool deployment and management strategy is clearly needed.

We discuss in this paper our tool deployment strategy and the “Automated Tools Deployment Manager” (ATDM) that we have implemented in Globus Genomics to manage the installation and deployment of newly wrapped tools and their dependencies from various public and local Galaxy tool repositories onto multiple Galaxy workflow systems hosted on AWS. The paper also highlights the use-case and challenges involved in hosting and managing multiple Galaxy instances on the Cloud.

The remainder of the paper is organized as follows. Section II provides background on Globus Genomics, Galaxy and Galaxy Tool Shed and discusses challenges in tools management and deployment. Section III describes how we address these challenges, while section IV presents use cases that illustrate the value delivered by this approach. Section V compares our approach with other work and section VI reviews future plans. We conclude in section VII with a brief summary.

2. GLOBUS GENOMICS: GALAXY PAAS The value of a Reproducible Research System [13] that makes it easy for non-computational researchers to verify their results has already been established [14]. Reproducible Research

Environments backed by scalable and on-demand cloud computing resources can provide a powerful system for researchers to carry out their everyday research experiments and at the same time publish their results using the system. Globus Genomics brings together several powerful components such as Galaxy, Globus Online and elastic computational infrastructure provided by Amazon Web Services (AWS) to build a cloud-based platform as a service (PaaS) for researchers to easily execute their computation research tasks and manage their research data.

Globus Genomics allows researchers to seamlessly move data between distributed locations (sequencing centers, collaborator institutions, local and remote storage, etc.) and the location(s) at which computation is to be performed. In the case of Globus Genomics, data can be move into AWS for analysis. In conjunction, Galaxy provides users with the ability to intuitively create new analytical workflows from scratch or choose from a variety of best-practice pipelines. The individual tools or pipelines can be run at scale using AWS elastic resources. Globus Genomics is currently being used by many research labs and hosts a large collection of publicly available tools as well as private tools for the analysis of NGS and genomic data. The applications include scripts and parsers written by various users of the Globus Genomics community. While Globus Genomics has been extremely useful as a hosted service for many research labs, it does require proper setup and handling of individual group’s platform preferences. One key area specifically involves managing the tools wrapped in Galaxy.

2.1 Tools Management in Galaxy Manual installation of a tool can be cumbersome. Administrators must determine and account for all dependencies and software required by the targeted application, and also any data dependencies such as reference genomes or other databases required by the manually installed tools. The public Galaxy Tool Shed [6] simplifies these tasks for publicly available tools, allowing any user to wrap a tool and publish it for the general community to consume.

The mercurial-based Tool Shed can build up appropriate binaries, as well as intelligently identify, install, and manage necessary dependencies. Each tool repository (a collection of tools) in the Tool Shed has a tool_dependencies.xml file where one can specify other dependencies that need to be installed along with the

Figure 1: The dependency Install tree for an example Galaxy tool, RSeQC

specified tool. For example, as shown in Figure 1: RSeQC [15], an RNA-seq quality control tool wrapped for the Galaxy Tool Shed, is dependent upon both the RSeQC binary itself as well as SAMtools [16]. As SAMtools is already included in the Galaxy Tool Shed as its own tool repository (provided by the Galaxy development team), the RSeQC tool_dependencies.xml can reference the SAMtools repository, creating a soft link to wherever the SAMtools executable was installed by the Galaxy Tool Shed. The tool_dependencies.xml also includes the instructions to download and build RSeQC. After both the RSeQC and SAMtools repositories are installed, executing the RSeQC tool will automatically load any environment variables set or appended by the SAMtools tool_dependencies.xml.

Tight integration of the Galaxy workflow system with the Tool Shed makes it easy for an administrator to install a tool from a toolshed directly into a Galaxy instance; the toolshed and Galaxy then handle all software dependencies automatically. However, while the Galaxy toolshed provides a large and growing collection of tools, most research labs also have their own in-house tools that they need during their analysis in addition to publically available tools. Thus tool management remains a challenge for individual Galaxy users.

2.2 Challenges While managing a single instance of Galaxy for an individual user or research lab can be done with reasonable effort, it requires computational expertise that most scientific groups either lack or may have to acquire instead of focusing on their research. A cloud-based hosted service such as Globus Genomics can be useful for researchers, but it brings many challenges for the service provider. It can be challenging to meet all user requirements including making sure that their private tools are available in their own instance with the correct versions. Retaining flexibility is also critical. It is important that users have complete control in managing their toolkit and have the flexibility to choose their tools of interest as well as wrap their own tools and make them available in their instance.

While we are exploring a multi-tenant architecture to support different groups in Globus Genomics, we currently deploy a dedicated instance of Galaxy on AWS for each group. Each group has different needs in terms of the number of applications required to perform analysis and each application can have different demands on the amount of memory and processors required.

Globus Genomics administrators should be able to manage the multiple instances in a centralized location instead of an individual basis. This is key to reducing the time required to do maintenance on multiple instances. A major component of performing maintenance on a Galaxy instance has to do with tool deployment. While installing a tool on a single instance via the Galaxy toolshed can be quick and simple, it becomes cumbersome to install the same set of applications on multiple Galaxy instances, since this task must be performed one tool and one instance at a time. A greater challenge is presented when a user introduces their own application and needs their application to be available in their Galaxy instance. Currently, the user must share the application, scripts, wrappers, and any dependencies with the Galaxy administrator who must then manually install the tool in the instance. Furthermore, if the user allows sharing of a tool with other Galaxy instances, the administrator must manually install the tool on an individual basis in each of the instances.

In order to address these challenges and provide complete control to users in managing their own set of tools, we have implemented and integrated an administrative tool within Galaxy that provides

a simple yet powerful framework for tool deployment and management. We leverage the Galaxy and Galaxy Tool Shed APIs and have built a deployment tool (the “Automated Tools Deployment Manager”) that allows us to select one or more tools from one or more toolsheds and programmatically install those tools on multiple Galaxy instances in Globus Genomics with error reporting and detailed logs.

3. OUR APPROACH In the process of building an administrative tool to meet the needs as described above, we considered several requirements: provide a standard interface for users and administrators to wrap and publish their tools; provide a standard mechanism for deploying these tools on any Galaxy instance with appropriate privileges; allow for testing of tools and their deployment on a test instance before deploying on a production instance; and allow for the rapid deployment of multiple tools on multiple Galaxy instances. Leveraging Galaxy’s Tool Shed concept and by implementing a deployment tool (ATDM), we have developed an intuitive and flexible tools installation and automated deployment strategy within Globus Genomics that supports all of these features from a simple web UI. We have installed ATDM in an Admin Galaxy instance that manages tools deployment on other Galaxy instances created for individual research groups in Globus Genomics.

As mentioned earlier, we leverage both the Galaxy API and the Galaxy Tool Shed to provide a deployment tool that allows for the selection of one or more tools from multiple toolsheds and the programmatic installation of those tools on multiple Galaxy instances in the Globus Genomics environment. We use Galaxy Tool Sheds (Public and Globus Genomics hosted private Tool Sheds) as the distributed tools repositories. We maintain a private Globus Genomics Tool Shed for applications developed in-house. Likewise, Globus Genomics users can install their own private toolshed for the purposes of developing, testing and maintaining their own applications (and tool versions). Additionally, the Galaxy public toolshed is used where hundreds of tools are stored and shared by the biomedical community. It is expected that tools would be installed from a variety of sources to a variety of platforms and architectures. The Tool Shed framework helps automate this process.

The Galaxy Tool Shed provides a comprehensive RESTful API that provides a programmatic mechanism to list all of the repositories and tools available in the Tool Shed and allows for

Figure 2: Various steps involved in the installation of tools coming from distributed Tool Sheds and installed on multiple Galaxy instances within Globus Genomics

remote installation of tools. We implemented the python-based ATDM tool for Globus Genomics to invoke the Tool Shed API and get a listing of tools within. The listing allows us (administrators) to select one or more tools and supply with a Galaxy instance URL as well as a valid administrator API key for the Galaxy instance. The ATDM then uses the Tool Shed REST API and initiates installation of the tools on the selected Galaxy instance. Optional arguments for the ATDM include the ability to install tool dependencies and install repository dependencies, as well as installing tools to a new or existing tool panel/folder on the left side of the Galaxy user interface.

As part of our effort, we wanted to take full advantage of the intuitive Galaxy user-interface; namely, its ability to take programs and tools originally run from a command line, and display them in an easy, user-friendly manner. To this end, we wrapped our ATDM as a new tool within our Admin Galaxy instance, which provides a familiar, easy to use capability. Thus, an admin Galaxy instance is used to manage other Galaxy instances. The web interface shown in Figure 3 illustrates how one can select multiple Tool Sheds by using their URLs and specify multiple Galaxy instances in which the tools are to be deployed. For each Tool Shed that is selected, we get a listing of all the tools that it contains; we can then select tools for deployment from each Tool Shed. The ATDM then loops through

all of the supplied instances, installing each specified tool on each instance with a POST request shown in Figure 2. The tools deployment is run as a regular Galaxy job and at the end of the execution, it generates a detailed log report of all the transactions, any failures and total time taken for each tool deployment. A report can be viewed from within the Galaxy UI. We implement the actions of getting a listing of the tools on a Tool Shed and installing the tools as two distinct tasks in order to incorporate ATDM into the Galaxy UI as a wrapped tool. The tools info grabber fetches all tool names, their owners, and other details from a specified Tool Shed. It then writes these details to a text file in the tool-data directory for the ATDM, within the Admin Galaxy instance. The ATDM’s XML wrapper can then read options from this updated or newly created file, and display them to the user in a multi-selectable and filtered list, making tool installation all the more convenient. The user can select any number of tools from this valid list, so that the script can take each tool name and owner selected, separate them, make an API call to get a valid revision number, and then make a “POST” request to install the tool. The administrators still have the option of manually entering tool names if they prefer to do so.

ATDM makes tool installation management considerably easier, as many tools can be installed across different Galaxy instances with just the click of a button. This approach has helped us standardize the complete process of tool deployment for the distributed model of Globus Genomics by providing an interface in the form of a private Tool Shed for our users to wrap their own tools and publish them in their Tool Shed. In addition, it allows us (Globus Genomics Administrators) to easily import and deploy those tools on a separate instance of Galaxy for testing. After the tools are tested and validated, we can deploy the tools into a production instance and make them available for the research group.

4. USE CASES The combination of Galaxy’s Tool Shed model and our Automated Tools Deployment Manager implemented in Globus Genomics have helped us put in place an automated, dynamic and flexible tools management strategy to manage multiple Galaxy instances hosted on the cloud. We have leveraged Galaxy’s Tool Shed model to allow our users to participate in contributing their own tools and we have implemented ATDM to allow Globus Genomics administrators to automatically deploy the tools. The following use cases elaborate on each perspective.

4.1 End User Use Case While many cloud based analytical services (mentioned in Section 1) either provide a block box approach or restrict their users to predefined set of analytical tools to their users, the use of Galaxy allows Globus Genomics the flexibility to allow its users to select from hundreds of tools available on the public Tool Shed. Additionally, out tools deployment strategy allows our users to contribute their own tools as well with the flexibility to keep them private on their Galaxy instance or even publish them for broader community use. For example, Cox lab is one of the users of Globus Genomics and has their own private Galaxy instance hosted in Globus Genomics. Members of the Cox lab at the University of Chicago have been developing a Single Nucleotide Polymorphism (SNP) consensus caller for their exome analysis pipeline. The application being developed calls SNPs using multiple variant callers and reduces the amount of false positives called by the primary variant callers by creating a consensus of all variants. This tool has shown to significantly reduce the amount

Figure 3: A screenshot of the Tools Installer implemented in Globus Genomics for the distributed deployment of tools

of false positive variants called. While their group is actively developing and improving the tool, they wanted this tool to be made available in their Galaxy instance. The Tool Shed based approach simplified this process. We created a private local Tool Shed and helped the group to wrap their tool and install it into their private Tool Shed with appropriate versioning. The group maintains different versions of this tool evolved during the last few months.

Once the consensus caller tool is made available in their private Tool Shed by the Cox lab, the Globus Genomics administrators then deployed the tool using our ATDM UI into the Cox Galaxy instance for immediate use. We were able to deploy multiple versions of their tool over a period of time, without getting involved in their development process or wrapping of the tools. But more importantly, the Globus Genomics tools management strategy allows users to be in control of their research, permitting them to select their own analysis applications and allowing researchers the ability to provide their own tools that meet their own specific needs.

Using the Galaxy Tool Shed, our users can easily share their developed applications with other members of the Globus Genomics ecosystem or publish them to the public Galaxy Tool Shed for the rest of the biomedical community to consume. This feature allows the greater community to take advantage of the great work being done by individual users, which is clearly evident from the hundreds of tools in use within the Galaxy community.

4.2 Administrator Use Case Managing a workflow system such as Galaxy as a service on the cloud makes it challenging for an administrator. It can be even more challenging when managing multiple instances of the system for different research groups, as is the case of Globus Genomics. The Galaxy Tool Shed based tools deployment strategy in Globus Genomics allows administrators to easily manage hundreds of genomics tools across multiple Galaxy instances. It standardizes and simplifies the process of wrapping new tools and deploying them via Tool Shed instead of manually adding the tools into the Galaxy instance. Globus Genomics currently hosts tens of Galaxy instances on AWS cloud resources supporting many research groups. Here we present a scenario with two groups, Dobyns Lab at the University of Washington and Cox Lab at the University of Chicago. The Dobyns lab was interested in creating an exome analysis pipeline in Galaxy using the Broad institute’s best practice variant detection using GATK [17]. All the tools required for this analysis were already available in public Galaxy Tool Shed [5]. The Cox lab were also interested in exome analysis pipeline but using multiple different variant callers as described in section 4.1. Not all the tools were already wrapped and made available publicly; hence we had to wrap some of these tools and the Cox lab also wrapped consensus caller into a private Tool Shed created for them. Once the Galaxy instances were created for each group, using the ATDM tool UI, we were easily able to install all the tools for Dobyns Lab by selecting the public Galaxy Tool Shed as a source, without handling any tools and their dependencies manually. Whereas for the Cox lab instance, we selected the required tools from public Galaxy Tool Shed and the private Tool Shed and were able to install them automatically on the Cox lab Galaxy instance.

Additionally, our tool deployment strategy using ATDM helps preserving separation of concerns such that wrapping, testing and

deployment can be handled separately by Globus Genomics admins with appropriate expertise.

5. FUTURE WORK We will continue to expand the capabilities of our Automated Tools Deployment Manager for maintaining and monitoring multiple instances. One limitation of the Galaxy tool shed is that all dependencies of an application must reside in the same Tool Shed. It would be beneficial to allow within ATDM to search and install dependencies from different tool sheds. ATDM is currently used by the Globus Genomics administrators only, however, it would be useful to allow members of the Globus Genomics user community the ability to manage the tools without the administrator’s intervention from a simple Web UI. We will work towards adding this functionality.

We are also exploring a new multi-tenant architecture to support multiple user groups within Globus Genomics and will continue to adapt the current tools deployment strategy for new architectures.

6. CONCLUSION: Biomedical research groups face substantial data management and data analysis challenges due to the large volume of data they are generating in the form of genomic sequence data. Galaxy has emerged as a popular tool for configuring and running genome sequence analysis pipelines. However, Galaxy users have until now had to choose between either running on an overloaded central Galaxy server or configuring and running their own Galaxy instance. The Globus Genomics platform-as-a-service system provides researchers with a third option, namely outsourcing the task of operating Galaxy instances to the Globus Genomics team. That team then handles all of the mechanics associated with creating, configuring, deploying, and operating the Galaxy instance(s) for the user. In the current Globus Genomics system, deployment is always on AWS cloud resources, a convenient solution that makes it easy to address large and time-varying computational requirements. However, the Globus Genomics architecture can easily be adapted for other environments.

The key innovation in the Globus Genomics platform is the technology that allows the Globus Genomics team to manage (create, configure, update, monitor, etc.) large numbers of Galaxy instances. This task is challenging due to the fact that each Galaxy instance may be configured with a different combination of tools—tools that may be selected from multiple Galaxy Tool Sheds containing hundreds of public and user-specific applications, and that may be deployed in different Galaxy instances with different versions and configuration options. The Galaxy Tool Shed based Automated Tools Deployment Manager within the Globus Genomics allows us to manage distributed tools across multiple Galaxy instances from a simple web interface. The solution standardizes the process of tools management. Additionally, it allows Globus Genomics users to create and maintain new applications in their own private Tool Shed. Once development is complete and the application is available on the Tool Shed, it can easily be imported into their Galaxy instance in Globus Genomics.

7. ACKNOWLEDGEMENTS This work was supported in part by the NIH through the NHLBI grant: The Cardiovascular Research Grid (R24HL085343) and by the U.S. Department of Energy under contract DE-AC02- 06CH11357. We thank Amazon, Inc., for providing a research

grant of Amazon Web Services resources that facilitated early experiments. We also thank the Dobyns Lab, Cox Lab and other users for helping us build new capabilities and being our early test users.

8. REFERENCES [1] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G.,

Peluso, P., et al. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910), 133-138, 2009.

[2] Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science, 327(5961), 78-81, 2010.

[3] Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., and Nolan, G. P. Computational solutions to large-scale data management and analysis. Nature Reviews. Genetics, 11(9), 647-657, 2010.

[4] Goecks, J., Nekrutenko, A., Taylor, J., & Team, T. G. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, 11(8), R86.

[5] http://toolshed.g2.bx.psu.edu/ [6] https://dnanexus.com/ [7] https://www.sbgenomics.com/ [8] http://www.spiralgenetics.com/ [9] http://maverixbio.com/

[10] Madduri, R. K., Sulakhe, D., Liu, B., Lacinski, L., Dave, P., & Foster, I. T. (2013, July). Experiences in building a next-generation sequencing analysis service using Galaxy, Globus Online and Amazon web service. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery (p. 34). ACM.

[11] http://aws.amazon.com

[12] Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing, May/June, 70-73, 2011.

[13] Mesirov JP: Computer science. Accessible reproducible research. Science 2010, 327:415-416

[14] Special Issue on Reproducible Results, Comput. Sci. Eng.11, 3 (2009).

[15] Wang, L., Wang, S., & Li, W. (2012). RSeQC: quality control of RNA-seq experiments. Bioinformatics, 28(16), 2184-2185

[16] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, H., Marth, G., Abecasis, G., & Durbin, R. "The sequence alignment/map format and SAMtools." Bioinformatics 25, no. 16 (2009): 2078-2079.

[17] http://gatkforums.broadinstitute.org/discussion/1186/best-practice-variant-detection-with-the-gatk-v4-for-release-2-0


Recommended