6283008 Data Integration Data Mining Clinical Research

8/4/2019 6283008 Data Integration Data Mining Clinical Research

1/42


2/42

Disclaimer and Confidentiality Notice

This document is provided on an as-is basis, without any warranty,expressed, implied or otherwise, as to satisfactory quality or fitnessfor a particular purpose. In particular, any technical details of thesoftware shown in this InforSense document may change withoutnotice. The information provided herein is confidential.

Copyright 2000-2006 InforSense Ltd. InforSense is a trademark

application of InforSense Ltd. All trademarks are owned by theirrespective owners.


3/42

2006 InforSense Ltd 1

TABLE OF CONTENTS

EXECUTIVE SUMMARY.......................................................................................2A CHALLENGING TIME FOR LIFE SCIENCE RESEARCH AND DEVELOPMENT ...3

Challenges from an IT system perspective............................................................. 4Challenges from a Scientists perspective............................................................... 5Addressing the key challenges using Integrative Analytics....................................... 6

ANALYTI CAL WORKFLOW S ...............................................................................7 Scientific workflows and automation...................................................................... 8The WfMC model ................................................................................................. 9Pitfalls of scientific workflow systems ...................................................................10Analytical workflows from automation to innovation ..............................................11

IN TEGRATIVE ANALY TICS USING IN FORSEN SE KDE .....................................1 2Workflow representation .....................................................................................13Scalable and open workflow execution middleware ...............................................15Interactive user environment...............................................................................17Rapid application deployment ..............................................................................20Enterprise process knowledge management .........................................................22

CASE STUDY: STREAMLI NIN G THE LEAD OPTIM IZATION PROCESS .............24 Overview of tasks ...............................................................................................24Deploying the process .........................................................................................27Case study summary...........................................................................................29

CASE STUDIES IN TRANSLATI ONAL M EDICINE .............................................30Diffuse Large B cell Lymphoma study...................................................................31Supporting translational medicine at the Windber Research Institute ......................35Case study summary...........................................................................................36

SUMMAR Y AN D CON CLUSIONS .......................................................................37A user-oriented framework ..................................................................................37A Flexible framework for life science R&D .............................................................38


4/42


5/42


6/42


7/42


time. This causes increased complexity, maintenance and often leads to furtherconfusion.

Traditionally, corporate or global informatics has dealt with these issues by eitherembarking on large and inflexible information systems projects, or by hackingtogether quick in-house solutions using scripting languages such as PERL, which can

prove extremely difficult to reuse and maintain. Unfortunately, each solution often onlymeets the needs of a few people and rarely achieves the momentum to be used by acomplete organization as a standard. The resulting vast amount of disparate knowledgeis becoming unsustainable and the cost of IT is rising accordingly.

Challenges from a Scientists perspective

Improving integration, structure and standardization of data and systems are keyprerequisites for the future of innovative research and development. However, thereare also other challenges to be overcome, such as empowering decision makers(scientists and business users) themselves.

Support for Interactive Analysis: Scientific decision-making is invariablybased on disparate data and information sources. It often requires users tointeractively access, integrate and analyze the data using a trial and errorapproach. This dynamic decision making process needs to be well supported,as it is iterative, and requires the continual integration of new data andcomplex tools into an ever evolving process.

Support for Collaborative Analysis: Solutions to handle high throughputdata also need to be collaborative as discovery increasingly needs to beconducted across different domains. For example, a project using highthroughput screening of hits towards lead identification requires decisions to bemade by a team of biologists and chemists, all with varying degrees ofexpertise in different areas of drug discovery. Scientists need to be able to

share and track their knowledge processes and best practices and to work atdifferent levels of abstraction depending on their level of expertise.

Rather than providing support for the dynamic and iterative thought processes ofscientists and business analysts, most IT solutions end up either restricting theinnovation of these users, dictating how they can develop their processes and turnthem into realizable solutions, or force decision makers to rely on sophisticated toolsand/or external services to build special applications. The cost for Life Science R&D hasproven to be high in a competitive industry.


8/42


Addressing the k ey challenges using Integrative Analytics

Integrative analytics is a paradigm pioneered by InforSense to provide a uniforminformatics infrastructure for information-driven scientific and business decision-making.

Figure 2: InforSenses Integrative Analytics layer provides a flexibleframework allowing users to perform ad-hoc data and applicationintegration.

Integrative analytics is not a data and application integration layer, but an enablingtechnology. This paradigm provides:

Auser-oriented framew orkthat liberates the end-user from worrying aboutthe details of the underlying infrastructure. It allows them to focus on

formulating their own decision-making process and solutions.

Af lexible frameworkthat can easily adapt to users with different functionaland usability requirements, rather than the other way around.

An open framework that supports decision-makers easy integration ofinformation resources and software applications for building and deployingsolutions as they are formulated.

A collaborative framework that is able to support enterprise-wideknowledge production and management.

InforSense KDE is a software platform that successfully implements the integrative

analytics paradigm.

In the remainder of this paper, we describe the InforSense integrative analyticsapproach in more detail. We start by describing the concept ofanalytical workflows,which plays a central role in the paradigm, followed by a description of the other keyingredients of the approach. This is followed by a brief overview of InforSense KDEitself, followed by two case studies to show how the system addresses the challenges inhigh throughput chemistry and translational medicine applications.


9/42


ANALYTICAL WORKFLOWS

Since the late 1980s, workflow technology has been used in many application areaswhere the primary driver has been to enable automated execution of repetitive tasks,including:

Office Automation: where a workflow describes the automation of officeprocedures. Business Processes Specification: where a workflow describes interaction

and collaboration between different entities for the execution of a task.

Planning and Scheduling: where a workflow describes the action logic of asystem.

Visual Scripting: where a workflow describes dataflow between a set offunctions or commands within a system.

Data Processing: where a workflow describes a process of transforming andanalyzing data.

Application Integration: where a workflow provides the glue to integratedistributed applications invoked, for example, through a web service interface.

Informally, a workflow (see Figure 3) is an abstract description of the steps required forexecuting a particular real-world process, and the flow of information between thesetasks. Work passes through the flow from start to finish and the activities are executedby people or by system functions.

Figure 3: Abstract workflow for a combined microarray data analysis.

Workflows provide a way to describe the order of execution of tasks (or work units) andany relationships between them. More formally, a workflow is often best represented asa directed graph where tasks are represented as nodes (boxes) and information flowrepresented as arcs (arrows).

Workflow systems typically allow users to specify such workflows and automate theirexecution. In such a case, a workflow is constructed using a visual interface (See Figure

4). Once designed, a workflow can be submitted for execution by an execution enginethat controls the invocation and data transfer between the different activities. Theseactivities are not restricted to a specific application area and the integrative analyticsparadigm can deal with complex multidisciplinary processes.


10/42


Figure 4: A workflow provides a description of the steps required forexecuting a particular real-world process. A workflow can be authoredusing a visual front-end and its execution is delegated to a workflowexecution engine that handles the invocation of the remote applications.

Scientific work flows and automation

Since the early 2000s, the use of scientific workflow technology has found wideacceptance in the fields of bioinformatics and cheminformatics where it has been usedas a means for developing and executing distributed applications and integratingmultiple data sources and tools.

In this case, a scientific workflow represents a machine executable protocol for insilicoscientific activities. The workflow shown in Figure 4 comprises of different stepsincluding the analysis of microarray data using statistical methods, queries to a numberof remote data sources (including the NCBI, SwissProt and PubMed datasources), asimilarity search using a remote BLAST tool, as well as some text analysis operations.

Its execution, in terms of invoking different tools and moving data between them, isdelegated to an automatic execution engine. The development of the workflow using avisual editor reduces the barriers to technology adoption as the required tools are

accessed via an intuitive user interface, compared with coding a workflow in anapplication by using classical programming, scripting and query languages (i.e. C++,Perl, SQL, etc.).


11/42


The WfMC model

As with workflow systems used for business process automation, the main model usedfor designing scientific workflow systems follows the model suggested by the WfMC(Workflow Management Coalition) as shown in Figure 5.

Figure 5: WfMC (Workflow Management Coalition) reference model forworkflow systems.

The WfMC model is based on supporting the following operations:

1. Workflow Authoring: User selects data sources and tools from an availablelist and connects them according to the required data flow and control flowlogic.

2. Workflow Submission: User submits workflow for execution to a workflowexecution engine that handles the invocation of the required tools (includingother workflow systems) and handles passing the data between them.

3. Workflow Execution: End-users are able to execute the workflows from aworkflow client interface (web portal or dedicated application), where theysubmit data and collect results.

4. Workflow Monitoring: System administrators can monitor the execution ofworkflows through a monitoring interface.


12/42


Pitfalls of scientif ic workflow systems

Although the WfMC model itself is a generic descriptive abstraction, most scientificworkflow systems based on the model can fail for a variety of reasons and may end uphindering scientists rather than empowering them. The main reasons can be

summarized as: Focus on Automation: Many workflow systems have their roots in business

workflow systems used for automating the execution of predefined steps. Theyare primarily designed for scripting and then automating the execution of apre-designed process. It is essential to note that, as opposed to a workflowsystem that automates the execution of a pre-defined and rigid businessprocess (designed once executed many times), a scientific workflow system

needs to enable users to integrate and access data based on ad-hoc queriesand research problems that dynamically change. However, the workflow stillneeds to retain the automation capabilities and provide different degrees ofinteractivity to suit the application and user needs.

Focus on Predefined Tools/Data Sources/ Tools Vendors: Other systemshave also been designed with a particular application domain in mind (e.g.cheminformatics, data mining, etc). As such, they end up being too restrictiveand inflexible in terms of the data and analyses types supported. They areusually optimized for a particular integration method that often leads toresearch choices being driven or constrained by the informatics infrastructure.

Focus on a Particular Execution Model: Many systems, driven by the questfor optimizing the performance of automated execution, end up supportingonly pre-defined execution models. For example, supporting only a data

pipelined model of communication between applications; supporting only webservice methods for invocation and communication between applications; orsimply being restricted to predefined hardware. Such systems tend to becometoo rigid for building scientific applications and they end up dictating howscientists can integrate and analyze their data.

Inability to Integrate Interactive Applications: Driven by automation,many workflow systems focus on integrating applications and data sources asbackend systems. They fail to provide integrative interactivity tools, such asvisualization tools that are typically used by scientists and engineers. At best,they can simply invoke the execution of such tools to view the resultsgenerated from the execution of a particular workflow, but fail to providemechanisms that capture the user interaction with these tools in the middle ofa workflow, either as session tracking or output capture.

Poor Support for Workflow Authoring: Again, driven by automation, manyof these tools require an expert to design and implement the workflowsthemselves. Little support is given to the end-user scientist or domain expert in

designing/modifying their workflows rapidly. The workflow authoring toolsbecome detached from the execution environment and at best become simplevisual scripting tools.


13/42


Analytical workflow s from automation to innovation

It is our belief that the goal of a scientific workflow system is not simply to support theautomated execution of predefined steps. These systems need to be designed tosupport the dynamic, interactive and iterative thought processes of scientific

researchers in an information rich environment.

This calls for a fresh look at workflows themselves as analytical tools, going beyondtheir use for automation and towards their use for innovation. The key features ofanalytical workflow environments that enable this are summarized below:

1. Interactive: Scientists may need to be able to follow a rapid trial and errorapproach with no programming involved. The analytical workflow environmentneeds to be able to support the thought processes of scientific researchers ratherthan curtail them.

2. Tailored: Scientists need total freedom to define and build their own creativescientific processes. The analytical workflow environment needs to support, ratherthan limit, the innovation and creativity of end users.

3. Integrative: By using workflows, multiple resources (data, software andinstruments) can be dynamically combined in the same scientific study. Theanalytical workflow environment needs to allow this seamless integration ondemand.

4. Collaborative : Scientific research rarely proceeds as an individual activity. Theanalytical workflow environment needs to be able to support collaboration betweenmultiple researchers, allowing them to share and track their modifications andchanges within multidisciplinary scientific processes. It also needs to supportknowledge production and management enterprise-wide.

5. Ubiquitous: Scientific data are heterogeneous and disparate, and scientificactivities are rich in variety. Analytical workflow environments must be able to

handle different types of data sources and applications in each.6. Open: Scientists need access to new scientific algorithms and tools as they become

available. The analytical workflow environment must be easily extensible to includesuch tools as they become available.

7. Reusable: Scientists need to be able to record their discovery processes and makethem available for re-use by others in various forms. The analytical workflowenvironment needs to support the representation and storage of these processes,both for modification and re-execution.


14/42


INTEGRATIVE ANALYTICS USING INFORSENSE KDE

The InforSense integrative analytics approach provides control and flexibility in thehands of the users, thereby addressing the key requirements of each class of userefficiently and easily. Specifically, InforSense workflow technology meets the needs ofthe following classes of users:

Research Scientists: or knowledge workers who need to discover newknowledge through the analysis of their data.

Power Users: who develop new methods for data analysis and decision-making.

Discovery IT personnel: who need to maintain, integrate and deliver theunderlying data sources and software tools.

Project Managers: who must be able to control, evaluate and forecast theprogress of large-scale discovery projects.

Figure 6: The InforSense integrative analytics approach provides a user-oriented framework for developing and using analytical workflows.

The InforSense KDE integrative analytics approach builds on the use of analyticalworkflows and extends them with the following key elements:

1. Scalable and Open Workflow Execution Middleware: supporting theexecution of workflows, and the underlying data and application integrationmechanisms, using a wide variety of efficient methods, supporting the easyintegration of new data sources and components and compliant to W3Cstandards.

2. Interactive User Environment supporting:a. Workflow Authoring and Execution: allowing users to interactively and

rapidly construct and execute their analytical processes as workflows.

b. Data Integration and Analysis: allowing users to query, access,integrate and analyze data from any data sources at any time.

c. Knowledge Discovery: providing a rich toolset of specialized dataanalysis and visualization components to support the end user in derivingnew knowledge from the analysis of their data sets.

d. Dynamic Integration of Visualization tools: allowing users tointegrate multiple interactive visualization tools in their analysis session.

e. Automatic Workflow Capture: through recording interaction of the userwith knowledge discovery tools and capturing it as workflows for re-use infurther applications.


15/42


f. Analytical Wizards: guiding the user through the execution of complexprocesses while capturing them as workflows for re-use in furtherapplications.

3. Rapid Application Deployment Tools: providing a programming-freemechanism allowing users to construct new applications using the analytical

workflow model; making them available as interactive end user applicationsacross the enterprise via a variety of methods including through enterpriseportals.

4. Enterprise-Wide Process Know ledge Management Tools: providing thesupport for enterprise-wide knowledge management through enablingmultidisciplinary collaboration between different users, tracking the evolution ofthe analytical workflows and allowing their re-use in different settings.

5. Extensibility Tools: providing the support for integrating both in-house andthird party data sources and tools and supporting interoperability with otherapplications.

Workflow representation

InforSense KDE workflows are represented and stored using DPML (Discovery ProcessMarkup Language), an XML-based representation of the workflows. The languagesupports both a data flow model of computation for analytical workflows, as well ascontrol flow operations for linking and orchestrating multiple analytical workflows. Theworkflows are constructed using a visual editor and their execution is delegated to aworkflow execution engine, which handles the invocation of the different computationaltools and data movement between them.

Figure 7: InforSense KDE supports both data flow and control flowoperations in workflows. Workflows are represented and recorded in XML.


16/42


Workflow components

Each component in a workflow (representing either a data source or a computationaltool) is represented as a node in a graph. This provides a description of the input andoutput ports of the component, the type of data that can be passed to the componentand parameters of the service that a user might want to change. Each node descriptorcontains information, or metadata, covering three aspects: the tools parameters, theservice history within the context of the workflow (changes to parameter settings, userinformation, etc) and user-added comments.

Figure 8: Abstract representation of a workflow component.

Data flow operations

Each analytical workflow is represented as a data flow graph of nodes. Within eachanalytical workflow data is passed for processing along a chain of work steps, eachperforming transformation of data passed along its input ports to produce data on itsoutput ports.

Figure 9: Data flow within a workflow.

Control f low operations

Control flow constructs also exist within the language for higher level orchestrationaccording to business rules or to scientific heuristics. This also includes support foriteration loops, conditionals and check pointing.

Figure 10: Control flow operations within a workflow.


17/42


Scalable and open workflow execution middleware

The InforSense KDE architecture is shown in Figure 11. The implementation is based ona high performance and scalable service-based architecture for managing andintegrating data from distributed sources and for coordinating the execution of

distributed data analysis software components.

Figure 11: Overview of InforSense KDE architecture.

The workflow engine communicates with different workflow clients (specialized clients,web-based clients, etc) via a variety of APIs (Application Programming Interfaces). Itprovides services for managing data and control flow operations, notification operationsas well as workflow scheduling operations. It also manages services for accessing andquerying remote data sources and invoking remote computational services throughstandardized APIs. Finally, it also supports services for managing workflow and datastorage on behalf of the user.

Implementation

The implementation of the architecture is based on a pure Java J2EE compliant

implementation ensuring that it can execute on a wide variety of hardware platformsand operating systems. Security is handled through JAAS (Java Authentication and Authorization Service), this allows the server to connect to most authenticationmechanisms such as LDAP servers for user account information.

Figure 12: Detailed overview of InforSense KDE workflow server.


18/42


Workflows, data tables and other data required during workflow construction andexecution are made persistent on the server in a user space. Each user has a privateas well as any number of shared areas. The user space provides a role-based access tothese shared areas, the roles being assigned through the authorization service.

All workflows submitted to the server for execution are held in a jobs queue and

scheduled. The server can be conFigured to use different persistency service providersfor storing workflows. By default it comes with an instance of the HyperSonic SQLdatabase. It can also be conFigured to use an Oracle database as persistency serviceprovider.

Data tables generated from workflow executions are persisted in the userspace and canbe streamed to the client through HTTP access. Intermediate results of workflowexecutions are cached in order to improve execution performance. The server handlesthe lifecycle of these cached files and ensures efficient re-use and clean-up.

The client-side user interface can be easily deployed across an organization using JavaWeb Start and communicates with the server using two mechanisms: RMI for complexinteractions with the server and HTTP for efficient data upload/download. Furthermore,

a Web-client provides a thin-layer access to the server functionalities. It is based on JSPfor fast access to the server-side components.

Supporting multiple models of workflow execution

The workflow execution engine is designed to enable a wide variety of executionmodels to ensure efficiency. This includes, but is not restricted to:

Invocation of local or remote computations. Use of synchronous or iterative data transfer between computational

components as needed.

Use of in-database or out-of-database processing, allowing data to remainresident within an external database throughout processing or to be pulled outinto a data manager server when needed.

Support for pointer-based data manipulation working over distributedenvironments and file systems.

Interaction with remote scheduling systems and tools.All these features allow for a flexible system that is not restricted to a particular modelof computation and ensure its applicability and efficiency across a large number ofapplication domains.

High performance server computing

The workflow engine implementation is also backed up with high performance serverclustering methods enabling the use of multiple workflow execution servers in the sameinstallation. The technology allows the delivery of enterprise-wide installations achievingscalability in terms of data size, tasks and numbers of users accessing the system.

Built-in support for a w ide variety of applications and data types

The implementation of InforSense KDE supports the access to and integration of a widevariety of data sources and tools used in bioinformatics and cheminformatics, as well asdata mining and text mining. The availability of these commonly used tools provides an

out-of-the box solution for many applications.


19/42


Openness and Extensibility

The workflow engine itself is extensible. The use of standards, including web services,for access of distributed data and resources and a well defined API and SDK (SoftwareDevelopment Kit) enables users to rapidly integrate diverse resources and applicationsin their workflow.

Interactive user environment

InforSense KDE provides an interactive user environment, enabling experts and non-experts alike to rapidly construct and deploy their analytics applications with noprogramming effort. It provides a wide variety of interactive knowledge discovery tools,including interactive visualization tools that are typically required in scientific research.Furthermore, it provides built-in mechanisms for recording end-user interactions withthe visualization tools and storing such interaction as analytical workflows.

Interactive workflow authoring tools

InforSense KDE provides a user-friendly environment to enable informaticians, scientistsand business analysts to design, implement and execute integrated analyticalapplications that are tailored to their own decision-making. The InforSense visual clientinterface technology is based on drag-and-drop components representing data sourcesand analytical applications that are composed easily and interactively into analyticalworkflows. End users simply select and connect the required components into visualworkflows that represent the logic of their processes.

Figure 13: InforSense KDE supports a drag-and-drop visual interface fordeveloping workflows from a defined component set.


20/42


Interactive workflow execution and know ledge discovery

The InforSense KDE workflow client interface is designed to support the dynamic,iterative and interactive decision-making process of scientists. This is achieved throughinclusion of many interactive tools within the environment.

The interactivity tools transform the workflow-authoring tool from a simple scriptingenvironment to a fully-fledged knowledge discovery environment allowing users tointeractively inspect the results of executing a workflow at any point. It also allowsthem to easily follow a trial and error approach to analyzing their data if needed.

The basic interactivity features of the environment include:

Dynamic Data Integration: Users can spontaneously query, analyze andintegrate data from multiple data sources with no database or SQLprogramming expertise.

Built-in Knowledge Discovery Tools: The interactive executionenvironment also provides a wide range of built-in data processing andtransformation operations that can be applied to the data.

Interactive visualization technology

Figure 14: InforSense KDE provides a wide variety of interactivevisualization tools for data exploration and analysis.

The system provides an extensive set of generic and specialized visualization toolsenabling instant review, comparison and refinement of results to deliver quick insights

and decision-making. These visualization tools can be launched from any node in aworkflow. Each tool allows users to visually browse the data and models it is operatingon, and allows them to perform visual data selection and filtering. In addition, thesystem also supports the use of wizards for guiding the execution of workflows as wellas for guiding predictive model development and optimization.

Interactive Visualization Netw ork technology

InforSense KDE provides seamless interaction between independent visualization tools(including third party visualization tools registered into the system) enabling end usersto easily define relations between independent datasets and dynamically transfer dataselections between the independent visualization tools. Multiple independent interactivevisualization tools can be launched from any node in a workflow. The InforSense

Visualization Network reflects data selections made in one visualization toolinstantaneously in all other viewers provided.


21/42


Figure 15: The InforSense Visualization Network technology allowsinteraction between independent visualization tools (including third party

tools) for seamless exploration of complex and multi dimensional datasets.

Furthermore, using the Visualization Network technology users can, on-demand,exchange data selections between different visualization tools that are invoked onmultiple data sets, performing on demand joining of the data.

Mid-w orkflow execution

Figure 16: InforSense KDE supports interactive mid-workflow execution ofvisualization tools, allowing visualization tools (and third party tools) to belaunched at pre-set interaction points.

Mid-workflow execution of visualization tools is enabled by defining key points within aworkflow to act as pre-set interaction points: Data analysis and transformation stepsare automatically executed by the workflow up to a specified interaction point and thena visualization session is launched for interactive visual analysis and decision making.Once data items of interest are marked by the end user, the automatic execution of theworkflow continues using predefined data transformation and analysis steps.

Automatic capture of visual analysis into analytical w orkflows

InforSense KDE supports a close model of interaction between visualization tools andanalytical workflows. Data transformations and selection conducted by the user in avisualization tool can easily be recorded and captured as data processing componentsthat are represented as a workflow.


22/42


Recording such operations provides dual benefits:

First, it provides an audit trail of the visual analysis steps allowing users toinspect and review them.

Second, it provides a mechanism for re-using the same analytic steps asworkflow nodes in other applications.

Figure 17: Interactive analysis steps can be easily recorded withinteractive analysis sessions in visualization tools (including third partytools, such as Spotfire). These recorded steps can then be used in otherworkflows and applications.

Rapid application deployment

The InforSense visual analytical workflow environment provides a programming-freeapproach that simplifies the creation and execution of data access, transformation andanalysis tasks. Once created, InforSense analytical workflows are deployed easily and

rapidly as end-user applications across the enterprise via a variety of methods includinga portal interface or into another 3rd party application The deployment mechanism isalso programming-free and requires no extra software or third party code. These sameworkflows can be executed as web services via a command line or the InforSenseServer API, ensuring consistent enterprise-wide dissemination of applications.

Interacti ve portals and dashboards

Analytical workflows deployed as applications, via a portal interface, can make use ofInforSense's interactive visualization tools including charts and plots deployed asapplets. This feature provides portal users with the same interactive analysis anddecision-making facilities of the InforSense main client. Furthermore, the InforSenseportal supports communication between multiple workflow applications ensuring that

the deployed interactive applications can easily form part of a bigger enterprise-widesolution.


23/42


Figure 18: Workflows can be deployed for execution using a variety oftechniques, including from within a portal.

Guided execution of analytical work flows

InforSense KDE technology supports the construction of end user applications withspecified interactive/visual decision points to occur within the automated execution ofcomplete workflows. Using this feature, workflow authors develop and customize theirown wizards (Action Maps) for guiding non-expert users through the execution ofcomplex analytical workflows. This provides a novel way to allow end users to easilywrite guided analytics steps.

Figure 19: InforSense workflow technology allows for the easydevelopment of guided analytical applications. From within the main clientinterface, this is enabled through mid-workflow invocation of visualizationtools (including third party visualizers) through the use of Action Maps.Furthermore, analytical workflows can be easily deployed into portals andother third party tools for interactively guiding end users through theexecution of other analytical procedures.


24/42


Enterprise process know ledge management

InforSense's integrative analytics approach goes beyond data and applicationintegration. It provides a basis for intellectual property capture and management withinan organization Teams of users can collaboratively author their workflows and re-usethem in different applications. Organizations can audit the analytical processes

developed using workflows and manage portfolios of such projects more efficiently. Theend results include capturing and sharing best practices, improved project managementand improved decision-making across the organization. InforSense's workflowmanagement technology is supported by:

Analytical workflow storage and retrieval

InforSense analytical workflows are stored, searched and retrieved easily by users.Complete workflows can be re-executed and workflow templates can be modified andre-used in different applications by different groups. The workflow repository can easilybe queried to retrieve workflows based on various search criteria provided by the user.

Analytical workflow annotation

InforSense technology enables end users to annotate analytical workflows (wholeworkflows, workflow templates, workflow fragments and individual components within aworkflow). Annotations can be entered as free text comments or may be part of acontrolled vocabulary/ontology. By annotating workflows, users capture and sharefurther intellectual property about their processes and their knowledge about them, andallow other users to easily query and locate such workflows.

Figure 20: The InforSense KDE workflow authoring environment allowsusers to attach various annotations to their workflows.


25/42


Workflow change history tracking

InforSense technology allows valuable information about each workflow to beautomatically captured and stored. Such information is captured as the workflows aredeveloped and includes recording which components were used in a workflow, theparameter change history for individual components and information about the user idinitiating the change, etc. By using these features, users and organizations can easilybacktrack through the change history of individual workflows and maintain differentversions of the same analytical process.

Figure 21: The InforSense KDE workflow authoring environment provides

a history of all parameter changes.

Audit trails

In addition to tracking the evolution of individual workflows, InforSense technologyallows storing and tracking information regarding the workflow execution. This includesinformation on the outputs of a workflow (data and models) and its relationship to theparameters used in each workflow. Such audit trails provide a valuable record of data

provenance and form the input to project management and reporting tools.

Automatic reporting

InforSense analytical workflow are stored as XML, information about the history of

these workflows is also stored in XML and is easily retrieved by users. The systemprovides access to automatic reporting tools that process such information andsummarize it to end users within customizable views to enable better decision makingboth within individual projects and across multiple projects.

Figure 22: Both automatically generated and user-specific reports can beeasily generated.


26/42


CASE STUDY: STREAMLIN ING THE LEADOPTIM IZATION PROCESS

In this section we present a case study for using the InforSense integrative analyticsparadigm as applied to high throughout chemistry, with particular respect to lead

optimization.

The work is conducted using InforSense ChemSense, which allows for the access andintegration of various experimental and chemical compound databases, data cartridgesand cheminformatics tools available from the major vendors.

Figure 23: Overview of an InforSense workflow integrating tools frommultiple vendors and InforSense analytics. The flexible executionmechanism of the InforSense workflow engine allows the direct andconcurrent access to chemical data. This ensures full in-databaseprocessing for a workflow. It is also allows an out-of-database processingmodel when required by other tools.

The case study illustrates the type of workflows that can be constructed usingInforSense KDE. It also highlights some of the associated visual analysis tools that canbe used to interactively browse and analyze the data and results and to supportcollaboration through annotation and workflow sharing.

Overview of tasks

The high level steps that typically form a lead optimization process are shown in theFigure 24. In this case, for simplicity, it has been split into five main tasks. Each task isimplemented using InforSense KDE. The process has also been split into two distinctphases, an automated phase and an interactive phase. The objective is to select thebest subset of relevant compounds for further testing in order to maximize the ROI perexperiment.

Figure 24: Overview of lead optimization process, divided into five maintasks.


27/42


Task 1: Assay results for mult iple libraries

The process of lead optimization starts with assay results from multiple libraries. Beforeanalyzing this data a series of preprocessing steps need to be conducted to filter outfalse positives and rescue false negatives.

A number of statistical approaches can be used to achieve this. The approach we areusing builds a cross-validated predictive model using a Partial Least Square (PLS) basedregression to predict activity. Then, by selecting sufficiently different mismatchesbetween the actual and the predicted values, we can remove or rescue compoundsappropriately to create a list of hits for further processing. We have ensured that thenew data set is statistically consistent with the training set used to build the PLS model.The InforSense KDE workflow that is used to address this task is shown below.

Figure 25: Workflow to highlight false positives and false negatives.

Task 2: Compound Clustering

The second task is to bin, or group, the hits based on structural similarity. Using afingerprint method to identify each structure, the hits are compared using a hierarchicalclustering algorithm such as Wards Method. The workflow can contain multiplebranches to assess the dependence of the clustering methods on the output of binning.The results in each bin represent clusters of structures most similar to one another.Activities and properties can complement the information contained in fingerprints.

Figure 26: Workflow to cluster hits into distinct bins, and dendrogramviewer used to visualize the results.


28/42


Task 3: Maxim um Com mon Substructure and R-Group deconvolution

With structures clustered into bins, a Maximum Common Substructure (MCS) analysiscan be performed on each bin of compounds. Using the resulting MCS, an R-Groupdeconvolution is performed. The result of this is a table of parent structures andcorresponding R-Group structures for each position. At this phase, R-Group memberscan be plotted against assay results to find structures with interesting activity fordifferent substitution points. These parent molecules are selected for further analysis.For instance, local QSAR models can be obtained to build focused libraries.

Figure 27: R Group Analysis of R-Group vs. Activity.

Task 4: Predictive ADM ET calculation

Having selected the parent molecules for further processing, we can automatically scoreour similar hits against absorption, digestion, metabolism, excretion and therapeutic(ADMET) profiles that are built in the database. This allows us to search and score ourmodels using single-database resident applications.

Figure 28: Scoring Molecules to create ADMET profiles. The models arebuilt using in-database technology, which allows them to be updated everytime new ADMET results are added to the repository.


29/42


Task 5: Prioriti zing hit series

Now we have an enhanced list of compounds from which we can select compounds forfurther experimental screening. Using InforSense ChemStudio, a fully interactive tool,the user is able to browse, select and prioritize the lead groups and identify whichcompounds to process.

Figure 29: Using InforSense ChemStudio to process the compounds.

Deploying the process

The overall process has been separated into five tasks. The goal in the next stage is toshow how workflows can be deployed into production.

Part 1: Portal deployment

For the end user who does not want to be exposed to the complexities of theunderlying analysis, workflows can be deployed into the portal environment. In thiscase, the workflow author can control which key parameters are made available forchange to the end user. Alternatively, the workflow author may choose not to exposeany parameters and the portal becomes a simple reporting tool to deliver information tothe rest of the project team.

Figure 30: By deploying results from the predictive ADME workflow

together with other project information the portal provides an easy way toassess relevant information to enable project decision making andreporting.


30/42


Part 2: Control f low for automation

Another approach to deploying these tasks is to use InforSense Control Flow operationswhere each task is defined as a step in a control flow. A control flow contains heuristicsand business rules. At the end of each step, business rules can be applied to check, forexample, that the required number of hits is still in the process.

The process can be automatically invoked when the screening step is completed and itcan preprocess, select, deliver and notify formatted results to the project team forprioritization.

Figure 31: Control Flow to automate the process and notify about progressat each stage.

Part 3: Interactive tasks for selection and communications

Task five is a collaborative, interactive task. To achieve this, the leads are loaded intoChemStudio. This interactive tool allows browsing and visualization of data, but equallyimportant, it allows users to select views of the data, annotated them, save them andshare them with other users. This unique ability allows project teams to build aconsensus of which leads to prioritize and then allows them to be sent to follow-upscreening.

Figure 32: Collaboration within the ChemStudio browser using definedcommunication points.


31/42


Case study summary

By using a workflow model, it is straightforward to design and rapidly implement thedifferent steps of a complex analysis process. In-house and commercial tools can beleveraged and integrated within a single interface. Using a combination of simple data

flow and control flow operations, the whole process can easily be automated.Furthermore, with the use of various interactive analysis tools, such as InforSenseChemSense, users can collaboratively browse, visualize and annotate the output resultsin a user-friendly environment. The availability of such an interactive framework isessential not only for enabling the analysis of the data and results, but also for enablingthe users to share their results effectively with their colleagues.


32/42


CASE STUDIES IN TRANSLATIONAL MEDICINE

With the recent drive towards translational medicine, it has become increasinglyimportant that researchers are able to exploit the vast quantities of clinical datagenerated from both clinical trials and during clinical practice. Such data is aprerequisite for effective cross-domain strategies such as systems biology, biomarker

discovery and personalized medicine. Ultimately, each of these approaches has thecommon need to improve patient care. To assist this, a successful informatics platformneeds to enable the seamless transition from basic research through to decision supportwithin a clinical setting. Moreover, such a platform has to enable movement in thereverse direction, so not only from bench to bedside but also from bedside to bench.

Current processes to access this valuable phenotypic data are highly inefficient.Whereas the researchers are demanding more flexible, yet simplified interfaces topatient data. Meeting their needs means direct control is given back to the domainexperts, rather than IT consultants, so that researchers become solution builders,rapidly developing and deploying their own applications.

However, analysis of clinical data is only half of the translational medicine paradigm.With the advent of the high throughput biology era, techniques such as geneexpression, proteomics and genetics are now common research tools used to providebetter molecular definition of disease. Such experimental approaches generate ever-increasing volumes of data which needs to be integrated with each other as well as withthe clinical data silos.

A major aspect of translational medicine is its collaborative nature. By definition, itrequires the bridging of different disciplines and departments, most notably thoseinvolved in basic research and clinical science. It is also a very iterative and dynamicprocess with constant information exchange required between these groups.Historically, research and clinical domains have not interacted a great deal due to thetypical language barrier that often exists. Although this exchange is now happening inlarge pharmaceutical organizations, it is the research institutes that are better placed toembrace the concept first, due to their less rigid organizational structures.

The case studies presented in this section focus on translational medicine anddemonstrate how InforSense KDE can be used to bridge the gap between life scienceand healthcare. The studies show this by providing a cross-domain enterprise-wideinfrastructure designed to support a varied user community ranging from the benchbiologist to the physician. The first study features examples of the analytical workflowsthat can be used in typical studies to better understand the molecular definition ofdisease, specifically a study into Diffuse Large B cell Lymphoma. The second case studyshows an example of this approach in practice where we illustrate how the InforSenseKDE system is used for translational medicine at the Windber Research Institute,Pennsylvania, USA.


33/42


Diffuse Large B cell Lymphoma study

In this section, we focus on a particular case study for Diffuse Large B Cell Lymphomawhere the main aims of the study were to:

a. Identify and further characterize, using a variety of analytical approaches, thegenes that are important in predicting outcome of disease based on geneexpression.

b. Build a predictive model of disease outcome based on gene expression datausing methods such as supervised learning algorithms. This model is deployedto the end user (e.g. clinical researcher) for further validation in other samplesets.

The high level steps that form the parts of the process above are shown in Figure 33and the actual workflow constructed for the study is shown in Figure 34.

Figure 33: Overview of Large B cell Lymphoma analysis.

Figure 34: Actual Workflow constructed for Large B cell Lymphoma

analysis

Task 1: Stratifying pati ent samples based on clinical data

For clinical data analysis, the use of an OLAP browser allows a highly flexible way to binand group sets of relevant data for easy browsing and selection of patient

subpopulations. InforSenses browser tool (Figure 35) has the added advantage ofallowing this hierarchy to be defined dynamically and does not encounter the dimensionlimit found in many other OLAP tools. In this Large B cell Lymphoma example, the toolcan be used to define the different disease outcome categories and the resulting datasets passed into the workflow authoring environment for more complex downstreamanalysis.


34/42


Figure 35: Use of the InforSense browser in dynamically defining thehierarchy for subsequent easy browsing of clinical data

Task 2: Genotype phenoty pe data integration

Once clinical data is collected for the patient samples, the next task within the workflowis to integrate it with the experimental results from gene expression experiments. Therelevant table for the Lymphoma data is loaded into the system and integrated using apreviously deployed service workflow, shown as a single node called Data Integrationin Figure 36.

Figure 36: Data integration workflow.

Task 3: Identification of disease relevant genes

Once integrated, the combined phenotype-genotype data is then analyzed using twodifferent approaches as shown by the branching workflow below

Figure 37: identification of genes important in predicting disease outcome.


35/42


A) In the top branch, an attribute importance (AI) algorithm is applied toidentify critical genes to predict disease outcome. The high scoringgenes from this analysis were then manually selected by an interactivestep using InforSenses Interactive Browser to plot out the results andview the individual gene scores

Figure 38: Interactive browser showing a plot of the ranked high scoringgenes.

The selected genes were then further characterized downstream in the workflow

B) In the bottom branch, a predictive model is built using a SVM (SupportVector Machine) algorithm. In addition to SVM prediction, InforSenseKDE provides a wide variety of other predictive modeling tools

including Bayesian classification, Decision Trees, etc. that could besimilarly used to develop a predictive model based on the availabledata sets. Once developed, the model can then be easily applied toother data sets and assessed in terms of diagnostic potential.

Task 4: Assemb ling a collection of disease relevant informat ion

In addition to building a predictive model, the workflow also assembles a wealth ofinformation about the genes thought to be important in affecting outcome. Suchanalyses are typical in drug discovery or biomarker studies where researchers want toprioritize genes for further study based on diverse information. These approachesprovide context and include:

Visualization techniques that enable rapid determination of clusters of co-regulated genes.

Pathway analysis to search for modulated genes which cluster into subsets ofmetabolic or signal transduction pathways.

Sequence analysis using the BioSense extension to identify homologues ororthologues in other species to identify animal models.

Text analytics using the TextSense extension to investigate the state ofknowledge in scientific literature or patents.

Chemical structure similarity searches using the ChemSense extension to lookfor small molecules that may inhibit or stimulate activity, often based onsimilarities to the natural ligand.


36/42


Figure 39: Multiple analysis branches for the B Cell Lymphoma study.

Using the InforSense KDE workflow model and knowledge discovery tools, it is simple toextend the analysis incrementally to cover each area.

Task 5: Deploym ent of disease outcome model

The SVM model, generated earlier, can easily be made available using the portaldeployment technology to a wider community, such as clinicians, for use in validationwith other data sets. In this situation, the complexity of the analysis can be hidden fromthe end user. Users are simply required to input the new data in the web interface andthen click a button which executes the modeling workflow in the background, this inturn sends the results back to the portal for review.

Figure 40: Portal technology being used for clinical decision support.


37/42


Supporting translational medicine at the Windber ResearchInstitute

InforSense has been collaborating with the Windber Research Institute (WRI) todevelop a flexible IT infrastructure for translational medicine applications. WRI is one of

the leading biomedical research institutes in the USA and has a strong collaborationwith the Walter Reed Army Medical Centre. The main areas of research focus on breastdisease, gynecologic cancers and cardiovascular disease.

The collaboration between InforSense and WRI has resulted in a high-performance andhighly agile integrative decision-support system for systems biology and clinicalpractice. Workflows for data integration and analysis, similar to those presented in theprevious case study (section 5.1) have been built using the InforSense KDE system anddeployed to a portal interface.

Researchers use the analytical capabilities of the platform, as described in the casestudy above, to explore the relationship between clinical presentation and diseaseprogression to ultimately identify patient cohorts of interest. These stratified patient

populations then feed into InforSenses analytical framework (InforSense KDE) forfurther investigation using techniques such as those used for gene expression analysis.

Figure 41: InforSense KDE workflows and visualization tools are easilydeployed for execution from a portal interface. The portal interfaceprovides a more familiar and simple-to-use interface for clinicians.

Both researchers and clinicians can browse and dynamically drill down into data,choosing from hundreds of dimensions, to identify patient populations for furtherexploration. The identified populations can then be used as a starting point for web-deployed workflows that enable the complex analysis of experimental data. As data andanalysis techniques change, new workflows can easily be published into the portal,enabling informatics staff to support a broad, disparate and ever evolving usercommunity within a single technology framework.


38/42


Case study summary

The InforSense KDE integrative analytics platform provides translational medicineresearchers with the ability to integrate all their required data types and analytic toolstogether. It also allows them to optimize all workflow steps across the different

research disciplines together with both molecular and clinical data types.

Figure 42: Integrative analytics workflows can be used to bridge the gapbetween bedside and bench side research.

Within translational medicine applications, workflows integrate data from a largenumber of data sources from different domains. They are developed by the domainexperts themselves, where each expert personalizes the development of their workflowsto suit their analysis needs. This ends up with a wide coverage of a variety of differentanalyses. The interactive features of InforSense KDE provide the users with thedifferent tools required to browse and interpret their datasets and to refine their

models. With a large number of such workflows developed by many researchers, thecollaborative features of InforSense KDE become an essential part of the wholeapproach.

Using the InforSense portal technology it is straightforward to deploy the workflow as aweb application. Deployment exposes the exact parameters that a clinician may need toaccess together with the relevant actions that need to be carried out, such as uploadingnew data and running the model. The result of the deployed workflow guides theclinician to a therapy recommendation.

Such a unifying framework that integrates both genotype and phenotype enablesclinical data to guide and accelerate the discovery of new diagnostics and therapeuticsand ultimately deliver this knowledge seamlessly to the clinician.


39/42


SUMMARY AND CONCLUSIONS

In this paper, we have presented an overview of the InforSense integrative analyticsapproach and have described how it provides a practical and cost effective approach foraddressing the underlying IT challenges faced by Life Science research anddevelopment activities. We have also presented the InforSense KDE system and

highlighted how specific features support decision makers in their integrative analysisprocesses. We have also presented case studies in high throughput chemistry andtranslational medicine to demonstrate the overall approach and tools used. The keybenefits of the approach are summarized below.

A user-oriented framew ork

The InforSense integrative analytics approach provides a user-oriented framework forusing analytical workflows to co-ordinate access to distributed resources as needed. Itis designed to allow the domain experts themselves to construct the analyticalworkflows, providing them with a user-friendly environment to conduct this analysis andwith mechanisms by which they can easily deploy their analysis workflows for use by

other colleagues. The approach also enables them to manage their knowledgediscovery processes using a range of collaborative and knowledge management tools.

Figure 43: Integrative analytics provides a user-oriented framework fordeveloping and using analytic workflows.


40/42


A Flexible framework for l ife science R&D

The InforSense integrative analytics approach also provides a unifying framework forcross-domain analytics for Life Science R&D activities and beyond into Manufacturingand Sales/Marketing. It enables research processes that span all drug discovery

activities from target identification and validation right through to the clinic. It alsoprovides the tools that address the needs of all the decision makers involved in theprocess: research scientists, power users and discovery IT personnel together withproject managers who are in charge of controlling and evaluating the progress of large-scale discovery and development projects.

Figure 44: Integrative analytics provides a unifying framework for cross-domain analytics in Life Sciences research and development.


41/42


42/42

Date post:	07-Apr-2018
Category:	Documents
Upload:	abhiramvaranasisince1982
View:	214 times
Download:	0 times

6283008 Data Integration Data Mining Clinical Research

Documents