BeeSpace Software

BeeSpace SoftwarePlans, Design, and Development

Outline Goals Context Approach Software Process Functionality Design Implementation Details Future Prospects

Project Goals & Parameters “This project will analyze social behavior… using Apis Mellifera as the model organism”.

Goal: support research and analysis of the Western honey bee. Using “biology research (that) will generate a unique database of gene expressions…” and

“microarray experiments (that will) utilize the recently sequenced genome, supported by state-of-the-art statistics.” Goal: support application of biological methods and techniques for exploratory

analysis. And using “informatics research (that) will develop an interactive environment to analyze all

information sources relevant to bee social behavior.” Goal: support application of language processing methods for exploratory analysis.

“The BeeSpace environment will enable users to navigate a uniform space of diverse databases and literature sources for hypothesis development and testing. (Ref: http://www.beespace.uiuc.edu/) Goal: support dual analysis methodologies via an integrated analysis environment.

Parameter: 5 years to complete project, includes research, development, deployment, outreach and documentation.

Parameter: annual milestones and workshops expected.

Context There are voluminous amounts of biomedical and genomic literature containing

valuable knowledge and research results. Implication: Too much for human processing; and not in a machine-ready format for

reasoning based systems. There exist novel language processing techniques that have been primarily

applied in niche applications. Implication: Emerging technologies (NLP, TM, etc.) can provide backbone for

strategic solution, but their risks must be mediated thru controlled developmental cycles.

There exist numerous, but currently isolated, tools for data processing of bioinformatics. Implication: Opportunities exist for interoperability with disparate systems, but success

hinges on standardization. The web is seeing an increase in smaller, highly focused communities-of-interest.

Implication: Opportunities exist for supporting the creation and management of localized “knowledge-spaces”.

Context – Related Tools & Projects 3rd Millennium Inc. – “…development of an integration framework for genomic, gene

expression, and interaction data (protein-protein well as protein-DNA) from multiple sources and model organisms that can enable the display of the relationships between biochemical objects into the context of biological pathways and networks.”

iHOP – Information Hyperlinked Over Proteins: supports lookup and summarization of genes/proteins. “In general more than 90% of all active relations between proteins in the literature are expressed syntactically as ‘protein verb protein’”. Ref.

IntAct Database – “IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.”

Entrez eUtils – A web services (SOAP) interface for programmatically querying and interacting with NCBI databases.

Software ProcessSystem Development Life Cycle (SDLC)

Identify project goals and critical success factors. Investigate current methodologies and tools that have functional or domain

overlap with project objectives. Research the applicability of novel analysis techniques for extracting deeply

embedded and stratified knowledge structures. Build an integrated software suite that will allow for interactive analysis and

augmentation of rich data sets. Test and deploy software to focused user groups. Document and publish research results. Re-iterate above process for continuous quality improvement.

Functionality Should be web-based system supporting lightweight GUI components and having

minimal end-user requirements. Should accommodate user-directed query-by-navigation (QBN) of “concept

space”. Should extract and normalize concepts as “equivalence classes” of things with

highly similar meaning. Should recognize and denote entities. Should allow user to drill-down, drill-up and drill-across concept space. E.g. text-

to-concept, concept-to-concept, concept-to-theme, and the reverse directions as well.

Should allow user to perform encyclopedia-style lookup of entities. Should provide hooks for tie-in to 3rd party bioinformatics tools.

Design Principles Maintainability Portability Extensible Efficiency Organized Interoperability Configurability Ease-of-use Trusted “Quality without a Name”

References: “Code Complete”, 2nd ed., “Pattern-Oriented Software Architecture”, volume 1.

Design – Use Case Diagram

Design - Component DiagramBeeSpace Design

Application Layer

BeeSpaceNavigator

Query & Data Access Layer

Annotated Data, Meta-Data and Indices

XMLSchemas

XMLData

Indices

EntityRecognizer

NP Chunker

POSTagger

ConceptNormalizer

InverterConcept

Generator

Data Processing Layer

Data Sources

TextBases

FuzzyQueryEngine

DataAccess

Component

Design - Deployment Scenarios

BeeSpace Software Packaging

Core Library

Data ProcessingComponents

WebApp

StandaloneGUI App

Query/AccessComponents

Extension Library

CommunicationComponents

Agents/P2PClients

Design – Class Diagram

Implementation DetailsThe current system is being constructed as follows: The (v1.0) application is being developed as a web-based application.

Design Decision: The interface is built on top of lightweight technologies (e.g. HTML, DHTML & JavaScript). Typical web-app challenges, such as sessioning and security, need to be addressed.

The output of the data processing pipeline is a set of indices and annotated data files that the client application depends on. Design Decision: There is a clear separation-of-concerns between the server-side processing and

the client-side interface. XML is being fully utilized to as a data interchange format between software components.

The pipeline is composed of independent software components, but these components need to be inter-connected. Design Decision: Components are called as executables with defined interfaces.

Some components need to be able to store their data aggregations persistently (and other components may need access to this data). Design Decision: Currently each component handles this problem independently. Better, long term

solution is to extract out this concern and address it globally; for example, using ORDBMS.

Future Implementation Details Support both a web interface (HTML, CSS, DHTML, JavaScript) and a full-blown GUI

interface (Java Web Start app). Consistent Java implementation for portability, maintainability, RAD, etc. Incorporate a DBMS for consistent handling of “persistent storage”. Library extensions for communication between distributed, heterogeneous applications

(perhaps KIF). Optimized data processing and communication.

Climbing the Pyramid

Raw Data (Txns)

Aggregations (Reports)

Predictions (Trends)

Intelligent-driven Business (Profit)

Raw Text (Lit.)

Semantics (Nodes)

Hidden Relationships (Network)

Intelligent-driven Research (Profit)

Computer Automated Research (Success) Computer Automated Business (Success)

Text Mining Data Mining

Pyramid of Knowledge

Future Prospects Generalize the system so that it is NOT domain-specific and can be readily applied to other domains. Allow for persistent sessioning and sharing of sharing of knowledge-spaces amongst communities-of-

interest. Support a visual query system (VQS) interface and/or a query-by-example (QBE) interface. Support all kinds of hypothesis generation: deduction, abduction & induction. Support personalized annotations. (What constitutes a “good” KR structure: clarity, logic, expressive?). Smooth the integration between the BeeSpace Navigator and the myriad number of web-based tools. Support n-ary, semantically rich relations as opposed to just dyadic.

Visual Query in Text Mining Application

Polypetptide: p150Glued

Found-In Gene: ?x Protein: ?y Has-

ProductOrg: bee

Org: fly Found-InGene: Glued

Has-Product

Similar-To

Threshold: 0.9

Future BeeSpace ComponentsFuture BeeSpace Design

Application Layer

BeeSpaceNavigator

BeeSpaceWorkflowManager

BeeSpaceAnalyzer

Q/AComponent

FuzzyQueryEngine

ExpertShell

Component

DataAccess

Component

EntityMapper

Query & Data Access Layer

TextMiner

Central Knowledge Base

ORDBMS

EntityRecognizer

NP Chunker

POSTagger

ConceptNormalizer

InverterConcept

Generator

Data Processing Layer

RelationExtractor

TopicDetection

RuleMiner

Data Sources

DataBases

TextBases

WebBases

OntologyDetector

Snake Space?

Date post:	30-Dec-2015
Category:	Documents
Upload:	sade-snyder
View:	25 times
Download:	0 times

BeeSpace Software

Documents