Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | sade-snyder |
View: | 25 times |
Download: | 0 times |
BeeSpace SoftwarePlans, Design, and Development
Outline Goals Context Approach Software Process Functionality Design Implementation Details Future Prospects
Project Goals & Parameters “This project will analyze social behavior… using Apis Mellifera as the model organism”.
Goal: support research and analysis of the Western honey bee. Using “biology research (that) will generate a unique database of gene expressions…” and
“microarray experiments (that will) utilize the recently sequenced genome, supported by state-of-the-art statistics.” Goal: support application of biological methods and techniques for exploratory
analysis. And using “informatics research (that) will develop an interactive environment to analyze all
information sources relevant to bee social behavior.” Goal: support application of language processing methods for exploratory analysis.
“The BeeSpace environment will enable users to navigate a uniform space of diverse databases and literature sources for hypothesis development and testing. (Ref: http://www.beespace.uiuc.edu/) Goal: support dual analysis methodologies via an integrated analysis environment.
Parameter: 5 years to complete project, includes research, development, deployment, outreach and documentation.
Parameter: annual milestones and workshops expected.
Context There are voluminous amounts of biomedical and genomic literature containing
valuable knowledge and research results. Implication: Too much for human processing; and not in a machine-ready format for
reasoning based systems. There exist novel language processing techniques that have been primarily
applied in niche applications. Implication: Emerging technologies (NLP, TM, etc.) can provide backbone for
strategic solution, but their risks must be mediated thru controlled developmental cycles.
There exist numerous, but currently isolated, tools for data processing of bioinformatics. Implication: Opportunities exist for interoperability with disparate systems, but success
hinges on standardization. The web is seeing an increase in smaller, highly focused communities-of-interest.
Implication: Opportunities exist for supporting the creation and management of localized “knowledge-spaces”.
Context – Related Tools & Projects 3rd Millennium Inc. – “…development of an integration framework for genomic, gene
expression, and interaction data (protein-protein well as protein-DNA) from multiple sources and model organisms that can enable the display of the relationships between biochemical objects into the context of biological pathways and networks.”
iHOP – Information Hyperlinked Over Proteins: supports lookup and summarization of genes/proteins. “In general more than 90% of all active relations between proteins in the literature are expressed syntactically as ‘protein verb protein’”. Ref.
IntAct Database – “IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available.”
Entrez eUtils – A web services (SOAP) interface for programmatically querying and interacting with NCBI databases.
Software ProcessSystem Development Life Cycle (SDLC)
Identify project goals and critical success factors. Investigate current methodologies and tools that have functional or domain
overlap with project objectives. Research the applicability of novel analysis techniques for extracting deeply
embedded and stratified knowledge structures. Build an integrated software suite that will allow for interactive analysis and
augmentation of rich data sets. Test and deploy software to focused user groups. Document and publish research results. Re-iterate above process for continuous quality improvement.
Functionality Should be web-based system supporting lightweight GUI components and having
minimal end-user requirements. Should accommodate user-directed query-by-navigation (QBN) of “concept
space”. Should extract and normalize concepts as “equivalence classes” of things with
highly similar meaning. Should recognize and denote entities. Should allow user to drill-down, drill-up and drill-across concept space. E.g. text-
to-concept, concept-to-concept, concept-to-theme, and the reverse directions as well.
Should allow user to perform encyclopedia-style lookup of entities. Should provide hooks for tie-in to 3rd party bioinformatics tools.
Design Principles Maintainability Portability Extensible Efficiency Organized Interoperability Configurability Ease-of-use Trusted “Quality without a Name”
References: “Code Complete”, 2nd ed., “Pattern-Oriented Software Architecture”, volume 1.
Design – Use Case Diagram
Design - Component DiagramBeeSpace Design
Application Layer
BeeSpaceNavigator
Query & Data Access Layer
Annotated Data, Meta-Data and Indices
XMLSchemas
XMLData
Indices
EntityRecognizer
NP Chunker
POSTagger
ConceptNormalizer
InverterConcept
Generator
Data Processing Layer
Data Sources
TextBases
FuzzyQueryEngine
DataAccess
Component
Design - Deployment Scenarios
BeeSpace Software Packaging
Core Library
Data ProcessingComponents
WebApp
StandaloneGUI App
Query/AccessComponents
Extension Library
CommunicationComponents
Agents/P2PClients
Design – Class Diagram
Implementation DetailsThe current system is being constructed as follows: The (v1.0) application is being developed as a web-based application.
Design Decision: The interface is built on top of lightweight technologies (e.g. HTML, DHTML & JavaScript). Typical web-app challenges, such as sessioning and security, need to be addressed.
The output of the data processing pipeline is a set of indices and annotated data files that the client application depends on. Design Decision: There is a clear separation-of-concerns between the server-side processing and
the client-side interface. XML is being fully utilized to as a data interchange format between software components.
The pipeline is composed of independent software components, but these components need to be inter-connected. Design Decision: Components are called as executables with defined interfaces.
Some components need to be able to store their data aggregations persistently (and other components may need access to this data). Design Decision: Currently each component handles this problem independently. Better, long term
solution is to extract out this concern and address it globally; for example, using ORDBMS.
Future Implementation Details Support both a web interface (HTML, CSS, DHTML, JavaScript) and a full-blown GUI
interface (Java Web Start app). Consistent Java implementation for portability, maintainability, RAD, etc. Incorporate a DBMS for consistent handling of “persistent storage”. Library extensions for communication between distributed, heterogeneous applications
(perhaps KIF). Optimized data processing and communication.
Climbing the Pyramid
Raw Data (Txns)
Aggregations (Reports)
Predictions (Trends)
Intelligent-driven Business (Profit)
Raw Text (Lit.)
Semantics (Nodes)
Hidden Relationships (Network)
Intelligent-driven Research (Profit)
Computer Automated Research (Success) Computer Automated Business (Success)
Text Mining Data Mining
Pyramid of Knowledge
Future Prospects Generalize the system so that it is NOT domain-specific and can be readily applied to other domains. Allow for persistent sessioning and sharing of sharing of knowledge-spaces amongst communities-of-
interest. Support a visual query system (VQS) interface and/or a query-by-example (QBE) interface. Support all kinds of hypothesis generation: deduction, abduction & induction. Support personalized annotations. (What constitutes a “good” KR structure: clarity, logic, expressive?). Smooth the integration between the BeeSpace Navigator and the myriad number of web-based tools. Support n-ary, semantically rich relations as opposed to just dyadic.
Visual Query in Text Mining Application
Polypetptide: p150Glued
Found-In Gene: ?x Protein: ?y Has-
ProductOrg: bee
Org: fly Found-InGene: Glued
Has-Product
Similar-To
Threshold: 0.9
Future BeeSpace ComponentsFuture BeeSpace Design
Application Layer
BeeSpaceNavigator
BeeSpaceWorkflowManager
BeeSpaceAnalyzer
Q/AComponent
FuzzyQueryEngine
ExpertShell
Component
DataAccess
Component
EntityMapper
Query & Data Access Layer
TextMiner
Central Knowledge Base
ORDBMS
EntityRecognizer
NP Chunker
POSTagger
ConceptNormalizer
InverterConcept
Generator
Data Processing Layer
RelationExtractor
TopicDetection
RuleMiner
Data Sources
DataBases
TextBases
WebBases
OntologyDetector
Snake Space?