FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne...

FlyWeb: the way to go for biological data integration

Jun Zhao, Alistair Miles and Graham KlyneImage Bioinformatics Research Group

Department of ZoologyUniversity of Oxford

FlyWeb Application

To answer questions about "what does this gene do?” Gene Expression Images Sequence and ESTs (Expressed sequence tags) of the gene Publications about the gene ....

A first example of the Image Web that our group is developing

Investigate the feasibility of existing Semantic Web tools and technologies for real applications

Gene expression images

Reveal gene expression pattern in different development stages

Important for identifying genes of interests and verifying a picture of probable gene functions

FlyWeb demonstration

http://openflydata.org/flyui/build/apps/imagemashup2/ Run application: [go]

Two examples: Single gene query (aos1) Use gene synonyms to enhance gene matching (rbf)

More than one synonyms

of gene “rbf”

How does it work?

Data from 3 independent sources: www.flybase.org – model organism

reference database, gene namesand identifiers

www.fruitfly.org (BDGP) – embryo in situ images

www.fly-ted.org – testis in situ images

All data accessed via SPARQL

Pure Ajax user application

Essentially, a mashup using a SPARQL API

The client side FlyUI:

a library of Javascript widgets as front ends to SPARQL data sources

Built on Yahoo User Interface (YUI) library

Widgets are composed in a browser to create the complete application

Each widget provides: A Service that implements

SPARQL queries A Model encapsulating SPARQL

query results A Renderer

The in situ search application

GeneFinderWidget

FlyTED ImageWidget

BDGP ImageWidget

Gene name mapping

FlyTED and BDGP use different gene names FlyTED data derived from spreadsheets with imperfectly

controlled gene name vocabulary BDGP's data are annotated using FlyBase's unique FBgn

numbers

Use FlyBase for automatic gene mapping

Additional inputs from scientists for disambiguating many-many mappings

Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)

SPARQL queries

Free text matchings

Case insensitive searching

Very important for our users

Too expensive using SPARQL Filter

Pre-generate lower-case gene names and load into the Flybase RDF DB

SELECT * WHERE { ?gene fbutil:anyName "userInput"^^xs:string ;

a chado:Feature ;chado:name ?symbol ;chado:uniquename ?flybaseID .

OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . }}

SELECT DISTINCT * WHERE { ?fullImageURL " + flyted:associatesToGene <http://openflydata.org/id/flyted/gene-geneName> ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }

The RDF data sources

Flybase and BDGP: relational databases

FlyTED, an image repository built using Eprints

FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet

Creating RDF from data sources

D2RQ mapping FlyBase and BDGP, native relational databases Conservative mapping, with minimum interpretation

OAI2SPARQL Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in

support by Eprints Further from ESWC2008 paper

Custom Python program FlyAtlas Generating N3 from spreadsheet table

More about the data sources

Bulk download http://openflydata.org/dump/flybase, ~8m triples http://openflydata.org/dump/bdgp, ~1m triples http://openflydata.org/dump/flyted, ~30,000 triples

SPARQL endpoint http://openflydata.org/query/flybase http://openflydata.org/query/bdgp http://openflydata.org/query/flyted

Schema http://purl.org/net/chado/schema/ http://purl.org/net/flybase/synonym-types/ http://purl.org/net/bdgp/schema/

SPARQL server

Amazon EC2 (Elastic Compute Cloud): To run SPARQL endpoints To host the demo you've just seen

Jena TDB as triple store For better loading performance: ~6K tps for ~9M triples to

Amazon Elastic Block Storage (EBS) For better querying performance

SPARQLite home-grown SPARQL protocol implementation More later

Apache, Tomcat, mod_jk, etc.

SPARQLite protocol

http://sparqlite.googlecode.com Also, a platform for exploring SPARQL service quality concerns,

more later

Motivation Enable streaming Create a database connection pool

Designed for Jena TDB/SDB + Postgres

Restricted forms of query (SELECT, ASK)

Restricted query result format (e.g. only JSON)

Lessons

RDF provides a uniform and flexible data model RDF dump is cheaper and quicker Maintaining a separate SPARQL endpoint for each data source

makes it easier than a data warehouse approach for handling data updates

RDF facilitates data re-use and re-purposing

SPARQL raises the point of departure for an application

Benefits for the future Linking to other data sources Querying genes using the Fly Anatomy ontology Magic of inference

Performance

Loading: Our datasets ~10 million triples Jena / RDB / Postgres, OK with <1 M triples Jena / SDB / Postgres better, but problems with load performance

with larger datasets Jena / TDB gives much better load performance (~6K tps), even on

32 bit system with Amazon EBS storage (but not so good with local EC2 store)

Virtuoso performs reasonably well

Querying, particularly text matching and case insensitive search

Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL

Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search

Any suggestions?

Further lessons

SPARQL results streaming Resolves out of memory errors for large datasets Joseki / SDB / Postgres can be made to stream results, but using

just a single JDBC connection, causing performance problems with concurrent requests

Therefore, SPARQLite

The openness of SPARQL: SPARQL is an inherently open query language and protocol Open endpoints are vulnerable to simple queries that can

overload the service, exposing them to denial of service style attacks (whether intended or not)

Futures: API key mechanism? Restricted SPARQL profiles?

Future directions

Adding new data sources: FlyAtlas tissue-specific Drosophila gene expression levels More information from FlyBase – e.g. references

More applications: Find out all the gene expression images of its neighbours Find out all the genes related to “blood pressure” ...

Linked data (dereferencable, follow-your nose) We're thinking about this, but our application does not currently

need it

How to control and predict quality of service for open SPARQL endpoints

Acknowledgement

Alistair Miles, Graham Klyne and David Shotton

Dr Helen White-Cooper and her research group

BBSRC for funding building the FlyTED database

BDGP and FlyBase for making the data available

JISC, for funding the FlyWeb project

The Jena team, esp. Andy Seaborne

Date post:	19-Dec-2015
Category:	Documents
View:	215 times
Download:	1 times

FlyWeb: the way to go for biological data integration Jun Zhao, Alistair Miles and Graham Klyne...

Documents