Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
FlyWeb: the way to go for biological data integration
Jun Zhao, Alistair Miles and Graham KlyneImage Bioinformatics Research Group
Department of ZoologyUniversity of Oxford
FlyWeb Application
To answer questions about "what does this gene do?” Gene Expression Images Sequence and ESTs (Expressed sequence tags) of the gene Publications about the gene ....
A first example of the Image Web that our group is developing
Investigate the feasibility of existing Semantic Web tools and technologies for real applications
Gene expression images
Reveal gene expression pattern in different development stages
Important for identifying genes of interests and verifying a picture of probable gene functions
FlyWeb demonstration
http://openflydata.org/flyui/build/apps/imagemashup2/ Run application: [go]
Two examples: Single gene query (aos1) Use gene synonyms to enhance gene matching (rbf)
More than one synonyms
of gene “rbf”
How does it work?
Data from 3 independent sources: www.flybase.org – model organism
reference database, gene namesand identifiers
www.fruitfly.org (BDGP) – embryo in situ images
www.fly-ted.org – testis in situ images
All data accessed via SPARQL
Pure Ajax user application
Essentially, a mashup using a SPARQL API
The client side FlyUI:
a library of Javascript widgets as front ends to SPARQL data sources
Built on Yahoo User Interface (YUI) library
Widgets are composed in a browser to create the complete application
Each widget provides: A Service that implements
SPARQL queries A Model encapsulating SPARQL
query results A Renderer
The in situ search application
GeneFinderWidget
FlyTED ImageWidget
BDGP ImageWidget
Gene name mapping
FlyTED and BDGP use different gene names FlyTED data derived from spreadsheets with imperfectly
controlled gene name vocabulary BDGP's data are annotated using FlyBase's unique FBgn
numbers
Use FlyBase for automatic gene mapping
Additional inputs from scientists for disambiguating many-many mappings
Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)
SPARQL queries
Free text matchings
Case insensitive searching
Very important for our users
Too expensive using SPARQL Filter
Pre-generate lower-case gene names and load into the Flybase RDF DB
SELECT * WHERE { ?gene fbutil:anyName "userInput"^^xs:string ;
a chado:Feature ;chado:name ?symbol ;chado:uniquename ?flybaseID .
OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . }}
SELECT DISTINCT * WHERE { ?fullImageURL " + flyted:associatesToGene <http://openflydata.org/id/flyted/gene-geneName> ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }
The RDF data sources
Flybase and BDGP: relational databases
FlyTED, an image repository built using Eprints
FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet
Creating RDF from data sources
D2RQ mapping FlyBase and BDGP, native relational databases Conservative mapping, with minimum interpretation
OAI2SPARQL Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in
support by Eprints Further from ESWC2008 paper
Custom Python program FlyAtlas Generating N3 from spreadsheet table
More about the data sources
Bulk download http://openflydata.org/dump/flybase, ~8m triples http://openflydata.org/dump/bdgp, ~1m triples http://openflydata.org/dump/flyted, ~30,000 triples
SPARQL endpoint http://openflydata.org/query/flybase http://openflydata.org/query/bdgp http://openflydata.org/query/flyted
Schema http://purl.org/net/chado/schema/ http://purl.org/net/flybase/synonym-types/ http://purl.org/net/bdgp/schema/
SPARQL server
Amazon EC2 (Elastic Compute Cloud): To run SPARQL endpoints To host the demo you've just seen
Jena TDB as triple store For better loading performance: ~6K tps for ~9M triples to
Amazon Elastic Block Storage (EBS) For better querying performance
SPARQLite home-grown SPARQL protocol implementation More later
Apache, Tomcat, mod_jk, etc.
SPARQLite protocol
http://sparqlite.googlecode.com Also, a platform for exploring SPARQL service quality concerns,
more later
Motivation Enable streaming Create a database connection pool
Designed for Jena TDB/SDB + Postgres
Restricted forms of query (SELECT, ASK)
Restricted query result format (e.g. only JSON)
Lessons
RDF provides a uniform and flexible data model RDF dump is cheaper and quicker Maintaining a separate SPARQL endpoint for each data source
makes it easier than a data warehouse approach for handling data updates
RDF facilitates data re-use and re-purposing
SPARQL raises the point of departure for an application
Benefits for the future Linking to other data sources Querying genes using the Fly Anatomy ontology Magic of inference
Performance
Loading: Our datasets ~10 million triples Jena / RDB / Postgres, OK with <1 M triples Jena / SDB / Postgres better, but problems with load performance
with larger datasets Jena / TDB gives much better load performance (~6K tps), even on
32 bit system with Amazon EBS storage (but not so good with local EC2 store)
Virtuoso performs reasonably well
Querying, particularly text matching and case insensitive search
Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL
Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search
Any suggestions?
Further lessons
SPARQL results streaming Resolves out of memory errors for large datasets Joseki / SDB / Postgres can be made to stream results, but using
just a single JDBC connection, causing performance problems with concurrent requests
Therefore, SPARQLite
The openness of SPARQL: SPARQL is an inherently open query language and protocol Open endpoints are vulnerable to simple queries that can
overload the service, exposing them to denial of service style attacks (whether intended or not)
Futures: API key mechanism? Restricted SPARQL profiles?
Future directions
Adding new data sources: FlyAtlas tissue-specific Drosophila gene expression levels More information from FlyBase – e.g. references
More applications: Find out all the gene expression images of its neighbours Find out all the genes related to “blood pressure” ...
Linked data (dereferencable, follow-your nose) We're thinking about this, but our application does not currently
need it
How to control and predict quality of service for open SPARQL endpoints
Acknowledgement
Alistair Miles, Graham Klyne and David Shotton
Dr Helen White-Cooper and her research group
BBSRC for funding building the FlyTED database
BDGP and FlyBase for making the data available
JISC, for funding the FlyWeb project
The Jena team, esp. Andy Seaborne