John Deck, University of California, BerkeleyBrian Stucky, University of Colorado, BoulderLukasz Ziemba, University of Florida, GainesevilleNico Cellinese, University of Florida, GainesvilleRob Guralnick, University of Colorado, Boulder
BiSciCol TeamReed Beaman, Nico Cellinese, Jonathan Coddington, Neil Davies,
John Deck, RobGuralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle,
Kate Rachwal, BrianStucky, Rob Whitton, Lukasz Ziemba
BiSciCol: Tracking Biodiversity Objects to Brokering Standards“Or, Gustav’s Big Problem”
Biological Science Collections Tracker
working towards building an infrastructure designed to tag and track scientific collections and all of their
derivatives.
National Science Foundation funded 2010 – 2014
Partners are University of Florida at Gaineseville, University of Colorado at Boulder, Bishop Museum, University of California at Berkeley, Smithsonian Institution, University of Arizona at Tucson
Relies on globally unique identifiers (GUIDs) to track objects
Implements a Linked Data approachProvides support for the Global Names
Architecture
Why? Here is Gustav’s Problem….
(Prefers to collect stuff)
Lots of Data ….
Generates …
Due to project requirements and integration needs, Gustav is left navigating a plethora of redundant and disconnected distributed Databases. Lots of effort to track objectsAnd their derivatives.
Taxonomic Type Filter
Class Filter
X
X
Specimens
Tissues
Sequences
FunctionsX Infer Relationships Across providers
A Biological Relationship Graph …
Moorea Biocode Example: Tracking biological material from field collection through analysis, across multiple systems
(Biocode Event)
(Essig Museum Specimen)
(Smithsonian Tissue)
(CAMERA Gut Sample Event)
(Genbank Sequence)
(metagenomic Sequencing)
Key Blast*n
Taxon*nTaxon
Blast
Taxon
(Key)
(Taxon)
Tracking Biological Object Relationships
Group like terms into classes. In Darwin Core, for example we have the following “groups of terms”: Events, Locations, Occurrences, GeologicalContext, Identification, Taxon.
Assign Identifiers. Use globally unique, resolvable, persistent identifiers for each class or term.Link Identifiers using Relationship Terms. For example, “This object is related to that object.”
Put this data on the Web.
Related Projects that are Grouping Like terms into Classes
Darwin-SW (http://code.google.com/p/darwin-sw/) Building an ontology of Darwin Core Terms to make it possible to describe biodiversity resources on the web.
Gene Ontology (http://www.geneontology.org/) Standardizing the representation of gene and gene product attributes across species and databases.
ENVO (http://environmentontology.org/) Annotating the environment for any biological sample.
OBO Foundry (http://www.obofoundry.org/)A suite of orthogonal interoperable reference ontologies in the biomedical domain
Creating Globally Unique Identifiers (GUIDs)
Globally unique (mandatory) Persistent (not mandatory, but very helpful) Resolvable (not mandatory, but very helpful)
Resolution/Domain + Identifier
JDeckSpecimen1 (A named identifier)http://mycollection.org/specimen/
http://mycollection.org/specimen/JDeckSpecimen1http://mycollection.org/specimen/uuid=7217D220-836A-11DF-8395-0800200C9A66
Examples:
http://example.org/urn:lsid:example.org:specimen/7217D220-836A-11DF-8395-0800200C9A66
+1-541-914-4739 (Unique, at least for phones)7217D220-836A-11DF-8395-0800200C9A66 (opaque)
http://example.org/urn:lsid:example.org:specimen/
Linking Identifiers Using Relationship Terms
PredicateAn RDF
Statement:Subject Object
relatedTo (Transitive):
relatedToGUID1 GUID2 GUID3
relatedTo GUID1 <-> GUID2GUID2 <-> GUID3GUID1 <-> GUID3
ORPredicate
GUID1 GUID2
A Simple BiSciCol Graph
(graph=set of RDF Statements):
relatedTo
a aDate Date
GUID1 GUID2 GUID3
relatedTo
Event
“2011-06-20”“2011-05-01”
Tissue
“2011-06-01”
Specimen
a Date
Getting the most out of your data:Inferring Object Relationships
Facebook Inferencing:“Let us sell you, to others (or vice-versa)”BiSciCol Inferencing:“What relationships exist that haven’t been explicitly expressed”
Location1(Essig Museum)
Organism2(Smithsonian)
sameAs
inferred
Organism1(Essig Museum)
relatedTo
Tissue1(Essig Museum)
relatedTo
Tissue2 (Smithsonian)
relatedToGeoreference1(BioGeomancer)
relatedTo
48.198,16.371;crs=wgs84;u=40
hasSpatialThingGeoreference
Even though Tissue #2 is not directly related to Location1, we can Still infer its relationship through Organism1 and Organism2 being the same as each other.
Tissue1(Essig Museum)
infe
rred
Tissue2(Smithsonian)
inferred
Inferred Relationship Chains
Update Mechanisms
Gustav’s Watchlist:GP12345-3939-33939 (Occurrence)BE99999-3939-3dd39 (Event)GP12346-3939-33II3 (Occurrence)GP12dd6-3939-3xxxI (Tissue)GP9999-xkx9d-dkdkd (Occurrence)…
BiSciCol API(Search on Date And return graphOf object)
Search Descendents(By Recent Modification)
Updates
“Triplifier” linking biological objects
Mysql
KEMU
“Triplifier”Create links fromNative data formats
Mysql
BiSciCol
Darwin Core Archive
Example Taxonomic Query
Aedes increpitusSearch Scientific Name: Run
Client Interface:
BISCICOL SERVICE LOOKUP:dwc:IdentificationID1 :relatedTo http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126314dwc:IdentificationID1 :relatedTo dwc:OccurrenceID1dwc:IdentificationID2 :relatedTo http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126317dwc:IdentificationID2 :relatedTo dwc:OccurrenceID3
Results:OccurrenceID1 (Aedes increpitus Dyar, 1916 ) OccurrenceID3 (Aedes vittata Theobald, 1903)
Taxon SERVICE (ITIS / GNUB)http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126314http://lsid.itis.gov/urn:lsid:itis.gov:itis_tsn:126317http://gnub.org/8E19F1DC-74BA-47D4-A505-6498414B4CCE
Working with LocationsE.g. Tracking location in space of a moving individual (whales)
EventID1
EventID2
EventID3
IndividualID1 GeoreferenceID1
GeoreferenceID2
GeoreferenceID3
Data Impact Factor – Graph Metrics
Occurrence:MBIO1234 (“2011-10-18 09:10:00”)DNA Extraction:Extrac9999 (“2011-10-18 09:00:00”)Sequence:s1113939999 (“2011-10-18 08:00:00”)Occurrence:MBIO1235 (“2011-10-17 00:00:00”)Photo:P123456 (“2011-10-17 00:00:00”)
Whats New?
Occurrences
MBIO99999(1024 total descendents)
IMBL8888888(723 total descendents)
Events
Biocode10234(4234 direct children)
Expedition21234(1023 direct children)
Collectors
Gustav Paulay(102,000 direct children)
Christopher Meyer(83,000 direct children)
Craig Moritz(523 direct children)
[ ] GBIF Relations Graph[X] Moorea Biocode[X] SI MSNGR System[+] Add New Graph
Graphs
Summary
All objects are re-usable in the semantic web. We only need to express an identifier once and then it can be linked by anything else (either directly or indirectly)
By using sameAs relations it is possible to infer relations for data that was not previously expressed.
Queries are easily federated – possibility to create global graphs and ask questions against heterogeneous databases.Graph based databases can help us understand the relevance of individual objects. For example, indicate the number of relations a particular object has for 1st, 2nd, 3rd, or nth order relations.