Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | dag-endresen |
View: | 1,241 times |
Download: | 0 times |
Virtual Biodiversity Research and Access Network for Taxonomy (ViBRANT)
Knowledge Organization System for GBIF
Dag Endresen Knowledge Systems Engineer Éamonn Ó Tuama Senior Programme Officer, Inventory, Discovery, Access (IDA) Global Biodiversity Information Facility (GBIF) 31 August 2012
Enabling interoperability for the GBIF network and beyond “The ability of two or more systems or components to exchange information and to use the information that has been exchanged” (ref: IEEE Standard Computer Dictionary: Compilation of IEEE Standard Computer Glossaries, ISBN:155937079)
Key requirement: Common exchange standards and protocols for biodiversity data …
necessitate agreement on use of common vocabularies for the classes of objects and their properties
Knowledge Organisation Systems (KOS) - can help us manage our vocabularies
GEO BON, IPBES
Knowledge Organisation Systems
- Term lists: glossaries, dictionaries, gazetteers
- Classifications / categorisations: taxonomies
- Relationships: thesauri, ontologies
... to manage the vocabularies used for sharing biodiversity information.
simple relationships a model of a domain
e.g., Dewey Decimal Classification
taxonRank
higherClassification
taxonConceptID collectionCode
geodeticDatum specificEpithet
coordinatePosition
Darwin Core – a glossary of terms
collectionCode: The name, acronym, coden, or initialism identifying the collection or data set from which the record was derived. Examples: "Mammals", "Hildebrandt", "eBird".
nt Natural resources nt Biological resources
nt Genetic resources nt Germplasm
uf Genetic material uf Germplasm resources rt Protoplasm rt Genes rt Gene pools rt Biodiversity rt Germplasm collections rt Gametes
AgroVoc vocabulary – a thesaurus
bt Resources
bt = broader term nt = narrower term uf = used for rt = related term
http://aims.fao.org/standards/agrovoc/functionalities/hierarchy
Ontology – a model of a domain
ontologies = computable dictionaries
inverseOf
sameAs differentFrom
William Jefferson Clinton
Bill Clinton
collectors take samples
samples are taken by collectors
NHM, Los Angeles County
NHM, London
A hasAncestor B B hasAncestor C
hasAncestor
transitiveProperty
Clinton image source: http://www.whitehouse.gov/sites/default/files/first-family/masthead_image/42bc_header_sm.jpg?1250887359
Term versus Concept
Dextre Clarke, S.G. and L. Zeng (2012). From ISO 2788 to ISO 25964: the evolution of thesaurus Standards towards Interoperability and data modeling. ISQ Information Standards Quarterly 24(1): 20-26.
“The SKOS (simple knowledge organization system) format is designed to present KOS data in a format that is suitable for machine inferencing and particularly for use in the Semantic Web (….) The model [ISO 25964] is based on the understanding that thesauri show the relationships between concepts – units of thought – and distinguishes these from the terms that are used to label these concepts. These terms may be in one or more languages, and one term per language is chosen as a preferred term for each concept. One or more additional terms for the same concept may be recorded in the thesaurus as non-preferred terms.” Will, L. (2012). The ISO 25964 Data Model for the Structure of an Information Retrieval Thesaurus. Bulletin of the American Society for Information Science and Technology 38(4): 48-51.
- New dedicated position at GBIF funded through external projects (ViBRANT, i4Life)
Knowledge Organisation Systems
- Review GBIF Vocabularies Service and develop vocabulary management system
- Engage with wider community: - participation in Dublin Core workshop, Sept 2011 - KOS symposium at TDWG 2011 Conf, Oct 2011 - TDWG Vocabulary Management Task Group, 2012
- Review recommendations in KOS task group report and develop implementation roadmap
KOS activities in GBIF work programme
Key requirement: a platform to support the development, maintenance and governance of vocabularies for the biodiversity community
ViBRANT: Task 4.1 Ontology platform (GBIF, JKI) Description of work:
• “[F]lexible, user-friendly ontology management environment, enabling users to create, define, extent and share their own terms and concepts where needed, providing options for discussions and annotation, while supporting re-use of terms from standardized ontologies wherever possible”.
• Extent the functionalities of existing vocabulary services (like GBIF).
• Collaborative community interface for users and user-networks, bottom-up, user-friendly and non-technical.
• Flexibility for biologists to express their knowledge regardless of whether the terminology has been standardized yet or not.
Text from the ViBRANT project summary, page 13 (my highlighting).
9
ViBRANT WP4: GBIF tasks and deliverables
Deliverable 4.2: Ontology tools: • “Develop the GBIF ontology tool and produce an equivalent tool based on a seman<c wiki. Deliver a single user interface for ontology crea<on and edi<ng based on user-‐acceptance of the alterna<ve technologies.” Text from the ViBRANT project summary, page 14 (my highlighting).
10
h=p://community.gbif.org/pg/groups/21382/
11
Governance structure (TDWG VoMaG)
• Maximize the reuse of terms, focus on the definition and labels for basic terms.
• Low threshold for non-technical biologists and biodiversity domain experts to access terms and contribute (compared to richer ontologies).
• Preferred technology: RDF (resource description framework) and SKOS (simple knowledge organization system).
• Construction and maintenance of OWL ontologies are demanding in respect to expertise, effort and costs.
• Maintaining SKOS vocabularies are less demanding. • RDF resources are designed to be easily extended. • Ontologies (OWL) can be based on (extend) terms
declared by a RDF/SKOS vocabulary. • SKOS became a W3C recommendation in 2009.
Why use a flat vocabulary ?
12
• OWL DL supports machine reasoning through machine accessible formal semantics.
• OWL provides by default an URI as identifier for classes, properties, relations and instances.
• E.g. OBO target practical solutions in the biomedical / biology domain, while OWL is more generic and provide cross-domain interoperability.
• OWL 1.0 became a W3C recommendation in 2004, • OWL 2.0 in 2009. • http://www.w3.org/2007/OWL/
• Recommendation: • REUSE terms declared by flat vocabularies… • Start with SKOS - then explore OWL…
Why use OWL (web ontology language) ?
13
Concept Vocabulary (rdf, skos)
Wiki Vocabulary Management
ISOcat Vocabulary Management
Excel, text, etc… Template for Vocabularies
GBIF Resources Browser
Resources Repository
1. Mint and maintain concepts and terms, in domain-expert working groups.
2. Release final version as a Concept Vocabulary. 3. REUSE terms from published concept vocabularies
and ontologies when designing new DwC-A extensions & controlled value vocabularies.
4. Publish at the GBIF Resources Repository. 5. Browse at the GBIF Resources Browser.
GBIF Vocabularies
DwC-A extensions & controlled vocabularies Evaluation of
collaborative management tools http://kos.gbif.org/
proposed template processor
2
1
1
1
4
3
5
GBIF Vocabularies as a collaborative management tool for Darwin Core Archive extensions and controlled vocabularies.
Vocabulary management
14
GBIF Vocabularies
Darwin Core Archive extensions and controlled value vocabularies
GBIF Vocabularies as a collaborative
management tool for Darwin Core Archive
extensions and controlled value
vocabularies.
Concept Vocabulary (rdf, skos)
Wiki Vocabulary Management
Resources Repository
ISOcat Vocabulary Management
MS Excel Template for Vocabularies
Evaluation of various tools for collaborative management of concept vocabularies (RDF).
DwC-A extensions & controlled vocabularies
GBIF IPT
Scratchpads
?
GBIF Vocabulary Server (Drupal)
GBIF Vocab Server is based on Drupal 6 / Scratchpads (v1) --> Drupal 7/Scratchpads2 --> Drupal 8 ?
Integration with Scratchpads2? Integration with the NPT?
15
Concept Vocabulary (rdf, skos) Resources
Repository
DwC-A extensions &
controlled vocabularies
GBIF IPT
Scratchpads
Wiki Forum for Terms
Wiki forum for terms as an open community platform for description and maintenance of existing terms. Replacement tool also for the GBIF Vocabulary Server?
Semantic wiki forum for terms
16
Wiki Vocabulary Management
ISOcat Vocabulary Management
MS Excel Template for Vocabularies
Evaluation of various tools for collaborative management of concept vocabularies (RDF).
?
Concept Vocabulary (rdf, skos) Resources
Repository
The GBIF Term Browser allows a user to browse for
terms defined in widely used concept vocabularies such as
Darwin Core, Dublin Core, FOAF, etc., including where
available, translations. http://kos.gbif.org/termbrowser/
GBIF Term browser
17
Wiki Vocabulary Management
ISOcat Vocabulary Management
MS Excel Template for Vocabularies
Evaluation of various tools for collaborative management of concept vocabularies (RDF).
Concept vocabularies stored/deposited at http://rs.gbif.org/terms/
Concept Vocabulary (rdf, skos)
Wiki tool inc. Ontology Management ??
Resources Repository (incl. ontologies?)
Ontologies (rdf, owl)
Biodiversity ontology management
Evaluation of tools for the
development of biodiversity
ontologies.
REUSE terms from RDF vocabularies
Evaluation of biodiversity
ontology repository
solutions.
18
1 2
BioPortal ontology repository
h=p://bioportal.bioontology.org/projects/168
Proposal: establish a biodiversity “slice” at the NCBO BioPortal. • Loading biodiversity ontologies into the NCBO BioPortal promotes
mapping (and reuse of terms) between bio-medical and biodiversity ontologies.
• An instance of the BioPortal software for biodiversity requires long-term obligations to host and maintain the resource – does e.g. GBIF have the resources to offer to host a BioPortal instance?
19
Concept vocabularies (skos:conceptSchema, RDF)
• Darwin Core, Darwin Core “extensions”, NCD, GNA, Audubon Core (and other vocabularies of concepts).
as a basis and foundation for
Software application schema (XML, XML schema)
• Darwin Core Archive (DwC-A) extensions and controlled value vocabularies.
• Resources such as the DwC-A extensions and controlled value vocabularies REUSE terms (URI) from a vocabulary of terms.
20
GBIF KOS resources
Biodiversity KOS (based on Darwin Core)
Darwin Core (DwC) is a flat list of terms, expressed using RDF. à DwC “extensions” (flat vocabularies for declaration of concepts). à Reuse concepts from other vocabularies whenever possible. Darwin Core Archive (DwC-A) has a star schema model. • DwC-A core(s), extensions and controlled value vocabularies
• declared as XML lists of terms. • DwC-A resources should REUSE terms from Darwin Core and other flat
concept vocabularies. • New DwC-A core types (data types), eg. sample? Formalize class
entities (ontology). [Current types: Taxon & Occurrence] à Formalize a governance structure for maintaining KOS resources
based on the principles established for Darwin Core (towards TDWG VoMaG).
21
Darwin Core Archive (DwC-A) v DwC-A publish DwC records including terms
from DwC-A extensions. v Simple text based format. v Zipped single file archive.
Germplasm.txt
22
Darwin Core Archive extension (XML term list)
23 http://rs.gbif.org/sandbox/extension/audubon.xml
Concept vocabulary (RDF/SKOS)
http://rs.gbif.org/terms/geotime/geotimeConcept.rdf 24
In progress: XSLT -> HTML for human readable version.
GBIF Vocabulary Server The GBIF Vocabulary Server can assist a user to create and manage DwC-A extensions or controlled value vocabularies. However, it is not designed to create RDF/SKOS concept vocabulary resources with reusable concepts. It can export XML, but not RDF. It is based on Scratchpads (v1), aka. Drupal v 6.
25
XML export
edit interface
Global Names Architecture (GNA)
26
Many of the GNA term URI identifiers does not resolve (404 not found). The rowType identifiers simply resolve to the software application schema (to the DwC-A extension). We propose to formalize the GNA concept declarations using RDF/SKOS for improved re-usability of the GNA terms and concepts.
Global Names Architecture (GNA)
27
The Global Names Architecture (GNA) terms were originally simply declared by the DwC-A extension. We propose to formalize the GNA concept declarations using RDF/SKOS for improved re-usability of the GNA terms.
RDF/SKOS
XML
Global Names Architecture (GNA)
28
We propose to formalize the GNA concept declarations using RDF/SKOS for improved re-usability of the GNA terms.
RDF/SKOS
Darwin Core Archive extensions
29
• Global Names Architecture (GNA) • Audubon Core (multimedia) • Invasive species (GISIN) • Genetic Resources (Germplasm) • Natural Collections Description (NCD) • Metadata profile (EML) • EOL species profile • Taxonomic Concept Schema (TCS) • Genomics Standards Consortium (GSC) • Meta-genomics (?) • ABCD (?) • …
• Geological time periods • chronostratigraphy • magnetostratigraphy
• Species interactions • saproxylic interactions • pollinators
• Country codes • Language • Basis of record • Taxonomic rank • Nomenclatural status • Life form • Life stage • …
Controlled value vocabularies
30
a proposed workflow / brainstorming
Versioning resources
Move outdated vocabularies to a separated folder named “deprecated”? No versions? Will IPT be aware of this folder? Note that previous DwC-A datasets could be mapped to deprecated vocabulary resources…!
Versioning resources
Version the DwC-A vocabularies and extensions using a [_DATE] postfix. Could IPT be made aware of this postfix? Note that previous DwC-A datasets could be mapped to outdated vocabulary resources…!
Versioning RDF vocabularies
Move outdated vocabularies to a subfolder named “archive/[DATE]”? Same versioning model for extensions and vocabularies…?
Versioning RDF vocabularies
Deprecated and outdated vocabularies and DwC-A resources could declare their status, eg. using dcterms:isReplacedBy…? Drawback: the XML document is required to be accessed and parsed to read resource status.
Versioning vocabulary resources
• Separated folder named “deprecated”?
• Postfix using [_DATE]?
• Subfolder named “archive/[DATE]”?
• dcterms:isReplacedBy
• Other ideas, solutions?
a proposed workflow
TranslaTon of vocabulary term descripTons
Term translations (SKOS/RDF) dwc_translations.rdf
Archive (SKOS/RDF) [DATE]/dwc_translations.rdf
Export working file format from the SKOS file (RDF/SKOS à CSV).
Expert working groups or a collaborative expert community develop new translations or refine previous translations.
Archive the translations each time the “active” SKOS file is updated.
The expert group provides their output as a CSV file, XML data or as a SKOS/RDF resource.
Translations for a given vocabulary of terms are maintained and published as a SKOS/RDF file at the GBIF Resources Repository (http://rs.gbif.org/terms/).
Example: master SKOS/RDF resource
http://rs.gbif.org/terms/dwc/dwc_translations.rdf
[ [ [ [ en
es
zh
ja
Workflow for term translaTon
Term translations (SKOS/RDF) dwc_translations.rdf
Adding new term translations or updating previous term translations always starts and ends with the “active” SKOS/RDF resource for translations.
XSLT
dwc_translations_de.csv dwc_translations_es.csv dwc_translations_fr.csv dwc_translations_jp.csv dwc_translations_ru.csv dwc_translations_zh_Hans.csv …
dwc_translations_fr.csv (*) updated
XSLT
dwc_translations_de.csv dwc_translations_es.csv dwc_translations_fr.csv (*) dwc_translations_jp.csv dwc_translations_ru.csv dwc_translations_zh_Hans.csv … dwc_translations_pt.csv (**)
(*) Updated CSV files with translations simply replace extracted previous translations – in the XSLT split and merge cycle. (**) Adding translations to a new language simply by adding the CSV resource into the XSLT cycle.
XSLT split and merge
cycle
expert group
New data types?
- complement, not duplicate work
- GBIF as premier gateway to discovery, access
Genomic level observations
A roadmap developed by Q1 2013 - genomic data - ecological data
Ecological measurements associated with observations
Metadata
The GBIF metadata catalogue system allows interoperability across distributed metadata repositories http://metadata.gbif.org
Essential for discovery and access to new data types
The challenge ahead ... populating the catalogue with high quality, complete metadata
GBIF KOS work-program Some suggested next steps
• GBIF Resources Repository (h=p://rs.gbif.org/)
• Further development of new DwC-‐A extensions and controlled value vocabularies.
• Workflow for the translaTon of term descripTons.
• ConTnue the evaluaTon of collaboraTve tools for management of flat vocabularies of terms (RDF/SKOS).
• SemanTc Wiki, ISOcat, Protégé (web-‐protégé), …
• New semanTc Wiki for descripTon of terms / glossary of terms / community-‐driven discussion forum (with JKI, Gregor Hagedorn).
• Discussion, discovery and REUSE of exisTng terms.
• NCBO BioPortal as a repository for biodiversity ontologies.
• Will GBIF contribute to mint new biodiversity ontologies? • BFO based OWL version of Darwin Core…?
• KOS governance structure developed and formalized by the (TDWG) Vocabulary Management Task Group (VoMaG).
• Roadmap for KOS into the GBIF infrastructure, portal, …!
43
Furthermore, I think that we need persistent identifiers!
Cato the Elder ended all his speeches in the senate of Rome with: "Ceterum
autem censeo Carthaginem esse delendam" (English: "Furthermore, I think Carthage must be destroyed").
44