Post on 01-Jan-2016
description
transcript
e-SI Theme: Exploiting Diverse Sources of Scientific Data
Integrating Diverse Sources of Scientific Data:
Is it safe to match on names?
Prof. Jessie Kennedy
Exploiting Diverse Sources of Scientific Data 2
Exploiting Diverse Sources of Scientific Data
Wealth and diversity of scientific data collected and stored is growing rapidly Increase in automation
Genetic sequencing, remote sensing, astronomy satellites Decrease in technological costs
Computers more powerful, disk space greater for the same £Huge potential for scientific discovery by exploiting
this data especially multi-disciplinary research
Number, complexity and diversity of resources makes this a difficult task
Case Study Data Integration Matching data sets on biological names
Exploiting Diverse Sources of Scientific Data 3
Science Environment for Ecological Knowledge USA National Science Foundation funding
Multidisciplinary project Biology: Ecology, Taxonomy Environmental science: Geography, Remote sensing,
Meteorology, Climatology Computer Science: Database, GRID/Web, Ontologies,
Workflows, Algorithms, Human Computer Interaction
SEEK
Exploiting Diverse Sources of Scientific Data 4
Geographic Space Ecological Space
occurrence points on native species distribution
ecological niche modeling
Project back onto geography
Native range prediction
Invaded range prediction
The SEEK Prototype: Ecological Niche Modeling
temperature
Model of niche in ecological dimensions
pre
cip
itatio
n
Biodiversity information e.g.
data from museum
specimens, ecological surveys
Geospatial and remotely sensed
data
Results taken to integrate with
other data realms (e.g.,
human populations, public health,
etc.)
Exploiting Diverse Sources of Scientific Data 5
Species prediction map
PredictedDistribution:Amur snakehead(Channa argus)
Image from http://www.lifemapper.org
Exploiting Diverse Sources of Scientific Data 6
SEEK - Informatics Challenges
Data is DistributedData is Heterogeneous
Syntax e.g. Text, Excel, Relational Database…..
Schema e.g. Names of the tables, columns in tables
Semantics principal focus for SEEK From many disciplines
Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioural experiments,…
Data on economics, demographics, legal issues,…
Exploiting Diverse Sources of Scientific Data 7
SEEK Overview
Analysis and Modelling System (Kepler)Modelling scientific workflows
EcoGrid:Making diverse environmental data systems interoperate
Semantic Mediation System:“Smart” data discovery and integration
Knowledge Representation WG:Ontologies, MetadataTaxon WG:
Taxonomic name/concept resolution server
BEAM WG:Biodiversity and Ecological Analysis and Modelling
Exploiting Diverse Sources of Scientific Data 8
SEEK Overview
EcoGrid
Exploiting Diverse Sources of Scientific Data 9
EcoGrid Resources
AND
LUQ
NTL
VCR
HBR
Metacat node
Legacy system
SRB node
DiGIR nodeVegBank node
Xanthoria node
Natural History Collections (>> 100)
LTER Network (24)
Organization of Biological Field Stations (180)
Partnership for Interdisciplinary Studies of Coastal Oceans (4)
UC Natural Reserve System (36)
Multi-agency Rocky Intertidal Network (60)
Exploiting Diverse Sources of Scientific Data 10
EcoGrid Data Access
EcoGrid registry to discover data sourcesEML (Ecological Metadata Language)
Experimental data, survey data, spatial raster and vector data, etc.
XML based Discovery information
Creator, Title, Abstract, Keyword, etc.
Coverage Geographic, temporal, and taxonomic extent
Logical and physical data structure Data semantics via unit definitions and typing
Protocols and methods
DarwinCore Museum collections
Exploiting Diverse Sources of Scientific Data 11
EcoGrid Services
Service to Analysis and Modelling Layer Interaction with Kepler – Workflows Interaction with Grid Computing Facilities
Distributed computation
Service to Semantic Mediation Layer Access to Ontologies; Taxon Services
Access to Legacy Apps LifeMapper Spatial Data Workbench
Exploiting Diverse Sources of Scientific Data 12
SEEK Overview
AMS
Exploiting Diverse Sources of Scientific Data 13
Model the way scientists currently work with data coordinate export and import of data among software systems
Workflows emphasize data flowOutput generation includes creating appropriate metadata
The analysis workflow itself becomes metadata The workflow describes the data lineage as it has been transformed Derived data sets can be stored in EcoGrid with provenance
Scientific Workflows
Query EcoGrid to find data
Archive output to EcoGrid with workflow
metadata
Exploiting Diverse Sources of Scientific Data 14
Scientific workflows
EML provides semi-automated data binding
Exploiting Diverse Sources of Scientific Data 15
Kepler: Ecological Niche Model
(200 to 500 runs per speciesx
2000 mammal speciesx
3 minutes/run)
=833 to 2083 days
Exploiting Diverse Sources of Scientific Data 16
(200 to 500 runs per speciesx
2000 mammal speciesx
3 minutes/run)/
100 nodes=
8 to 20 days
Utilize distributed computing resourcesExecute single steps or sub-workflows on distributed
machines
Grid-enable Kepler
KeplerGrid for NicheModeling
Exploiting Diverse Sources of Scientific Data 17
SEEK Overview
SMS
Exploiting Diverse Sources of Scientific Data 18
Key information needed to read and machine process a data file is in the metadata Physical descriptors (CSV, Excel, RDBMS, etc.) Logical Entity (table, image..),Attribute (column) descriptions
Name Type (integer, float, string…) Codes (missing values, nulls...) Integrity constraints
Semantic descriptions (ontology-based type systems)Metadata driven data ingestion
Metadata
Exploiting Diverse Sources of Scientific Data 19
Ecological ontologiesWhat was measured (biomass or photosynthetic solar radiation)
Type of quantity measured (mass, length)
Context of measurement (Psychotria limonensis, wavelength band)
How it was measured (dry weight, total solar radiation)
Exploiting Diverse Sources of Scientific Data 20
Label data with semantic typesLabel inputs and outputs of analytical components
with semantic types
Use reasoning engine to generate transformation stepUse reasoning engine to discover relevant component
Semantic Mediation
Data Ontology Workflow Components
Exploiting Diverse Sources of Scientific Data 21
Homogeneous data integration Integration via EML metadata is relatively straightforward
Heterogeneous Data integration Requires advanced metadata and processing
Attributes must be semantically typed Collection protocols must be known Units and measurement scale must be known Measurement relationships must be known
e.g., that ArealDensity=Count/Area
Data integration
Exploiting Diverse Sources of Scientific Data 22
Simple Example
Exploiting Diverse Sources of Scientific Data 23
Life Sciences Data
Much of the data gathered in ecological studies and used in ecological data analysis is bio-referenced data typically organisms are referenced by a Latin name
e.g. Picea rubens
Many analyses require integrating data originating in many locations and at various points in time
For most bio-referenced data, integration involves matching on organism name SEEK Taxon investigating associated issues
Exploiting Diverse Sources of Scientific Data 24
Biological (Scientific) Names
Used for communicating information about known organisms and groups of organisms – taxa Framework for all biologists to communicate…
Arise from taxonomists applying them to species and higher taxa following classification
Formalized according to strict codes of nomenclature differ depending on kingdom
Use a Latin naming scheme polynomial for species + below; monomial for genus + above
Quoted as: LatinName NameAuthors Year Example: Carya floridana Sarg. 1913
Can cause problems in data analysis…..
Exploiting Diverse Sources of Scientific Data 25
Taxon_concept
Taxon_concept Taxon_concept Taxon_concept
classify
Pile of specimens
Genus
Species
Taxonomic Hierarchy
_a
_b _c _d
Classification, Concepts & Names
Type specimens
Exploiting Diverse Sources of Scientific Data 26
classify
Pile of specimens
Classification, Concepts & Names
Taxon_concept_dTaxon_concept_d
Exploiting Diverse Sources of Scientific Data 27
In Linnaeus 1758 In Archer 1965 In Tucker 1991
In Pargiter 2003
In Pyle 1990
Aus aus L.1758
(ii) Aus L.1758
Aus bea Archer 1965
Archer 1965
(i) Aus L.1758
Aus aus L.1758
Linnaeus 1758
In Fry 1989
(iii) Aus L.1758
Aus aus L.1758
Aus bea Archer 1965
Aus cea BFry 1989
Fry 1989
(v) Aus L.1758
Xus beus (Archer) Pargiter 2003.
Aus ceus BFry 1989
Xus Pargiter 2003
Pargiter 2003
Aus aus L. 1758
bea and cea noted as invalid names and replaced with beus and ceus. Pyle 1990
Aus aus L.1758
Tucker 1991
(iv) Aus L.1758
Aus cea BFry 1989
Publications of Taxonomic Revisions
Publicationsof Purely Nomenclatural Observation
A diligent nomenclaturist, Pyle (1990), notes that the species epithet of Aus bea and Aus cea are of the wrong gender and publishes the corrected names Aus beus corrig. Archer 1965 and Aus ceus corrig. BFry 1989
Tucker publishes his revision without noting Pyle’s corrigendum of the name of Aus cea
Pargiter publishes his revision using Pyle’s corrigendum of the epithet bea to beus and Aus cea to Aus ceus.
type specimengenus nameGenus
concept
Species concept
species name
publication
specimen
Archer splits Aus aus L. 1758 into two species, retains the name for one and creates a new one
Fry splits Aus bea Archer. 1965 into two species, retains the name for one and creates a new one
Tucker finds new specimens and combines Aus aus L. 1758 and Aus bea Archer. 1965 into one species, retains the name.
Pargiter decides to re-split Aus aus but believes bea(beus) is in a new genus Xus.
Taxonomic history of Aus L. 1758
Exploiting Diverse Sources of Scientific Data 28
Problems with Taxonomic Names
Are not unique “Re-use” of names with changed definition Name is ambiguous without definition/context
Subject to alterations and 'corrections' in time Often recorded inappropriately in datasets
No author and/or year (e.g. Carya floridana) Abbreviated (e.g. C. floridana) Internal code (e.g. PicRub for Picea rubens) Vernacular used (e.g. Scrub Hickory) Misspelled
Exploiting Diverse Sources of Scientific Data 29
Taxon Concepts ……
The published expert opinion defining and describing a group of organisms which are given a (scientific) name Scientific names qualified with a reference to the
definition of a concept
Should be used for communicating about groups of organisms
Comparing or integrating data based on taxon concepts will be more accurate
Exploiting Diverse Sources of Scientific Data 30
Taxon Concepts…
Created by someone - an Author Described in a PublicationGiven a Name
Related to the type specimenDefinitionReferenced by
Full Scientific name + “according to” (Author + Publication + Date) Definition
Carya floridana Sarg. (1913) “according to” Charles Sprague Sargent, Trees & Shrubs 2:193 plate 177 (1913)
Exploiting Diverse Sources of Scientific Data 31
Taxon Concepts ……
Defined by set of Specimens examined during classification set of common Characters
context dependent; differentiate taxa rather than fully describe them;
use natural language with all its ambiguities
relationships to other Taxon Concepts Taxon circumscription
the lower level taxa
Congruence, overlap, includes etc. to taxa in other classifications
Exploiting Diverse Sources of Scientific Data 32
Taxon Concepts ……
Original concept 1st use of name as described by the taxonomist
same author + date in scientific name and “according to” Carya floridana Sarg. (1913) Charles Sprague Sargent, Trees &
Shrubs 2:193 plate 177 (1913) TC_a
Revised concept Re-classification of a group
Carya floridana Sarg. (1913) “according to” Stone, Flora of North America 3:424 (1997)
TC_b
Relationship between the taxon concepts TC_b includes TC_a
Exploiting Diverse Sources of Scientific Data 33
Legacy Data …
In legacy data names often appear in place of concepts
Names are imprecise inappropriate for referring to information regarding taxa
e.g. observational/collection data BUT…sometimes that’s all we have
How do we interpret names?….. potentially multiple definitions
the sum of all definitions that exist for the name one of the existing definitions the “attributes” in common to all the definitions represented by the type specimen
Exploiting Diverse Sources of Scientific Data 34
Names as Taxon Concepts
Nominal concepts Sub-set of TaxonConcepts Name but no AccordingTo
non-unique (concept) identifier attributes can be given a unique concept identifier
No definition Explicitly saying it’s something with this name
but not really sure what is/was meant by the name Encourage people to understand and address the
issue of names Allowing mark-up of data with names allows them to
believe names are really good enough Will improve long term usefulness of scientific data Ease integration
Exploiting Diverse Sources of Scientific Data 35
SEEK Taxon’s Message…..
Scientific names are not unique identifiers for biological entities
Integrating data from different sources based on names alone could cause serious errors in analysis of the integrated data
Biologists must reference organisms precisely if datasets to be of use long term or to other users
Reference by taxon concept rather than name integrate data for analysis on taxon concepts
Exploiting Diverse Sources of Scientific Data 36
Taxonomic DatabasesMain taxonomic list servers are still name based
single perspective on taxonomy don’t represent multiple classifications
unclear what the definition is (don’t even try!) provide non-standardised interface (web page, xml
download)SEEK Taxon aims to prototype a concept/name
resolution service for ecologists working with SEEK Find concepts given a name Compare concepts Relate concepts Mark up ecological data sets with concepts
First Need data on names and concepts Need an exchange standard….
Exploiting Diverse Sources of Scientific Data 37
Taxon Concept SchemaTCS standard for exchange of taxonomic
names/concept data Taxonomic Databases Working Group (TDWG) Global Biodiversity Information Facility (GBIF) XML based exchange schema Makes heavy use of Globally Unique Identifiers (GUIDs)
Not designed as the “correct way” to model a Taxon Concept No “rules” as to what a taxon must have Design to accommodate different models
Includes Taxon Names more constrained - the codes of nomenclature
TCS/EML TCS modifications to EML taxon coverage
Exploiting Diverse Sources of Scientific Data 38
Taxon Names and Taxon Concepts
Important to be able to pass names alone For nomenclatural and some taxonomic
purposes But not for identifications/observations
Taxon Concepts refer to Names By GUID Names must not change
Can’t record original taxon concept
Exploiting Diverse Sources of Scientific Data 39
Taxon Concept/Name Resolution Server
Taxon Object Server Schema based on the TCS model Implements the GUIDs using LSID technology Tool to import/export data from TCS documents
TOS Allows registration, retrieval of taxonomic datasets Match concepts given names, concepts, etc.
Allow users to See different taxonomic opinions Uses GUIDs to reference concepts (LSIDs) Find concepts… Author new concepts Make new relationships between existing concepts
Integrated with Kepler workflow system
Exploiting Diverse Sources of Scientific Data 40
SEEK User Interface Tools
Concept mapper A desktop tool to assist taxonomists to relate
concepts from one source to another For use in creating data sets for TOS or TCS For creating new relationships between concepts in TOS
Taxonomy comparison visualisation Visualisation tool to explore different classifications Compare concepts
Exploiting Diverse Sources of Scientific Data 41
Concept Mapper Main GUIQuery
concepts
Concepts
Relationships
Exploiting Diverse Sources of Scientific Data 42
Concept Comparison Visualisation
Exploiting Diverse Sources of Scientific Data 43
SEEK Summary
Environment to support large scale ecological data analysis Scientific Workflows: Kepler Semantic Mediation
Ecological ontology creation/use for data integration Grid/Wed based data discovery Resolution of Taxonomic Names/Concepts
Standards development Concept matching server Visualisation tools
http://seek.ecoinformatics.org
Exploiting Diverse Sources of Scientific Data 44
Is it safe to match on names?
I hope I have convinced you that the answer is
NO as a general rule…
BUTDepends on the purpose of the data
therefore the accuracy required
The degree of automation used in matching greater automation – greater potential problem
Expertise of person involved in the matching
Exploiting Diverse Sources of Scientific Data 45
Many Outstanding Issues….Educating biologists of the inherent problem in names
Not limited to the Linnaean system of nomenclatureLack of good taxon concept data Widening usage and application of taxon concepts
Adopting GUIDs Provision of reliable ‘look up’ facilities Cross referencing of GUIDs
Reuse is vital Must not create duplicate GUIDs if possible
Conversion of legacy dataDevelop good matching algorithmsPotential move from XML schema -> semantic web
technologies……..
Exploiting Diverse Sources of Scientific Data 46
AcknowledgementsThis material is based upon work supported by:The National Science FoundationSEEK Collaborators: NCEAS (UC Santa Barbara),
University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Arizona State University, UC Davis Matt Jones – for many of the slides….
Global biodiversity Information FacilityeScience Institute
Research Theme Programme Malcolm Atkinson
Exploiting Diverse Sources of Scientific Data 47
Exploiting Diverse sources of Scientific Data
Upcoming Workshop discussing possible technology solutions
RDF, Ontologies and Meta-Data Workshop7th – 9th June, 2006 e-Science Institute
15 South College Street Edinburgh
http://www.nesc.ac.uk/esi/events/683/