Science Environment for Ecological Knowledge: Ecogrid Interfaces Dave Vieglais

Post on 12-Jan-2016

36 views 0 download

Tags:

description

Science Environment for Ecological Knowledge: Ecogrid Interfaces Dave Vieglais The Natural History Museum and Biodiversity Research Center University of Kansas. Science Environment for Ecological Knowledge. Research Objectives Access to ecological and environmental data - PowerPoint PPT Presentation

transcript

Science Environment for Ecological Knowledge: Ecogrid Interfaces

Dave VieglaisThe Natural History Museum and Biodiversity Research Center

University of Kansas

Science Environment for Ecological Knowledge

Research Objectives

Access to ecological and environmental data Enable data sharing & re-use Enhance data discovery at global scales

Scalable analysis and synthesis Taxonomic, Spatial, Temporal, Conceptual integration of

data Enable communication and collaboration for analysis Address data heterogeneity issues Enable re-use of analytical components

Data is Heterogeneous Syntax Schema Semantics

From many disciplines Biodiversity surveys, hydrology, atmospheric

chemistry, spatial data, behavioral experiments,… Data on economics, demographics, legal issues,…

Data is distributed

Informatics Challenges for SEEK

SEEK Components

EcoGrid Ecological, biodiversity and environmental data Computational access

Analysis and Modeling System Modeling scientific workflows

Semantic Mediation System “Smart” data discovery Knowledge-based data integration Knowledge-based analysis integration

Knowledge Representation Ontologies for describing ecology

Building the EcoGrid

AND

SEV

LUQ

VCR

HBR

NTL

NRSPISCO1

PISCO2 OBFS

Metacat node

Site node

LTER Network (24)Organization of Biological Field Stations (180)UC Natural Reserve System (36)Partnership for Interdisciplinary Studies of Coastal Oceans (4)Multi-agency Rocky Intertidal Network (60)

SDSC

NET

KU

NCEAS

SRB node

DiGIR node

SEEK EcoGrid

Integrate diverse data networks from ecology, biodiversity, and environmental sciences Metacat, DiGIR, SRB, Xanthoria, ...

EML is the core for data documentation Access to computational resources via the Grid

(OGSA)

Ecological Metadata Language (EML)

Metadata: a means to manage ecological data There is no universal data model for ecology Accommodate heterogeneity and dispersion

EML Discovery information

Creator, Title, Abstract, Keyword, etc. Coverage

Geographic, temporal, and taxonomic extent Logical and physical data structure

Data semantics via unit definitions and typing Protocols and methods

DiGIR Overview

DiGIR = Distributed Generic Information Retrieval A DiGIR client may communicate with any number of

data providers A DiGIR data provider may expose any number of

resources (databases) A DiGIR resource is a collection of objects described

by a single federation schema

DiGIR Client

DiGIR Provider

DataResource1..n 1..n

EcoGrid Interfaces

Registry

Session

Query

Taxon

SMS

Resolves references to objects

•Interface definitions

•Data structures

•Service instancesAuthentication

Details on session information

Coarse granularity of resource restriction

Search and retrieve metadata and data

Different levels of “conformance”

Low bar for participation in SEEKSystem to reduce ambiguity in scientific names

Commonly used to address synonomy

Mechanism for relating and resolving data andmetadata concepts

EcoGrid Query Interfaces

Provides a mechanism for search and retrieval of metadata and federated data

Supports third party interaction with search results – forwarding of result set identifiers to another service instance for retrieval

Different levels of compliance Low barrier for participation Bulk of data will be accessible through Type I

Query Interfaces Implemented

Initial requirement to support query and retrieval from: SRB Metacat DiGIR Xanthoria

Federated data sets that subscribe to a small set of federation schemas

EcoGrid Query Level I

Basic, entry level exposure of data and metadata for EcoGrid and SEEK

Response contains data – intended for direct communications rather than 3rd party indirection

ResultsetType query(SessionID,QueryType)

byte[] get(SessionID,objectID)

Query Example

<egq:query queryId="query-digir.1.1" system="http://knb.ecoinformatics.org"

xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-

query-1.0.0beta1 ../../src/xsd/query.xsd"> <namespace

prefix="darwin">http://digir.net/schema/conceptual/darwin/2003/1.0</namespace>

<returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield> <title>Peromyscus genus query</title> <condition operator="LIKE"

concept="Genus">Peromyscus</condition></egq:query>

Query Structure

Language independent representation of a query structure

Transformed into the appropriate native language of the data store

Example:<AND> <condition operator="LIKE“ concept="ScientificName">

peromyscus man%</condition>

<condition operator="NOT EQUALS“ concept="DecimalLatitude"> NULL</condition>

</AND>

Specifying the Resultset

Specify the list of concepts (fields) to be returned in the resultset

Simple paths used to identify elements or document subtrees

Effectively flattens the structure of the records, but allows generic representation

Example: <returnfield>/ScientificName</returnfield>

<returnfield>/Longitude</returnfield>

<returnfield>/Latitude</returnfield>

Query Result Set Structure

<rs:resultset resultsetId="foo.1.1" system="urn:not://sure/what/to/put/here" xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-

1.0.0beta1 ../../src/xsd/resultset.xsd"> <resultsetMetadata> <sendTime>2003-05-02T16:45:50-09:00</sendTime> <startRecord>1</startRecord> <endRecord>2</endRecord> <recordCount>2</recordCount> </resultsetMetadata> <record number="1"

system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2" identifier="mvz1" namespace="http://digir.net/schema/conceptual/darwin/2003/1.0" lastModifiedDate="2003-03-03T10:42:13" creationDate="2003-03-03T10:42:13"> <darwin:ScientificName>PEROMYSCUS LEUCOPUS NOVEBORACENSIS

</darwin:ScientificName> <darwin:Longitude>121</darwin:Longitude> <darwin:Latitude>33</darwin:Latitude> </record>

EcoGrid Query Level II

More detailed handling of results Uses RSIDs to identify resultsets- handles

that can be passed to a third party

Resultset retrieve(SessionID,RSID,start,numrecs)

RSID search(SessionID,query)

query decodeResultsetIdentifier(SessionID,RSID)

statusinfo getResultStatus(SessionID)

int transfer(SessionID,sourceURL,destURL,ObjectID)

EcoGrid Write

Used to push data back to sources (e.g. publishing EML documents)

Depends on the availability of an authentication system

put(sessionID, objectID, object, type)

delete(sessionID,objectID)

Data Instance Query?

New requirement to support direct query and retrieval with arbitrary data sets

Generally no common schemas between different instances

Could either Push data instance to service that can query

object (e.g. the SRB) Implement interface at the data instance location

Simple JDBC / SQL interface?

dbSchema getDataSchema(sessionID,objectID)

dbResultset search(sessionID,objectID,SQL)

Convergence with Globus?

EcoGrid originally intended to use Globus since it provided much of the infrastructure

Globus is not a viable infrastructure layer due to installation and reliability concerns

Should SEEK implement Globus infrastructure to support project requirements?

Likely to duplicate minimal service definitions and re-implement

Acknowledgements

This material is based upon work supported by:

The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.

The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

The Andrew W. Mellon Foundation.

PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)