+ All Categories
Home > Documents > Hughes RDAP11 Data Publication Repositories

Hughes RDAP11 Data Publication Repositories

Date post: 30-May-2015
Category:
Upload: asist
View: 574 times
Download: 0 times
Share this document with a friend
Description:
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories The 2nd Research Data Access and Preservation (RDAP) Summit An ASIS&T Summit March 31-April 1, 2011 Denver, CO In cooperation with the Coalition for Networked Information http://asist.org/Conferences/RDAP11/index.html
Popular Tags:
22
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery Research Data Access & Preservation Denver, Colorado March 31 - April 1, 2011 Steve Hughes Dan Crichton Chris Mattmann Sean Kelly
Transcript
Page 1: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery

Research Data Access & Preservation

Denver, Colorado

March 31 - April 1, 2011

Steve Hughes

Dan Crichton

Chris Mattmann

Sean Kelly

Page 2: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Topics

• E-Science Trends• Software Architectures• Open Source• Object-Oriented Data Technology• Use Case• Data Driven

2Leveraging Open Source Technologies to Enable Scientific Discovery

Page 3: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

“eScience” Trends

• Highly distributed, multi-organizational systems– Systems are moving towards loosely coupled systems or federations in

order to solve science problems which span center and institutional environments

• Sharing of data and services which allow for the discovery, access, and transformation of data – Systems are moving towards publishing of services and data in order to

address data and computationally-intensive problems– Infrastructures which are being built to handle future demand– Use of commodity services to address elasticity

• Address complex modeling, inter-disciplinary science and decision support needs– Need a dynamic environment where data and services can be used quickly

as the building blocks for constructing predictive models and answering critical science questions

– Need to ensure information architecture support the varying science needs

• Changing the way in which data analysis is performed– Moving towards analysis of distributed data to increase the study power– Enabling greater collaboration across centers– Systematizing, where possible

3Leveraging Open Source Technologies to Enable Scientific Discovery

Page 4: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Highly Distributed Science Environments

Leveraging Open Source Technologies to Enable Scientific Discovery 4

Planetary Data SystemDistributed Planetary Science Archive

Small Bodies NodeUniversity of Maryland

College Park, MD

Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA

Geosciences NodeWashington University

St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ

THEMIS Data NodeArizona State UniversityTempe, AZ

Central NodeJet Propulsion LaboratoryPasadena, CA

Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA

Rings NodeAmes Research CenterMoffett Field, CA

Atmospheres NodeNew Mexico State UniversityLas Cruces, NM

National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN

Universityof Michigan

(CEC)

Moffitt CancerCenter, Tampa

(BDL)

CreightonUniversity

(CEC)

UT Health ScienceCenter, San Antonio

(CEC)

University ofColorado

(CEC)

Fred HutchinsonCancer Research Center, Seattle

(DMCC)

University ofPittsburgh

(CEC)

Highly distributed/federatedCollaborative

Information-centricDiscipline-specificGrowing/evolvingHeterogeneous

(Implementations)

Page 5: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Why Software Architecture?

• Software Architecture: The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000)

• Architecture is about strategy to address key architectural concerns…– How can we exploit common patterns to improve reuse?– Can we develop software product lines?– Can we improve interoperability?– Can we reduce dependencies?

• What are the architectural principles..?: loosely-coupled, data-driven, highly distributed, commodity services, service oriented, collaborative/multi-institutional

5Leveraging Open Source Technologies to Enable Scientific Discovery

Page 6: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Notional Service Architectures Concept

6Leveraging Open Source Technologies to Enable Scientific Discovery

• The service architecture concept exploits many of the architectural concepts discussed• Loosely coupled• Elasticity (e.g. Commodity-based)• Multi-organizational• etc

• At an enterprise-scale, architectures don’t need to prescribe what’s inside services….just their interfaces, function, behavior, etc…

• Services might include….• Data discovery• Data access• Security• Transformation

Client BClient A

Service

CService Interface

C2 Architectural Style

Page 7: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

What does this have to do with open source?

• The identification of core software product lines and tools, that can be reused, are excellent examples of opportunities to create open source projects– Across a federation of organizations, systems and users, what be

developed and shared?– How can software components be developed in generic ways, but allow

for extensions?

• Open source itself is a strategy– Can improve collaborations – Can drive a robust set of reusable software components and tools– Can push standards development– Can encourage use of common architectural patterns

Leveraging Open Source Technologies to Enable Scientific Discovery 7

Page 8: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Open Source Models

• Software sharing with an open source license (e.g, BSD-style license)

• Software distribution through open source organizations (e.g., SourceForge)

• Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation)

• Ad hoc open source project communities with their own governance

Leveraging Open Source Technologies to Enable Scientific Discovery 8

Page 9: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Open Source Models: Our Opinion

• Software sharing with an open source license (e.g, BSD-style license)– It’s a great start– Limited community involvement

• Software distribution through open source organizations (e.g., SourceForge)– Provides good software distribution support

• Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation)– This moves from just distribution support to collaboration and

governance over the development

• Ad hoc open source project communities with their own governance– This can make a lot of sense for larger federations…

Leveraging Open Source Technologies to Enable Scientific Discovery 9

Page 10: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

The Apache Software Foundation

• Largest open sourcesoftware development entity in the world– Over 2300+ committers– Over 3500+ contributors

• 84 Top Level Projects– 36 Incubating– 30 Lab Projects

• 8 retired projects in the “Attic”• Over 1.2 million revisions

Leveraging Open Source Technologies to Enable Scientific Discovery 10

- Over 10M successful requests served a day across the world

- HTTPD web server used on 100+ million web sites (52+% of the market)

Page 11: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

OODT: An Open Source Framework for Building Distributed Science Data Mgmt Environments

• Focus on– distribute environments– science data generation – data capture, end-to-end– access to science data by

the community

• A set of building blocks/services to exploit common system patterns for reuse

• 04-FEB-2011 - Apache OODT v0.2 Released

• Used for a number of science data system activities

11Leveraging Open Source Technologies to Enable Scientific Discovery

http://oodt.apache.org/

Page 12: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

e-Science Examples and OODT

Planetary Data SystemDistributed Planetary Science Archive

Small Bodies NodeUniversity of Maryland

College Park, MD

Planetary Plasma Interactions NodeUniversity of California Los AngelesLos Angeles, CA

Geosciences NodeWashington University

St. Louis, MOImaging NodeJPL and USGSPasadena, CA and Flagstaff, AZ

THEMIS Data NodeArizona State UniversityTempe, AZ

Central NodeJet Propulsion LaboratoryPasadena, CA

Navigation Ancillary Information NodeJet Propulsion LaboratoryPasadena, CA

Rings NodeAmes Research CenterMoffett Field, CA

Atmospheres NodeNew Mexico State UniversityLas Cruces, NM

National Data Sharing InfrastructureSupporting Collaboration In Biomedical Research For EDRN

Universityof Michigan

(CEC)

Moffitt CancerCenter, Tampa

(BDL)

CreightonUniversity

(CEC)

UT Health ScienceCenter, San Antonio

(CEC)

University ofColorado

(CEC)

Fred HutchinsonCancer Research Center, Seattle

(DMCC)

University ofPittsburgh

(CEC)

Planetary Science Data System• Highly diverse (40 years of science data from NASA and Int’l missions)• Geographically distributed; moving int’l• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous nodes with common interfaces• Integrated based on enterprise-wide data standards• Sits on top of COTS-based middleware

EDRN Cancer Research• Highly diverse (30+ centers performing parallel studies using different instruments)• Geographically distributed• New centers plugging in (i.e. data nodes)• Multi-center data system infrastructure• Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization)

12Leveraging Open Source Technologies to Enable Scientific Discovery

Page 13: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Mission Pipelines – Data Generation and Archive

DJC-13

• Leveraged OODT software framework for constructing ground data systems for earth science missions– Used OODT Catalog and Archive Service

software– Focus is on “process management”

• Constructed “workflows” – Execution of “processors” based on a set of

rules– Explicit separation of workflow management

from management of computational resources

• Provided “lights out” operations

• Multiple Missions– SeaWinds– QuikSCAT– Orbiting Carbon Observatory (OCO), OCO-

2…– NP Sounder PEATE– SMAP

Spacecraft& Ancillary

Files

Pre-Processors

(PP)

ScienceLevel

Processors(LP)

Science Analysis

and Quality

Reporting(SA)

InstrumentCommands

File

Transf er (F

X)

User Interface (Process Monitoring & Control, Instrument Commanding, Data Verification)

Data Management and Automatic Process Control (PM) using OODT

EngineeringAnalysis

(EA)

Product D

elive ry (PM

)

ScienceProductsReleased

toPO.DAAC

SeaWinds on ADEOS II (Launched Dec 2002)

Leveraging Open Source Technologies to Enable Scientific Discovery

Credit: D. Freeborn, C. Mattmann, D. Woollard

Page 14: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Conceptual Capabilities

• OODT Apache Suite (oodt.apache.org)– File Management– Workflow Management (for jobs/processing)– Data Transformation– Data Access– Metadata Query

• Registry (future addition to OODT)– Metadata Management based on ebXML registry specification– Used to manage different type of “extrinsic” objects (metadata

descriptions of data, services, etc)• “targets”, “science data products”, “documents”, “services”, etc

– Product identification, versioning, tracking, and subscription/notification

– Indexing, Classification, and Associations

Page 15: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Information Architecture

• OODT + Registry contains two different types of “models”– Core Infrastructure model– Discipline model

• Core infrastructure model is intrinsic (integrated with the software)– It is built in and used by the software; this never changes and you don’t need to

worry about it– Services are part of the core infrastructure (“intrinsic”) but all other metadata

objects are “extrinsic”

• Discipline model is extrinsic (defined outside the software)– It is dynamically configured – For example, the registry can be configured to use whatever “extrinsic”

metadata objects are important to manage– This allows for the registry to be used for tracking artifacts, managing services,

etc.– This is what projects need to define

Page 16: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Observational Product – Concept Map

Page 17: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

PDS4 High Level Concept Map

Page 18: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Defining Extrinsic Objects and their Context (Ontology)

Page 19: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

External Data Standards

• Open Archival Information System (OAIS) Reference Model - Defines the “Information Object” a key component of the model.

• ISO/IEC 11179-3: Registry Metamodel and Basic Attributes - Provides the schema for the data dictionary. Defines the concepts of registration authority and steward for governance.

• Object_Oriented Data Modeling – Used as a standard modeling methodology.

• XML/XML Schema – Provides the label syntax and validation mechanism.

• OASIS/ebXML Registry Information Model - Provides attributes for object registration within a federated registry/repository.

• ISO 15836:2009 The Dublin Core Metadata Element Set – Provides standard web resource identification attributes.

• Semantics - RDF, RDFS, OWL - Provides W3C standards for knowledge representation.

Page 20: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

A perspective to leave you with…

• Agency science federations, based on an open source/collaborative model, are very attractive for the following reasons:

– Science benefits: can drive a growing enterprise of shared science services and software infrastructure support

– Technology benefits: can drive innovation through its peer review and collaboration process

– Infusion benefits: creates a defined process for contributing new ideas and capabilities

– Architecture benefits: helps you build towards a common architectural vision and drive community standards

– Cost benefits: can enable better leveraging and reuse of skills and capabilities across institutions

– Tech Transfer Benefits: may benefit other science (and non-science disciplines)

20Leveraging Open Source Technologies to Enable Scientific Discovery

Page 21: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Questions?

Thank You!!!

Steve Hughes

[email protected]

Chris Mattmann

[email protected]

Note…we have several papers, book chapters on data intensive systems, etc that we’d be happy to share! A few key ones…

D. Crichton, C. Mattmann, J. S. Hughes, S. Kelly, and A. Hart. “A Multi-Disciplinary, Model- Driven, Distributed Science Data System Architecture.” Guide to e-Science: Next Generation Scientific Research and Discovery. X. Yang, L. L. Wang, W. Jie, eds. Spring Verlag, 2010, To appear.

D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. “A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer”. Accepted for publication at the 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, the Netherlands, December 4th-6th, 2006.

C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. 21Leveraging Open Source Technologies to Enable Scientific Discovery

Dan [email protected]

Sean Kelly [email protected]

Page 22: Hughes RDAP11 Data Publication Repositories

National Aeronautics and Space Administration

Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California

Backup

22Leveraging Open Source Technologies to Enable Scientific Discovery


Recommended