INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE Data Management
Peter Kunszt
Diligent – EGEE JRA1 Meeting, 2004 December 16
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Contents
• Component Overview• gLite Catalogs
– Overview– Concepts– Implementations– Distribution
• gLite Transfer Management– Scheduling model– Implementation
• Deployment models• Distribution mechanisms• Discussion
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 3
Enabling Grids for E-sciencE
INFSO-RI-508833
ServiceService Oriented Architecture Oriented Architecture
Guiding Principles
InteroperabilityInteroperability
PortabilityPortability
ModularityModularity
ScalabilityScalability
Web ServicesWeb ServicesBuilding on existingBuilding on existing
components in acomponents in alightweight mannerlightweight manner
AliEn LCG Condor
Globus SRM ...
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 4
Enabling Grids for E-sciencE
INFSO-RI-508833
Data Management Tasks
• File Management– Storage– Access– Placement– Cataloguing– Security
• Metadata Management– Secure database access– Schema management– File-based metadata– Generic metadata
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 5
Enabling Grids for E-sciencE
INFSO-RI-508833
Product Overview
• File Storage– Storage Elements with SRM (Storage Resource Manager) interface– Posix I/O interface through glite-io– Supports transfer protocols (bbftp, https, ftp, gsiftp, rfio, dcap, …)
• Catalogs– File and Replica Catalog– File Authorization Service– Metadata Catalog– Distribution of catalogs, conflicts resolution (messaging)
• Transfer– Top-level Data Scheduler as global entry point (there may be many). – Site File Placement Service managing transfers and catalog
interactions– Site File Transfer Service managing incoming transfers (the network
resource)
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 6
Enabling Grids for E-sciencE
INFSO-RI-508833
File Movement and Management
• Data scheduling and high-level optimization
• Job-like data transfers (queuing, ordering, etc)
• Possibility to use reliable managed file transfer
• Site self-consistency (locality of reference)
• SRM-based managed storage (permanent and volatile)
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 7
Enabling Grids for E-sciencE
INFSO-RI-508833
File Movement and Management
Internals
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Catalog Contents
Global UniqueIDentifyer
Storage URL
Storage URL
LogicalFile Name
SymLink
SymLink
Storage URL
UniqueSystem-definedImmutable UUID
UniqueUser-defined
Mutable
Metadata
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Concepts
• Directories• Symlinks• Authorization: ACL and base (unix) permissions• File metadata (size, ctime, mtime, checksum, status,
type)• File-based metadata (key-value pairs on files), the
schema is associated per directory• Extensible metadata including schema manipulation• Maybe virtual directories (cached metadata queries) in
the future
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Interface Design
ServiceBase
FASBase
ReplicaCatalogFileCatalog
MetadataBase
MetadataSchema
FiReManFASMetadataCatalog
SEIndex
Base Interfaces Service Interfaces Feature InterfacesEnd-userInterface
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Metadata Capabilities
• Metadata directly in the File Catalog– Like POSIX file metadata: key-value pairs stored.– Metadata Schema (description of key-value pairs) may be
different for each directory, but all files in the same directory share the same keys
– Limited query and search capabilities to single directory or single schema: the hierarchy has to restrict the query (we don’t allow a global find-like operation on metadata)
• Unconstrained Metadata– Any schema possible– Schema manipulation interface available– Generic query interface (just pass in a query string)
• Application-specific Metadata– On top of any of these two gLite specifications, applications can
build their own metadata interface
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 12
Enabling Grids for E-sciencE
INFSO-RI-508833
gLite Catalog Implementations
• Fireman Interface– Oracle 9i implementation– MySQL implementation
• MetadataCatalog Interface– MySQL implementation– Oracle 9i implementation
• MetadataSchema Interface– MySQL implementation– Oracle 9i implementation
• Apply interfaces to existing implementations– Will have a Fireman interface also over the AliEn FC– Fireman interface over the LCG FC– MetadataCatalog and MetadataSchema over existing application
catalogs– …
DONEIn progress or planning
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 13
Enabling Grids for E-sciencE
INFSO-RI-508833
Catalog Deployment Models
• Single central catalog (AliEn, LCG-2 model)– All operations go there
• Local catalogs with a central component– Update operation only on local catalogs– Update operation on both local and central catalogs
• Local catalogs, no central component – only indices for certain queries
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 14
Enabling Grids for E-sciencE
INFSO-RI-508833
Distribution Mechanism 1
• Data Scheduler (global and local schedulers)– Global scheduler (VO-specific) takes requests like
Copy set of files from A to B Make set of files available at C Upload files from GSIFTP server to D Delete files Maybe also metadata operations
– Local scheduler fetches tasks from known global schedulers Coupled tightly to a local transfer service Manage transfer where the local site is a target Assure atomicity of transfer and catalog operations
• Transfer Service– Queue data transfers to/from a given Storage Element (SRM)– Receives jobs from local scheduler– Manages transfers through a set of states
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 15
Enabling Grids for E-sciencE
INFSO-RI-508833
Distribution Mechanism 2
• Certainly possible to just rely on DB replication• Middleware distribution of updates between catalogs
– Using a messaging system (JMS using JORAM)– Publish updates to message queue locally– Subscribe to updates at central catalogs / index nodes– Asynchronous messaging queues take care of update delivery– Scales well to the number of sites we deal with– However, error messages have to be queued for retrieval as well
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 16
Enabling Grids for E-sciencE
INFSO-RI-508833
To be understood
• What to distribute and how– All of the data? (Replication)– Just parts? (Indexing)– Read-write mechanisms and updates between many copies (Policies)
• Metadata usage– Schema manipulation capabilities – what is really needed– Metadata services by experiments may interface with gLite or
implement the gLite interfaces themselves Are a set of canned queries good enough? If yes, user does not need to
have a generic query interface. Does all of the metadata need to be local? Or will some metadata have to
be fetched from remote sites? What kinds of distributed queries are necessary at all? What kind of metadata is for local/laptop usage? What kinds of update semantics are needed if at all? (Single instance,
single master, multi master)
Dec. 16 Diligent – JRA1 Workshop Peter Kunszt 17
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
• gLite Data Management provides a complete set of file management middleware including data and catalog distribution
• Many extensible modules based on simple interfaces. Capabilities may easily be extended if needed.
• Actual usage patterns need to be understood in order to set up an efficient deployment scenario.
• Still many difficult open questions which have to be answered individually for each Grid VO.
We are looking forward to work with the community to address these issues.