+ All Categories
Home > Education > Session 24 - Distribute Data and Metadata Management with gLite

Session 24 - Distribute Data and Metadata Management with gLite

Date post: 11-May-2015
Category:
Upload: issgc-summer-school
View: 6,566 times
Download: 5 times
Share this document with a friend
Popular Tags:
132
INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org Antonio Calanducci [email protected] National Institute of Nuclear Physics INFN Catania EGEE NA3 Training & Induction ISSGC 2009 Sophia-Antipolis, Polytech’Nice-Sofia , 10/07/2009 Distribute Data and Metadata Management with gLite
Transcript
Page 1: Session 24 - Distribute Data and Metadata Management with gLite

INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

Antonio [email protected]

National Institute of Nuclear Physics INFN Catania

EGEE NA3 Training & InductionISSGC 2009

Sophia-Antipolis, Polytech’Nice-Sofia , 10/07/2009

Distribute Data and Metadata Management

with gLite

Page 2: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 2

Outline

• Grid Data Management Challenge

• Storage Elements and SRM

• File Catalogs and Data Management tools

• Metadata Service with use cases

Page 3: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 3

The Grid DM Challenge

• Heterogeneity– Data are stored on different

storage systems using different access technologies

• Distribution– Data are stored in different

locations – in most cases there is no shared file system or common namespace

– Data need to be moved between different locations

• Data about data– Data stored can be huge: find

files according to their contents

– Need common interface to storage resources Storage Resource

Manager (SRM)

– Need to keep track where data is stored File and Replica

Catalogs– Need scheduled, reliable

file transfer File transfer service

– Need a way to describe files’ content and query their descriptions Metadata service

Page 4: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 4

SRM in an example

She is running a job which needs:Data for physics event reconstructionSimulated DataSome data analysis filesShe will write files remotely too

They are at CERNin dCache

They are at FermilabIn a disk array

They are at Nikhefin a classic SE

Page 5: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 5

SRM in an example

dCacheOwn system, own protocols and parameters

CastorNo connection with dCache or DPM

gLite DPMIndependent system from dCache or Castor

You as a user need to know all

the systems!!!

Page 6: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 5

SRM in an example

dCacheOwn system, own protocols and parameters

CastorNo connection with dCache or DPM

gLite DPMIndependent system from dCache or Castor

You as a user need to know all

the systems!!!S

RM

I talk to them on your behalfI will even allocate space for your filesAnd I will use transfer protocols to send your files there

Page 7: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 6

Storage Resource Managers

• Data are stored on disk pool servers or Mass Storage Systems

• Storage resource management needs to take into account– Transparent access to files (migration to/from disk pool)– File pinning– Space reservation– File status notification– Life time management

• The SRM (Storage Resource Manager) takes care of all these details– The SRM is a single interface that takes care of local storage

interaction and provides a Grid interface to the outside world

Page 8: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 7

Data management with gLite• Assumptions:

– Users and programs produce and require data– the lowest granularity of the data is on the file level (we

deal with files rather than data objects or tables) Data = files

• Files: – Mostly, write once, read many– Located in Storage Elements (SEs)– Several replicas of one file in different sites– Accessible by Grid users and applications from “anywhere”– Locatable by the WMS (data requirements in JDL)

• Also…– WMS can send (small amounts of) data to/from jobs: Input

and Output Sandbox – Files may be copied from/to local filesystems (WNs, UIs) to

the Grid (SEs)

Page 9: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 8

gLite Grid Storage Requirements

• The Storage Element is the service which allow a user or an application to store data for future retrieval

• Manage local storage (disks) and interface to Mass Storage Systems(tapes) like – HPSS, CASTOR, DiskeXtender (UNITREE), …

• Be able to manage different storage systems uniformly and transparently for the user (providing an SRM interface)

• Support basic file transfer protocols– GridFTP mandatory– Others if available (https, ftp, etc)

• Support a native I/O (remote file) access protocol– POSIX (like) I/O client library for direct access of data (GFAL)

Page 10: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 9

gLite Storage Element

(http/https)

Page 11: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 10

gLite SE types history

• gLite data access protocols:– File Transfer: GSIFTP (GridFTP)/HTTP(S)– File I/O (Remote File access):

gsidcap rfio (insecure RFIO) gsirfio (secured RFIO)

• Classic SE:– GridFTP server– Insecure RFIO daemon (rfiod) – only LAN limited

file access– Single disk or disk array– No quota management– Does not support the SRM interface

Page 12: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 11

gLite SE types history(II)

• Mass Storage Systems (Castor)– Files migrated between front-end disk and back-end tape storage

hierarchies– GridFTP server– Insecure RFIO (Castor)– Provide a SRM interface with all the benefits

• Disk pool managers (dCache and gLite DPM)– manage distributed storage servers in a centralized way– Physical disks or arrays are combined into a common (virtual) file

system– Disks can be dynamically added to the pool – GridFTP server– Secure remote access protocols (gsidcap for dCache, gsirfio for

DPM)– SRM interface

Page 13: Session 24 - Distribute Data and Metadata Management with gLite

12EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

• The Disk Pool Manager (DPM) is a lightweight solution for disk storage management, which offers the SRM interfaces (2.2 released in DPM version 1.6.3).

• Each DPM–type Storage Element (SE), is composed by an head node and one (in the same machine) or more disk servers

• The DPM handles the storage on Disk Servers. – It handles pools: a pool is a group of file systems, located

on one or more disk servers. The DPM Disk Servers can have multiple filesystems in the pool.

• Supported protocols:– File Transfers: GridFTP, HTTP/HTTPs– File I/O: GSIRFIO

DPM Overview

3

Page 14: Session 24 - Distribute Data and Metadata Management with gLite

13EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 5

/vo

DPM Architecture

/dpm/domain

/home

DPMhead node file

(uid, gid1, …)

DPMdisk servers

• DPM Name ServerNamespaceAuthorizationPhysical files location

• Disk ServersPhysical files

• Direct data transfer from/to disk server (no bottleneck)

CLI, C API, SRM-enabled client,

etc. data transfer

Page 15: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

Page 16: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

Page 17: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

DPMhead node /domain

/vofile

SRMv1 SRMv2 SRMv2.2

DPMdaemon

DPNSdaemon

/dpm

/home

Page 18: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

DPMhead node /domain

/vofile

SRMv1 SRMv2 SRMv2.2

DPMdaemon

DPNSdaemon

/dpm

/home

Page 19: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

DPMhead node /domain

/vofile

SRMv1 SRMv2 SRMv2.2

DPMdaemon

DPNSdaemon

/dpm

/home

DPMdisk servers

Secure RFIOGridFTP

Secure RFIOGridFTP

Page 20: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

data transfer

DPMhead node /domain

/vofile

SRMv1 SRMv2 SRMv2.2

DPMdaemon

DPNSdaemon

/dpm

/home

DPMdisk servers

Secure RFIOGridFTP

Secure RFIOGridFTP

Page 21: Session 24 - Distribute Data and Metadata Management with gLite

14EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

DPM architecture (details)

CLI, C API, SRM-enabled

client, etc.

data transfer

DPMhead node /domain

/vofile

SRMv1 SRMv2 SRMv2.2

DPMdaemon

DPNSdaemon

/dpm

/home

DPMdisk servers

Secure RFIOGridFTP

Secure RFIOGridFTP

DPM database

DPNS database

Page 22: Session 24 - Distribute Data and Metadata Management with gLite

15EGEE Tutorial - Barcelona 14th - 18th April 2008,

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

DPM strengths

• Easy to install/configure- Few configuration files

• Manageable storage- Logical Namespace- Easy to add/remove file systems

• Low maintenance effort• Supports as many disk servers as needed• Low memory footprint• Low CPU utilization

7

Page 23: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 16

Files Naming conventions• Logical File Name (LFN)

– An alias created by a user to refer to some item of data, e.g. “lfn:/grid/gilda/20030203/run2/track1”

• Globally Unique Identifier (GUID) – A non-easy-to-remember unique identifier for an item of data, e.g. “guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6”

• Site URL (SURL) (or Physical File Name (PFN) or Site FN)– The location of an actual piece of data on a storage system

e.g. “srm://grid009.ct.infn.it/dpm/ct.infn.it/gilda/output10_1” (SRM)

• Transport URL (TURL)– Temporary locator of a replica + access protocol: understood by a SE – e.g. “rfio://lxshare0209.cern.ch//data/alice/ntuples.dat”

Page 24: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 17

SE

What is a file catalog

gLiteUI

File Catalog

SE

SE

Page 25: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Upload and file convention example [issgc59@issgc-ui ~]$ lcg-cr -v -d aliserv6.ct.infn.it -l lfn:/grid/gilda/tutorials/issgc09/message0.sh file:/home/issgc59/message0.sh

Using grid catalog type: lfc

Using grid catalog : lfc-gilda.ct.infn.it

SE type: SRMv1

Destination SURL : srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/gilda/generated/2009-07-10/file273edfaa-e0ed-448c-b713-7996c2d09f88

Source SRM Request Token: 401296

Source URL: file:/home/issgc59/message0.sh

File size: 20

VO name: gilda

Destination specified: aliserv6.ct.infn.it

Destination URL for copy: gsiftp://aliserv6.ct.infn.it/aliserv6.ct.infn.it:/data01/gilda/2009-07-10/file273edfaa-e0ed-448c-b713-7996c2d09f88.401296.0

# streams: 1

20 bytes 0.01 KB/sec avg 0.01 KB/sec inst

Transfer took 3070 ms

Using LFN: lfn:/grid/gilda/tutorials/issgc09/message0.sh

Using GUID: guid:2eed7180-2505-46fb-bef2-19ef77f1b2fd

Registering LFN: /grid/gilda/tutorials/issgc09/message0.sh (2eed7180-2505-46fb-bef2-19ef77f1b2fd)

Registering SURL: srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/gilda/generated/2009-07-10/file273edfaa-e0ed-448c-b713-7996c2d09f88 (2eed7180-2505-46fb-bef2-19ef77f1b2fd)

guid:2eed7180-2505-46fb-bef2-19ef77f1b2fd

Page 26: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 19

The LFC (LCG File Catalog)

• It keeps track of the location of copies (replicas) of Grid files • LFN acts as main key in the database. It has:

– Symbolic links to it (additional LFNs)– Unique Identifier (GUID)– System metadata– Information on replicas– One field of user metadata

Page 27: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 20

LFC Features

– Cursors for large queries– Timeouts and retries from the client– User exposed transactional API (+ auto rollback on failure)– Hierarchical namespace and namespace operations (for

LFNs)– Integrated GSI Authentication + Authorization– Access Control Lists (Unix Permissions and POSIX ACLs)– Checksums– Integration with VOMS (VirtualID and VirtualGID)

Page 28: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 21

Listing the entries of a LFC directorylfc-ls [-cdiLlRTu] [--class] [--comment] [--deleted] [--display_side] [--ds]

path…

where path specifies the LFN pathname (mandatory)

– Remember that LFC has a directory tree structure– /grid/<VO_name>/<you create it>

– All members of a VO have read-write permissions under their directory

– You can set LFC_HOME to use relative paths> lfc-ls /grid/gilda/tony> export LFC_HOME=/grid/gilda> lfc-ls -l tony> lfc-ls -l -R /grid

lfc-ls

Defined by the userLFC Namespace

-l : long listing-R : list the contents of directories recursively: Don’t use it!

Page 29: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 22

lfc-mkdir

Creating directories in the LFClfc-mkdir [-m mode] [-p] path...

• Where path specifies the LFC pathname

• Remember that while registering a new file (using lcg-cr, for example) the corresponding destination directory must be created in the catalog beforehand.

• Examples:> lfc-mkdir /grid/gilda/tony/demo

You can just check the directory with:> lfc-ls -l /grid/gilda/tony

drwxr-xrwx 0 19122 1077 0 Jun 14 11:36 demo

Page 30: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 23

lfc-ln Creating a symbolic link

lfc-ln -s file linknamelfc-ln -s directory linkname

Create a link to the specified file or directory with linkname

– Examples:> lfc-ln -s /grid/gilda/tony/demo/test /grid/gilda/tony/aLink

Let’s check the link using lfc-ls with long listing (-l):> lfc-ls -llrwxrwxrwx 1 19122 1077 0 Jun 14 11:58 aLink ->/grid/gilda/tony/demo/testdrwxr-xrwx 1 19122 1077 0 Jun 14 11:39 demo

Original FileSymbolic

link

Page 31: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 24

LFC commands

lfc-chmod Change access mode of the LFC file/directorylfc-chown Change owner and group of the LFC file-directory

lfc-delcomment Delete the comment associated with the file/directory

lfc-getacl Get file/directory access control lists

lfc-ln Make a symbolic link to a file/directory

lfc-ls List file/directory entries in a directory

lfc-mkdir Create a directory

lfc-rename Rename a file/directory

lfc-rm Remove a file/directory

lfc-setacl Set file/directory access control lists

lfc-setcomment Add/replace a comment

Summary of the LFC Catalog commands

Page 32: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 25

LFC C API

lfc_deleteclasslfc_delreplicalfc_endtranslfc_enterclasslfc_errmsglfc_getacllfc_getcommentlfc_getcwdlfc_getpathlfc_lchownlfc_listclasslfc_listlinks

lfc_listreplicalfc_lstatlfc_mkdirlfc_modifyclasslfc_opendirlfc_queryclasslfc_readdirlfc_readlinklfc_renamelfc_rewindlfc_rmdirlfc_selectsrvr

lfc_setacllfc_setatimelfc_setcommentlfc_seterrbuflfc_setfsizelfc_starttranslfc_statlfc_symlinklfc_umasklfc_undeletelfc_unlinklfc_utimesend2lfc

lfc_accesslfc_aborttranslfc_addreplicalfc_apiinitlfc_chclasslfc_chdirlfc_chmodlfc_chownlfc_closedirlfc_creatlfc_delcommentlfc_delete

Low level methods (many POSIX-like):

Page 33: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 26

GFAL: Grid File Access Library • Interactions with SE require some components:

- File catalog services to locate replicas- SRM- File access mechanism to access files from the SE on the WN

• GFAL does all this tasks for you: - Hides all these operations- Presents a POSIX interface for the I/O operations

- Single shared library in threaded and unthreaded versions (libgfal.so, libgfal_pthr.so)- Single header file (gfal_api.h)

- User can create all commands needed for storage management- It offers as well an interface to SRM

• Supported protocols:- file (local or nfs-like access) - dcap, gsidcap and kdcap (dCache access)- rfio (castor access) and gsirfio (DPM)

Page 34: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 27

GFAL: File I/O API (I)

int gfal_access (const char *path, int amode);int gfal_chmod (const char *path, mode_t mode);int gfal_close (int fd);int gfal_creat (const char *filename, mode_t mode);off_t gfal_lseek (int fd, off_t offset, int whence);int gfal_open (const char * filename, int flags, mode_t mode);ssize_t gfal_read (int fd, void *buf, size_t size);int gfal_rename (const char *old_name, const char *new_name);ssize_t gfal_setfilchg (int, const void *, size_t);int gfal_stat (const char *filename, struct stat *statbuf);int gfal_unlink (const char *filename);ssize_t gfal_write (int fd, const void *buf, size_t size);

Page 35: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 28

GFAL: FC API (II) int gfal_closedir (DIR *dirp);

int gfal_mkdir (const char *dirname, mode_t mode);

DIR *gfal_opendir (const char *dirname);

struct dirent *gfal_readdir (DIR *dirp);

int gfal_rmdir (const char *dirname);

Page 36: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 29

GFAL: Catalog API

int create_alias (const char *guid, const char *lfn, long long size)int guid_exists (const char *guid)char *guidforpfn (const char *surl)char *guidfromlfn (const char *lfn)char **lfnsforguid (const char *guid)int register_alias (const char *guid, const char *lfn)int register_pfn (const char *guid, const char *surl)int setfilesize (const char *surl, long long size)char *surlfromguid (const char *guid)char **surlsfromguid (const char *guid)int unregister_alias (const char *guid, const char *lfn)int unregister_pfn (const char *guid, const char *surl)

Page 37: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 30

GFAL: SRM API

int deletesurl (const char *surl)

int getfilemd (const char *surl, struct stat64 *statbuf)

int set_xfer_done (const char *surl, int reqid, int fileid, char *token, int oflag)

int set_xfer_running (const char *surl, int reqid, int fileid, char *token)

char *turlfromsurl (const char *surl, char **protocols, int oflag, int *reqid, int *fileid, char **token)

int srm_get (int nbfiles, char **surls, int nbprotocols, char **protocols, int *reqid, char **token, struct srm_filestatus **filestatuses)

int srm_getstatus (int nbfiles, char **surls, int reqid, char *token, struct srm_filestatus **filestatuses)

Page 38: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 31

GFAL Java API

• GFAL API are available for C/C++ programmers• We wrote a wrapper around the C APIs using Java Native

Interface and a the Java APIs on top of it• More information can be found here: https://grid.ct.infn.it/twiki/bin/view/GILDA/APIGFAL

Page 39: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 32

lcg_utils DM tools

• High level interface (CL tools and APIs) to– Upload/download files to/from the Grid (UI,CE and WN

<---> SEs)– Replicate data between SEs and locate the best replica

available– Interact with the file catalog

• Definition: A file is considered to be a Grid File if it is both physically present in a SE and registered in the File Catalog

• lcg-utils ensure the consistency between files in the Storage Elements and entries in the File Catalog

Page 40: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 33

lcg-utils commandsReplica Management lcg-cp Copies a grid file to a local destinationlcg-cr Copies a file to a SE and registers the file in the cataloglcg-del Delete one filelcg-rep Replication between SEs and registration of the replicalcg-gt Gets the TURL for a given SURL and transfer protocollcg-sd Sets file status to “Done” for a given SURL in a SRM

request

File Catalog Interaction

lcg-aa Add an alias in LFC for a given GUID

lcg-ra Remove an alias in LFC for a given GUID

lcg-rf Registers in LFC a file placed in a SE

lcg-uf Unregisters in LFC a file placed in a SE

lcg-la Lists the alias for a given SURL, GUID or LFN

lcg-lg Get the GUID for a given LFN or SURL

lcg-lr Lists the replicas for a given GUID, SURL or LFN

Page 41: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 34

LFC interfaces

LFC SERVER

DLI

LFCCLIENTC API

GFAL

Python

LCGUTIL

CLIlfc-ls, lfc-mkdir,

lfc-setacl, …

WMS

SE SESE

Page 42: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Data movement introduction

• Grids are naturally distributed systems• The means that data also needs to be distributed

– First generation data distribution mainly concentrated on copy protocols in a grid environment: gridftp http + mod_gridsite

• But copies controlled by clients have problems…

Page 43: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Direct Client Controlled Data Movement

• Although transport protocol may be robust, state is held inside client – inconvenient and fragile.

• Client only knows about local state, no sense of global knowledge about data transfers between storage elements.– Storage elements overwhelmed with replication requests– Multiple replications of the same data can happen

simultaneously– Site has little control over balance of network resources - DoS

Data Flow Channel

Client

Source Storage ElementDestination Storage

Element

Control Channels

Page 44: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Transfer Service

• Clear need for a service for data transfer– Client connects to service

to submit request– Service maintains state

about transfer– Client can periodically

reconnect to check status or cancel request

– Service can have knowledge of global state, not just a single request Load balancing Scheduling

Transfer Service

Source Storage Element

Destination Storage Element

Control

Data Flow

Client

•Submit new request•Monitor progress•Cancel request

SOAP via https

Page 45: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

gLite FTS: Channels

• FTS Service has a concept of channels

• A channel is a unidirectional connection between two sites

• Transfer requests between these two sites are assigned to that channel

• Channels usually correspond to a dedicated network pipe (e.g., OPN) associated with production

• But channels can also take wildcards: – * to MY_SITE : All incoming– MY SITE to * : All outgoing– * to * : Catch all

• Channels control certain transfer properties: transfer concurrency, gridftp streams.

• Channels can be controlled independently: started, stopped, drained.

Page 46: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Grids allow to save millions of files spread over several storage sites.

• Users and applications need an efficient mechanism– to describe files– to locate files based on their contents

• This is achieved by– associating descriptive attributes to files

Metadata is data about data– answering user queries against the associated information

Why Grid needs Metadata?

Page 47: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Entries – Representation of real world entities which we are attaching metadata to for describing them

• Attribute – key/value pair– Type – The type (int, float, string,…)– Name/Key – The name of the attribute– Value - Value of an entry's attribute

• Schema – A set of attributes• Collection – A set of entries associated with a

schema• Metadata - List of attributes (including their values)

associated with entries

Basic Metadata Concept

40

Page 48: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Movie trailers files (entries) saved on Grid Storage Elements and registered into File Catalogue

• We want to add metadata to describe movie content. • A possible schema:

– Title -- varchar– Runtime -- int– Cast -- varchar– LFN -- varchar

• A metadata catalogue will be the repository of the movies’ metadata and will allow to find movies satisfying users’ queries

Example: Movie Trailers

Page 49: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Entry names Title Runtime

Cast LFN8c3315c1-811f-4823-a778-60a203439689 My Best

Friend’s wedding

80 Julia Roberts

lfn:/grid/gilda/movies/mybfwed.avi

51a18b7a-fd21-4b2c-aa74-4c53ee64846a Spider-man 2 120 Kirsten Dunst

lfn:/grid/gilda/movies/spiderman2.avi

401e6df4-c1be-4822-958c-ce3eb5c54fcb The God Father 113 Al pacino lfn:/grid/gilda/movies/godfather.avi

Trailer example

42

Page 50: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Entry names Title Runtime

Cast LFN8c3315c1-811f-4823-a778-60a203439689 My Best

Friend’s wedding

80 Julia Roberts

lfn:/grid/gilda/movies/mybfwed.avi

51a18b7a-fd21-4b2c-aa74-4c53ee64846a Spider-man 2 120 Kirsten Dunst

lfn:/grid/gilda/movies/spiderman2.avi

401e6df4-c1be-4822-958c-ce3eb5c54fcb The God Father 113 Al pacino lfn:/grid/gilda/movies/godfather.avi

Attribute

Trailer example

42

Page 51: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Entry names Title Runtime

Cast LFN8c3315c1-811f-4823-a778-60a203439689 My Best

Friend’s wedding

80 Julia Roberts

lfn:/grid/gilda/movies/mybfwed.avi

51a18b7a-fd21-4b2c-aa74-4c53ee64846a Spider-man 2 120 Kirsten Dunst

lfn:/grid/gilda/movies/spiderman2.avi

401e6df4-c1be-4822-958c-ce3eb5c54fcb The God Father 113 Al pacino lfn:/grid/gilda/movies/godfather.avi

AttributeSchema

Trailer example

42

Page 52: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Entry names Title Runtime

Cast LFN8c3315c1-811f-4823-a778-60a203439689 My Best

Friend’s wedding

80 Julia Roberts

lfn:/grid/gilda/movies/mybfwed.avi

51a18b7a-fd21-4b2c-aa74-4c53ee64846a Spider-man 2 120 Kirsten Dunst

lfn:/grid/gilda/movies/spiderman2.avi

401e6df4-c1be-4822-958c-ce3eb5c54fcb The God Father 113 Al pacino lfn:/grid/gilda/movies/godfather.avi

AttributeSchema

Entries

Trailer example

42

Page 53: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Entry names Title Runtime

Cast LFN8c3315c1-811f-4823-a778-60a203439689 My Best

Friend’s wedding

80 Julia Roberts

lfn:/grid/gilda/movies/mybfwed.avi

51a18b7a-fd21-4b2c-aa74-4c53ee64846a Spider-man 2 120 Kirsten Dunst

lfn:/grid/gilda/movies/spiderman2.avi

401e6df4-c1be-4822-958c-ce3eb5c54fcb The God Father 113 Al pacino lfn:/grid/gilda/movies/godfather.avi

AttributeSchema

EntriesCollection /trailers

Trailer example

42

Page 54: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Metadata service on the Grid

43

Page 55: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!

Metadata service on the Grid

43

Page 56: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!• metadata can describe any grid entity/object

– ex: JobIDs - add logging information to your jobs

Metadata service on the Grid

43

Page 57: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!• metadata can describe any grid entity/object

– ex: JobIDs - add logging information to your jobs• monitoring of running applications:

– ex: ongoing results from running jobs can be published on the metadata server

Metadata service on the Grid

43

Page 58: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!• metadata can describe any grid entity/object

– ex: JobIDs - add logging information to your jobs• monitoring of running applications:

– ex: ongoing results from running jobs can be published on the metadata server

• Inputset for a storm of parametric jobs

Metadata service on the Grid

43

Page 59: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!• metadata can describe any grid entity/object

– ex: JobIDs - add logging information to your jobs• monitoring of running applications:

– ex: ongoing results from running jobs can be published on the metadata server

• Inputset for a storm of parametric jobs• information exchanging among grid peers

– ex: producers/consumers job collections: master jobs produce data to be analyzed; slave jobs query the metadata server to retrieve input to “consume”

Metadata service on the Grid

43

Page 60: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Information about files -- but not only!• metadata can describe any grid entity/object

– ex: JobIDs - add logging information to your jobs• monitoring of running applications:

– ex: ongoing results from running jobs can be published on the metadata server

• Inputset for a storm of parametric jobs• information exchanging among grid peers

– ex: producers/consumers job collections: master jobs produce data to be analyzed; slave jobs query the metadata server to retrieve input to “consume”

• Simplified DB access on the grid– Grid applications that needs structured data can model their

data schemas as metadata

Metadata service on the Grid

43

Page 61: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

MetadataCatalogue

WN

WN

WN

CE

/results collection

SE

Customer/Scientist

Scientist/Developersubmitting jobs

WorkloadManager

showing results as long as they are produced

Monitoring of running application

44

Page 62: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Inputset for parametric jobs

• /grid/my_simulation/input----------------------------------------------------------------------------------------------------|entry     |x1        |x2        |y1        |y2        |step      |isTaken   |found     |output    ||--------------------------------------------------------------------------------------------------||1         |9453.1    |9453.32   |-439.93   |-439.91   |0.0006    |JobID1234 |No pillars|          ||2         |9342.13   |3435      |3423      |2343.2    |0.003     |No        |          |          ||3 |34254.3 |342342 |432.43 |132 |0.002 |No | | || ...... and so on |----------------------------------------------------------------------------------------------------

• This collection lists all the parameter set to be run on the Grid

• On the WN, one of the inputset is selected and “isTaken” is set = JOB_ID of the job that has fetched it

• Results is also written in the “found” column to monitor the simulation• so users can check the simulation from a UI, querying the

metadata server, or from a WebPage (using APIs for ex)• StdOutput can be copied also into the “output”

text column45

Page 63: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

A possible parameter-get.sh script#!/bin/bash

# Find the first set of parameters that has not been taken by noone

ID=`mdcli find /grid/my_simulation/input 'isTaken="No"' | head -1`

# Exit if all the parameters set has been already analyzed

if [ "$ID" = "" ]; then exit 1; fi

# set isTaken as its JOB_ID so that no one else will analyze the same set of parameter

mdcli setattr /grid/my_simulation/input/$ID isTaken `echo $GLITE_WMS_JOBID`

# retrieve the set of the parameter to be scanned

X1=`mdcli getattr /grid/my_simulation/input/$ID x1 | tail -1`

Y1=`mdcli getattr /grid/my_simulation/input/$ID y1 | tail -1`

X2=`mdcli getattr /grid/my_simulation/input/$ID x2 | tail -1`

Y2=`mdcli getattr /grid/my_simulation/input/$ID y2 | tail -1`

STEP=`mdcli getattr /grid/my_simulation/input/$ID step | tail -1`

# Run the scan with the proper parameter and save the output to output.txt

java -cp issgc_sfk_nesc.jar:sfkscanner.jar  uk.ac.nesc.toe.sfk.radar.Scanner $X1 $Y1 $X2 $Y2 $STEP > output.txt

# the Scanner class returns the writing "No pillars found in this area" or "Found area:" - so this will give useful info for monitoring during the run

mdcli setattr /grid/my_simulation/input/$ID found `cat output.txt | grep -i found`

# save the output (and the pillar text if found) on the metadata server

mdcli setattr /grid/my_simulation/input/$ID output `cat output.txt`

46

Page 64: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Use a Metadata services to exchange data among running jobs

Page 65: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Suppose we have two sets of jobs: – Producers: they generate a file, store on a SE, register

it onto the LFC File Catalogue assigning a LFN– Consumers: they will take a LFN, download the file and

elaborate it

Use a Metadata services to exchange data among running jobs

Page 66: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Suppose we have two sets of jobs: – Producers: they generate a file, store on a SE, register

it onto the LFC File Catalogue assigning a LFN– Consumers: they will take a LFN, download the file and

elaborate it

• A Metadata collection can be used to share the information generated by the Producers; it could act as a “bag-of-LFNs” (bag-of-task model) from which Consumers can fetch file for further elaboration

Use a Metadata services to exchange data among running jobs

Page 67: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

MetadataCatalogue

WN

WN

WN

CE

/bag-of-LFNs collection

SE

Scientist/Developersubmitting jobs

WorkloadManager

WN

WN

WN

Producers jobsConsumers jobs

CE

fetch LFNput LFN

Information exchanging among grid peers

48

Page 68: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Official metadata service for the gLite middleware– but no dependencies from gLite software– it can be used with other grid technologies/other environments• AMGA: Arda Metadata Grid Application• Provide a complete but simple interface, in order to make all

users able to use it easily.• Designed with scalability in mind in order to deal with large

number of entries– based on a lightweight and streamed text-based protocol, like HTTP/SMTP• Grid security is provided to grant different access levels to

different users.• Flexible with support to dynamic schemas in order to serve

several application domains• Simple installation by tar source, RPMs or Yum/YAIM

The AMGA Metadata Catalogue

49

Page 69: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

AMGA Features

Page 70: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Dynamic Schemas– Schemas can be modified at runtime by client

Create, delete schemas Add, remove attributes

AMGA Features

Page 71: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Dynamic Schemas– Schemas can be modified at runtime by client

Create, delete schemas Add, remove attributes

• AMGA collections are hierarchical organized– Collections can contain sub-collections– Sub-collections can inherit/extend parent collection’ schema

AMGA Features

Page 72: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Dynamic Schemas– Schemas can be modified at runtime by client

Create, delete schemas Add, remove attributes

• AMGA collections are hierarchical organized– Collections can contain sub-collections– Sub-collections can inherit/extend parent collection’ schema

• Flexible Queries– SQL-like query language– Different join type (inner, outer, left, right) between schemas are

providedselectattr /gLibrary:FileName /gLAudio:Author /gLAudio:Album '/gLibrary:FILE=/gLAudio:FILE and like(/gLibrary:FileName, “%.mp3")‘

AMGA Features

Page 73: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Dynamic Schemas– Schemas can be modified at runtime by client

Create, delete schemas Add, remove attributes

• AMGA collections are hierarchical organized– Collections can contain sub-collections– Sub-collections can inherit/extend parent collection’ schema

• Flexible Queries– SQL-like query language– Different join type (inner, outer, left, right) between schemas are

providedselectattr /gLibrary:FileName /gLAudio:Author /gLAudio:Album '/gLibrary:FILE=/gLAudio:FILE and like(/gLibrary:FileName, “%.mp3")‘

Support for Views, Constraints, Indexes

AMGA Features

Page 74: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Example

Page 75: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

AMGA Security

52

Page 76: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Unix style permissions - users and groups

AMGA Security

52

Page 77: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Unix style permissions - users and groups• ACLs – Per-collection or per-entry (table row).

AMGA Security

52

Page 78: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Unix style permissions - users and groups• ACLs – Per-collection or per-entry (table row). • Secure client/server connections – SSL

AMGA Security

52

Page 79: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Unix style permissions - users and groups• ACLs – Per-collection or per-entry (table row). • Secure client/server connections – SSL• Client Authentication based on

– Username/password– General X509 certificates (DN based)– Grid-proxy certificates (DN based)

AMGA Security

52

Page 80: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Unix style permissions - users and groups• ACLs – Per-collection or per-entry (table row). • Secure client/server connections – SSL• Client Authentication based on

– Username/password– General X509 certificates (DN based)– Grid-proxy certificates (DN based)

• VOMS support:– VO attribute maps to defined AMGA user– VOMS Role maps to defined AMGA user– VOMS Group maps to defined AMGA group

AMGA Security

52

Page 81: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• C++ multiprocess server– Backends

Oracle, MySQL 4/5, PostgreSQL, SQLite

– Front Ends� TCP text streaming

• High performance• Client API for C++, Java, Python,

Perl, PHP� SOAP (deprecated)

• Interoperability• Scalability

� WS-DAIR Interface (new in AMGA 2.0)

• WS-enable environment

• Standalone Python Library implementation

– Data stored on file system

• AMGA server runs on SLC3/4, Fedora Core, Gentoo, Debian

AMGA Implementation

Page 82: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

- Using the above datatypes you are sure that your metadata can be easily moved to all supported back-ends

- If you do not care about DB portability, you can use, in principle, as entry attribute type ALL the datatypes supported by the back-end, even the more esoteric ones (PostgreSQL Network Address type or Geometric ones)

AMGA Datatypes

Page 83: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• TCP Streaming Front-end– mdcli & mdclient CLI and C++ API (md_cli.h, MD_Client.h)– Java Client API and command line mdjavaclient.sh & mdjavacli.sh

(also under Windows !!)– Python and Perl Client API– PHP Client API – NEW

developed totally by the GILDA team – INFN CT– AMGA Web Interface (AMGA WI) ---NEW

Developed totally by the GILDA team – INFN CT Based on JAVA AMGA Standard APIs Web Application using standard as JSP Custom Tags, Servlet

• SOAP Frontend (WSDL)– C++ gSOAP– AXIS (Java)– ZSI (Python)

Accessing AMGA from UI/WNs

Page 84: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• AMGA provides a replication/federation mechanisms• Motivation

– Scalability – Support hundreds/thousands of concurrent users– Geographical distribution – Hide network latency– Reliability – No single point of failure– DB Independent replication – Heterogeneous DB systems– Disconnected computing – Off-line access (laptops)

• Architecture– Asynchronous replication– Master-slave – writes only allowed on the master– Application level replication

Replicate Metadata commands– Partial replication – supports replication of only sub-trees of the

metadata hierarchy

Advanced features: Metadata Replication

Page 85: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Full replication Partial replication

Federation Proxy

Metadata Replication: Use cases

Page 86: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Since AMGA 1.2.10, a new import feature allow to access existing DB table

• Once imported into AMGA the tables from one or more DBs you want to access through AMGA, you can exploit many of the features brought to you by AMGA for your existing tables

• Advantages: – your db tables can be accessed by grid users/applications,

using grid authentication (VOMS proxies)/authorization with ACLs

– exploiting AMGA federation features you can access several databases together from the Grid

Existing SQL DBs importing

58

Page 87: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• To remember: AMGA stores its own tables in its DB backend

• To access and existing DB you have 2 option: import the tables of the DB you want to access to into

AMGA DB backend viceversa, add AMGA DB backed tables to the DB you want

to access to

• Use the import command by root to “mount” you table into the AMGA collection hierarchy

Query> whoami>> rootQuery> createdir /worldQuery> cd /world/ Query> import world.City /world/CityQuery> import world.Country /world/CountryQuery> import world.CountryLanguage /world/CountryLanguage

Set up AMGA to access your tables

59

Page 88: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Properly set up authorization on the imported tables:

Query> acl_remove /world/City/ system:anyuserQuery> acl_remove /world/Country system:anyuserQuery> acl_add /world/ gilda:users rxQuery> acl_show /world>> root rwx>> gilda:users rx>> system:anyuser rxQuery> selectattr City:CountryCode City:Name 'like(City:Name, "Am%") limit 5'>> NLD>> Amsterdam>> NLD>> Amersfoort>> BRA>> Americana>> ECU>> Ambato>> IDN

‣ More information on existing DB access @:‣ http://amga.web.cern.ch/amga/importing.html‣ https://grid.ct.infn.it/twiki/bin/view/GILDA/AMGADBaccess

Set up AMGA to access your tables

60

Page 89: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Native SQL Support• Objective:

– implement native SQL query processing functionality in AMGA

• Current Status:– direct SQL data statement in SQL92 Entry Level has been

implemented in the 1.9 release Including 4 statements: SELECT, DELETE, UPDATE and INSERT ALL SQL commands should be issued in UPPERCASE

• Entry name:– when a new entry is created with addentry/addentries, a name

has to be assigned (filling the “file” column in the AMGA db backend) in the INSERT implementation, it’s filled automatically with a

random guid

61

Page 90: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Native SQL exampleQuery> INSERT INTO `City` VALUES (1,'Kabul','AFG','Kabol',1780000)

>> Operation Success

Query> dir /world/City/

>> /world/City/80b4fe646ed11dda02100304873049

>> entry

Query> SELECT COUNT (*) FROM /world/City

>> 3429

Query> SELECT * FROM /world/City WHERE Name LIKE '%Catani%'

>> 1472

>> Catania

>> ITA

>> Sisilia

>> 337862

Query> SELECT /world/City:Name, /world/City:District, /world/Country:Name, /world/Country:Region, /world/Country:Continent FROM /world/City, /world/Country WHERE /world/City:Name LIKE '%Catani%' AND Code = 'ITA'

>> Catania

>> Sisilia

>> Italy

>> Southern Europe

>> Europe

62

Page 91: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Medical Data Manager – MDM– Store and access medical images and associated metadata on the Grid– Built on top of gLite 1.5 data management system– Demonstrated at last EGEE conference (October 05, Pisa)

• Strong security requirements– Patient data is sensitive– Data must be encrypted– Metadata access must be restricted to authorized users

• AMGA used as metadata server– Demonstrates authentication and encrypted access– Used as a simplified DB

• More details at– https://uimon.cern.ch/twiki/bin/view/EGEE/DMEncryptedStorage

Biomed - MDM

Page 92: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• gMOD provides a Video-On-Demand service• User chooses among a list of video and the chosen

one is streamed in real time to the video client of the user’s workstation

• For each movie a lot of details (Title, Runtime, Country, Release Date, Genre, Director, Case, Plot Outline) are stored and users can search a particular movie querying on one or more attributes

• Two kind of users can interact with gMOD: TrailersManagers that can administer the db of movies (uploading new ones and attaching metadata to them); GILDA VO users (guest) can browse, search and choose a movie to be streamed.

gMOD: grid Movie On Demand

Page 93: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• Built on top of gLite services:• Storage Elements, sited in different place, physically

contain the movie files• LFC, the File Catalogue, keeps track in which Storage

Element a particular movie is located • AMGA is the repository of the detailed information for

each movie, and makes possible queries on them• The Virtual Organization Membership Service (VOMS) is

used to assign the right role to the different users • The Workload Management System (WMS) is responsible

to retrieve the chosen movie from the right Storage Element and stream it over the network down to the user’s desktop or laptop

gMOD under the hood

Page 94: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

VOMS

LFC FileCatalogue

MetadataCatalogue

WN WN

WN

CE

Storage Elements

User

GENIUS Portal

Workload Management System

get RoleAMGA

gMOD interactions

Page 95: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

gMOD is accesible through the Genius Portal (https://glite-demo.ct.infn.it)

gMOD screenshot

Page 96: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

gLibrary features

• INFN-developed tool totally gLite based• It allows to store, organize, search and retrieve digital

assets on a Grid environment with an intuitive front-end• What we mean by Digital Assets:

68

Page 97: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

gLibrary as the iTunes for the Grid

69

Page 98: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Browse & Search

70

• Assets can be browsed selecting a type (or category) and selecting one or more filters:

– attributes of the selected types, chosen from a defined list, used to narrow the result set

• Filter application is cascading and context-sensitive: the selection of a filter value dynamically influences subsequent filter values (“à la iTunes” browsing)

– Classical search by description and keywords available too

Page 99: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Organize assets

71

• “Types” and “Categories” definition by repository providers/admins:

• Assets are organized by type:

– a list of specific attributes to describe each kind of asset to be managed by the system

– hierarchical (a child type shares and extend parent’s attributes)

– queried during searches

• and/or organized by collection:

– Group together related assets of different type;

– Useful also to define subsets of assets belonging to the same type

– Multiple category assignment per asset (tagging)

Collections

Page 100: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Store & Retrieve

72

• Users can upload their local assets on one or more (creating replicas) Storage Elements of the Grid- Files already on grid SE can be registered in a gLibrary

repository by the LFC File Catalogue browser

• Download from SEs to the users’ laptop/desktop: - selection of a replica link from a list

• Transfers are handled from the browser over HTTP/HTTPS provided that users have their own X.509 Grid Certificate imported

Page 101: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

gLibrary architecture

73

Page 102: Session 24 - Distribute Data and Metadata Management with gLite

Antonio Calanducci ISGC09 Grid Tutorial, Taipei 18-04-2009

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

• Implemented as Web 2.0 application– AJAX and Javascript are strongly used to offer a desktop like

user experience– Business logic implemented using PHP 5 OOP support

74

Features

Page 103: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Technologies used

75

Page 104: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Technologies used

• Web standards:– Javascript/AJAX/JSON on the client side– PHP5 classes to implement business logic on the server side

75

Page 105: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Technologies used

• Web standards:– Javascript/AJAX/JSON on the client side– PHP5 classes to implement business logic on the server side

75

Page 106: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

Technologies used

• Web standards:– Javascript/AJAX/JSON on the client side– PHP5 classes to implement business logic on the server side

• Grid technologies:– Storage Element SRM interface to get the TURLs (Transfer URLs)– Transfers handled with GridFTP (first release) and X.509 cert auth

HTTPS– X.509 based Globus Security Infrastructure with the VOMS

extensions to handle authentication and authorization (ACL based) on Metadata and Storage Elements

– All grid services implemented with the EGEE gLite middleware (DPM Storage Elements, AMGA Metadata Catalogue, LFC File Catalogue, VOMS Services)

75

Page 107: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

De Roberto cultural heritage

• De Roberto, an Italian writer of the XIX/XX century, born in Naples, but spending his life in Catania, has left to the humanistic community numerous works

• Those are made up of valuable and hard-to-manage pieces: manuscripts, typescripts, drafts with handwritten corrections, magazines, cuts, sketches, photos, etc.

76

Page 108: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833 77

Page 109: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Digitalize to preserve them• Some sheets are damaged (mold, crumbed

pieces) and need physical restoration• Digitalization to avoid the loss of this works,

some of them still unpublished and relevant for the humanistic communities

78

Page 110: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833 79

Page 111: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Acquisition stage

80

Page 112: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Acquisition stage• Digitalization of manuscripts, typescripts, printed works

– TIFF Files, one per page, 600 dpi, about 100MB for A3 High resolution scans for in-depth examination

– PDF, one per work, 300 dpi, varying file sizes 40-400MB Overall examination of works

– 8000 sheets/scans, 3 Terabyte of disk space– Different physical formats, A3/A4/custom size

80

Page 113: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Acquisition stage• Digitalization of manuscripts, typescripts, printed works

– TIFF Files, one per page, 600 dpi, about 100MB for A3 High resolution scans for in-depth examination

– PDF, one per work, 300 dpi, varying file sizes 40-400MB Overall examination of works

– 8000 sheets/scans, 3 Terabyte of disk space– Different physical formats, A3/A4/custom size

• Embedded Metadata– TIFF with embedded metadata to provide scan physical

features and information about the content ImageWidth, ImageHeight, XResolution, FileSize, CreationDate, ModifyDate Description, Keywords, CaptionWriter, Title, Author, Copyright Status, Copyright

Notice

– Added with Photoshop after the digitalization phase (Adobe XMP format)

80

Page 114: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Goals and requirements

81

Page 115: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Goals and requirements• Make those works accessible to the humanistic communities

– Always on-line: 24 x 365– Available from everywhere– Simple and easy-to-use interface for non-expert people

81

Page 116: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Goals and requirements• Make those works accessible to the humanistic communities

– Always on-line: 24 x 365– Available from everywhere– Simple and easy-to-use interface for non-expert people

• Quickly find the desired document– Document organization according the physical and semantic metadata

� Organization by type/collections�Dynamic filtering of search result sets according the selection of

one or more document metadata

81

Page 117: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Goals and requirements• Make those works accessible to the humanistic communities

– Always on-line: 24 x 365– Available from everywhere– Simple and easy-to-use interface for non-expert people

• Quickly find the desired document– Document organization according the physical and semantic metadata

� Organization by type/collections�Dynamic filtering of search result sets according the selection of

one or more document metadata• Long-term preservation (digital preservation)

– Multiple copies (replicas) spread in different geographical sites– Reliability of storage systems and replica redundancy to achieve secure

preservation

81

Page 118: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

What Data Grids can offer to them

82

Page 119: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

What Data Grids can offer to them

82

Page 120: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

What Data Grids can offer to them

82

Page 121: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

• document organization for a quick search ---> Metadata Services

What Data Grids can offer to them

82

Page 122: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

• document organization for a quick search ---> Metadata Services

• long-term digital preservation of data ---> redundancy through Replicas of files on several Storage Elements

What Data Grids can offer to them

82

Page 123: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

• document organization for a quick search ---> Metadata Services

• long-term digital preservation of data ---> redundancy through Replicas of files on several Storage Elements

• simple and easy-to-use system for searches, organization, upload and download of digitalized documents on the Grid

What Data Grids can offer to them

82

Page 124: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

• document organization for a quick search ---> Metadata Services

• long-term digital preservation of data ---> redundancy through Replicas of files on several Storage Elements

• simple and easy-to-use system for searches, organization, upload and download of digitalized documents on the Grid ----->

What Data Grids can offer to them

82

Page 125: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

• store the 8000 scans of De Roberto Heritage ----> Data Grid Storage Elements

• enable an ubiquitous and 24/24h access to scientists ---> Web Application

• document organization for a quick search ---> Metadata Services

• long-term digital preservation of data ---> redundancy through Replicas of files on several Storage Elements

• simple and easy-to-use system for searches, organization, upload and download of digitalized documents on the Grid ----->

What Data Grids can offer to them

82

Page 126: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Data Grids to preserve Cultural Heritage

83

Page 127: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC’09, Taipei, 21st April 09

Enabling Grids for E-sciencE

INFSO-RI-508833

Live Demo

84

Page 128: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 85

• Storage Element – save date and provide a common interface– Storage Resource Manager (SRM) dCache, DPM, ... – Native Access protocols rfio, dcap, nfs, …– Transfer protocols gsiftp, https, …

• Catalogs – keep track where data are stored– File Catalog – Replica Catalog– Metadata Catalog

• Data Movement – schedules reliable file transfer– File Transfer Service gLite FTS (manages physical transfers)

Data Management Services Summary

AMGA Metadata Catalogue

LCG File Catalog (LFC)

Page 129: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 86

References

• gLite documentation homepage– http://glite.web.cern.ch/glite/documentation/default.asp

• DM subsystem documentation– http://egee-jra1-dm.web.cern.ch/egee-jra1-dm/doc.htm

• LFC and DPM documentation– https://uimon.cern.ch/twiki/bin/view/LCG/

DataManagementDocumentation

Page 130: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833

• AMGA Web Site http://cern.ch/amga

• AMGA Manualhttp://amga.web.cern.ch/amga/downloads/amga-manual_1_3_0.pdf

• AMGA API Javadochttp://amga.web.cern.ch/amga/javadoc/index.html

• AMGA Web Frontendhttp://gilda-forge.ct.infn.it/projects/amgawi/

• AMGA Basic Tutorialhttps://grid.ct.infn.it/twiki/bin/view/GILDA/AMGAHandsOn

• More information on existing DB access @:– http://amga.web.cern.ch/amga/importing.html– https://grid.ct.infn.it/twiki/bin/view/GILDA/AMGADBaccess

References

87

Page 131: Session 24 - Distribute Data and Metadata Management with gLite

Antonio Calanducci ISGC09 Grid Tutorial, Taipei 18-04-2009

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 88

gLibrary references

• gLibray project homepage:– https://glibrary.ct.infn.it/

• gLibrary paper:– https://glibrary.ct.infn.it/glibrary/downloads/gLibrary_paper_v2.pdf

Page 132: Session 24 - Distribute Data and Metadata Management with gLite

ISSGC 09, Sophia-Antipolis, 10-07-09

Enabling Grids for E-sciencE

INFSO-RI-508833 89

Questions…


Recommended