RWL Jones, Lancaster University
MAGDAMAGDA
Main authors: Main authors: WenshengWensheng Deng Deng, , Torre WenausTorre Wenaus Magda is a distributed data manager prototype for grid-Magda is a distributed data manager prototype for grid-
resident data. resident data. The system is designed for rapid and flexible evolution of
database schema and surrounding infrastructure, integration and interchange of third party components, etc.
Web service (via perl::SOAP) and command line interfacesWeb service (via perl::SOAP) and command line interfaces C++, Java and Perl APIs for access to all components of C++, Java and Perl APIs for access to all components of
the database the database C++, Java APIs autogenerated by perl scripts from the
mySQL database, so always synchronised
Developed as part of PPDGDeveloped as part of PPDG
RWL Jones, Lancaster University
Magda Documents and Installation Magda Documents and Installation StatusStatus
User guide User guide http://www.http://www.atlasgridatlasgrid..bnlbnl..govgov//magdadocmagdadoc//userguideuserguide..htmhtm In preparation. Suggestions are welcome.
Useful introduction - SC2002 hand outUseful introduction - SC2002 hand out http://www.atlasgrid.bnl.gov/magdademo/sc2002_poster.ppt
Servers (web and database) at BNL
AFS clients available at:AFS clients available at: /afs/usatlas.bnl.gov/project/magda/current /afs/cern.ch/atlas/maxidisk/d94/wenaus/wdeng/atlas_magda/
magda_setup
Installation documentInstallation document http://www.atlasgrid.bnl.gov/magdadoc/userguide.htm#3
RWL Jones, Lancaster University
Magda UsageMagda Usage
Total 327k files occupying 26 TBTotal 327k files occupying 26 TB 38k DC1 files (mostly gathered using spiders) Transferred more than 4 TB data between CERN castor and BNL
HPSS since the start of DC1
4k U.S. Grid Testbed DC1 files and replicas registered using 4k U.S. Grid Testbed DC1 files and replicas registered using magda toolsmagda tools
Tokyo and Lyon tried for DC1, other sites being added Tokyo and Lyon tried for DC1, other sites being added progressively.progressively. RAL is a priority (large store)
GDMP and Reptor integration now underway but we need GDMP and Reptor integration now underway but we need these (production level) tools nowthese (production level) tools now
RWL Jones, Lancaster University
Stores currently accessedStores currently accessed
NFS and AFS disk areas at US ATLAS Tier 1, CERN NFS and AFS disk areas at US ATLAS Tier 1, CERN ATLAS pools in the CERN staging system ATLAS pools in the CERN staging system CERN Castor mass store (ATLAS storage areas, eg. CERN Castor mass store (ATLAS storage areas, eg.
testbeam data) testbeam data) US ATLAS Tier 1 HPSS 'rftp' service (the HPSS access US ATLAS Tier 1 HPSS 'rftp' service (the HPSS access
mode that US ATLAS currently has access to) mode that US ATLAS currently has access to) ATLAS code repository contents ATLAS code repository contents Personal data areas Personal data areas MSS Locations at US ATLAS grid testbed sites (ANL, MSS Locations at US ATLAS grid testbed sites (ANL,
LBNL, Boston, Indiana) LBNL, Boston, Indiana) Also Lyon, Tokyo, …Also Lyon, Tokyo, …
RWL Jones, Lancaster University
MAGDA EntitiesMAGDA Entities
prime: File catalog. prime: File catalog. Catalogs all instances of all files in the system.
logical: Logical filename catalog. Metadata about logical files (associated logical: Logical filename catalog. Metadata about logical files (associated keys) not specific to particular physical instances. keys) not specific to particular physical instances.
site: A computing facility, may have many data stores site: A computing facility, may have many data stores e.g CERN CASTOR
location: Data locations (eg. directory, staging pool). location: Data locations (eg. directory, staging pool). Associated with a particular site. Given location designated as either a 'prime' or 'replica' location.
host: Computers on which system runs or which provide accesshost: Computers on which system runs or which provide access Is the means by which the spider knows
Where it is What locations it can scan
collection: Collections of logical files. collection: Collections of logical files. collectionContent: Logical file lists for collections.
task: Catalog of replication tasks. task: Catalog of replication tasks. generic_sig: Generic 'data signature‘ sufficient for regenerationgeneric_sig: Generic 'data signature‘ sufficient for regeneration
Identifies equivalent data sets
RWL Jones, Lancaster University
Host 2
Location
Cache
Disk Site
LocationLocation
Location
Mass StoreS
ite Source to cache
stagein
Source to dest
transfer
MySQLSynch via DB
Host 1
Collection of logical
files to replicate
Spider
Spider
gridftp,bbftp,scp
Register replicas
Catalog updates
Cache
Location
LocationLocation Disk
Site
LocationLocation
Location Mass StoreS
ite
Replication tasks
Magda ArchitectureMagda Architecture
RWL Jones, Lancaster University
Magda Command-line ToolsMagda Command-line Tools
Type tools without parameters - get usage infoType tools without parameters - get usage info Calls ‘globus-url-copy’ internally, and ‘globus-job-run’ to interact Calls ‘globus-url-copy’ internally, and ‘globus-job-run’ to interact
with HPSSwith HPSS magda_findfile: searches the magda databasemagda_findfile: searches the magda database magda_putfile: magda_putfile: extended to work with Lyon HPSS recentlyextended to work with Lyon HPSS recently
magda_getfile:magda_getfile: magda_delete:magda_delete:
Usage: magda_delete filerecord <filename> <site:location> magda_delete filerecord <site:location> magda_delete location <site:location>
RWL Jones, Lancaster University
Magda ExamplesMagda Examples
$ magda_findfile dc1.002107.simul.0024 --sub$ magda_findfile dc1.002107.simul.0024 --sub LFN://atlas.org/test.dc1.002107.simul.0024.hlt.eta_scan.zebra site=usatlasrftp path=…
size=28188000 primary LFN://atlas.org/dc1.002107.simul.0024.hlt.eta_scan.zebra site=utatlasfarm path=…
size=28188000 also shows .his and .log files
$ magda_getfile dc1.002107.simul.0024.hlt.eta_scan.log$ magda_getfile dc1.002107.simul.0024.hlt.eta_scan.log … Instance at usatlasrftp:/home/grid_a/simul/002107/log remotely accessible. Instance at utatlasfarm:/opt/testbed/cache/replica remotely accessible. globus-url-copy -p 3
gsiftp://atlas000.uta.edu/opt/testbed/cache/replica/dc1.002107.simul.0024.hlt.eta_scan.log file:///tmp/dc1.002107.simul.0024.hlt.eta_scan.log 2>&1
File dc1.002107.simul.0024.hlt.eta_scan.log staged into local directory
LFN follows EDG formLFN follows EDG form Multiple versions are handledMultiple versions are handled
RWL Jones, Lancaster University
Magda ReplicationMagda Replication
Automated file replication is supportedAutomated file replication is supported Definition of replication tasks:Definition of replication tasks:
collection of files to be replicated information on source location, including a cache collection if
needed information on the file transport mechanism (currently gridftp,
bbftp and scp) information on destination location, including a destination-side
cache if necessary
RWL Jones, Lancaster University
Magda File SpiderMagda File Spider
File spider processes run as cron jobs on distributed hosts to fill File spider processes run as cron jobs on distributed hosts to fill catalog and keep the catalog up to datecatalog and keep the catalog up to date
Based on the host it is running on, it determines which sites and Based on the host it is running on, it determines which sites and locations are accessible and updates themlocations are accessible and updates them
Catalog entry is deleted if file is removedCatalog entry is deleted if file is removed Run ‘crontab –e’ to set it up as a cron job, useful infoRun ‘crontab –e’ to set it up as a cron job, useful info
/afs/usatlas.bnl.gov/project/magda/current/*.cron
Spider can be invoked from command-lineSpider can be invoked from command-line dyFileSpider.pl [site:location]
magda_putfile is preferred for positive registrationmagda_putfile is preferred for positive registration in production scripts in production scripts
RWL Jones, Lancaster University
1. submit jobs
2. check status
3. move outputs, catalog them
4. check partition, may repeat step 3
5. list statistics
MySQL server
Mass storage
Remote linux farm
magda_putfile - third party transfer, put files to BNL HPSS directly and register them with the magda database
magda_putfile - copy replicas to disk store and register them with the magda database
magda_findfile- search the magda database
magda_findfile- search the magda database
Magda in U.S. Grid Testbed DC1Magda in U.S. Grid Testbed DC1
RWL Jones, Lancaster University
Magda Production DatabaseMagda Production Database
Magda production database capability was used in U.S. Grid Magda production database capability was used in U.S. Grid Testbed for DC1Testbed for DC1
Jobinfo:Jobinfo: filename, submithost, processhost, joburl, moddate, primestore
Jobstatus:Jobstatus: project, dataset, step, partition, finished, joburl, started, group,
filename, dirname, extra
Very useful feature for general ATLAS DC production Very useful feature for general ATLAS DC production managementmanagement
RWL Jones, Lancaster University
Magda Future PlansMagda Future Plans
Integration with ATLAS MetaData Iinterface for DC analysisIntegration with ATLAS MetaData Iinterface for DC analysis Will integrate Hierarchical Resource Manager (HRM) with the Will integrate Hierarchical Resource Manager (HRM) with the
command line toolscommand line tools Implementation of managing files distributed on the local Implementation of managing files distributed on the local
disk of each node of a Linux farmdisk of each node of a Linux farm When file records go up to the order of millions, scalability is When file records go up to the order of millions, scalability is
an important issue. Will look into grid catalog service (RLS)an important issue. Will look into grid catalog service (RLS) Being evaluated by other experiments (STAR)Being evaluated by other experiments (STAR)