Upgrade D0 farm. Reasons for upgrade RedHat 7 needed for D0 software New versions of –ups/upd v4_6...

Post on 31-Mar-2015

213 views 1 download

Tags:

transcript

Upgrade D0 farm

Reasons for upgrade

• RedHat 7 needed for D0 software

• New versions of – ups/upd v4_6– fbsng v1_3f+p2_1– sam

• Use of farm for MC and analysis

• Integration in farm network

MC production on farm

• Input: requests

• Request translated in mc_runjob macro

• Stages:1. mc_runjob on batch server (hoeve)

2. MC job on node

3. SAM store on file server (schuur)

farm server file server

node

SAM DB

datastore

fbs(rcp,sam)

fbs(mcc)

mcc request

mcc input

mcc output

1.2 TB

40 GB

FNALSARA

control

data

metadata

fbs job:1 mcc2 rcp3 sam

100 cpu’s

farm server file server

node

SAM DB

datastore

fbs(rcp[,sam])

fbs(mcc)

mcc request

mcc input

mcc output

1.2 TB

40 GB

FNALSARA

control

data

metadata

fbs job:1 mcc2 rcp

100 cpu’s

cron:sam

fbsuser:cpfbsuser:mcc

fbsuser: rcp

willem:sam

hoeve node schuur

fbsuser:mc_runjob

fbs submit

fbs submit

data

control

cron

SECTION mcc EXEC=/d0gstar/curr/minbias-02073214824/batch NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/curr/minbias-02073214824/stdout STDERR=/d0gstar/curr/minbias-02073214824/stdoutSECTION rcp EXEC=/d0gstar/curr/minbias-02073214824/batch_rcp NUMPROC=1 QUEUE=IOQ DEPEND=done(mcc) STDOUT=/d0gstar/curr/minbias-02073214824/stdout_rcp STDERR=/d0gstar/curr/minbias-02073214824/stdout_rcp

#!/bin/sh

. /usr/products/etc/setups.shcd /d0gstar/mcc/mcc-dist. mcc_dist_setup.sh

mkdir -p /data/curr/minbias-02073214824cd /data/curr/minbias-02073214824cp -r /d0gstar/curr/minbias-02073214824/* .touch /d0gstar/curr/minbias-02073214824/.`uname -n`sh minbias-02073214824.sh `pwd` > logtouch /d0gstar/curr/minbias-02073214824/`uname -n`/d0gstar/bin/check minbias-02073214824

#!/bin/shi=minbias-02073214824if [ -f /d0gstar/curr/$i/OK ];thenmkdir -p /data/disk2/sam_cache/$icd /data/disk2/sam_cache/$inode=`ls /d0gstar/curr/$i/node*`node=`basename $node`job=`echo $i | awk '{print substr($0,length-8,9)}'`rcp -pr $node:/data/dest/d0reco/reco*${job}* .rcp -pr $node:/data/dest/reco_analyze/rAtpl*${job}* .rcp -pr $node:/data/curr/$i/Metadata/*.params .rcp -pr $node:/data/curr/$i/Metadata/*.py .rsh -n $node rm -rf /data/curr/$irsh -n $node rm -rf /data/dest/*/*${job}*touch /d0gstar/curr/$i/RCPfi

batchruns on node

batch_rcpruns on schuur

#!/bin/shlocate(){file=`grep "import =" import_${1}_${job}.py | awk -F \" '{print $2}'`sam locate $file | fgrep -q [return $?}. /usr/products/etc/setups.shsetup samSAM_STATION=hoeveexport SAM_STATION

tosam=$1LIST=`cat $tosam`

for job in $LISTdo cd /data/disk2/sam_cache/${job} list='gen d0g sim' for i in $list do until locate $i || (sam declare import_${i}_${job}.py && locate ${i}) do sleep 60; done done

list='reco recoanalyze' for i in $list do sam store --descrip=import_${i}_${job}.py --source=`pwd` return=$? echo Return code sam store $returndonedoneecho Job finished ...

declare gen, d0g, sim

store reco, recoanalyze

runs on schuurcalled by fbs or cron

Filestream

• Fetch input from sam

• Read input file from schuur

• Process data on node

• Copy output to schuur

rcp

d0exe

rcp

sam

hoeve node schuur

mc_runjob

fbs submit

fbs submit

data

control

cron

attach filestream

Analysis on farm

• Stages:– Read files from sam– Copy files to node(s)– Perform analysis on node– Copy files to file server– Store files in sam

farm server file server

node

SAM DB

datastore

1.2 TB

40 GB

FNALSARA

control (fbs)

data

metadata

100 cpu’s

1. sam + rcp2. analyze3. rcp + sam

fbs(1), fbs(3)

fbs(2)

triviaal node-2

fbsuser:rcp

fbsuser:rcp

fbsuser:

analysisprogram

willem:sam

willem:sam

input

output

SECTION sam EXEC=/home/willem/batch_sam NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout

#!/bin/sh

. /usr/products/etc/setups.shsetup samSAM_STATION=triviaalexport SAM_STATION

sam run project get_file.py --interactive > log

/usr/bin/rsh -n -l fbsuser triviaal rcp -r /stage/triviaal/sam_cache/boo node-2:/data/test >> log

batch.jdf

batch_sam

farm server file server

node

SAM DB

datastore

1.2 TB

40 GB

FNALSARA

control (fbs)

data

metadata

100 cpu’s

1. sam2. rcp + analyze + rcp3. rcp + sam

fbs(1), fbs(3)

fbs(2)

triviaal node-2

fbsuser:rcpanalysisprogram

rcp

willem:sam

willem:sam

input

output

fbsuser:fbs submit

SECTION sam EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout

#!/bin/shuname -adate

rsh -l fbsuser triviaal fbs submit ~willem/batch_node.jdf

#!/bin/sh. /usr/products/etc/setups.shsetup fbsngsetup samSAM_STATION=triviaalexport SAM_STATIONsam run project get_file.py --interactive > log/usr/bin/rsh -n -l fbsuser triviaal fbs submit /home/willem/batch_node.jdf

SECTION sam EXEC=/home/willem/batch NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout

SECTION ana EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout

#!/bin/shrcp -pr server:/stage/triviaal/sam_cache/boo /data/test. /d0/fnal/ups/etc/setups.shsetup root -q KCC_4_0:exception:opt:threadsetup kailibroot -b -q /d0gstar/test.C

{gSystem->cd("/data/test/boo");gSystem->Exec("pwd");gSystem->Exec("ls -l");}

## This file sets up and runs a SAM project.#import os, sys, string, time, signalfrom re import *from globals import *import run_projectfrom commands import *########################################### Set the following variables to appropriate values

# Consult database for valid choicessam_station = "triviaal"

# Consult Database for valid choicesproject_definition = "op_moriond_p1014"

# A particular snapshot version, last or newsnapshot_version = 'new'

# Consult database for valid choicesappname = "test"version = "1"group = "test"

# The maximum number of files to get from sammax_file_amt = 5

# for additional debug info use "--verbose"#verbosity = "--verbose"verbosity = ""

# Give up on all exceptionsgive_up = 1

def file_ready(filename): # Replace this python subroutine with whatever # you want to do # to process the file that was retrieved. # This function will only be called in the event of # a successful delivery. print "File ",filename," has been delivered!"# os.system('cp '+filename+' /stage/triviaal/sam') return

get_file.py

Disk partitioning hoeve

/d0

/fnal

/d0dist /d0usr

/mcc

/mcc-dist /mc_runjob /curr/ups

/db /etc /prd

/fnal -> /d0/fnal/d0usr -> /fnal/d0usr/d0dist -> /fnal/d0dist/usr/products -> /fnal/ups

/fbsng

ana_runjob

• Is analogous to mc_runjob

• Creates and submits analysis jobs

• Input– get_file.py with SAM project name

• Project defines files to be processed

– analysis script

Integration with grid (1)

• At present separate clusters:– D0, LHCb, Alice, DAS cluster

• hoeve and schuur in farm network

Present network layout

hoeve schuur

switch

node node node

router

hefnet

surfnet

ajax

NFS

New network layout

farmrouter

switch switch switch

D0LHCb

hefnet

lambda

hoeve schuur

alice

ajax

NFS

booder

New network layout

farmrouter

switch switch switch

D0LHCb

hefnet

lambda

hoeve schuur

alice

ajax

NFS

booder

das-2

Server tasks

• hoeve– software server– farm server

• schuur– fileserver– sam node

• booder– home directory server– in backup scheme

Integration with grid (2)

• Replace fbs with pbs or condor– pbs on Alice and LHCb nodes– condor on das cluster

• Use EDG installation tool LCGF– Install d0 software with rpm

• Problem with sam (uses ups/upd)

Integration with grid (3)

• Package mcc in rpm

• Separate programs from working space

• Use cfg commands to steer mc_runjob

• Find better place for card files

• Input structure now created on node

Grid job

#!/bin/sh

macro=$1

pwd=`pwd`

cd /opt/fnal/d0/mcc/mcc-dist. mcc_dist_setup.sh

cd $pwddir=/opt/fnal/d0/mcc/mc_runjob/py_scriptpython $dir/Linker.py script=$macro

[willem@tbn09 willem]$ cat test.pbs# PBS batch job script

#PBS -o /home/willem/out#PBS -e /home/willem/err#PBS -l nodes=1

# Changing to directory as requested by user

cd /home/willem

# Executing job as requested by user

./submit minbias.macro

PBS job submit

RunJob class for gridclass RunJob_farm(RunJob_batch) : def __init__(self,name=None) : RunJob_batch.__init__(self,name) self.myType="runjob_farm"

def Run(self) : self.jobname = self.linker.CurrentJob() self.jobnaam = string.splitfields(self.jobname,'/')[-1] comm = 'chmod +x ' + self.jobname commands.getoutput(comm) if self.tdconf['RunOption'] == 'RunInBackground' : RunJob_batch.Run(self) else : bq = self.tdconf['BatchQueue'] dirn = os.path.dirname(self.jobname) print dirn comm = 'cd ' + dirn + '; sh ' + self.jobnaam + ' `pwd` >& stdout' print comm runcommand(comm)

To be decided

• Location of minimum bias files

• Location of MC output

Job status

• Job status is recorded in– fbs– /d0/mcc/curr/<job_name>– /data/mcc/curr/<job_name>

SAM servers

• On master node:– station– fss

• On master and worker nodes:– stager– bbftp