Second PANDA Grid Data Challenge
Is scheduled for November 17-21, 2008. Site admins are kindly asked to update their sites with the required packages until the evening of November 11.
General goals
The main goals are to test site, installed software and storage availability.
Testing of package compilation is not be part of the data challenge itself.
Site admins are in charge of:
- making sure all grid services on their sites are up and running
- making sure all software packages are installed and compiled successfully
- making sure jobs running on their sites finish and save output successfully
Please read this and other grid wikis attentively and report problems as early as possible.
Schedule
The job submission will start on November 17, at 08:00 AM Vienna time.
Timeline:
17.10.08 . agreement to have DC02 on November 17-21 PANDAGrid workshop, Sinaia
-29.10.08 . agreement on packages to use Paul, Johan, Tobias
29.10.08 . email announcement to all site admins with Paul
instructions for DC02 preparation
-11.11.08 . evaluation of the status (packages, sites) Paul, Dan
. selection of macros to run Paul, Tobias, Johan, Dan
. installation of packages on all sites site managers
. preparation of jdls Paul, Johan, Dan
. check availability of all sites site admins
. definition of job submission schedule Paul, Dan
17-20.11.08 . job execution Paul
21.11.08- . evaluation of results Paul, Dan, Johan, Tobias
Links
Prerequisites
Some older alien versions have a bug which causes jobs to end with an ERROR_IB. Please update all sites with the latest version of alien (currently v2-15.61 or later)
Alien can be updated with following procedure:
>
wget http://alien.cern.ch/alien-auto-installer
>
chmod +x alien-auto-installer
>
./alien-auto-installer
With this the latest version will be installed.

ATTENTION: In $ALIEN_ROOT/../ the directory alien.v2-15 will be modified. So make a copy of this directory first if you like to preserve your current alien version.
Packages
For the DC02 the following software packages need to be installed and working on all sites
pbarprod@mlcert::1.0
pbarprod@panda_extern::jul08
pbarprod@july08_patch1_jgm::v1.0
pbarprod@july08_patch2_jgm::v1.0
pbarprod@urqmd::root520
pbarprod@tobias::v3.5

After every alien update you have to redo the installation of
mlcert. Please see notes in
here.
The status of the packages can be seen at
http://mlr2.gla.ac.uk:7001/packages/list.jsp
Testing Packages
The jdl /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl can be used to test the physics packages.
Usage: submit runMvdUrQMDSim.jdl [site] [runid]
[site] : CE (e.g. PANDA::Vienna::smigrid01 or PANDA::GSI::grid8 or ...)
[runid]: results are saved into directory /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/
You can get the list of CEs with the alien command
services CE.
This script produces 10 UrQMD events (at Ppbar=4.0 GeV/c on a C12 nucleus) and then traces these events through the PANDA detector (using package pbarprod@tobias::v3.5). The job is split into 2 subjobs. Upon successful completion the following three files will have been created for each subjob in directories /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/[1,2]/
Urqmd.root
Mvd_GridUrqmdSim.root
Mvd_GridUrqmdSimParam.root
Following example commands can be used to test specific sites:
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Glasgow::ce2 01
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Bucharest::panda01 03
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Vienna::smigrid01 04
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::KVI::kvip81 05
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Juelich::ce642 06
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Juelich::ikp663 07
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::GSI::grid8 08
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Frascati::ed22lf 10
etc.
To see a list of sites that match your JDL use
jobListMatch JOBID, replacing JOBID with the job's id (after submission).
Run list
- The jdl used to run the jobs is /panda/user/p/pbarprod/dc02/jdl/runMvdUrqmdSim.jdl
Usage: submit runMvdUrQMDSim.jdl [runid] [nsplit] [nr of events] [momentum in GeV/c] [A] [Z]
(e.g. submit dc02/jdl/runMvdUrqmdSim.jdl 123 100 5 4.0 12 6)
- All output goes to /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/[split]
- some typical numbers: MvdUrQMD: 100 events in approx. 15 minutes producing 100 MB, 500 events / 2 - 3 h / 650 MB
| jdl | runid | nr of split | nr of events/split | momentunm [GeV/c] | submit time | MasterJob ID | DONE |
| MvdUrqmd | pbar + Xe132 | | | | | | 01.12.2008, 10:00 |
| runMvdUrqmdSim | 100 | 100 | 100 | 2.0 | 17.11.2009, 08:00 | 123849 | 100 |
| | 101 | 100 | 500 | 2.0 | 17.11.2008, 08:25 | 124152 | 99 |
| | 102 | 200 | 500 | 2.0 | 17.11.2008, 09:05 | 124455 | 183 |
| | 103 | 500 | 500 | 2.0 | 17.11.2008, 21:30 | 125400 | 383 |
| | 104 | 100 | 500 | 2.0 | 20.11.2008, 07:10 | 128950 | 98 |
| | 105 | 200 | 500 | 2.0 | 20.11.2008, 09:00 | 129257 | 123 |
| | 106 | 200 | 500 | 2.0 | 20.11.2008, 16:50 | 129984 | 188 |
| | 107 | 200 | 500 | 2.0 | 20.11.2008, 23:15 | 130673 | 196 |
| | | | | | | | |
| | 200 | 100 | 100 | 6.2 | 17.11.2008, 08:01 | 123950 | 99 |
| | 201 | 100 | 500 | 6.2 | 17.11.2008, 08:26 | 124253 | 98 |
| | 202 | 200 | 500 | 6.2 | 17.11.2008, 09:05 | 124656 | 192 |
| | 203 | 500 | 500 | 6.2 | 17.11.2008, 21:30 | 126903 | 452 |
| | 204 | 100 | 500 | 6.2 | 20.11.2008, 07:10 | 128976 | 97 |
| | 205 | 200 | 500 | 6.2 | 20.11.2008, 09:00 | 129258 | 12 |
| | 206 | 200 | 500 | 6.2 | 20.11.2008, 16:50 | 129987 | 192 |
| | 207 | 200 | 500 | 6.2 | 20.11.2008, 23:15 | 130674 | 193 |
| | | | | | | | |
| | 300 | 100 | 100 | 15.0 | 17.11.2009, 08:02 | 124051 | 100 |
| | 301 | 100 | 500 | 15.0 | 17.11.2008, 08:26 | 124354 | 93 |
| | 302 | 200 | 500 | 15.0 | 17.11.2008, 09:06 | 124816 | 179 |
| | 303 | 500 | 500 | 15.0 | 17.11.2008, 21:30 | 126917 | 168 |
| | 304 | 100 | 500 | 15.0 | 20.11.2008, 07:10 | 129017 | 88 |
| | 305 | 200 | 500 | 15.0 | 20.11.2008, 09:00 | 129329 | 34 |
| | 306 | 200 | 500 | 15.0 | 20.11.2008, 16:50 | 130028 | 198 |
| | 307 | 200 | 500 | 15.0 | 20.11.2008, 23:15 | 130679 | 194 |
Accessing results
When the DC02 is finished the resulting data files will be grouped into collections. The results of each runid will be packed into a collection using
find -c dc02_run[runid]_collection . run[runid]/*/dc2_MvdUrqmd_*.zip
The collections will be placed into directory /panda/user/p/pbarprod/dc02/output/Mvd/. They can be used for further analysis and can be
copied to a local directory with
get dc02_run[runid]_collection <local dir>
Journal
-
17.11.08, 08:00: first jobs submitted: runs 100, 200, 300 (123849, 123950, 124051)
-
17.11.08, 08:04: jobs are nicely picked up by sites and are running, kvit14.KVI.nl is producing ERROR_V
-
17.11.08, 08:20: first jobs of run 100 DONE (in lxpanda.jinr.ru)
-
17.11.08, 08:26: new jobs submitted: runs 101, 201, 301 (124152, 124253, 124354)
-
17.11.08, 09:05: currently 210 subjobs are running, need to submit new jobs: runs 103, 202, 302 (124455, 124656, 124816)
-
17.11.08, 10:19: first jobs of run 101 has finished at 09:55
-
17.11.08, 10:53: started to resubmit subjobs with ERROR_V (123849, 123950, 124051)
-
17.11.08, 13:13: many (not all) jobs on lxpanda.jinr.ru finish with ERROR_SV (see e.g. run 102/124455)
-
17.11.08, 13:30: introduced Requirements=member(other.GridPartitions,"Production"); in jdl to prevent kvit14.KVI.nl to run jobs, Partition Production includes: PANDA::Bucharest::panda01, PANDA::KVI::kvip81, PANDA::Juelich::ikp663, PANDA::Juelich::ce642, PANDA::GSI::grid8, PANDA::Ateneo::medgrid, PANDA::Dubna::pbs, PANDA::Ateneo::SGE, PANDA::Torino::PBS, PANDA::Glasgow::fc8, PANDA::Vienna::smigrid02
-
17.11.08, 14:10: resubmitting subjobs of runs 101, 201, 301 (124152, 124253, 124354) with ERROR_V and ERROR_SV
-
17.11.08, 14:46: resubmitting subjobs of runs 102 and 202 (124455, 124656) with ERROR_V and ERROR_SV
-
17.11.08, 15:00: currently 471 jobs are in status WAITING
-
17.11.08, 19:37: resubmitting subjobs of run 202 (124656) with ERROR_SV, ERROR_E, EXPIRED, ZOMBIE
-
17.11.08, 21:25: currently 205 jobs are in status WAITING
-
17.11.08, 21:30: submitting new jobs: runs 103, 203, 303 (125400, 125533, 125695)
-
18.11.08, 07:05: killed jobs 125533 and 125695,but submitted runs 203 and 303 (126903, 126917) now allowing kvit14.KVI.nl to run jobs again
-
18.11.08, 19:47: resubmitted error-jobs of runs 203 and 303 (126903, 126917)
-
18.11.08, 20:49: all jobs runnig at grid8.gsi.de are currently finishing with ERROR_SV
-
18.11.08, 22:30: the jobs runnig at grid8.gsi.de are still finishing with ERROR_SV, without grid8.gsi.de only few jobs can be finished per hour, since there are still more than 500 subjobs in status WAITING I postpone the submission of new jobs to tomorrow - by he way, since this afternoon subjobs ending with an error are automatically resubmitted
-
19.11.08, 10:00: the problem at grid8.gsi.de has been identified (full disk) and is under consideration
-
19.11.08, 15:00: submitted GSI test jobs runGSI01 (128816), output redirected to PANDA::GSI::virtual
-
19.11.08, 16:50: killed WATING subjobs of run203 (126903) and of run303 (126917), hope that GSI will now pick up subjobs of runGSI01 (128816)
-
19.11.08, 17:05: killed all remaining WATING subjobs except for those of runGSI01 (128816)
-
19.11.08, 18:34: all subjobs of runGSI01 (128816) finished with ERROR_SV
-
19.11.08, 18:40: submitted new GSI test jobs runGSI01 (128841), output now redirected to PANDA::GSI::virtual2
-
19.11.08, 19:36: subjobs of runGSI01 (128841) also finish with ERROR_SV, kill remaining subjobs ins status WAITING
-
19.11.08, 19:37: submitted new GSI test jobs runGSI02 (128929), output now redirected to PANDA::Vienna::file2
-
19.11.08, 21:00: SE in Dubna is full, but Valery gives us an additional 100GB
-
20.11.08, 07:10: finally all subjobs of runGSI02 (128929) are DONE, also runs 203 and 303 have finished, time to submit new jobs: runs 104, 204, 304 (128950, 128976, 129017)
-
20.11.08, 09:00: things work fine, need to submit more jobs runs 105, 205, 305 (129257, 129258, 129329)
-
20.11.08, 15:20: disk in Dubna is full, we stopped the CE
-
20.11.08, 16:50: submitting more jobs runs 106, 206, 306 (129984, 129987, 130028)
-
20.11.08, 23:15: there are currently 214 WAITING jobs, to make sure that enough jobs are available for the night I submit new runs 107, 207 307 (130673, 130674, 130679)
-
21.11.08, 20:15: there are currently still approximately 100 jobs running and 100 jobs in status WAITING, I will let them finish but will not submit any new jobs, so far we have produced 1'643'900 events!
Evaluation
I propose we use the same evaluation scheme as for DC1. Below is the list, but please suggest or simply add other items you would like to see here.
Site readiness
From this point of view there were three categories of sites:
- unmaintained - Pavia, Frascati
- down for upgrades - ScotGrid, Glasgow NPE (delay in moving the computers to a new room)
- well maintained
And the point here is that
all admins did their best to have their sites running. Lots of thanks for that.
Many sites added new resources just before DC2 and all this new hardware performed well: Juelich2, KVI2, Vienna2, Glasgow2 (FORK placeholder for a new PBS cluster).
Software installation
This time the software installation has been finalized well before the DC, and without any difficulties. There's also a new ML feature displaying a table of
packages on grid:
http://mlr2.gla.ac.uk:7001/packages/list.jsp
The only issue worth mentioning was, during the installation of UrQMD on some systems, a missing symbolic link to
libg2c.so in
/usr/lib or
/usr/lib64.
System performance
Disk storage was the problem this time. We produced a large amount of data, and the sites cruncing most of the jobs ran out
of disk space by Thursday: GSI had to add a new SE, Dubna provided 100GB more which were filled up within 1/2 day (we will get a new SE in Dubna in January).
Human Error
Nothing worth mentioning this time. Looks like everybody is getting more and more professional about this, and know-how has now spread within the PANDA Grid
community. Excellent!
Operation
We would all like to thank Paul Buehler, our production coordinator, for running this DC very professionally! Thanks also to Johan Messchendorp for his help in
providing the scripts and JDLs to run. And lots of thanks to Tobias Stockmans for providing the customized software package used.
Data produced
As mentioned in the journal above, we produced more than 1.5M events, still to be evaluated by Tobias.
Plots and statistics
These plots can be viewed by setting the interval as Apr 17, 7:00 to Apr 22, 7:00 on the interactive graphs from
http://mlr2.gla.ac.uk:7001/ (feel free to browse yourself).
- Jobs status during the DC2 week:
- Jobs DONE (history):
- DONE jobs share (pie chart):
- RUNNING jobs share (pie chart):
- RUNNING jobs (history):
- Jobs with errors (history):
- Traffic between storage elements:
- Files written to SEs (Glasgow SE missing for some reason):
to top