Skip to topic | Skip to bottom
Home
Computing

Start of topic | Skip to actions

Second PANDA Grid Data Challenge

Is scheduled for November 17-21, 2008. Site admins are kindly asked to update their sites with the required packages until the evening of November 11.

General goals

The main goals are to test site, installed software and storage availability. Testing of package compilation is not be part of the data challenge itself.

Site admins are in charge of:

  • making sure all grid services on their sites are up and running
  • making sure all software packages are installed and compiled successfully
  • making sure jobs running on their sites finish and save output successfully

Please read this and other grid wikis attentively and report problems as early as possible.

Schedule

The job submission will start on November 17, at 08:00 AM Vienna time.

Timeline:

17.10.08             . agreement to have DC02 on November 17-21           PANDAGrid workshop, Sinaia
-29.10.08            . agreement on packages to use                       Paul, Johan, Tobias
29.10.08             . email announcement to all site admins with         Paul
                       instructions for DC02 preparation
-11.11.08            . evaluation of the status (packages, sites)         Paul, Dan
                     . selection of macros to run                         Paul, Tobias, Johan, Dan
                     . installation of packages on all sites              site managers
                     . preparation of jdls                                Paul, Johan, Dan
                     . check availability of all sites                    site admins
                     . definition of job submission schedule              Paul, Dan

17-20.11.08          . job execution                                      Paul

21.11.08-            . evaluation of results                              Paul, Dan, Johan, Tobias 

Links

Prerequisites

Some older alien versions have a bug which causes jobs to end with an ERROR_IB. Please update all sites with the latest version of alien (currently v2-15.61 or later)

Alien can be updated with following procedure:

> wget http://alien.cern.ch/alien-auto-installer
> chmod +x alien-auto-installer
> ./alien-auto-installer

With this the latest version will be installed.

ALERT! ATTENTION: In $ALIEN_ROOT/../ the directory alien.v2-15 will be modified. So make a copy of this directory first if you like to preserve your current alien version.

Packages

For the DC02 the following software packages need to be installed and working on all sites

  pbarprod@mlcert::1.0
  pbarprod@panda_extern::jul08
  pbarprod@july08_patch1_jgm::v1.0
  pbarprod@july08_patch2_jgm::v1.0
  pbarprod@urqmd::root520
  pbarprod@tobias::v3.5

ALERT! After every alien update you have to redo the installation of mlcert. Please see notes in here.

The status of the packages can be seen at http://mlr2.gla.ac.uk:7001/packages/list.jsp

Testing Packages

The jdl /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl can be used to test the physics packages.

Usage: submit runMvdUrQMDSim.jdl [site] [runid]

  [site] : CE (e.g. PANDA::Vienna::smigrid01 or PANDA::GSI::grid8 or ...)
  [runid]: results are saved into directory /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/

You can get the list of CEs with the alien command services CE.

This script produces 10 UrQMD events (at Ppbar=4.0 GeV/c on a C12 nucleus) and then traces these events through the PANDA detector (using package pbarprod@tobias::v3.5). The job is split into 2 subjobs. Upon successful completion the following three files will have been created for each subjob in directories /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/[1,2]/

 Urqmd.root
 Mvd_GridUrqmdSim.root
 Mvd_GridUrqmdSimParam.root

Following example commands can be used to test specific sites:

submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Glasgow::ce2         01
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Bucharest::panda01   03
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Vienna::smigrid01    04
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::KVI::kvip81          05
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Juelich::ce642       06
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Juelich::ikp663      07
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::GSI::grid8           08
submit /panda/user/p/pbarprod/dc02/jdl/testMvdUrqmdSim.jdl PANDA::Frascati::ed22lf     10
etc.

To see a list of sites that match your JDL use jobListMatch JOBID, replacing JOBID with the job's id (after submission).

Run list

  • The jdl used to run the jobs is /panda/user/p/pbarprod/dc02/jdl/runMvdUrqmdSim.jdl
      Usage: submit runMvdUrQMDSim.jdl [runid] [nsplit] [nr of events] [momentum in GeV/c] [A] [Z]
      (e.g. submit dc02/jdl/runMvdUrqmdSim.jdl 123 100 5 4.0 12 6)
  • All output goes to /panda/user/p/pbarprod/dc02/output/Mvd/run[runid]/[split]
  • some typical numbers: MvdUrQMD: 100 events in approx. 15 minutes producing 100 MB, 500 events / 2 - 3 h / 650 MB

jdl runid nr of split nr of events/split momentunm [GeV/c] submit time MasterJob ID DONE
MvdUrqmd pbar + Xe132           01.12.2008, 10:00
runMvdUrqmdSim 100 100 100 2.0 17.11.2009, 08:00 123849 100
  101 100 500 2.0 17.11.2008, 08:25 124152 99
  102 200 500 2.0 17.11.2008, 09:05 124455 183
  103 500 500 2.0 17.11.2008, 21:30 125400 383
  104 100 500 2.0 20.11.2008, 07:10 128950 98
  105 200 500 2.0 20.11.2008, 09:00 129257 123
  106 200 500 2.0 20.11.2008, 16:50 129984 188
  107 200 500 2.0 20.11.2008, 23:15 130673 196
               
  200 100 100 6.2 17.11.2008, 08:01 123950 99
  201 100 500 6.2 17.11.2008, 08:26 124253 98
  202 200 500 6.2 17.11.2008, 09:05 124656 192
  203 500 500 6.2 17.11.2008, 21:30 126903 452
  204 100 500 6.2 20.11.2008, 07:10 128976 97
  205 200 500 6.2 20.11.2008, 09:00 129258 12
  206 200 500 6.2 20.11.2008, 16:50 129987 192
  207 200 500 6.2 20.11.2008, 23:15 130674 193
               
  300 100 100 15.0 17.11.2009, 08:02 124051 100
  301 100 500 15.0 17.11.2008, 08:26 124354 93
  302 200 500 15.0 17.11.2008, 09:06 124816 179
  303 500 500 15.0 17.11.2008, 21:30 126917 168
  304 100 500 15.0 20.11.2008, 07:10 129017 88
  305 200 500 15.0 20.11.2008, 09:00 129329 34
  306 200 500 15.0 20.11.2008, 16:50 130028 198
  307 200 500 15.0 20.11.2008, 23:15 130679 194

Accessing results

When the DC02 is finished the resulting data files will be grouped into collections. The results of each runid will be packed into a collection using

find -c dc02_run[runid]_collection . run[runid]/*/dc2_MvdUrqmd_*.zip

The collections will be placed into directory /panda/user/p/pbarprod/dc02/output/Mvd/. They can be used for further analysis and can be copied to a local directory with

get dc02_run[runid]_collection <local dir>

Journal

  • 17.11.08, 08:00: first jobs submitted: runs 100, 200, 300 (123849, 123950, 124051)
  • 17.11.08, 08:04: jobs are nicely picked up by sites and are running, kvit14.KVI.nl is producing ERROR_V
  • 17.11.08, 08:20: first jobs of run 100 DONE (in lxpanda.jinr.ru)
  • 17.11.08, 08:26: new jobs submitted: runs 101, 201, 301 (124152, 124253, 124354)
  • 17.11.08, 09:05: currently 210 subjobs are running, need to submit new jobs: runs 103, 202, 302 (124455, 124656, 124816)
  • 17.11.08, 10:19: first jobs of run 101 has finished at 09:55
  • 17.11.08, 10:53: started to resubmit subjobs with ERROR_V (123849, 123950, 124051)
  • 17.11.08, 13:13: many (not all) jobs on lxpanda.jinr.ru finish with ERROR_SV (see e.g. run 102/124455)
  • 17.11.08, 13:30: introduced Requirements=member(other.GridPartitions,"Production"); in jdl to prevent kvit14.KVI.nl to run jobs, Partition Production includes: PANDA::Bucharest::panda01, PANDA::KVI::kvip81, PANDA::Juelich::ikp663, PANDA::Juelich::ce642, PANDA::GSI::grid8, PANDA::Ateneo::medgrid, PANDA::Dubna::pbs, PANDA::Ateneo::SGE, PANDA::Torino::PBS, PANDA::Glasgow::fc8, PANDA::Vienna::smigrid02
  • 17.11.08, 14:10: resubmitting subjobs of runs 101, 201, 301 (124152, 124253, 124354) with ERROR_V and ERROR_SV
  • 17.11.08, 14:46: resubmitting subjobs of runs 102 and 202 (124455, 124656) with ERROR_V and ERROR_SV
  • 17.11.08, 15:00: currently 471 jobs are in status WAITING
  • 17.11.08, 19:37: resubmitting subjobs of run 202 (124656) with ERROR_SV, ERROR_E, EXPIRED, ZOMBIE
  • 17.11.08, 21:25: currently 205 jobs are in status WAITING
  • 17.11.08, 21:30: submitting new jobs: runs 103, 203, 303 (125400, 125533, 125695)
  • 18.11.08, 07:05: killed jobs 125533 and 125695,but submitted runs 203 and 303 (126903, 126917) now allowing kvit14.KVI.nl to run jobs again
  • 18.11.08, 19:47: resubmitted error-jobs of runs 203 and 303 (126903, 126917)
  • 18.11.08, 20:49: all jobs runnig at grid8.gsi.de are currently finishing with ERROR_SV
  • 18.11.08, 22:30: the jobs runnig at grid8.gsi.de are still finishing with ERROR_SV, without grid8.gsi.de only few jobs can be finished per hour, since there are still more than 500 subjobs in status WAITING I postpone the submission of new jobs to tomorrow - by he way, since this afternoon subjobs ending with an error are automatically resubmitted
  • 19.11.08, 10:00: the problem at grid8.gsi.de has been identified (full disk) and is under consideration
  • 19.11.08, 15:00: submitted GSI test jobs runGSI01 (128816), output redirected to PANDA::GSI::virtual
  • 19.11.08, 16:50: killed WATING subjobs of run203 (126903) and of run303 (126917), hope that GSI will now pick up subjobs of runGSI01 (128816)
  • 19.11.08, 17:05: killed all remaining WATING subjobs except for those of runGSI01 (128816)
  • 19.11.08, 18:34: all subjobs of runGSI01 (128816) finished with ERROR_SV
  • 19.11.08, 18:40: submitted new GSI test jobs runGSI01 (128841), output now redirected to PANDA::GSI::virtual2
  • 19.11.08, 19:36: subjobs of runGSI01 (128841) also finish with ERROR_SV, kill remaining subjobs ins status WAITING
  • 19.11.08, 19:37: submitted new GSI test jobs runGSI02 (128929), output now redirected to PANDA::Vienna::file2
  • 19.11.08, 21:00: SE in Dubna is full, but Valery gives us an additional 100GB
  • 20.11.08, 07:10: finally all subjobs of runGSI02 (128929) are DONE, also runs 203 and 303 have finished, time to submit new jobs: runs 104, 204, 304 (128950, 128976, 129017)
  • 20.11.08, 09:00: things work fine, need to submit more jobs runs 105, 205, 305 (129257, 129258, 129329)
  • 20.11.08, 15:20: disk in Dubna is full, we stopped the CE
  • 20.11.08, 16:50: submitting more jobs runs 106, 206, 306 (129984, 129987, 130028)
  • 20.11.08, 23:15: there are currently 214 WAITING jobs, to make sure that enough jobs are available for the night I submit new runs 107, 207 307 (130673, 130674, 130679)
  • 21.11.08, 20:15: there are currently still approximately 100 jobs running and 100 jobs in status WAITING, I will let them finish but will not submit any new jobs, so far we have produced 1'643'900 events!

Evaluation

I propose we use the same evaluation scheme as for DC1. Below is the list, but please suggest or simply add other items you would like to see here.

Site readiness

From this point of view there were three categories of sites:

  1. unmaintained - Pavia, Frascati
  2. down for upgrades - ScotGrid, Glasgow NPE (delay in moving the computers to a new room)
  3. well maintained

And the point here is that all admins did their best to have their sites running. Lots of thanks for that.

Many sites added new resources just before DC2 and all this new hardware performed well: Juelich2, KVI2, Vienna2, Glasgow2 (FORK placeholder for a new PBS cluster).

Software installation

This time the software installation has been finalized well before the DC, and without any difficulties. There's also a new ML feature displaying a table of packages on grid: http://mlr2.gla.ac.uk:7001/packages/list.jsp

The only issue worth mentioning was, during the installation of UrQMD on some systems, a missing symbolic link to libg2c.so in /usr/lib or /usr/lib64.

System performance

Disk storage was the problem this time. We produced a large amount of data, and the sites cruncing most of the jobs ran out of disk space by Thursday: GSI had to add a new SE, Dubna provided 100GB more which were filled up within 1/2 day (we will get a new SE in Dubna in January).

Human Error

Nothing worth mentioning this time. Looks like everybody is getting more and more professional about this, and know-how has now spread within the PANDA Grid community. Excellent!

Operation

We would all like to thank Paul Buehler, our production coordinator, for running this DC very professionally! Thanks also to Johan Messchendorp for his help in providing the scripts and JDLs to run. And lots of thanks to Tobias Stockmans for providing the customized software package used.

Data produced

As mentioned in the journal above, we produced more than 1.5M events, still to be evaluated by Tobias.

Plots and statistics

These plots can be viewed by setting the interval as Apr 17, 7:00 to Apr 22, 7:00 on the interactive graphs from http://mlr2.gla.ac.uk:7001/ (feel free to browse yourself).

  • Jobs status during the DC2 week:
    Jobs status during the DC2 week

  • Jobs DONE (history):
    Jobs DONE

  • DONE jobs share (pie chart):
    DONE jobs share (pie chart)

  • RUNNING jobs share (pie chart):
    RUNNING jobs share (pie chart)

  • RUNNING jobs (history):
    RUNNING jobs (history)

  • Jobs with errors (history):
    Jobs with errors (history)

  • Traffic between storage elements:
    Traffic between storage elements

  • Files written to SEs (Glasgow SE missing for some reason):
    Files written to SEs (Glasgow SE missing for some reason)

to top

I Attachment sort Action Size Date Who Comment
runMvdUrqmdSim.jdl manage 0.8 K 14 Nov 2008 - 12:49 PaulBuehler runMvdUrqmdSim.jdl
testMvdUrqmdSim.jdl manage 0.7 K 15 Nov 2008 - 20:05 PaulBuehler jdl to test specific site
runMvdUrqmdScript.sh manage 1.4 K 15 Nov 2008 - 20:06 PaulBuehler runMvdUrqmdScript.sh
runMvdUrqmdSim.C manage 6.6 K 15 Nov 2008 - 20:06 PaulBuehler runMvdUrqmdSim.C
MvdUrqmd_validation.sh manage 2.2 K 15 Nov 2008 - 20:07 PaulBuehler MvdUrqmd_validation.sh
jobs_status.png manage 90.1 K 26 Nov 2008 - 13:05 DanProtopopescu Jobs status during the DC2 week
jobs_done_cumulative.png manage 38.5 K 26 Nov 2008 - 13:09 DanProtopopescu Jobs DONE (cumulative)
jobs_done.png manage 56.2 K 26 Nov 2008 - 13:12 DanProtopopescu Jobs DONE (history)
jobs_done_pie.png manage 51.4 K 26 Nov 2008 - 13:16 DanProtopopescu DONE jobs share (pie chart)
jobs_running_pie.png manage 51.9 K 26 Nov 2008 - 13:20 DanProtopopescu RUNNING jobs share (pie chart)
jobs_running.png manage 95.2 K 26 Nov 2008 - 13:21 DanProtopopescu RUNNING jobs (history)
jobs_errors.png manage 67.9 K 26 Nov 2008 - 13:23 DanProtopopescu Jobs with errors (history)
SE_traffic.png manage 108.2 K 26 Nov 2008 - 13:27 DanProtopopescu Traffic between storage elements
SE_files_written.png manage 54.9 K 26 Nov 2008 - 13:30 DanProtopopescu Files written to SEs (Glasgow SE missing for some reason)
panda_extern-jul08-post_install manage 3.2 K 02 Dec 2008 - 13:17 DanProtopopescu panda_extern::jul08 alien packman post_install file

You are here: Computing > PandaGrid > DataChallenge2

to top

Copyright © 1999-2010 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Impressum, Urheberrecht und Haftungsausschluss
Ideas, requests, problems regarding Panda Wiki? Send feedback