As part of the DESC Twinkles project we have been running 1000’s of batch jobs both at SLAC and NERSC. Due to the way DM butler works (or my limited understanding of it) all the jobs write to the same output directory, since the next processing stage is to run coadd jobs on the combined output.
When running ~100 jobs writing to the same output directory we see disk contention problems, typically resulting in spurious crashes at SLAC or extreme slowness (measured as CPU/wallclock time) at NERSC. In the past (for other experiments) we have overcome such problems by writing output to local scratch space for each job, then at the end of the job copying the output to its final location.
So several questions:
a) Are we correct in assuming that all of the output must be written to a single directory to allow the next DM job to read all of it into a single job? If not what should we be doing instead?
b) Can we write the output to a temporary (scratch) location, then at the end of the job do a cp -r to copy it to a single output location? If not is there some other way to achieve the same result?
c) Anything else we should know about running DM jobs at scale?
If useful you can find a description of the actual jobs we are running here: