Last night (prior to the w_2021_25 release) I merged DM-30649, which makes a number of changes to PipelineTask
execution and QuantumGraph
generation, some of which are slightly backwards compatible.
Our code now defines a Quantum
to have been run successfully based on whether its <task>_metadata
dataset was successfully written, regardless of whether all other predicted outputs were also written. This opens the door to a lot of useful functionality:
-
A
PipelineTask
can now succeed even if it has no work to do, which is a common occurrence in cases where the butler’s conservative spatial relationships overpredict overlaps between skymap regions and observations. A task with no work to do can simply exit without error without writing anything, and it can also raisepipe.base.NoWorkFound
, which will be caught by the execution system with the same effect: only metadata can be written. A task can also exit early or raiseNoWorkFound
after writing only some of its predicted outputs. -
On the input side,
Input
andPrerequisiteInput
connections now have aminimum
parameter (defaults to1
) that sets a lower bound for the number of datasets per quantum in that connection. It can be set to zero only forPrerequisiteInputs
(we already never make quanta with zero datasets in for regularInput
connection), and greater than one only for connections withmultiple=True
. When a quantum does not satisfy this condition due to a missingPrerequisiteInput
when building aQuantumGraph
, we’ll raiseFileNotFoundError
, addressing the long-standing confusing behavior of prerequisites not actually being required. And when a quantum does not satisfy this condition due to a missingInput
during execution (which can only happen because some upstream task did not produce one of its predicted outputs), the execution harness will skip it whiling still writing its metadata dataset (and logging accordingly), in effect raisingNoWorkFound
on its behalf. -
The options to skip or clobber existing outputs during execution now behave more consistently, and
--clobber-partial-outputs
has been renamed to--clobber-outputs
to better reflect its new behavior. Both--clobber-outputs
and--skip-existing
apply only to datasets in the outputRUN
collection (so they’re only useful with--extent-run
), and they can now be used both inQuantumGraph
generation and execution.-
During execution, passing
--skip-existing
and--clobber-outputs
will cause successfully-run quanta (those with a<task>_metadata
dataset in the outputRUN
collection) to be skipped and incomplete quanta (those with other datasets, but not metadata, in the outputRUN
collection) to be run again after first deleting existing outputs. Passing--skip-existing
alone makes incomplete quanta an error, and passing--clobber-outputs
alone will clobber and re-run even successful quanta. Passing neither makes both successful and incomplete quanta existing an error. -
During
QuantumGraph
generation, passing--skip-existing
will cause successfully-run quanta to be left out of the graph entirely, essentially skipping them in advance and allowing downstream quanta to use ther existing outputs (if there are any). Passing--clobber-output
duringQuantumGraph
generation just informs the algorithm that it can expect that option to be passed during execution, and hence it should not raise exceptions when it sees existing datasets that will need to be clobbered in the output collection.
-
In addition to NoWorkFound
, there are two more new exceptions that can raised by PipelineTask
to invoke special behavior:
-
pipe.base.RepeatableQuantumError
can be raised to indicate a true failure that should block execution of all downstream quanta, but one that should be repeatable given the same software versions, configuration, and data, and hence never automatically retried by the workflow system. It should not be used for environmental issues like out-of-memory conditions, or for cases where we think downstream tasks should proceed but skip the output of this task (that’s actually best described as aNoWorkFound
condition right now, though we plan to add support for more subtle conditional successes in the future). It is the right exception to raise for algorithmic problems that we hope to fully eliminate from the kinds of data we regularly process; we want to be forced to track down and investigate these when they occur. A failures to fit a PSF model or an astrometry matching problem are probably good examples. -
pipe.base.InvalidQuantumError
can be raised to indicate a logic problem in the configuration or construction of the pipeline. If possible, workflow systems will kill entire submissions (not just fail downstream quanta) when this exception is encountered, as it’s the kind of thing that will probably force the user to reconfigure and re-submit. Whenever possible, checks that could raise this error should instead be performed during configuration validation (inConfig.validate
methods) or duringQuantumGraph
generation (by overridingPipelineTaskConnections.adjustQuantum
), and in those contexts any exception can be raised; it’s only for the hopefully-rare cases where we cannot perform those checks until execution is underway.
Finally, it’s worth noting that this metadata-dataset defintion of success is probably just a placeholder; eventually we will want to record more detailed status for each quantum, beyond what can be represented by the presence or absence of a dataset (in particular, we will want to save different error states, and we can’t save those in the metadata if use the absence of metadata to indicate an error). It may be a placeholder that lasts for a long time, because the middleware team has a lot of other priorities, but if you’re interested in building anything new on top of this success definition, please come talk to us in #dm-middleware-dev on Slack; we’ll probably want to define some more stable APIs for checking quantum status for which metadata existence is just the current implementation.