Nobody is satisfied with the current Task
and CmdLineTask
. But they’re a marked improvement (at least in some important respects) from what we had before, and as some recent criticisms have suggested moving in perhaps a similar direction to what we had before, it was suggested at this week’s DMLT meeting that it’d be very useful to describe how we got to this point.
What Came Before
Once upon a time, the DM Stack was organized into Stages
, using a package called pex_harness
. These communicated using an object called a Clipboard
, to which each Stage
could attach its outputs and retrieve its inputs. This was all quite friendly to a workflow management system: a full pipeline could be described as just a sequence of Stage
s, and all the execution and I/O was very fully abstracted. That description was in a custom configuration format called Policy
(which you can still find in use in a few dark corners of the stack today).
This was a bit of a pain to run outside a workflow management system (maybe it was impossible? I don’t recall), but the real problem was that it made the algorithmic code impossible to follow, because the logic was split up across multiple Stage
s, and the links between those stages were captured in policy files, not the code itself. And it turned out those Stage
s weren’t really all that composable. The idea that they were really distinct units of processing was in some sense a fallacy, because the algorithmic responsibilities of each Stage
depended quite a bit on the details of what was done in the previous Stages
.
Beginnings of Task
The alternative that ultimately grew into Task
started with a project called “pipette” that began life on the HSC side. This simply connected the same algorithmic code in the Stage
-based pipeline into a very small number of concrete Python scripts. This made things less pleasant for workflow systems, but it made life much easier for developers. Task
then started as a transfer of this idea back to the LSST side with a common base class and some conventions for the callable objects that composed the pipeline. In particular, it added:
- The ability to construct a hierarchy of
Task
s that delegate to each other in a consistent way, and a configuration hierarchy that mirrors it. - Standardized access to a logger.
- Distinct setup and execution phases, allowing some code (i.e. schema definition) to be executed once before running on any data units.
At the same time, we added CmdLineTask
, a subclass of Task
that added some convenience code to run a Task
from a command line, assuming its inputs and outputs could be captured in a single Butler
dataref.
But the philosophy of Task
was very simple: we just wanted to write pipelines using regular Python language constructs, and Task
itself is just a callable that adds some configuration and utilities.
Where Task
Went Next
Since then, things have evolved in several ways. The most important one was the introduction of ConfigurableField
, which allows any subtask in any level in a hierarchy to be swapped out via a “retarget” option in a config file. This is an extremely powerful feature, and we’ve been using it as key piece of our customization system without really stopping to define the interfaces of those swappable components. That has made the system quite fragile and hard to follow. In addition, in contrast to RegistryField
, our other way of configuring plugin code, ConfigurableField
doesn’t properly preserve configuration settings when components are swapped, which can lead to hard-to-find bugs. We’ve also evolved to a state in which low-level non-CmdLine
Task
s do not use the Butler
. This makes their inputs and outputs more strictly-defined, and it also makes it possible to use them in contexts where a Butler
is not available.
The concrete Task
s and CmdLineTask
themselves have also evolved quite a bit, often in ways that highlight the lack of defined interfaces for specific concrete Task
s - there are many algorithmic features that have added implicit dependencies across Task
s, especially CmdLineTask
s, making it harder to swap out components safely. That of course was an even bigger problem with Stage
s, but it’s clear CmdLineTask
s have a similar problem, largely because their inputs and outputs are still too vaguely defined. It’s even true (to a lesser degree) of lower-level Task
s as well, because a Python signature doesn’t fully define an interface.
We’ve also never really settled the problem of how to organize a Task.run()
method. Having multiple methods that run()
delegates to allows one to replace an aspect of a component more easily, because you can inherit from the Task
and reimplement only one of these methods instead of run()
. One can do the same thing by delegating directly to a subtask in run()
and then retargeting the subtask, and we haven’t really evaluated the pros and cons of these two approaches. And the result is a lack of consistency across concrete Task
s.
Some CmdLineTask
s have also become a lot less readable as we’ve struggled with code reuse challenges when running similar algorithms on very different datasets - particularly in what’s arguably the most important task, ProcessCcdTask
. Most of the work there is delegated to its base class ProcessImageTask
, so it can be shared with ProcessCoaddTask
, and that makes it harder to follow. I actually think a lot of that ugliness will go away as the algorithms for single-frame and deep processing naturally diverge, but to the extent it doesn’t we should look for other ways to share code.
HSC Batch/Pool Tasks
We’re currently in the process of moving over some code from HSC that includes an even higher-level collection of Task
s which aggregate multiple CmdLineTask
s and run them in parallel on batch systems. This has exposed some subtleties in the differences between running these CmdLineTask
s individually and all together, and a real deficiency in our ability to warm-start processing (we’re currently relying on config options that tell us whether to reuse data products that already exist on disk instead of using them, which almost inevitably leads to confusion). The batch tasks are also difficult to read, because they’re mostly about setting up pools of processes and passing jobs to them, rather than the algorithm flow. We might be able to fix this with some (fairly intense) Python syntactic sugar, but it also might be a sign that the code-readability advantage of using direct Python calls to link up pipeline units largely disappears when we get to parallel execution.
Bottom Line
Task
s were a big step up from what we had before in terms of readability and reusability (and hence developer efficiency) at a low level, precisely because they used standard Python constructs for calling units of code (i.e. the function call operator), and used Python constructs (function signatures) for defining inputs and outputs. They were a almost certainly a step back in terms of being able to connect to workflow systems. They were probably a step back for introspection, but I think that was more of an oversight in the concrete implementation rather than a flaw in the approach.
I think it’s clear at the high level we’ll need more generic interfaces to allow the pipeline to be managed more effectively by workflow systems, and that probably needs to happen at or below the level at which we start doing parallelization - but we need to be very careful to avoid running into the same readability/hidden-dependencies/unusable-without-workflow-tools problems we had before we switched to Task
.