A recent HipChat discussion on Star Selectors brought up some design rules for Tasks
that I’ve long assumed (and I think others have as well) but have never really formally communicated or tried to write up. As these are just my opinions (for now), please feel free to push back if you disagree with any of this.
-
Tasks should be considered frozen after construction. No new attributes should be added and no attributes should be modified. This further implies that config objects and schemas cannot be modified after
Task
construction. This preventsTask
state from changing in the course of processing different data units, and hence helps ensure consistent, reproducible processing. I personally think that this restriction should extend to caching and lazy operations (if we need a cache, perhaps we need a better place to put it than theTask
?), but I could be convinced that we should have an exception for these. -
It must be possible to construct all
Task
s to be used in a production before actually running any of them. Together with (1), this ensures that all schemas and configuration can be realized without having access to any data. Note that this does not require that allTasks
actually be constructed before running any of them, but that’s often a good idea anyway, if for no other reason than to verify that it is possible. -
Operations should be performed at the highest-level possible; calculations that do not require access to data should be performed in
Task
constructors (or earlier), and calculations that apply to several units of related data (such as CCDs within a visit) should be performed once at the higher level (e.g. visit). While ourTask
framework does not currently provide all the hooks necessary to do this as much as we’d like, but we should expect that it ultimately will and structure our code accordingly (by e.g. putting conceptually higher-level code in separate methods).
These principles frequently require the schemas of input catalogs to be passed to Task
constructors, to allow them to compute their output schemas. When this happens at the CmdLineTask
level, lsst.pipe.base.ButlerInitializedTaskRunner
should be used to pass a Butler
to the CmdLineTask
constructor, which can then access the *_schema
for the catalog dataset it needs. See MeasureMergedCoaddSourcesTask
in pipe_tasks’ multiBand.py for an example. The CmdLineTask
should also take the schema(s) it needs as arguments directly, allowing the butler
argument to default to None
when those are provided.
These rules can make our Task
s harder to use outside of a production setting, and frequently the best way to avoid making them harder to use is to to provide a lot of flexibility in how a Task
constructor receives information about its predecessors (such as the separate schema
and butler
arguments mentioned above). That flexibility comes with its own costs, and I hope that eventually an improved Task
framework can take up some of this burden. For now, however, I think our policy must be that the burden should be born by the Task
s themselves, not by user code.
The fact that Python doesn’t make it easy to enforce to enforce Task
freezing after construction at a language level is also problematic; thinking they can modify config values after Task
construction is one of the most common gotchas for new users of the stack, partly because we don’t presently raise any exceptions when they do so. This - and the performance-focused, “no unnecessary operations” philosophy behind (3) do make me worry a bit that I might be trying to apply C++ design principles to a language where they simply don’t work as well, and that there might be more Pythonic principles we could be using instead that would achieve the same results. But I suspect that the reality is that having strongly-enforced rules to ensure consistent processing is just inherently unpythonic, and we’ll always have to balance Pythonic-ness and rigorous consistency.