Design Principles for Tasks

jbosch · June 28, 2016, 9:23pm

A recent HipChat discussion on Star Selectors brought up some design rules for Tasks that I’ve long assumed (and I think others have as well) but have never really formally communicated or tried to write up. As these are just my opinions (for now), please feel free to push back if you disagree with any of this.

Tasks should be considered frozen after construction. No new attributes should be added and no attributes should be modified. This further implies that config objects and schemas cannot be modified after Task construction. This prevents Task state from changing in the course of processing different data units, and hence helps ensure consistent, reproducible processing. I personally think that this restriction should extend to caching and lazy operations (if we need a cache, perhaps we need a better place to put it than the Task?), but I could be convinced that we should have an exception for these.
It must be possible to construct all Tasks to be used in a production before actually running any of them. Together with (1), this ensures that all schemas and configuration can be realized without having access to any data. Note that this does not require that all Tasks actually be constructed before running any of them, but that’s often a good idea anyway, if for no other reason than to verify that it is possible.
Operations should be performed at the highest-level possible; calculations that do not require access to data should be performed in Task constructors (or earlier), and calculations that apply to several units of related data (such as CCDs within a visit) should be performed once at the higher level (e.g. visit). While our Task framework does not currently provide all the hooks necessary to do this as much as we’d like, but we should expect that it ultimately will and structure our code accordingly (by e.g. putting conceptually higher-level code in separate methods).

These principles frequently require the schemas of input catalogs to be passed to Task constructors, to allow them to compute their output schemas. When this happens at the CmdLineTask level, lsst.pipe.base.ButlerInitializedTaskRunner should be used to pass a Butler to the CmdLineTask constructor, which can then access the *_schema for the catalog dataset it needs. See MeasureMergedCoaddSourcesTask in pipe_tasks’ multiBand.py for an example. The CmdLineTask should also take the schema(s) it needs as arguments directly, allowing the butler argument to default to None when those are provided.

These rules can make our Tasks harder to use outside of a production setting, and frequently the best way to avoid making them harder to use is to to provide a lot of flexibility in how a Task constructor receives information about its predecessors (such as the separate schema and butler arguments mentioned above). That flexibility comes with its own costs, and I hope that eventually an improved Task framework can take up some of this burden. For now, however, I think our policy must be that the burden should be born by the Tasks themselves, not by user code.

The fact that Python doesn’t make it easy to enforce to enforce Task freezing after construction at a language level is also problematic; thinking they can modify config values after Task construction is one of the most common gotchas for new users of the stack, partly because we don’t presently raise any exceptions when they do so. This - and the performance-focused, “no unnecessary operations” philosophy behind (3) do make me worry a bit that I might be trying to apply C++ design principles to a language where they simply don’t work as well, and that there might be more Pythonic principles we could be using instead that would achieve the same results. But I suspect that the reality is that having strongly-enforced rules to ensure consistent processing is just inherently unpythonic, and we’ll always have to balance Pythonic-ness and rigorous consistency.

swinbank · June 28, 2016, 10:29pm

Thanks for posting this. On first reading, I don’t disagree with anything substantial (except, perhaps, that I’m not as concerned about the last paragraph as you are).

I do want to note, though, that this is the sort of material that ultimately should find a home in our documentation, rather than just on community. I’ve added a note to DM-6648 that we should come back and harvest this material when the discussion has converged.

ktl · June 28, 2016, 10:32pm

I thought we added Config.freeze() specifically to address this. Is it not being called appropriately?

jbosch · June 28, 2016, 10:34pm

Either that or it’s not fully implemented. I remember that for a long while, freeze() was available but didn’t do anything, and that may still be the case for some field classes.

ktl · June 28, 2016, 10:39pm

It looks to me like it ought to work properly for all fields due to implementation in the Field base class and appropriate calls through the Config hierarchy. Also, the CmdLineTask argument parser calls namespace.config.freeze(). So I would think that exceptions ought to be thrown if any config parameters are modified. If they’re not, that suggests that a config is being created somewhere other than the argument parser, which ought to be a no-no.

parejkoj · June 28, 2016, 10:50pm

Thanks for writing this up here. This came from a HipChat conversation, and we didn’t want to lose it.

KT talks about freeze() above, but this seems like a use case for making config values @properties without @setters.

ktl · June 28, 2016, 10:52pm

Unfortunately, they’re more complicated than that in order to maintain history, and they need simple assignment syntax in order to allow config override files to be written naturally.

jbosch · June 28, 2016, 11:17pm

Ah, then it’s probably working fine when Tasks are invoked from the command-line. But there’s still a lot of potential for confusion when Tasks are used directly in Python user code; maybe we need to move the freeze() call into regular Task constructors?

ktl · June 28, 2016, 11:21pm

I don’t think there’s any problem calling freeze() on a Config multiple times, so I’d vote for calling it in both places: the Task.__init__() as well as the CmdLineTask argument parser.