I’m curious about the plans for memory management in the new butler, and in particular about when objects will be cached and not cached. Apparently the preferred way for one SuperTask
to call another (where the second is a sub-task) is for the first to use butler.put
for its outputs, and the second to use butler.get
instead of passing the data directly. Such data must be cached, both for performance and because it must work even if the first outputs are not persisted. However, overly aggressive caching will lead to memory problems. For example a task that produces coadd temporary exposures may write many exposures with one call to run_quantum
. If these are cached, we will quickly run out of memory. Another example is the coaddition task, which reads many exposures that must not be cached.
In case this issue has not been thoroughly thought through already, I present some ideas…
A very simple solution is to have the butler never cache data, and instead pass the data directly, just like calling any other sub-task. This requires the parent task to pass the input data, and also pass a butler and data IDs so the sub-task can persist its outputs. It makes the code slightly messier, but it quite direct and obvious. It also preserves the paradigm that there is only one way to call a sub-task. For a pre-supertask example see the new ProcessCcdTask
, which calls several command-line tasks.
Another solution is to have the butler only cache data products that are written by one SuperTask
and read by a later one in the same run. The butler will have to be careful to identify those products correctly.
I prefer the simpler solution. It may lose one of the advantages of the new SuperTask
architecture, but from my perspective it retains a primary advantage: SuperTasks
register their inputs and outputs with the butler, providing a nice way to see which products are read and written by which tasks.