DRP database loading and postprocessing

I’m now starting to flesh out the sections of LDM-151 on DRP postprocessing (catalog-only calculations that happen after all the image processing is complete), and I’ve found that I don’t have a great picture in my head for how this ought to work in the future (I’m definitely not talking about e.g. the next six months right now). I’d be curious if others (perhaps @ktl, @fritzm, and @natepease in particular) already have some ideas.

Here are the things that need to happen:

  • We’ll start producing Sources, DIASources, and DIAObjects relatively early in the processing (one tract on the sky at a time is probably the easiest way to think of it). We could immediately load these into temporary tables in the final database, then transform them from raw to calibrated units in the database, and then drop the temporary tables. Or we could do the transformation before ingesting them anywhere. Or we could ingest them into the as-yet-vague DRP internal database, do the transformation there, and then load the final database etc. And there are probably some other permutations, but they all seem about the same difficulty. I think the same questions and possible answers will apply to ForcedSource as well, even though we’ll produce that later in the processing.

  • We’ll produce Objects more incrementally (at least from this perspective), and it’s likely we’ll want to de-duplicate Objects between Tracts before we finish the pixel-level processing with MultiFit and Forced Photometry. Given that we’ll want to query the still-incomplete Object table during the MultiFit and Forced Photometry pipelines, I think it makes sense to load Object into the internal DRP database after deduplication; at that point the number of rows in Object will be fixed, and we’ll just be adding or updating columns. Those last few image processing pipelines would then add new fields, and we’d then have the same question about when to transform to calibrated quantities. We could also just stuff small summaries of the Object table (i.e. just a few fields) in the internal database, and leave the rest in some other filesystem or filesystem-like persistence until we’re ready to load the final database; we don’t need everything in the internal database in order to drive MultiFit and ForcedPhotometry.

  • After we’re done with all of the image processing, there will be some catalog-only processing steps that will be needed to fill out the Object catalog (i.e. to add the last few fields). These will essentially involve a series of full-table scans on Object and ForcedSource (ordered by Object), accompanied by some Python/C++ per-record and aggregation code (obviously, we’d express as much of this is possible as SQL, and that might include all the aggregation, but I’m not sure I can commit to that). I think we’ll probably want to do these in a pre-release version of the final database, unless the internal DRP database is expected to be similarly large. Is there some sort of framework we’re already planning to provide for running run Level 3 non-SQL user code at scale on the database tables? I think we might want to hook into something like that for these steps. And clearly that framework could also be a good way to implement the transformations from raw to calibrated units, if we decide to defer that until after final DB ingest. Of course, if Level 3 access to the final database is also typically mediated by the Butler, maybe this framework is really no different from the orchestration layer that runs the image processing pipelines.

Are any of my guesses on how we might do this way off, or unpleasant surprises to those doing planning Data Access? Are there any areas where I’ve thrown out a bunch of options, but you already have some plans for how we should proceed?

I’m busy with the Camera Workshop and so haven’t read through all of this or thought much about any of it, but some quick principles to maybe guide discussion:

  • The “final” database (i.e. the Query Access Database implemented on Qserv) should be thought of as write-once. While it can be incrementally loaded, it is not optimized for updates. In particular, all columns should be available when it is loaded.
  • My understanding is that transformation from “raw” units to “scientific, calibrated” units is supposed to be done by the framework that @jdswinbank wrote and occurs prior to ingestion into any database.
  • I expect that the internal Data Release Database should be used to perform SQL and spatial queries that we should try not to re-implement in Python or C++ code. It can also be used for UPDATE statements. But my thinking about the implementation of this database is that it uses “conventional” technology and so may not be particularly efficient for the kinds of queries that Qserv can support. In particular, it may not be able to hold all Objects (let alone ForcedSources) in a single database instance.

How are the results of recalibrations to be applied?