Attendees: @ctslater, @mtpatter, @KSK, Felipe M., @jalt, Rahul Biswas (partial), @connolly (partial)
Introduction to thoughts on L1 prompt processing system (Felipe):
James Parsons is doing the work to support the prompt processing system
Production prompt processing will require the following steps:
- Launch jobs via condor
- Utilities developed for supporting production processing will figure out what the pipeline needs (KSK: I don’t think there is a concrete plan for how this is done) and caches it to local disk
- Spawn processing job on the local resources running on the locally cached data using a butler pointing to the local repository
- Collect data products (KSK: I’m not clear if this is the job of the prompt pipelines or is an afterburner)
KSK: How do we do this in development? I.e. I don’t want to have to do a pre-caching step if I’m just testing things out, or running on a single node.
Felipe: We will absolutely attempt to make the development environment as close to the production environment as possible. This will mean needing a butler that knows how to get data from the data backbone (KSK: we went back and forth on the butler reading data from the data backbone, but I think this is where we ended up. Felipe, correct me if I’m wrong).
Introduction to the L1 Prompt pipelines (Simon):
One of the aspects that came out of the interface discussions is that there are three distinct kinds of information that flow out of the pipelines
- First class data products — I.e. defined in the DPDD
- Internal data products — Useful internally to science pipelines: e.g. backgrounds, sources. These are typically considered ephemeral.
- Data that needs to be broadcast outside science pipelines — logs, QA/QC information, possibly information fed back to OCS, etc.
A concrete example of the third type of data is performance statistics: e.g. timing, memory, network, I/O, and CPU usage. We have been handling this with decorators on methods, but if we need to continue to do that, we’ll need a) more decorators and b) to come up with a standard for how pipeline developers should be decorating their code. The other option is to have external instrumentation to measure performance.
** Can we measure performance at the level we want from outside the tasks or do they need to be instrumented from within? (Felipe)
The third type of data is not completely enumerated and has no process for being enumerated. It may not be possible to enumerate these. This suggests that we need two things (or possibly just the one):
- An ability to be flexible about what quantities are persisted by the prompt pipelines: e.g. metadata to logs, intermediate data products that could be used in further analysis
- An analysis pipeline that would operate in parallel to produce other outputs useful outside science pipelines. Code to be written by the AP team. These would be values too costly to compute in the prompt pipelines (the pipelines could miss deadlines).
** Figure out how we plan this part (@gpdf?; @mjuric?)
Similarly, there may be some information we need from the OCS.
- information required by processing. This information is attached to an exposure. It seems like the preferred path for moving that information into the processing is to attach them to the exposure metadata.
- Information not necessarily associated to a visit: i.e. slew, dome open, dome closed, what are the visits scheduled for the next 2 hrs. This needs to be made available, but I don’t know that there is a spec for this.
** Write tickets to request OCS information (@KSK; @connolly? ).
Finally, we note that the solar system ephemeris calculations needed for source association can be done in parallel and may benefit from doing so (i.e. we may have lots of predictions to make in a little time).
Data I/O abstraction:
As alluded to earlier, there was a lot of going back and forth about how data will be gotten off the data backbone.
From the AP team there was a very strong assumption that we would be able to use the butler to get/put things to and from the data backbone, but Felipe noted the idea in production was to use other tools to grab necessary data and pre-cache it to the local POSIX storage and hand the pipeline tasks a butler instantiated on that local repository.
Felipe noted that the need for separate tools for moving data to/from the data backbone was for robustness/scalability/reliability issues. Simon and Colin pointed out that the Butler is an API meant as an abstraction to any underlying storage technology, and that any robust tools could be plugged into the Butler as a backend. There was some pushback to this notion since the tool developers would have to implement the Butler parts as well any time the underlying technology changed.
The AP team thought it would be very useful to have the butler be able to talk to the data backbone for development. Felipe said that would be a worthy goal.
Some of the confusion seems to be related to a conflation of the Butler as an API and the backends. Part of this is related to the fact that the only implemented backend currently is a POSIX filesystem. It was unclear whether that was because we only ever expect to have a POSIX backend.
** Need to follow up with @ktl involving having a project policy that there will be a I/O abstraction API that has access to the “data backbone” (@KSK )
Alert distribution (with Jason Alt):
Jason: What is the interface between Prompt Processing and Alert Distribution?
Simon: I’m hoping this turns out to be a “butler.put” that talks directly to the Kafka server.
Felipe: That may be a problem to talk from the worker nodes to an external service.
Maria would like to have medium/large flavors with large attached ephemeral storage (1TB) available in Nebula.
**Jason Alt will ping @daues to see if we can get flavors of instances with attached ephemeral storage. (@jalt )
Maria’s estimate for the Kafka cluster is that we need something like 3x8 core nodes with 1TB fast attached storage.
Jason: Don’t skimp on hardware since it is less expensive to provide more compute than the time and effort it takes to try to fit things in less space.
Prototype workflow(Felipe):
Currently coded up in python calling all the command line tasks in python in sequence: ingest and processCcd.py. Currently using DECam data. No attempt to be scientifically valid at the moment, so some blocks are mocked. Just want an end-to-end pipeline to figure out where pieces are missing.
Prototype datasets
Felipe is currently using DECam.
Simon: We should try to use a dataset where we can get calibration products since that is some of the most complicated I/O we do in the prompt pipelines.
Colin: I think we can get calibs for the HiTS data.
** Get HiTS calibration products (@ctslater?)
** Make sure obs_decam is up to the task of using the calibration products we can get. (@mrawls? )