Proposed date: JTM
Connection: In person
Suggested audience: @gpdf @fritzm @ktl NCSA? who else cares?
I’d like to start a discussion about provenance. In particular I am interested in:
-
Describing SQuaRE’s requirements for job-level provenance (explanation below)
-
Establish whether DM has already planned this type of capability (eg. through the SLAC provenance work package)
-
If not, discuss potential solutions/way forward.
In our verification workflows (eg. those implemented by SQuaSH) we calculate metrics (not just KPMs, but arbitrary characterisations of performance). However in consuming those metrics (eg. in creating regression plots, excursion alerts etc) we group those metrics by certain common meaningful characteristics. For example, was this AM1 metric calculated with the testdata_hsc dataset or the testdata_chft dataset? Was it r band only, or all bands? Ultimately we want to compare like-with-like, so we need a definition of “like”.
In our context, establishing “like” is done with what we are calling the “provenance” of the metric. In the supertask/activator paradigm, for those familiar with that, the “provenance” is generally things the activator knows as opposed to things the supertask knows - what data (butler repo, whatever) was I run on? On what OS? What was the configuration passed in? Etc.
Typically in astronomy we think of provenance as the ancestry of, say, a data file (what individual exposures contributed to this mosaic? What was the version of the mosaic software that produced it? But in our case, the provenance of the metric is really the provenance of the “job” (whether a Jenkins job now or a production workflow system job ultimately). Not only can each job potentially generally metrics with different provenance, but even in cases where the provenance is the same, we track metrics versus time by tracking by job (this is the essence of a regression plot - here’s the number from this job, here’s the number from yesterday’s job, here’s the number from the day before yesterday’s job, etc etc)
Moreover, unlike situations where provenance information is largely forensic (and so does not necessarily justify tooling to access it), for this job-related provenance we need an API to interact with it as part of the normal operation of the verification system. Hey, I just calculated this metric, do I group it with the “AM1 from cfht data” bucket, or the “AM1 from hsc data” bucket? Oh it’s provenance is testdata_cfht, so “AM1 from cfht data it is”.
As with other aspects of SQuaSH, we initially assumed that we would be writing a cheap shim and throwing it away later for a planned production system. Today @fritzm and I had a very useful discussion (thanks!) in which we reviewed some of the current plan for the provenance capability planned for development at SLAC (eg see Jacek’s provenance prototype). It is relatively clear that if Fritz was to act on that design literally (which, I hasten to add, he has not necessarily committed to), this would not fulfil my job-based provenance use cases, as it stands it is very catalogue-row oriented. Also, no general access (eg dax_) had been envisaged.
What I’d like to understand better:
-
My immediate need here (since we are shimming what we need anyway to avoid being blocked) is to understand whether I’m developing a throwaway temporary system or whether I [royal I] should be thinking about a production grade system because no other work package will provide the functionality I need.
-
My own feeling is that the planned provenance system has enough information in it that a second, additional job-based provenance system would involve too much duplication; so perhaps a way forward is to for Fritz to incorporate some of my provenance representation needs, and I could provide the dax_ type API to it? Interested in Architecture’s guidance. I am obviously concerned about have to do unplanned/unfunded work, so overall I’d rather we found a way to address this through an already planned package.
-
Timing is also an issue as per Fritz, the provenance work is very late in Construction. Since the primary delivery date for verification/QC/etc tooling is ComCam/commissioning. I’d be going into my “operations” ahead of that potentially.
-
From experience, I would not be surprised if NCSA’s workflow work (whose details I am ignorant of) results in job-level provenance requirements too, so keen to avoid duplication there. Are there any other stakeholders lurking?
I welcome comments here, and perhaps interested parties can also meet at JTM if there is sufficient interest. If the answer is “you and Fritz sort it out, we don’t care” that’s fine too, but I would be surprised