Proposal for Butler DM-4168 "Data repository selection based on version"

natepease · March 7, 2016, 4:58pm

I’ve created a proposal for how to implement getting repositories from butler based on the version of repository (it would also work with other repository metadata). If you’re interested in such a thing and want to weigh in, please take a look.

The feature is desribed in LDM-463 section 2.3.6 Configuration (note, this is a link to a ticket branch version of LDM-463).

The pull request for implementation (in daf_persistence) is here. You can see a more complete use of the feature in a test case in daf_persistence in the reposInButler.py test.

Please let me know if you have any questions, comments, concerns. Votes in favor gladly accepted too.

thanks,
nate

natepease · March 7, 2016, 4:58pm

In particular I need input from @KSK and @rowen, as the feature request came from UW.

RHL · March 8, 2016, 1:10am

It is very hard to understand the use case from the link to LDM-463. Can you describe the problem you are solving?

Also, and more generally and less importantly, those examples are not self contained – in particular you don’t seem to define dp. I’m guessing import lsst.daf.persistence as dp

natepease · March 8, 2016, 5:56pm

The problem is at least somewhat described in DM-4168. In short, the feature request is to be able to retrieve a repository based on given criteria (version, validity date, etc).

This lays the foundation for that by using the butler to get repositories (actually config descriptors that can be used to instantiate a repository) based on version. The example is perhaps simplistic, but I hope illustrates how the concept can be expanded in a Butlerish way.

Thanks for the point about the missing import statement. I’ve added it.

rowen · March 10, 2016, 10:44pm

It would help to know what use case you had in mind when proposing this. However, if it is in the context of supporting time-varying data then I don’t think it will suffice.

A simple use case for time-varying data is as follows: to reprocess data from 5 years ago we want the flat fields, bias frames, etc. that were current when the image was taken. A wrinkle is that there may be other parameters whose values improve with time and knowledge, and we may want to use current (or at least more recent) values for those.

If your proposal is intended to address this use case, and if I am reading it correctly, we would be expected to put each version/date range of every time-varying data product into a different repository. That does not sound like a good fit for calibration data, as we may have updates for different bits of related data (data that belongs in the same repository) at different times. It could probably be done by fragmenting the repos, but it sounds messy and unnatural to me. I think this sort of thing is much better solved by keeping the related data in one place and using a database to sort out which items are wanted.

A more difficult use case that I am concerned about is obtaining suitable color term correction coefficients. This adds at least one additional dimension, because we color terms are defined by:

the camera and filter used to take the science image
the camera and filter of the reference data relative to which the corrections apply

and both of these can have a time-varying element. For example a filter may degrade with time or be swapped out (either in the science camera or the reference camera). I think we will always use the latest reference data when we want the best results (even when reprocessing older science data), but we may also want to try to reproduce old results exactly, in which case we must process it using the older reference data. Again, this sounds like a problem best solved with a database, though it may also benefit from multiple repositories.

natepease · March 11, 2016, 12:27am

My idea was that the updates would go into a new repository, and it would have the older repository as its parent. With the cfg mechanism it should be easier to maintain than setting _parent symlinks; the repository relationship is definable by code and therefore more malleable.

As I understand it the multiple-repository scheme is one we want to move towards, so that integrity of older data is preserved, and newer puts can ‘mask’ instead of ‘overwrite’ data in a repository. In effect older repositories should always (or almost always) be read-only.

If you want to keep all the time-varying data products in a single repository, I don’t think there’s anything even now that prevents you from doing that(?).

natepease · March 11, 2016, 12:55am

rowen:

A more difficult use case that I am concerned about is obtaining suitable color term correction coefficients. This adds at least one additional dimension, because we color terms are defined by:

the camera and filter used to take the science image
the camera and filter of the reference data relative to which the corrections apply

and both of these can have a time-varying element. For example a filter may degrade with time or be swapped out (either in the science camera or the reference camera). I think we will always use the latest reference data when we want the best results (even when reprocessing older science data), but we may also want to try to reproduce old results exactly, in which case we must process it using the older reference data. Again, this sounds like a problem best solved with a database, though it may also benefit from multiple repositories.

I chatted with @rowen about this and I think we are hoping @RHL or @jbosch can provide more details about it. I can’t tell how this is different than choosing a repository based on version (or some other temporal criteria such as creation date); newest vs. specific version.

For the use case I was trying to be very generic. Time-varying data is one example. But really any searchable criteria you want to apply to a repository should be doable.

jbosch · March 11, 2016, 2:22am

I’ll have to think about this. I can think of a number of applications for which repository versioning might be very useful:

calibration data
reference catalogs
managed reruns (including potentially rerun chains)
different externally-calibrated datasets, as in RFC-95

But I’ll have to think a lot more deeply about how to use it for any of these; at the very least we’d have to layer some code on top for any of them, and it’s quite possible that something breaks between the interface we’re imagining for these features and the lower-level implementation @npease has provided here (obviously it doesn’t have to do all of the above for it to be useful, but given that no one seems to remember what this is for anymore, it’s probably worth thinking about all of them).

And looking to the future, we should probably try to learn from this somehow and get better about tracing requirements to designed to implementations so we don’t forget what a particular feature is for again. I’m sure at least one failure mode here was people like @jbosch not paying attention to the designs @npease and others were generating at an early-enough stage.

natepease · March 11, 2016, 2:42am

Thanks for offering to take some blame @jbosch, I’m also guilty of needing to figure out how to share ideas early (I think this means learning to use the RFC mechanism). This proposal does not actually represent a lot of development-hours, it takes advantage of work that was done for https://jira.lsstcorp.org/browse/DM-4683 (Butler support for multiple repositories) and attempts to establish a meta-pattern of finding Butlerish things (i.e. repositories in storage) via the Butler. That said, maybe we’re not as far ahead of ourselves here as you might fear. However, I agree with you when you say:

Maybe the right next thing to do is for me to create an RFD and we can do some brainstorming during the tuesday time slot?

jbosch · March 11, 2016, 4:21am

An RFD sounds good, but note that Tuesday next week is already taken (but I should probably take some time to digest your design first, and I’m afraid it’s unlikely I’d get that done by Tuesday anyway).

And I should add that I’m glad that you’re thinking about meta-patterns and general solutions here, as we do have a lot of feature requests that we certainly haven’t thought hard about how to unify. It’ll just take a bit more time to connect the dots here.

rowen · March 11, 2016, 3:56pm

For the case of data good over a date range it would help to present an example of the layout you have in mind. Here is a proposed simple use case; I don’t claim it is accurate, but it should demonstrate what we need to know: suppose we take flats once a month starting 2020-01-01 and darks once a year starting 2020-01-02. The following demonstrates one way of keeping related data together, which I think would be desirable. The bottom-level directories contain one image per CCD, and might well be symlinks:

cals/
    flats/
        2020-01-01/  # contains one image per CCD
        2020-02-01/
        ....
    darks
        2020-01-02/

I worry that if you use separate repos for each of these then it could look a lot messier. For instance I don’t think this would be easy to navigate since the user can’t tell which products are contained in which repo:

cals
cals_2020-01-01/ # no darks taken on this date
    _parent->cals
    flats/
cals_2020-01-02/ # no flats taken on this date
    _parent->cals
    darks/

I have two other questions:

How do you intend to resolve date ranges?
How do you intend users to specify dates? Something like a date field in the data ID dict?

natepease · March 11, 2016, 4:54pm

Thanks @rowen, this is a great example use case and questions. I’ll work up an explanation.

ktl · March 11, 2016, 5:09pm

My idea would be that dates are handled by the existing calibration date range mechanism (lookup in a registry that has validity intervals); versions (e.g. improvements processed with new software) are new repositories and handled by Nate’s new meta-repository search. If some products (e.g. flats vs. darks) are updated (not with new date ranges but with new versions) at different frequencies than others, then we may need to go with Nate’s suggestion of combining multiple repositories (one for each type of product).

natepease · March 11, 2016, 5:28pm

I agree; you can continue to use the existing date range search mechanism if that works for you. You don’t have to put calibrations in separate repositories and find them by repo-search if you don’t want to.
@ktl, can you point me at an example of where date range lookup is performed in a registry? (I’ll keep looking through the test code I know about but nothing jumped out at me.)

@ktl I can’t determine what you’re referring to when you say my ‘suggestion of combing multiple repositories’. Can you spell that out a little more for me? Thanks!

ktl · March 11, 2016, 9:07pm

obs_cfht has this here.

We would add versioning as another level (the repository-of-repository-configs level) on top of this.

As an example, you (or an automated process) could create a “version 25 calibrations” repository composed of the following parents: “version 25 flats”, “version 11 biases”, “version A3 darks”.