@parejkoj, @ctslater and I were discussing subsectioning repositories today. What I mean is creating from a full repository another repository that only has the data associated with some of the datasets in the original repository.
An example is that I have a large repo that contains calexps, coadds, and all the source files. It would be nice to be able to create another repository with just the coadds so I can copy it over to another location for further analysis.
A better example is that we ran 1200 visits for the Twinkles project. To do anything with the catalogs, you also need the calexps
because you need the Calib
and Wcs
to do anything useful. Since all this was done on remote clusters, SLAC and NERSC, you have to do any analysis there.
It seems like this is just a tooling issue and that the tool would be fairly simple. I could imagine something like this:
$> selectDatasets.py input_repo --id --datasets 'deepCoadd' 'calexp' --output output_repo
In terms of implementation, I donât think this is much more than a butler.get on the input repo followed by a butler put on the output repo. The place where this gets tricky is for source catalogs because you want other info to go with the catalogs: e.g. Calib
objects and calexp_md
. The butler doesnât have the dependency information, so we might need a bit of additional information, ârecipesâ, for subsectioning on some dataset types, but I donât think itâs that many. My only worry is that I donât know how the butler handles butler.put('calexp_md')
.
Possible near term solutions
- For catalogs we could prioritize the ingestion scripts. That would probably suffice especially if we could easily put things in a sqlite database that we could hand out.
- Figure out how to persist items on their own that are currently persisted with very heavy things like exposures.
- Make butler repositories remotely accessible.
I think the only option that solves all the issues mentioned in the thread is the second one. I agree that the third option makes things much easier, but it still means we need to have an internet connection and we need to deal with authentication if we want to give the data away.