I am building a small prototype to uncover pitfalls related to capturing provenance, and here is the first one worth discussing… the same task will run in parallel on many machines. The thinking so far was that we will keep track of provenance of a group of “things” we process (say a group of exposures) and each group would be processed by exactly one node. Then we could just keep the mapping: group–> node. Such approach reduces size of provenance we need to track. The issue: we need to keep the mapping from individual “things” to a group. My current thinking:
For exposures:
- keep the mapping as-is, e.g., list of exposureIds --> groupId. This is most robust. It yields non-negligible volumes of provenance. (~35 GB, assuming ~28 million CCD visits in DR1, up to 1/2 billion CCD visits in DR11, and 11 data releases), but that is not too bad!
- an alternative: determine the mapping programmatically, e.g., use some hashing scheme to map exposureIds to nodeIds. That may yield uneven distribution / (overloading/underloading some nodes), and even more importantly, complicate things for the processing algorithms, which may want to control which exposures should be processed together and/or where.
For individual objects/sources/forcedSources, the granularity has to be per-group due to the sheer count of objects/sources. Grouping could be determined based on
- the exposure given object/source came from, or
- the location on the sky
- or something else, but I feel someone with knowledge of apps algorithms could determine that better than me…
Anyway, I feel this needs some thinking/discussion!