Simplifying and limiting deletion in Gen3 butler repos

jbosch · March 8, 2021, 3:58pm

See the bottom of this post for a glossary that may help Gen3 non-experts understand it.

The problems:

Right now the Gen3 butler has two high-level methods, pruneCollection and pruneDatasets, which try to cover all operations that look or smell like dataset deletion, including all of those handled by the also-public remove, removeCollection, and disassociate methods of Registry. The Butler methods (and their command-line counterparts) have been tough to maintain, test, and use, and I think it’s fundamentally because they try to do too much: we’ve ended up trying to support many operations that no one may ever use, just because they’re the logical combination of various options/arguments we need for other reasons (e.g. unstoring the datasets in a TAGGED collection).

I also think many of those operations need to be removed now to avoid complicating our ownership model in future data repositories with a real concept of user or group ownership of datasets; if one can modify a dataset via a reference to it from some non-RUN collection, we’ll need many different more kinds of ACLs.

In addition, right now we have one particularly important pain point, captured on DM-28857: it’s currently hard to delete the collection structure produced by pipetask (and as of DM-28960, BPS), which involves a CHAINED collection that references both output RUN collections and input collections of many types. One can’t delete the RUN collections first, because that trips a foreign key violation as long as they are referenced by the CHAINED collection, and if one deletes the CHAINED collection first, the easiest way to find those RUN collections also goes away (but note that one doesn’t want to delete the input collections, and butler has no way to tell the difference using the CHAINED collection, so it’s not that easy).

Finally, these methods are designed to encourage only unstoring datasets (while leaving their Registry description), to preserve provenance, but this is premature and annoying to users: they want to fully delete things, because there isn’t actually any provenance to preserve, and I think we need to provide a better way to “hide” collections before we make it too hard to fully delete them. That seems doable via an extra flag column in the collections table, but only with a schema change. Since adding provenance also will require a schema change, we can do those at the same time (later).

The near-term proposal:

We add a new method, Butler.removeRuns which fully removes one or more RUN-type collections and all of the datasets within them (I’ve started this on DM-29106).
We remove the Butler.pruneCollections method, leaving Butler.removeRuns as the recommended way to delete RUN-type collections and Registry.removeCollections as the way to remove all other kinds of collections (which would no longer involve any kind of dataset deletion, because the references to datasets from those collections don’t imply any ownership that should allow one to do that).
We also remove the Butler.pruneDatasets method, leaving us with no high-level way (for now) to fully delete individual datasets. I don’t think we have a use case for this right now, and I’d like to give us a chance to think about the future ownership model and actual use cases before reintroducing something like it (and I expect it will be replaced by multiple simpler methods for different kinds of deletion, as I am proposing we do now for collection).
We change the deletion logic for collections to allow child collections to be deleted while they are referenced by CHAINED collections, by replacing them there first with a special sentinal “[deleted]” collection, which can be used as a way to notify the owner of the CHAINED collection that this occurred.

On the command-line side of things,

butler prune-collection and butler prune-datasets would go away;
butler remove-runs and butler remove-collections would be added (the former would delete RUN collections and always delete datasets; the latter would delete non-RUN collections and never delete datasets).
we add a pipetask purge command, which deletes all output-RUN collections and the output CHAINED collection matching the usual pattern;
we add a pipetask cleanup command, which deletes all output-RUN collections that are not referenced by the named CHAINED collection but do match its name pattern (i.e. those left behind by --replace-run without --prune-replaced).

The last two options belong on pipetask, not butler, because it’s pipetask that defines the naming convention they rely upon to know what to delete. They would not necessarily be able to work on processing runs where --output-run was used to customize the RUN names.

Glossary:

unstore: delete files and Datastore records without (necessarily) deleting Registry records.

forget: delete Datastore records without deleting files or (necessarily) deleting Registry records.

RUN: a kind of butler collection that datasets intrinsically belong to

TAGGED: a kind of butler collection that only references datasets

CHAINED: a kind of butler collection that references other collections

disassociate: remove a reference to a dataset from a TAGGED collection

ktl · March 8, 2021, 4:21pm

That last glossary entry begs the question: what happens when you remove a dataset (with its RUN collection) that has been referenced from a TAGGED collection? The sentinel “[deleted]” collection can’t be used here, as it’s only a dataset.

jbosch · March 8, 2021, 4:26pm

Right now it just silently disappears from the TAGGED collection, because we already have an ON DELETE CASCADE clause on that foreign key. I think we can do the sentinal trick with a special dataset there, too, which would be a bit more user-friendly, but it’s trickier to do and (I think) a lower priority, both because the TAGGED collections are much more rare and because a silent deletion is better than blocking the RUN from being deleted.

roodman · March 8, 2021, 9:18pm

that Glossary is useful to see. Can I ask that you add to it with a definition of a COLLECTION and a DATASET ?

ktl · March 8, 2021, 9:21pm

DMTN-167 does a pretty good job of that, and it in turn points to the Butler documentation.