One of the tasks on my radar for Gen3 Butler development is a refactor of how we specialize various operations for different databases in the Registry
hierarchy. I already know how I want to do a lot of that, but one very open general question is how best to handle bulk operations across the Python/SQL boundary.
As an example, we have a Python API (selectDimensions
) that runs a big query that returns a an iterator over data structures that contain (among other things) reference to datasets. We have another Python API (associate
) that takes a sequence of references to datasets and adds them to a collection. That means I could combine these with Python code like this:
def extractDatasets(iter):
for row in iter:
for dataset in row.datasetRefs.values():
yield dataset
result = registry.selectDimensions(...)
registry.associate("MyCollection", extractDatasets(result))
That all looks quite natural in Python, and the fact that we’re using Python iterators rather than collections means it should scale fine in terms of memory usage even for very large queries.
Under the hood, that actually involves iterating over the results, constructing a bunch of Python objects to represent them (many of which are never used in this example), and then iterating and transforming them again in various ways, just to run one INSERT
statement for each dataset. (I also think I know how to get that done in one or a few INSERT
s instead by batching rows up in the last step; let’s not worry about that distinction for now).
It strikes me that it would potentially be much more efficient if we could do this lazily and keep all of the actual manipulations and iteration in the database, culminating in one INSERT ... SELECT
operation. I can roughly imagine what it would take to do that:
-
Have the
selectDimensions
method return a special iterator that just remembers the query and doesn’t actually execute it until you try to iterate over it. -
Move the logic in the
extractDatasets
transformation function into a method on that special iterator class, along with any other transformations we think might be useful, so we can also make them return similarly special iterators based on modified queries. -
Have
associate
check whether what its given was a special iterator or a regular Python iterable, and give it a special implementation for the former case that does anINSERT ... SELECT
.
That’s a lot of extra complexity in the implementation, but I could imagine the speedup being important, both when we have genuinely large-scale queries in Oracle registries and when time spent holding file locks needs to be minimized at medium scale in SQLite. I’m not at the point where I’d want to implement this kind of lazy evaluation on a bunch of registry methods now, but I am strongly considering trying now to make our APIs consistent with adding with that kind of implementation in the future.
So, for those of you with more database development experience than me, a few questions:
-
Is there prior art for designing these kinds of APIs that I should read up on?
-
The above case is a simple one, in that
associate
only needs to insert into one table. Some of our bulk operations need to insert into or delete from multiple tables, and for those to work we’d need to either repeat the passed-throughSELECT
or use temporary tables. I assume which is best would require context-dependent tests, but any thoughts on what to do by default? -
Any other questions I should be asking about this kind of thing?