Projects involving irregularly shaped data

jpivarski · October 12, 2020, 9:41pm

Hi!

I’m a developer of Awkward Array, a Python package for manipulating large, irregular datasets: JSON-like data (variable-length lists, nested records, missing values, mixed data types) with a NumPy-like interface and performance (slices, implicit-loop functions, reducers, etc. on contiguous numerical buffers).

The project was originally developed for LHC data analysis, but the concept is generic and we’re looking for use-cases in other fields. I know that sky images are rectangular, but there might be analysis steps at a later stage of processing that require large arrays with variable-length lists or other data structures, such as lists of candidate objects and their associations.

We’re also trying to fully integrate Awkward Array in the scientific Python ecosystem; it already interoperates well with pyarrow (Apache Arrow) and Numba, but we’re also looking at Dask, Zarr, RAPIDS (cuDF), and maybe Xarray. The scientific use-cases are the primary drivers of integrations like these, which is why we’d like to hear about your use-case.

We’re also working on adding a GPU backend, so that the same operations that work on CPU-bound arrays of irregular data work transparently on GPUs (replacing sequential algorithms with parallel ones under the hood). If you have an application that uses GPUs or might use GPUs in the future, that would also be interesting.

(Apologies if you’re seeing this twice because you’re on the Astropy mailing list!)

Thanks!
– Jim

ashleyvillar · October 12, 2020, 10:17pm

This sounds super cool, Jim!!

One dataset that immediately comes to mind is the Open Supernova Catalog (sne.space). These are a mix of metadata, irregular time series data, spectra and (long) model print outs, all currently stored as a JSON. Here is an example object: https://sne.space/sne/LSQ12dlf/ . Click the green “download all data” to get an idea of the JSON for one object of ~70k in the database.

jpivarski · October 12, 2020, 10:55pm

Thanks! That does look like a candidate use-case. I looked at the photometry and spectra, and it seems to be about 3.2 MB per supernova × 70,000 supernovae = 212 GB, which is easily large enough that these NumPy-like calculations would be important. I noticed that the numbers were all in quoted strings—that would complicate the data-reading, which is exactly sort of thing that I wanted to learn from working with use-cases (it breaks an assumption about how I thought JSON-loading would work).

Do you know of any specific projects that are using this now and might want to try using Awkward Array to their analysis code, with help from the authors? The best applications would be ones that need to use a lot of supernovae and can’t simplify to a flat table early in the analysis. We’d want to replace nested for-loops over these lists with NumPy-like slices and array-at-a-time functions.

Thanks again!