Obtain only image metadata with butler.get

stevenstetzler · October 12, 2021, 12:55am

Hi, I am wondering if there is a way to obtain just the metadata (or FITS headers) from a calexp dataset type using the butler. Right now, I am scraping through just the metadata of many exposures, something like:

refs = registry.queryDatasets("calexp", collections=collection)
for ref in refs:
    calexp = butler.get(ref, collections=ref.run)
    metadata = calexp.getMetadata()
    del calexp # so I don't run out of memory

This is going quite slowly as the butler is loading each exposure into memory. I see there is a parameters option in the butler.get docstring:

Signature:
butler.get(
    datasetRefOrType: 'Union[DatasetRef, DatasetType, str]',
    dataId: 'Optional[DataId]' = None,
    *,
    parameters: 'Optional[Dict[str, Any]]' = None,
    collections: 'Any' = None,
    **kwds: 'Any',
) -> 'Any'
Docstring:
Retrieve a stored dataset.

Parameters
----------
datasetRefOrType : `DatasetRef`, `DatasetType`, or `str`
    When `DatasetRef` the `dataId` should be `None`.
    Otherwise the `DatasetType` or name thereof.
dataId : `dict` or `DataCoordinate`
    A `dict` of `Dimension` link name, value pairs that label the
    `DatasetRef` within a Collection. When `None`, a `DatasetRef`
    should be provided as the first argument.
parameters : `dict`
    Additional StorageClass-defined options to control reading,
    typically used to efficiently read only a subset of the dataset.

Can I use this parameters to select just the metadata? I’ve tried a test

butler.get(ref, collections=ref.run, parameters={"test": "123"})

and get an error:

KeyError: "Parameter 'test' not understood by StorageClass ExposureF"

and if I try a test with a parameter I find by examining ref.datasetType.storageClass: StorageClassExposureF('ExposureF', pytype='lsst.afw.image.ExposureF', delegate='lsst.obs.base.exposureAssembler.ExposureAssembler', parameters=frozenset({'origin', 'bbox'}),

butler.get(ref, collections=ref.run, parameters={"origin": '123'})

I get an error

/home/admin/lsst/lsst_22_0_0/stack/miniconda3-py38_4.9.2-0.4.3/Linux64/obs_base/22.0.0+6225f1ba97/python/lsst/obs/base/formatters/fitsExposure.py in readFull(self, parameters)
    243         self._reader = self._readerClass(fileDescriptor.location.path)
--> 244         return self._reader.read(**parameters)
    245 

TypeError: read(): incompatible function arguments. The following argument types are supported:
    1. (self: lsst.afw.image.ExposureFitsReader, bbox: lsst.geom.Box2I = Box2I(minimum=Point2I(0, 0), dimensions=Extent2I(0, 0)), origin: lsst.afw.image.image.ImageOrigin = <ImageOrigin.PARENT: 0>, conformMasks: bool = False, allowUnsafe: bool = False, dtype: object = None) -> object

Invoked with: <lsst.afw.image.ExposureFitsReader object at 0x7febb2bfc8b0>; kwargs: origin='123'

which tells me parameters controls the constructor of the ExpsoureFitsReader…if that’s true, is there a constructor that loads just the metadata?

This is all the snooping I’ve done so far.

Thanks.

parejkoj · October 12, 2021, 2:06am

What metadata are you looking for in particular? Information about the exposure can be gotten with butler.get("calexp.visitInfo"), returning a VisitInfo object, which has a number of properties (e.g. date, exposureTime, observatory, weather, boresightRaDec). You can similarly get the wcs, photoCalib, bbox, and filterLabel objects, which together describe most of the information relevant to the exposure.

I don’t know that we have an explicit listing in the docs of all of the exposure components that can be gotten this way: we definitely should, if we don’t.

timj · October 12, 2021, 4:15am

The parameters can be used to obtain a subset of the image. You can define a bounding box and get a cutout.

For example, to get all the components associated with a calexp in the ci_hsc_gen3 output repository:

$ butler query-dataset-types --components $CI_HSC_GEN3/DATA calexp.*
          name          
------------------------
        calexp.apCorrMap
             calexp.bbox
      calexp.coaddInputs
         calexp.detector
       calexp.dimensions
           calexp.filter
      calexp.filterLabel
            calexp.image
             calexp.mask
         calexp.metadata
       calexp.photoCalib
              calexp.psf
     calexp.summaryStats
calexp.transmissionCurve
     calexp.validPolygon
         calexp.variance
        calexp.visitInfo
              calexp.wcs
              calexp.xy0

Here the calexp.metadata is the FITS header.

stevenstetzler · October 13, 2021, 10:50pm

Thanks Tim and John, calexp.metadata (along with calexp.visitInfo and calexp.wcs) works great for what I need. I wasn’t aware this functionality was available.

Timing a butler.get with a calexp data ref is about 1.5 - 2 seconds and a butler.get with a calexp.visitInfo (or calexp.wcs) is around 700ms. I thought it would be much faster than that, but that should give me at least 2x speedup. Thanks!

timj · October 14, 2021, 4:49pm

Where are you doing this test? On the IDF it has to download the whole file to read a little bit of it because cfitsio can’t access a file directly on a Google bucket. There is local file caching on IDF so if you butler.get the visitInfo and then the wcs it won’t download the file twice.

stevenstetzler · October 15, 2021, 6:47pm

I was doing this on an internal cloud deployed JupyterHub, similar to the IDF. Files are stored in a bucket. I tested this on (spinning) disks and the timing is 900ms for calexp and 33.8ms for calexp.visitInfo, so it looks like the latency is from the object download like you suggest.

timj · October 16, 2021, 8:05am

If you turn on the timer.lsst.daf.butler logger at DEBUG level it will report the relevant times for downloading versus file reading.