dkistdc / dkist Goto Github PK

View Code? Open in Web Editor NEW

23.0 6.0 10.0 2.23 MB

A Python library for obtaining, processing and interacting with calibrated DKIST data.

Home Page: https://docs.dkist.nso.edu/projects/python-tools/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

sunpy solar-physics astropy dkist

dkist's People

Contributors

Stargazers

Watchers

Forkers

cadair bxy8804 a-derks dhomeier eigenbrot solardrew williamjamieson wtbarnes braingram cyclingninja

dkist's Issues

Upstream gWCS slicing to gWCS

This involves moving everything in wcs/slicer.py up into the main gWCS object.

Rough todo:

WCSes using a 2D Lookup table for pointing do not support inverse transform

When constructing a VaryingCelestialTransform2D model we currently use Tabular2D which does not support an inverse transform even when it's all singularly valued.

The approach to fixing this is probably to make a subclass of Tabular2D which does return an inverse when all the points in the table are singularly valued. This probably should be upstreamed to Astropy or else we will have to maintain it forever.

Improve the endpoint activation flow

Currently when activating an endpoint the flow requires you to load app.globus.org without any redirect back or instructions.

Unify path handling between `Fido` and Globus transfers

The path formatting in Fido and FileManager.download and transfer_complete_datasets isn't the same, which makes it hard to download your ASDFs and FITS to the same place.

Dataset.plot behaves differently depending on dimension

if you plot a 3D+ object it creates it's own figure, it dosen't if you plot a <3D object.

High level stokes coordinates can not be converted to pixel coordinates

To reproduce the error do: ds.crop([..., 'I'])

The fundamental issue here is that the StokesProfile class in gWCS isn't sufficient to handle the requirements of being an APE 14 high level object. I have opened astropy/astropy#13790 to track adding a more fully featured stokes coordinate object.

Tasks

Beta Give feedback

Add a StokesCoordinate class to map numbers to symbols astropy/astropy#14482

💤 backport-v5.3.x coordinates wcs.wcsapi
Make StokesFrame use new StokesCoord from Astropy spacetelescope/gwcs#452
Changes to dkist-inventory to modify the table to run from 1 not 0.
Options

ds.download errors when transferring whole dataset?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-5c4b9fbfdba9> in <module>
----> 1 ds.download(destination_endpoint="CU Boulder Research Computing")

~/tmp/dkist/dkist/dataset/dataset.py in download(self, path, destination_endpoint, progress)
    234         local_destination = destination_path.relative_to("/").expanduser()
    235         old_ac = self._array_container
--> 236         self._array_container = DaskFITSArrayContainer.from_external_array_references(
    237             old_ac.external_array_references,
    238             loader=old_ac._loader,

~/tmp/dkist/dkist/io/array_containers.py in from_external_array_references(cls, ears, **kwargs)
     71         `asdf.ExternalArrayReference` objects.
     72         """
---> 73         shape = ears[0].shape
     74         dtype = ears[0].dtype
     75         target = ears[0].target

AttributeError: 'list' object has no attribute 'shape'

Expose ability to download the Quality Report PDF and Browse Movie

The file paths needed to transfer these out of the metadata streamer are all in the asdf file and dataset inventory, we should expose them.

The only choice is if we want to make them downloadable through Fido or our own thing.

Improve FIleManager.download's handling of local paths and endpoints which are or are not local

If you orchestrate a download to a remote endpoint you can't re-path the array container and have local data. This shouldn't raise an error.

Deal with the result limit on the dataset endpoint

The dataset endpoint only returns a maximum of 1000 results, and we don't support paging the results. We should at least tell the user that the number of results is truncated.

Currently the dataset search API tells you how many results there are and sorts them by most recently created inventory record first.

Ensure that FITS loader can work over http etc

We don't currently use anything like dask.read_bytes to load the files referenced by the asdf files. We should ensure that we have a well abstracted loader that can have any prefix and not just file-system local prefixes.

ASDF file download fails when downloading a dataset

Attempting to download a dataset successfully downloads the FITS frames, but the ASDF file fails to download. The Globus event log includes an error similar to:

"550-GlobusError: v=1 c=PATH_NOT_FOUND%0D%0A550-GridFTP-Path: \"/data/87dfaf/AWVLB/87dfaf/AWVLB/VBI_L1_20531113T145215_AWVLB.asdf\"%0D%0A550-globus_gridftp_server_s3_base: S3 Error accessing \"/data/87dfaf/AWVLB/87dfaf/AWVLB/VBI_L1_20531113T145215_AWVLB.asdf\": HttpErrorNotFound: HttpErrorNotFound: %0D%0A550 End.%0D%0A",

I believe the issue is that the values for base_path and self.meta['asdfObjectKey'] both include the proposalId and datasetId at this point:

dkist/dkist/dataset/dataset.py

Line 211 in 4ec087b

file_list.append(base_path / self.meta['asdfObjectKey'])

Resolve metadata_streamer not supporting Accept-Range header

https://github.com/DKISTDC/dkist/blob/main/dkist/net/client.py#L218

Add back description field to default display of Fido search results

As a short term measure the very long description field has been hidden by default. Really what would be nice is to truncate it's display to some sensible number of characters to aid discoverability.

asdf files are writing both the 0.9 and 1.0 extensions into the metadata

This, in theory, indicates that it's using both. This should be checked both in our code and upstream if needed.

Encode the FITS file array more efficiently

For my CRISP test dataset of ~9700 files the resulting asdf file (without the FITS headers) is 1.4M of yaml for just the ExternalArrayReference objects and the gwcs. This leads to a slow opening of the asdf file because of the yaml parser.

There is a lot of duplicated information in the list of ExternalArrayReference objects because all the arrays have the same type, shape and target. The best solution is probably an ExternalArrayArray object which only encodes them once.

Implement support for missing files in the loader

The user may not have all the FITS files downloaded, missing files in the Dask array should just return NaN.

Add examples to the example gallery

More examples would be great. It's probably worth waiting for one of the DC stacks to be online with example data.

Consider using ndcube.mask as a way to represent known missing data

Globus downloads don't handle failure well

Any warnings are shown to the user, but the task isn't stopped. We have limited options in terms of API for controlling the downloads, but we can do better.

Support querying the dataset API for possible values

The /datasets API has a way to get possible values we should use it with some caching for the search attrs.

Improve / change the dataset loading API

A couple of issues with the current API:

Only supports one path at a time, meaning you can't pass a whole bunch of downloaded asdfs from Fido.fetch and get a bunch of datasets back.
You use a class method on Dataset to get an instance of TiledDataset which is weird.

The units for the stokes axis in the default interactive dataset plot is seconds

🤷

Need a way of getting the spatial bounding box of a Dataset object

Good for all your overplotting on AIA needs

Consider saving DKIST specific objects in .meta

Currently the Dataset class exposes two key functionalities over the NDCube base class. There are:

The .download() method
Access to a table of all FITS headers

To make these work, two extra attributes are used _header_table and _array_container. If we added these to the .meta dict and then implemented download() as a function rather than a method. download and access to the header table would also work if the user converted the Dataset object to a more specific subclass of NDCube like sunraster.

Implement top level asdf schema

It is possible with asdf to define a top level schema to validate the whole file. We should use this in the usertools to check that the Dataset asdf is compatible with our code.

Provide a way to download the whole dataset without first downloading and loading the asdf

Something like

res = Fido.search(...)
dkist.net.download_whole_dataset_from_fido(res)

ASDF schema suggestions

Hi, I noticed a couple of issues in the ASDF schemas in this repository:

The dataset-0.1.0 schema specifies allowAdditionalProperties: true :
https://github.com/DKISTDC/dkist/blob/master/dkist/io/asdf/schemas/dkist.nso.edu/dkist/dataset-0.1.0.yaml#L38
but that isn't a recognized schema property. Probably it should be additionalProperties: true?

The $ref property is set to a tag URI in many cases:
https://github.com/DKISTDC/dkist/blob/master/dkist/io/asdf/schemas/dkist.nso.edu/dkist/dataset-0.1.0.yaml#L15
It should be a schema URI since $ref is an instruction to descend into the content of the referenced schema. This works currently due to an accident in the asdf Python implementation, but may not in the future. If the intention is to validate the tag of the child object, then the tag property is a good option:

tag: "tag:dkist.nso.edu:dkist/array_container-0.2.0"

Add a better repr and str for FileManager

FileManager.download should attempt to default to FileManager.basepath as the default download directory

I think

Provide documentation of what path interpolation parameters you can use in Globus download methods

See https://github.com/DKISTDC/dkist/pull/178/files#r912967338

Prepend `/~/` to paths in the globus downloader

Should we do this all the time or just for local endpoints?

On second thoughts, it looks like this isn't needed for actual download just for when we process the path:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-ddbacff306ce> in <module>
----> 1 small_ds.download(path='data')

~/tmp/dkist/dkist/dataset/dataset.py in download(self, path, destination_endpoint, progress)
    232         # TODO: This is a hack to change the base dir of the dataset.
    233         # The real solution to this is to use the database.
--> 234         local_destination = destination_path.relative_to("/").expanduser()
    235         old_ac = self._array_container
    236         self._array_container = DaskFITSArrayContainer.from_external_array_references(

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pathlib.py in relative_to(self, *other)
    905         if (root or drv) if n == 0 else cf(abs_parts[:n]) != cf(to_abs_parts):
    906             formatted = self._format_parsed_parts(to_drv, to_root, to_parts)
--> 907             raise ValueError("{!r} does not start with {!r}"
    908                              .format(str(self), str(formatted)))
    909         return self._from_parsed_parts('', root if n == 1 else '',

ValueError: 'data/8ef09f_large/BQWJB' does not start with '/'

This makes me think what happens if the user home directory on globus isn't the home directory of the local user.

Add test data (and a script to generate it) for TiledDataset

Mainly for asdf testing.

Upstream utils/model_tools into astropy

See astropy/astropy#8371

Make it so `DKISTClient.fetch` paths use `/{dataset_id}/{file}` as the default template not `{file}`

This will require sunpy changes, see sunpy/sunpy#4117

Add support for new parameters in dataset search

The new query parameters and return results are:

hasSpectralAxis
hasTemporalAxis
averageDatasetSpectralSamplingMin
averageDatasetSpectralSamplingMax
averageDatasetSpatialSamplingMin
averageDatasetSpatialSamplingMax
averageDatasetTemporalSamplingMin
averageDatasetTemporalSamplingMax

The hasSpectralAxis and hasTemporalAxis parameters probably should be folded into a.Physobs, but I am not sure how. The sampling ones are a little more interesting, I think we will need more DKIST specific attrs for them.

DCS-1000

The Dask array generated when reading an asdf doesn't always get the endianess correct

The issue is illustrated by the following:

In [11]: arr = dkist.Dataset.from_directory("./uncompressed").data

In [13]: arr.dtype
Out[13]: dtype('float64')

In [14]: arr.dtype.str
Out[14]: '<f8'

In [15]: arr.compute().dtype
Out[15]: dtype('>f8')

vs.

In [3]: arr = dkist.Dataset.from_directory("./compressed/").data

In [4]: arr.dtype
Out[4]: dtype('float64')

In [5]: arr.dtype.str
Out[5]: '<f8'

In [6]: arr.compute().dtype
Out[6]: dtype('float64')

In [7]: arr.compute().dtype.str
Out[7]: '<f8'

This stems from the apparent fact that when reading a compressed array with astropy.io.fits it gets byteswapped to native, but when reading an uncompressed array it doesn't. Presumably the compressed byteswapping happens at the decompression stage with minimal overhead (and saving of overhead later) but it's still annoying that it's different between the two code paths.

We don't actually save the endianess information to the asdf as dkist-inventory uses the astropy BITPIX2DTYPE mapping which only maps to numpy dtype names not dtypes with byte order.

We could change this to encode the byteorder as well, but the byte order in FITS is always big endian, so to be accurate we should always save the dtype to the asdf as big (>) even though when the compressed file is loaded with astropy on most modern systems it would get swapped to little.

This really puts the burden on the part of the user tools which is constructing the meta array for the Dask array to correctly tell dask what byte order we are expecting when the file is loaded. Alternatively we could byteswap the array to native when loading the actual data either here or here.

The issue with doing it at dask array construction time is that there is no obvious way at the moment to know if the FITS files are compressed or not (only which HDU they are in)[1] and the issue with doing it at array read time is that you will load all the chunk into memory if you do it where the file is opened, and you will maybe do it repeatedly or not for the whole chunk if you did it in the __getitem__ method of the loader.

Maybe the best, but not the easiest, solution is to put a flag into the asdf file which tells you if the data in the target hdu is compressed or not, then we could dispatch on that to determine the expected byte order.

Globus transfers have a max file limit

Currently we individually list all the files in the tfr request, this works nicely if you aren't transferring all the files, but we will hit the limit eventually.

The main thing we should do here is not list out all the files in the cases where we are downloading the whole dataset and just transfer the whole directory instead. This particularly applies to transfer_complete_datasets

Refactor the gWCS WCSAxes support to use APE 14

This could then be upstreamed into Astropy.

`import dkist` should not import `dkist.net`?

Pros: Don't have to import it again
Cons: Have to load all of Fido always

Consider adding a fitsio backed data loader

I discovered today that Astropy's FITS reader can't handle decompressing individual tiles inside a compressed HDU. This means that when people use the user tools, for any read operation on a file the whole file has to be loaded into memory and decompressed. This is memory and time inefficient. See: astropy/astropy#3895 for more details on the lack of support from Astropy.

An alternative (that I left the door open for) is to use a different FITS library to load the data inside our Dask arrays. The best candidate for this would be the fitsio package, some quick benchmarks indicate that fitsio handles decompressing individual tiles and even decompresses the whole image faster than astropy:

In [44]: %timeit fits.getdata("VBI_L1_00656282_2018_05_11T14_25_05_466665_I.fits", hdu=1)
183 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %timeit hdu[1][:, :]
131 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [46]: %timeit f.read("VBI_L1_00656282_2018_05_11T14_25_05_466665_I.fits")
131 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [48]: %timeit hdu[1][0, 0]
21.6 µs ± 191 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [49]: %timeit hdu[1][0, :]
22.3 µs ± 78.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [50]: %timeit hdu[1][:10, :]
235 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The advantages of using fitsio include:

Potentially significant performance enhancements.
Potentially significant reduction in memory usage.

There are some disadvantages:

The fitsio Python package is GPL v2 licensed, us importing it has potential licencing implications.
There are no binary wheels of fitsio published on PyPI this means people using PyPI to install this package would have to have a working toolchain and build fitsio.
The API is lower level, although this wouldn't be exposed to the user.

I think a good approach would be to be able to optionally use fitsio if it is already installed. This way we aren't mandating it's import so we probably could argue the corner with the licencing issues, and also we aren't requiring people to compile it. Making it optional like this would require hooking in some configuration options because the point where the loader is decided is inside the asdf converter, where we can't pass user arguments though.

When authing with Globus the first request after successful auth fails

Have no auth cache, run something which needs auth, see error, rerun command it works fine.

Link the header table to the files / FileManager

When you slice the Dataset, it should (automatically?) be possible to view the subset of the headers table.
It should also be possible to look in the headers table and work out what index of the FileManager a row corresponds to etc

Fix the ImageAnimatorDataset

probably related to sunpy/sunpy#2855 but the tests for the image animator need fixing and their xfails removing.

dkist/dkist/dataset/tests/test_plotting.py

Lines 41 to 47 in 9404269

 # TODO: 

 # I am not sure what's going on here, it looks like the sunpy imageanimator 

 # thing is just completely wonkey. Probably best to test this with the 

 # actual SST dataset and try and make some real plots to see what's 

 # actually broken. 

 a = ImageAnimatorDataset(dataset_3d, image_axes=image_axes, unit_x_axis=units[0], unit_y_axis=units[1]) 

 return plt.gcf()

	# TODO:
	# I am not sure what's going on here, it looks like the sunpy imageanimator
	# thing is just completely wonkey. Probably best to test this with the
	# actual SST dataset and try and make some real plots to see what's
	# actually broken.
	a = ImageAnimatorDataset(dataset_3d, image_axes=image_axes, unit_x_axis=units[0], unit_y_axis=units[1])
	return plt.gcf()