Giter Club home page Giter Club logo

dkist's People

Contributors

a-derks avatar braingram avatar cadair avatar eigenbrot avatar fraserwatson avatar github-actions[bot] avatar nabobalis avatar pre-commit-ci[bot] avatar solardrew avatar svensk-pojke avatar williamjamieson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dkist's Issues

WCSes using a 2D Lookup table for pointing do not support inverse transform

When constructing a VaryingCelestialTransform2D model we currently use Tabular2D which does not support an inverse transform even when it's all singularly valued.

The approach to fixing this is probably to make a subclass of Tabular2D which does return an inverse when all the points in the table are singularly valued. This probably should be upstreamed to Astropy or else we will have to maintain it forever.

See also DCS-2680

High level stokes coordinates can not be converted to pixel coordinates

To reproduce the error do: ds.crop([..., 'I'])

The fundamental issue here is that the StokesProfile class in gWCS isn't sufficient to handle the requirements of being an APE 14 high level object. I have opened astropy/astropy#13790 to track adding a more fully featured stokes coordinate object.

Tasks

  1. 💤 backport-v5.3.x coordinates wcs.wcsapi

ds.download errors when transferring whole dataset?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-5c4b9fbfdba9> in <module>
----> 1 ds.download(destination_endpoint="CU Boulder Research Computing")

~/tmp/dkist/dkist/dataset/dataset.py in download(self, path, destination_endpoint, progress)
    234         local_destination = destination_path.relative_to("/").expanduser()
    235         old_ac = self._array_container
--> 236         self._array_container = DaskFITSArrayContainer.from_external_array_references(
    237             old_ac.external_array_references,
    238             loader=old_ac._loader,

~/tmp/dkist/dkist/io/array_containers.py in from_external_array_references(cls, ears, **kwargs)
     71         `asdf.ExternalArrayReference` objects.
     72         """
---> 73         shape = ears[0].shape
     74         dtype = ears[0].dtype
     75         target = ears[0].target

AttributeError: 'list' object has no attribute 'shape'

Deal with the result limit on the dataset endpoint

The dataset endpoint only returns a maximum of 1000 results, and we don't support paging the results. We should at least tell the user that the number of results is truncated.

Currently the dataset search API tells you how many results there are and sorts them by most recently created inventory record first.

Ensure that FITS loader can work over http etc

We don't currently use anything like dask.read_bytes to load the files referenced by the asdf files. We should ensure that we have a well abstracted loader that can have any prefix and not just file-system local prefixes.

ASDF file download fails when downloading a dataset

Attempting to download a dataset successfully downloads the FITS frames, but the ASDF file fails to download. The Globus event log includes an error similar to:

"550-GlobusError: v=1 c=PATH_NOT_FOUND%0D%0A550-GridFTP-Path: \"/data/87dfaf/AWVLB/87dfaf/AWVLB/VBI_L1_20531113T145215_AWVLB.asdf\"%0D%0A550-globus_gridftp_server_s3_base: S3 Error accessing \"/data/87dfaf/AWVLB/87dfaf/AWVLB/VBI_L1_20531113T145215_AWVLB.asdf\": HttpErrorNotFound: HttpErrorNotFound: %0D%0A550 End.%0D%0A",

I believe the issue is that the values for base_path and self.meta['asdfObjectKey'] both include the proposalId and datasetId at this point:

file_list.append(base_path / self.meta['asdfObjectKey'])

Encode the FITS file array more efficiently

For my CRISP test dataset of ~9700 files the resulting asdf file (without the FITS headers) is 1.4M of yaml for just the ExternalArrayReference objects and the gwcs. This leads to a slow opening of the asdf file because of the yaml parser.

There is a lot of duplicated information in the list of ExternalArrayReference objects because all the arrays have the same type, shape and target. The best solution is probably an ExternalArrayArray object which only encodes them once.

Improve / change the dataset loading API

A couple of issues with the current API:

  • Only supports one path at a time, meaning you can't pass a whole bunch of downloaded asdfs from Fido.fetch and get a bunch of datasets back.
  • You use a class method on Dataset to get an instance of TiledDataset which is weird.

Consider saving DKIST specific objects in .meta

Currently the Dataset class exposes two key functionalities over the NDCube base class. There are:

  • The .download() method
  • Access to a table of all FITS headers

To make these work, two extra attributes are used _header_table and _array_container. If we added these to the .meta dict and then implemented download() as a function rather than a method. download and access to the header table would also work if the user converted the Dataset object to a more specific subclass of NDCube like sunraster.

ASDF schema suggestions

Hi, I noticed a couple of issues in the ASDF schemas in this repository:

The dataset-0.1.0 schema specifies allowAdditionalProperties: true :
https://github.com/DKISTDC/dkist/blob/master/dkist/io/asdf/schemas/dkist.nso.edu/dkist/dataset-0.1.0.yaml#L38
but that isn't a recognized schema property. Probably it should be additionalProperties: true?

The $ref property is set to a tag URI in many cases:
https://github.com/DKISTDC/dkist/blob/master/dkist/io/asdf/schemas/dkist.nso.edu/dkist/dataset-0.1.0.yaml#L15
It should be a schema URI since $ref is an instruction to descend into the content of the referenced schema. This works currently due to an accident in the asdf Python implementation, but may not in the future. If the intention is to validate the tag of the child object, then the tag property is a good option:

tag: "tag:dkist.nso.edu:dkist/array_container-0.2.0"

Prepend `/~/` to paths in the globus downloader

Should we do this all the time or just for local endpoints?


On second thoughts, it looks like this isn't needed for actual download just for when we process the path:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-ddbacff306ce> in <module>
----> 1 small_ds.download(path='data')

~/tmp/dkist/dkist/dataset/dataset.py in download(self, path, destination_endpoint, progress)
    232         # TODO: This is a hack to change the base dir of the dataset.
    233         # The real solution to this is to use the database.
--> 234         local_destination = destination_path.relative_to("/").expanduser()
    235         old_ac = self._array_container
    236         self._array_container = DaskFITSArrayContainer.from_external_array_references(

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pathlib.py in relative_to(self, *other)
    905         if (root or drv) if n == 0 else cf(abs_parts[:n]) != cf(to_abs_parts):
    906             formatted = self._format_parsed_parts(to_drv, to_root, to_parts)
--> 907             raise ValueError("{!r} does not start with {!r}"
    908                              .format(str(self), str(formatted)))
    909         return self._from_parsed_parts('', root if n == 1 else '',

ValueError: 'data/8ef09f_large/BQWJB' does not start with '/'

This makes me think what happens if the user home directory on globus isn't the home directory of the local user.

Add support for new parameters in dataset search

The new query parameters and return results are:

  • hasSpectralAxis
  • hasTemporalAxis
  • averageDatasetSpectralSamplingMin
  • averageDatasetSpectralSamplingMax
  • averageDatasetSpatialSamplingMin
  • averageDatasetSpatialSamplingMax
  • averageDatasetTemporalSamplingMin
  • averageDatasetTemporalSamplingMax

The hasSpectralAxis and hasTemporalAxis parameters probably should be folded into a.Physobs, but I am not sure how. The sampling ones are a little more interesting, I think we will need more DKIST specific attrs for them.

DCS-1000

The Dask array generated when reading an asdf doesn't always get the endianess correct

The issue is illustrated by the following:

In [11]: arr = dkist.Dataset.from_directory("./uncompressed").data

In [13]: arr.dtype
Out[13]: dtype('float64')

In [14]: arr.dtype.str
Out[14]: '<f8'

In [15]: arr.compute().dtype
Out[15]: dtype('>f8')

vs.

In [3]: arr = dkist.Dataset.from_directory("./compressed/").data

In [4]: arr.dtype
Out[4]: dtype('float64')

In [5]: arr.dtype.str
Out[5]: '<f8'

In [6]: arr.compute().dtype
Out[6]: dtype('float64')

In [7]: arr.compute().dtype.str
Out[7]: '<f8'

This stems from the apparent fact that when reading a compressed array with astropy.io.fits it gets byteswapped to native, but when reading an uncompressed array it doesn't. Presumably the compressed byteswapping happens at the decompression stage with minimal overhead (and saving of overhead later) but it's still annoying that it's different between the two code paths.

We don't actually save the endianess information to the asdf as dkist-inventory uses the astropy BITPIX2DTYPE mapping which only maps to numpy dtype names not dtypes with byte order.

We could change this to encode the byteorder as well, but the byte order in FITS is always big endian, so to be accurate we should always save the dtype to the asdf as big (>) even though when the compressed file is loaded with astropy on most modern systems it would get swapped to little.

This really puts the burden on the part of the user tools which is constructing the meta array for the Dask array to correctly tell dask what byte order we are expecting when the file is loaded. Alternatively we could byteswap the array to native when loading the actual data either here or here.

The issue with doing it at dask array construction time is that there is no obvious way at the moment to know if the FITS files are compressed or not (only which HDU they are in)[1] and the issue with doing it at array read time is that you will load all the chunk into memory if you do it where the file is opened, and you will maybe do it repeatedly or not for the whole chunk if you did it in the __getitem__ method of the loader.

Maybe the best, but not the easiest, solution is to put a flag into the asdf file which tells you if the data in the target hdu is compressed or not, then we could dispatch on that to determine the expected byte order.

Globus transfers have a max file limit

Currently we individually list all the files in the tfr request, this works nicely if you aren't transferring all the files, but we will hit the limit eventually.

The main thing we should do here is not list out all the files in the cases where we are downloading the whole dataset and just transfer the whole directory instead. This particularly applies to transfer_complete_datasets

Consider adding a fitsio backed data loader

I discovered today that Astropy's FITS reader can't handle decompressing individual tiles inside a compressed HDU. This means that when people use the user tools, for any read operation on a file the whole file has to be loaded into memory and decompressed. This is memory and time inefficient. See: astropy/astropy#3895 for more details on the lack of support from Astropy.

An alternative (that I left the door open for) is to use a different FITS library to load the data inside our Dask arrays. The best candidate for this would be the fitsio package, some quick benchmarks indicate that fitsio handles decompressing individual tiles and even decompresses the whole image faster than astropy:

In [44]: %timeit fits.getdata("VBI_L1_00656282_2018_05_11T14_25_05_466665_I.fits", hdu=1)
183 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [45]: %timeit hdu[1][:, :]
131 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [46]: %timeit f.read("VBI_L1_00656282_2018_05_11T14_25_05_466665_I.fits")
131 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [48]: %timeit hdu[1][0, 0]
21.6 µs ± 191 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [49]: %timeit hdu[1][0, :]
22.3 µs ± 78.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [50]: %timeit hdu[1][:10, :]
235 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The advantages of using fitsio include:

  • Potentially significant performance enhancements.
  • Potentially significant reduction in memory usage.

There are some disadvantages:

  • The fitsio Python package is GPL v2 licensed, us importing it has potential licencing implications.
  • There are no binary wheels of fitsio published on PyPI this means people using PyPI to install this package would have to have a working toolchain and build fitsio.
  • The API is lower level, although this wouldn't be exposed to the user.

I think a good approach would be to be able to optionally use fitsio if it is already installed. This way we aren't mandating it's import so we probably could argue the corner with the licencing issues, and also we aren't requiring people to compile it. Making it optional like this would require hooking in some configuration options because the point where the loader is decided is inside the asdf converter, where we can't pass user arguments though.

Link the header table to the files / FileManager

When you slice the Dataset, it should (automatically?) be possible to view the subset of the headers table.
It should also be possible to look in the headers table and work out what index of the FileManager a row corresponds to etc

Fix the ImageAnimatorDataset

probably related to sunpy/sunpy#2855 but the tests for the image animator need fixing and their xfails removing.

# TODO:
# I am not sure what's going on here, it looks like the sunpy imageanimator
# thing is just completely wonkey. Probably best to test this with the
# actual SST dataset and try and make some real plots to see what's
# actually broken.
a = ImageAnimatorDataset(dataset_3d, image_axes=image_axes, unit_x_axis=units[0], unit_y_axis=units[1])
return plt.gcf()

Support globus SDK 3.0

Looks like there are some lovely breaking changes in the 3.0 release. No idea how hard it will be to fix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.