creare-com / podpac Goto Github PK

View Code? Open in Web Editor NEW

43.0 8.0 6.0 10.65 MB

Pipeline for Observational Data Processing Analysis and Collaboration

Home Page: https://podpac.org

License: Apache License 2.0

Python 99.96% Dockerfile 0.04%

geospatial-analysis geospatial-processing geospatial-rasters geospatial-data collaboration analysis pipeline

podpac's Introduction

PODPAC

Pipeline for Observation Data Processing Analysis and Collaboration

Data wrangling and processing of geospatial data should be seamless so that earth scientists can focus on science.

The purpose of PODPAC is to facilitate:

Access of data products
Subsetting of data products
Projecting and interpolating data products
Combining/compositing data products
Analysis of data products
Sharing of algorithms and data products
Use of cloud computing architectures (AWS) for processing

Installation

For installation instructions, see install.md.

Documentation

The official PODPAC documentation is available here: https://podpac.org

For usage examples, see the podpac-examples repository.

Contributing

You can find more information on contributing to PODPAC on the Contributing page

Stability / Maturity

PODPAC is in a beta phase of development. As such:

All development will adhere to a semantic versioning system
Major revisions may contain backwards incompatible changes, but these changes will be documented in the CHANGELOG
We are working to improve documentation, please contact us or create an issue if documentation is out of date or incorrect
We look forward to receive feedback on usability and compatibility

Acknowledgments

This material is based upon work supported by NASA under Contract No 80NSSC18C0061.

References

For PODPAC references, see the References page

podpac's People

Contributors

Stargazers

Watchers

Forkers

ccuadrado rmeyers4 rabindranp bayotte ccpfoye cfoye-creare

podpac's Issues

Coordinate Creation Spec

Summary

Create a document that serves as the coordinate creation spec. This issue is concurrent with coordinate class naming (#53). Ideally, this issue precedes coordinate documentation (#4) and testing (#3) -- it would be nice to update the tests to match the spec, and then update the code (and docstrings) to pass the tests. This issue also precedes the user guide (#45).

Background

Coordinate creation is a challenging issue because it needs to be sufficiently general in order to cover enough use cases while remaining both concise as well as sufficiently simple/concrete/usable for a new user. We have overhauled coordinate creation a few times in the past, and it's time to really settle on a spec.

Current Status

Coordinate creation involves two main components:

creation of the coordinate values of a single dimension (currently the ~Coord classes)
assembling one or more of these 1d coordinates together with relational information (currently the Coordinate class)

develop

The Coordinate class initialization accepts either keyword arguments or a stacked dictionary, which define the coordinates values for each dimension (key) with either ~Coord objects or shortcuts that are mapped to ~Coord objects. Keys with an underscore encode relational information (stacking). All of the coordinate creation options are available in one place.

# a few options
Coordinate({'lat_lon':(Coord([1, 2, 3, 4]), Coord([10, 20, 30, 40])), time:Coord('2018-01-01')})
Coordinate({'lat_lon':([1, 2, 3, 4], [10, 20, 30, 40]), time:'2018-01-01'})
Coordinate(lat_lon=([1, 2, 3, 4], [10, 20, 30, 40]), time='2018-01-01')

PR #44

The Coordinate class initialization requires a stacked dictionary with coordinate values for each dimension (key) defined by ~Coord objects. Keys with an underscore encode relational information (stacking). The stacked dictionary argument matches the Coordinate.stacked_dict property.

A coordinate helper function accepts keyword arguments which define the coordinate values for each dimension (key) with shortcuts that are mapped to ~Coord objects.

Coordinate({'lat_lon':(Coord([1, 2, 3, 4]), Coord([10, 20, 30, 40])), time:Coord('2018-01-01')})
podpac.coordinate('lat_lon'=([1, 2, 3, 4], [10, 20, 30, 40]), time='2018-01-01')

@mlshapiro

@mlshapiro has suggested replacing the keyword arguments with a list of values and second argument for dimension labels. This is similar to xarray DataArray creation: the main data and its structure are provided as a positional argument, with labeling meta-data passed as a second keyword argument. The values would accept either ~Coord objects or shortcuts that are mapped to ~Coord objects.

Coordinate([([1, 2, 3, 4], [10, 20, 30, 40]), '2018-01-01'], dims=[('lat', 'lon'), 'time'])

Additional considerations

We discussed the pros and cons of putting all of the options into Coordinate class initialization vs. using factory functions, with general support for a spec with (potentially multiple) factory functions for common use-cases. The Coordinate initialization could remain strict, explicit, and general while factory functions would be easy-to-use, concise, and extensible.

Proposal: Parse sub-attributes in Pipelines

We want to be able to reference

node sub-attributes specifically in Algorithm inputs and Compositor sources
sub-attributes generally in attrs

e.g.

{
    "nodes": {
        "downscaled_sm_algorithm": {
            ...
       },
       "reprojected": {
            "node": "core.data.type.ReprojectedSource"
            "inputs": {"source": "downscaled_sm_algorithm.twi"}
            "attrs": {"reprojected_coordinates": "downscaled_sm_algorithm.solmst.native_coordinates"}
       }
    }
}

This is just a proposal for now, and needs to be thought through more carefully and discussed.

Interpolate lat_lon to lon_lat and vice versa

Description
After the Coordinate refactor, we cannot request lat_lon data from a DataSource with lon_lat native coordinates (and vice versa).

Steps to Reproduce

class B(NumpyArray):
    source = 2 * np.arange(100.)

    def get_native_coordinates(self):
        return Coordinate(lat_lon=[(-20, 20), (0, 5), 100])

b = B()
b.execute(Coordinate(lon_lat=[(-20, 20), (0, 5), 100]))

Expected Behavior
The data should be retrieved.

Observed Behavior
Currently a NotImplementedError is raised. Without it, an IndexError is raised in DataSource.interpolate_point_data.

DataSource example in pipeline docs is wrong

Description
The example in https://github.com/creare-com/podpac/blob/develop/doc/source/user/pipelines.md#sample-1 is actually for SMAP, which is a compositor node.

Expected Behavior
We should have an example to a different type of node, which is actually a datasource. I propose the WCS node.

Coordinates docs

Document classes and functions for the coordinates module.

Increase Node coverage > 90%

Target is 0.1.0 release.

Fix lat_lon order in Coordinate

Its used to be ((lat_start, lon_start), (lat_end, lon_end), Number of points)
Now it's ((lat_start, lat_end), (lon_start, lon_end), Number of points)

Style has no member node

line 64/65 of node.py we reference self.node, but Style has no member node

    def __init__(self, node=None, *args, **kwargs):
        if node:
            self.name = self.node.__class.__name__
            self.units = self.node.units

BUG: Generalize/fix convolution nodes

Though I haven't tested it, I'm pretty sure this line
podpac/core/algorithm/signal.py#L43

will fail if the input doesn't have lat/lon coordinates.

Similarly, this line will fail
podpac/core/algorithm/signal.py#L63

if the input doesn't have time coordinates

matplotlib.cm has no member virdis

Line 741 in node.py references cm.virdis. Matplotlib has this member on the pyplot module, but not cm.

Add transpose to coordinates and smart indexing of UnitsDataArrays

Sometimes it would be preferable to transpose coordinates before they're applied to a node. For example, in WCS's native_coordinates method, the native coordinates are constructed based on the (somewhat arbitrary) WCS coordinates and the evaluated coordinates but use the order of the WCS coordinates, and it would be nice to transpose to order of evaluated coordinates.

OUT_DIR is not a member of settings

Line 636 of podpac/core/node.py references settings.OUT_DIR. There is no member in settings named OUT_DIR.

FEATURE: Create a webpage for the project

Should have auto-generated reference docs from docstrings as well as more user-friendly documentation for users, developers, and advocates

Increase Data coverage to > 90%

Target is 0.1.0 release.

CoordinateGroup

@mpu I think you know this, but there is an initial implementation with very basic tests in the repository. Here's a summary.

Both Coordinate and CoordinateGroup inherit from BaseCoordinate, which mainly just defines a common API (currently stack, unstack, and intersect).

The CoordinateGroup wraps a list of Coordinate objects in a private attribute. Maybe you would prefer that this be a numpy array, not sure.

Implemented

len(group)
for c in group
indexing group[1]
slicing group[1:4]
multi-indexing group[1, 'lat'] to get the underlying Coord objects
groupA + groupB
groupA += groupB
groupA.append(coordinate)
group.dims gives a set of unstacked dims, shared by all coordinates in the group but not necessarily in the same order or stacking.
group.stack, group.unstack, and group.intersect just map to each coordinate in the list and return a CoordinateGroup.

Not implemented

groupA + coordinate - I didn't know if it was better to fail or to append.
add_unique
iterchunks
latlon_bounds_str

Next Steps

initial feedback
how to execute a node given a CoordinateGroup and what the output should be

Custom pipeline output types

Description
Users should be able to define custom Output types and use them in JSON pipeline definitions.

Describe the solution you'd like
Similar to the plugin support that is implemented for custom Nodes in JSON pipeline definitions.

Sample Code

TODO

Additional context
This will be implemented alongside or after #36.

podpac update - how are times defined/handled?

I have a question on how the podpac update to allow requesting a single time was done, because I’ve noticed changes in some of my functions from yesterday. For example, it now allows me to request times that had no data as of yesterday, so for example, this code returns data in all 5 times, rather than just T06 and T18:

lat = -34.848
lon = 146.172

step = 0.01 # 201x201 pixels in 2degree by 2degree square area
lat_range = np.arange(round(lat*4)/4-1, round(lat*4)/4+1+step,step)
lon_range = np.arange(round(lon*4)/4-1, round(lon*4)/4+1+step,step)

smap_coords = podpac.Coordinate(time=('2016-03-17','2016-03-18','6,h'), lat=lat_range, lon=lon_range,
                       order=['time', 'lat', 'lon'])
o_smp1 = smap.execute(smap_coords)

What is very curious and concerning is that if I request two days instead of just one, as below, the data for the first day doesn’t match up.

smap_coords = podpac.Coordinate(time=('2016-03-17','2016-03-19','6,h'), lat=lat_range, lon=lon_range,
                       order=['time', 'lat', 'lon'])
o_smp2 = smap.execute(smap_coords)

When requesting a specific time using:

smap_coords = podpac.Coordinate(time=podpac.Coord('2016-03-17T18', delta='2,h'), lat=lat_range, lon=lon_range,
                       order=['time', 'lat', 'lon'])
o_smp = smap.execute(smap_coords)

I can also get data in time windows when there didn’t used to be data such as:

smap_coords = podpac.Coordinate(time=podpac.Coord('2016-03-17T12', delta='2,h'), lat=lat_range, lon=lon_range,
                       order=['time', 'lat', 'lon'])
o_smp = smap.execute(smap_coords)

(this gives the same data as 18h for this call, when I was expecting nan's).

What is the default time window in which it looks for data? Is the default using interpolation between nearest neighbor points, or just taking the closest value? If I want the exact value in the AM or PM files, what function call should I use?

Thanks,
Rachel

node and node.attribute references in pipeline definitions

Description
We need to be able to define node attributes in pipeline definitions by referring to another node or another node's attribute (without clashing with attributes that are strings, dicts, etc)

Describe the solution you'd like
Reference the node or attribute as a string: "my_attr": "node_name" or "my_attr": "node_name.attr1.attr2". These references need to be in a special location in the node definition so that the pipeline parser knows to interpret these as node and attribute references rather than as a string.

(We have also discussed dictionaries and tuples in the past, which remain a fine option.)

Tasks

Define new spec.
Update tests and docs to match new spec.
Update the pipeline parser to use the spec (probably after #36)

Nodes with default inputs breaks pipeline

Problem

test2.json

{
  "nodes": {
    "RedistDefault": {
      "plugin": "GeoWATCH",
      "node": "RedistDefault"
    }
  },
  "outputs": [
    {
      "mode": "image", "format": "png", "vmin": 0, "vmax": 1,
      "nodes": [  "RedistDefault"  ]
    }
  ]
}

import GeoWatch
import podpac
pipeline = podpac.Pipeline('geowatch/GeoWATCH/algorithm/test/test2.json')
coords = podpac.Coordinate(lat=(45, 44, 128), lon=(-100, -99, 128), time='2017-12-01T12:00:00', order=('lat', 'lon', 'time')
)
pipeline.execute(coords)

NodeException: Cannot determine shape if evaluated_coordinates and native_coordinates are both None.

> /home/ubuntu/podpac/podpac/core/algorithm/algorithm.py(30)execute()
     29                 if coords is None:
---> 30                     coords = convert_xarray_to_podpac(node.output.coords)
     31                 else:

ipdb> ll
     17     def execute(self, coordinates, params=None, output=None):
     18         self.evaluated_coordinates = coordinates
     19         self.params = params
     20         self.output = output
     21 
     22         coords = None
     23         for name in self.trait_names():
     24             node = getattr(self, name)
     25             if isinstance(node, Node):
     26                 if self.implicit_pipeline_evaluation:
     27                     node.execute(coordinates, params)
     28                 # accumulate coordniates
     29                 if coords is None:
---> 30                     coords = convert_xarray_to_podpac(node.output.coords)
     31                 else:
     32                     coords = coords.add_unique(
     33                         convert_xarray_to_podpac(node.output.coords))

The above fails because slope was not evaluated before because implicit_pipeline_evaluation is False and the pipeline didn't evaluate the node before needing the output from the node

Solution

We either need to:

Fix this bug in Pipeline by having the Pipeline node search through any Algorithm or Compositor nodes and collecting any
inputs / traits of type Node and adding them to pipeline definition so that they will be executed.
OR: rethink how we build pipelines/algorithms like this.

Increase Algorithm Coverage to >90%

Target is 0.1.0 release.

Pipeline refactor

Description
Pipelines should be usable as inputs into Nodes (and by other pipelines) and should only allow a single output. Note also that Pipelines will no longer attempt to manage execution.

Describe the solution you'd like
The Pipeline class should be converted into a Node. It will still accept a JSON pipeline definition as input and execute that pipeline in its execute method. The spec will be updated to support only a single output node and type in each pipeline definition.

This replaces the PipelineNode class.

No data when requesting stacked coordinates from a DataSource with stacked native coordinates.

Description
A DataSource with stacked native_coordinates returns NaN values when executed with stacked coordinates.

Steps to Reproduce

class MyNode(NumpyArray):
    source = 2 * np.arange(100.)

    def get_native_coordinates(self):
        return Coordinate(lat_lon=[(-20, 20), (0, 5), 100])

n = MyNode()
coords = n.native_coordinates
out = b.execute(bcoords)

Expected Behavior
out should contain all of the data from MyNode.source.

Observed Behavior
out is all NaN.

Tag 0.0.0 release

To master, "stable" for now

Compositor Shortcut Native Coordinates Implicit Assumption

Right now the Compositor can short-cut create the native_coordinates (for SMAP) so that we don't have to read the file to get the native coordinates.

This assumes the order of native coordinates is source_coordinate + shared_coordinates as opposed to shared_coordinates + source_coordinates

fill in autodoc classes

autodoc has been stubbed in for the future, but docs need to be filled in. This issue serves as a reminder of classes/modules that still need documenting. We can remove at the 0.1.0 release, or when the majority of autodoc classes have been filled in.

To find unfilled autodocs, search the podpac module for Summary, TYPE, Description

podpac.Algorithm.definition.fget undefined in type.py

line 865 of type.py podpac.Algorithm.definition has no member fget

        d = podpac.Algorithm.definition.fget(self)
        d['attrs'] = OrderedDict()

Rasterio exception using integer array as DataSource source

Description
Defining a DataSource node with a integer array as the source results in a rasterio exception when evaluating the node. The no-data value during interpolation using rasterio is np.nan which is invalid for integer inputs.

Steps to Reproduce

class A(NumpyArray):
    source = np.arange(10000).reshape(100, 100)

    def get_native_coordinates(self):
        return Coordinate(lat=(-25, 25, 100), lon=(-25, 25, 100), order=['lat', 'lon'])

a = A()
coords = Coordinate(lat=(0, 10, 80), lon=(0, 10, 80), order=['lat', 'lon'])
out = a.execute(coords)

Observed Behavior

Traceback (most recent call last):
  File "jxm.py", line 37, in <module>
    out = a.execute(coords)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 108, in execute
    self.interpolate_data(data_subset, coords_subset, coordinates)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 270, in interpolate_data
    return self.rasterio_interpolation(data_src, coords_src, data_dst, coords_dst)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 393, in rasterio_interpolation
    resampling=getattr(Resampling, self.interpolation)
  File "/usr/local/lib/python2.7/dist-packages/rasterio-1.0a9-py2.7-linux-x86_64.egg/rasterio/env.py", line 275, in wrapper
    return f(*args, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/rasterio-1.0a9-py2.7-linux-x86_64.egg/rasterio/warp.py", line 286, in reproject
    init_dest_nodata, **kwargs)
  File "rasterio/_warp.pyx", line 277, in rasterio._warp._reproject (rasterio/_warp.cpp:4742)
ValueError: src_nodata must be in valid range for source dtype

Expected Behavior
A traitlets.TraitsError or ValueError exception is raised during node initialization.

Add Documentation for new developers

slc is undefined in type.py

Line 329, 330 slc is undefined in RasterIO source in type.py

        data.data.ravel()[:] = self.dataset.read(
            self.band, window=((slc[0].start, slc[0].stop),
                               (slc[1].start, slc[1].stop)),
            out_shape=tuple(coordinates.shape)
            ).ravel()

Custom nodes may silently output incomplete/incorrect pipeline definitions

The definition property of a node is used to output a human-readable and human-editable pipeline definition. It is defined for the main node types DataSource, Compositor, and Algorithm.

In some node subclasses, the default (parent) definition is sufficient and does not need to be reimplemented, but subclasses that use additional attributes will need to overwrite or extend the definition. For example, the definition property is extended in some datalib.smap nodes.

Nodes that do not reimplement the definition may silently output pipeline definitions that are incorrect due to missing attributes. This applies to existing podpac nodes that do not yet reimplement the property, nodes that are be added to podpac in the future, and custom nodes created by podpac users.

It is not yet clear what a solution to this would look like.

reduce is undefined in stats.py

Line 175 and 254 in stats.py: reduce is undefined.

One example

        if self.chunk_size and self.chunk_size < reduce(mul, coordinates.shape, 1):
            result = self.reduce_chunked(self.iteroutputs())
        else:
            if self.implicit_pipeline_evaluation:
                self.input_node.execute(coordinates, params)
            result = self.reduce(self.input_node.output)

Coordinate Naming Convention

Description

The current Coordinate naming convention of Coord and Coordinate is a little bit confusing. Proposing a new naming convention when we are refactoring the Coordinate classes.

Support integer DataSource.

Currently a DataSource source must be floats for the interpolation to work. NaN values are used for missing data.

It would be nice to support integer sources with a custom (integer) no-data value.

Interpolation from point data sets

Description
When executing a node that has lat_lon coordinates as a source, the output is all 'nans'

Describe the solution you'd like
Get the actual data as output

Sample Code

>>> import podpac
>>> import podpac.core.data.type
>>> import numpy as np
>>> n = podpac.core.data.type.NumpyArray(source=np.arange(10), native_coordinates=podpac.Coordinate(lat_lon=((0., 1.), (1., 2.), 10)))
>>> n.execute(n.native_coordinates)
<xarray.UnitsDataArray (lat_lon: 10)>
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
Coordinates:
  * lat_lon  (lat_lon) [('lat', '<f8'), ('lon', '<f8')] (0., 1.) ...
Attributes:
    layer_style:  <podpac.core.node.Style object at 0x000000001C05A128>
    params:       None

Public API when importing PODPAC

Description
What should be the root level imports when you import PODPAC?

Describe the solution you'd like
We could have a flat structure where all the nodes are available at import. That's currently the approach:

import podpac
dir(podpac)
['Algorithm',
 'Arithmetic',
 'CACHE_DIR',
 'Compositor',
 'Convolution',
 'Coord',
 'Coordinate',
 'Count',
 'DataSource',
 'ExpandCoordinates',
 'Kurtosis',
 'Max',
 'Mean',
 'Median',
 'Min',
 'MonotonicCoord',
 'Node',
 'OrderedCompositor',
 'Pipeline',
 'PipelineError',
 'PipelineNode',
 'SinCoords',
 'Skew',
 'SpatialConvolution',
 'StandardDeviation',
 'Style',
 'Sum',
 'TimeConvolution',
 'UniformCoord',
 'Units',
 'UnitsDataArray',
 'UnitsNode',
 'Variance',
 'coord_linspace',
 'core',
 'settings']

Or we could have a more categorical or hierarchical approach.

FEATURE: Enable 'native' specification of coordinates

e.g.

coords = Coordinate(time=("2000-01-01", None, None))

Where None signifies whatever the 'native' coordinate is for the node.

This will cause clashing which might be:

Desired?
Dealt with by using the native coordinates of the output node

Coordinate User Guide

This would be in addition to improved docs in the module itself.

Increase Utils/Units coverage > 90%

Target is 0.1.0 release.

Refactor Interpolation

Right now interpolation is a large if-else statement as part of DataSource.
The logic for determining the data that's included in the get_data request is fully determined by the Coordinates.intersect method -- this should not be the case.
- Nearest-neighbor interpolation needs considerably fewer points that bi=linear interpolation
- How are out-of-extents cases handles
- point vs. segment coordinates now matter a lot more
What do to about different CRS?
Interpolation should be its own module
- Interpolants should be easy to add
- There should be an order of priority for interpolators
- Each interpolator should know what coordinates it can interpolate to/from
- Each interpolator should know how to select appropriate coordinates from the datasource
- Multiple interpolators may be required for each request:
  - Time could use NN interpolation
  - lat/lon could use bilinear with a specified CRS/Projection
- The order of these multiple interpolators matters from an optimization perpsective
  - Consider the size of the dataset before/after interpolation
  - Consider the cost of the interpolation operation

AttributeError: module 'podpac' has no attribute 'GridCompositor'

Trying to import airmoss, I see this error.

----> 1 from podpac.datalib import airmoss

~/computing/repositories/geospatial/podpac/podpac/datalib/airmoss.py in <module>()
     88 
     89 
---> 90 class AirMOSS_Site(podpac.GridCompositor):
     91     """Summary
     92 

AttributeError: module 'podpac' has no attribute 'GridCompositor'

Rethink how params are used and how they should behave

Right now, any params are completely overwritten by an execute call.
Presumably a developer could have default params in the dictionary, and users only want to overwrite those in some calls
- i.e. subsequent calls should use the defaults
- Right now, subsequent calls will use the last provided params
params are not traited right now -- they have to be in the dictionary

I think ideally:

Default parameters should not be over-written at the class level
Developers should be able to specify traited parameters
- This could be picked up by the pipeline generation to avoid the silently -incorrect pipeline definitions

(time, lat_lon) to (time, lat_lon) interpolation not working

Description
(time, lat_lon) to (time, lat_lon) interpolation not working
Steps to Reproduce

import podpac
import podpac.core.data.type
n = podpac.core.data.type.NumpyArray(source=np.random.rand(64, 150), 
    native_coordinates=podpac.Coordinate(time=np.arange(64), lat_lon=((0, 1), (0, 1), 150),
    order=['time', 'lat_lon']))
n.execute(n.native_coordinates)

Expected Behavior
It should really just return the input data

Observed Behavior

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-016938f25bfd> in <module>()
----> 1 n.execute(n.native_coordinates)

C:\Repository\podpac\podpac\core\data\data.pyc in execute(self, coordinates, params, output)
     88
     89         # Need to ask the interpolator what coordinates I need
---> 90         res = self.get_data_subset(self.evaluated_coordinates)
     91         if isinstance(res, UnitsDataArray):
     92             if self.output is None:

C:\Repository\podpac\podpac\core\data\data.pyc in get_data_subset(self, coordinates)
    172             coords_subset_slc = new_coords_slc
    173
--> 174         data = self.get_data(coords_subset, coords_subset_slc)
    175
    176         return data, coords_subset

C:\Repository\podpac\podpac\core\data\type.pyc in get_data(self, coordinates, coordinates_slice)
    100         s = coordinates_slice
    101         d = self.initialize_coord_array(coordinates, 'data',
--> 102                                         fillval=self.source[s])
    103         return d
    104

IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (64,) (150,)

Additional Notes
I'm using xarray version 0.10.0

coordinate testing

This involves migrating some old unit tests to pytests, writing some new unit tests, and removing some old unit tests.

Refactor Caching

Description
Presently there is no clear/consistent way to cache objects to RAM/disk. In fact, core.node.Node has multiple save/load/write functions that should be unified. I suspect part of this was used for the initial caching functionality, and another part was used for the Pipeline implementation. This needs to be cleaned up.

Describe the solution you'd like
Define a single consistent interface for caching, saving, and loading files.

Increase Compositor Coverage to >90%

Target is 0.1.0 release.

Question: how does interpolation in time handle nan's?

If one asks podpac for data at a time that is intermediate between two of the native source time points, how does the interpolation handle nan's in one of the adjacent points? Does it return nan? Does it give the nearest value present and not interpolate?

Thanks,
Rachel

NumpyArray -> NDArray

Description

Is there a reason to specify Numpy in the array DataSource type? Could a naive user implement a list of data (perhaps from a JSON document or an excel spreadsheet?)

Describe the solution you'd like

Rename NumpyArray to NDArray or something more general

doc improvements

General issue to support doc improvements leading to 0.1.0

support viewing multiple versions of documentation
- View multiple versions of podpac docs on the built documentation website via a dropdown selection. This is a built in feature of read the docs. Ideally this would be a sphinx based solution
include code coverage in documentation

Exception requesting stacked coordinates from a DataSource with unstacked native coordinates

Steps to Reproduce

class MyNode(NumpyArray):
    source = np.arange(10000.).reshape(100, 100)

    def get_native_coordinates(self):
        return Coordinate(lat=(-25, 25, 100), lon=(-25, 25, 100), order=['lat', 'lon'])

n = MyNode()
coords = Coordinate(lat_lon=[(-20, 20), (0, 5), 100])
out = n.execute(coords)

Expected Behavior
out should contain data interpolated from MyNode.source.

Observed Behavior

Traceback (most recent call last):
  File "jxm.py", line 29, in <module>
    a_out = n.execute(coords)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 108, in execute
    self.interpolate_data(data_subset, coords_subset, coordinates)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 292, in interpolate_data
    grid=False)
  File "/home/jmilloy/Pipeline/podpac/podpac/core/data/data.py", line 472, in interpolate_irregular_grid
    x, y = coords_i_dst['lon']
TypeError: list indices must be integers, not unicode

Additional Notes
See also #28

Remove Node.output attribute

Problem

Nodes that are executed more than once with expanded or reduced coordinates will have unpredictable output at any given time. In the following simple example, input1.output and input2.output will have different sizes becasue b.execute will re-execute a with expanded coordinates:

a = MyDataSource()
b = Convolution(source=a)
a.execute(coords)
b.execute(coords)
d = a.output - b.output

Solution

Remove the following attributes so that Nodes are reusable and "stateless":

evaluated
evaluated_coordinates
output

Users (and node methods internally) can store the output returned from execute when they wish to re-use that data.

a = MyDataSource()
b = Convolution(source=a)
aout = a.execute(coords)
bout = b.execute(coords)
d = aout - bout

Debugging and Caching

See #33.

Podpac should avoid re-executing the same node with the same coordinates and params. It is also useful to be able to inspect various node's outputs after execution while developing a new Node. Previously, the output attribute naively provided both of these features without properly handling the potentially multiple outputs. The caching should replace these features; a simple version would simply store each of a node's outputs in a dictionary with the coordinates and params as the key.

Notes

This is a breaking change and should be implemented before the first official release.

creare-com / podpac Goto Github PK

podpac's Introduction

PODPAC

Installation

Documentation

Contributing

Stability / Maturity

Acknowledgments

References

podpac's People

Contributors

Stargazers

Watchers

Forkers

podpac's Issues

Summary

Background

Current Status

develop

PR #44

Additional considerations

Problem

Solution

Problem

Solution

Debugging and Caching

Notes

Recommend Projects

Recommend Topics

Recommend Org