continuumio / elm Goto Github PK

Phase I & part of Phase II of NASA SBIR - Parallel Machine Learning on Satellite Data

Home Page: http://ensemble-learning-models.readthedocs.io

Python 99.24% Batchfile 0.03% Shell 0.73%

elm's Introduction

NASA SBIR Phase I & II - Open Source Parallel Image Analysis and Machine Learning Pipeline

This repository archives the prototype ELM software developed in NASA SBIR Phase I and II from 2016 to January 2018. Current versions of this code are available at EarthML.pyviz.org; the code here is primarily of historical interest.

Using the Code

To use this code:

Install:

Create the development environment:

conda env create

Activate the environment:

source activate elm-env

(older versions of the code may have elm in place of elm-env above. The environment name was changed to avoid conflict with elm package on anaconda.org. The elm-env is uploaded to the nasasbir org on anaconda.org.)

Install the source:

python setup.py develop

Clone the elm-data repo using Git LFS so that more tests can be run:

brew install git-lfs # or apt-get, yum, etc
git lfs install
git clone https://github.com/ContinuumIO/elm-data
git remote add origin https://github.com/ContinuumIO/elm-data

Add the following to your .bashrc or environment, changing the paths depending on where you have cloned elm-data:

export DASK_EXECUTOR=SERIAL
export ELM_EXAMPLE_DATA_PATH=/Users/psteinberg/Documents/elm-data

elm's People

Contributors

Stargazers

Watchers

elm's Issues

Package simplification

The following changes I think will make the code layout more intuitive:

Move the contents of elm/preproc to within elm/sample_util
Get rid of elm/preproc as a subpackage in setup.py, elsewhere if needed
Move most or all of the contents of elm/pipeline/sample_pipeline.py to elm/sample_util/sample_pipeline.py (a new file)
Search and replace for changes required by above items, ensuring to check for functions that are not directly imported by provided as callable strings like "elm.pipeline.sample_pipe:some_func"

document single-band tif caveat

document tif single band caveat:

import rasterio as rio
r = rio.open('../elm-data/tif/L8/015/033/LC80150332013207LGN00/LC80150332013207LGN00_B1.TIF')
out = np.empty((1,200,300),dtype=np.uint16)
r.read(out=out).shape
(1, 200, 300)

note that it depends on the tif being 3-D with 1 band in the first dimension. this can be detected with r.count() above. Is that a pretty common format for geo tiff's to have 1 band per file? We may have to caveat this tif reader for now as reading 1-band tif files

Example like "Classifying reanalysis surface temperature..." (Loikith et al)

Make sure "batch_size" is used in initialization of partial_fit models

Make elm.readers wrapper that guesses correct reader funcs

Instead of calling load_netcdf_array, load_hdf4_array etc, just call elm.readers.load_array or elm.readers.load_meta which will guess using file extensions which reader to call, if a reader was not given directly, finally defaulting to netcdf with a warning that file type could not be detected.

Copy the workflow in several hurricane track papers

My condensed notes from reading two papers recommended by Daniel Duffy:

Loikith et al:
    clustering on temperature anomaly
        binned the anomaly
        for b in anomaly_bin:
            logprob = log10 of (count of bin b divided by sum observations)
        do clustering on logprob

Michael Bowen paper
    used International Best Track Archive for Climate Stewardship (tropical storms for 170 yrs - tracks at 6 hr increments)

    MERRA2 - "M2T1NXSLV is a time-averaged (at hourly increments) single-level diagnostic dataset which comprises satellite sensor readings for 47 atmospheric variables. The spatial resolution is 0.5 latitude by 0.625 longitude, and readings are present for the entire Earth. The dataset is just under 6TB in size
    DBSCAN on distance matrix" (netcdf)

    bounding box query on that for west pacific

    Piecewise Aggregate Approximation (transform / untransform)

    then compute a distance measurement between storms
        used a Euclidean distance modified by the treatment of breakpoints from PAA
        summing these for all variables is the distance between storms

Make sure the workflows in these two papers are supported.

Create a cluster configuration files

Install:

OpenGrads
numpy stack
netcdf4
gdal

Example like "Hurricane Teleconnection Locator: HURTLoc" (Bowen et al)

Some ideas on the config for a hurricane track classifier like "Hurricane Teleconnection Locator: HURTLoc" by Michael Bowen.

Data source (configure for the MERRA-2 training X data):

"reader": the default NetCDF reader
sample_args_generator: return a series of lon, lat, time subgrid args which are marked with hurricane class: 1/0 or scale 1 to 5
sample_from_args_func: take lon lat boxes and time windows- open netcdf files for each window / box - return average of each band for that query for each X-hour period, such as 6-hour chunks like the Y data.

sample_pipeline pseudo-code:

get_y: get the Y data (track data) for the lon, lat boxes and time windows and shape it into 1 column of the same shape as X's row count. Classes 0 / 1 or storm scale integer for in a grid of (lon, lat, time).
{modify_coords: some_module:normalize} in sample_pipeline: Determine Gaussian breakpoints of hourly data for each variable (band) within each storm lon, lat box and time window, such as breakpoints among about 150 to 200 hourly data points for a given lon, lat point. Define 5 histogram classes in this normalization, setting the mean to 0 within the ca. 150-200 hour period for each storm y, x point. (setting the alphabet size 5). Return a matrix that gives for each 6-hour chunk of time for each spatial point the anomaly class among 0,1,2,3,4 (alphabet size 5). For each band (weather variable), a grid (lon, lat, time [6 hour chunk]) of integers 0 through 4.
{modify_coords: some_module:reshape_3d_2d} sample_pipeline action: Now the ElmStore has many bands (weather variables) and a (lon, lat, time) structure for each band. Flatten the time dimension into the band to create separate data arrays. Example: If the bands were:

temp
precip
humid

And the time points were:

Then make an ElmStore with "bands" (data_vars) as:

temp-0
temp-1
temp-2
temp-3
precip-0
...
humid-2
humid-3

And then for each band, there is a (y, x) grid

call {"flatten": "C"} in the sample_pipeline: With flatten we will take the "temp-0", "temp-1",.."humid-2" data_vars and turn them into column headers in a single matrix. Returns ElmStore with one DataArray called "flat". The "flat" DataArray will have column headers "temp-0", etc as the "band" attribute and the rows will be all the y, x points flattened into a single dimension called "space" (ravel order "C")
Other steps here such as scikit-learn preprocessing, feature selection, PCA (not mentioned in paper)

train:

model_init_class: sklearn.cluster:DBSCAN

Make sure our conda package has all data files

See this test failure:
https://anaconda.org/psteinberg/iamlp/builds/12/1#L50359

It indicates that when we install as a conda package, not all data files are installed and found later. When I install from conda env create && python setup.py develop I do not have the missing files problem. To see the conda installation code that runs into the error, check .travis.yml and .binstar.yml which conda build the package and install from local.

Fix the data files metadata in conda.recipe.

Provide means of getting `Y` data to classifier/regressor

Currently the pipeline does not yet have a way to specify a Y data set (true observations for training) but this could be accomplished somewhere in the sampler logic.

Consider general time series funcs for data with shape (time, x, y)

selecting / evolving ensemble members - base_model_selector using deap

I have used Distributed Evolutionary Algorithms in Python (deap) for hyperparameterization in the past. It should be possible to use it for a general selector over models that make up an ensemble - to choose ones to keep or drop out or alter parameters based on each other. This could go in elm/model_selectors/ as a base selector class or function.

The nice thing about evolutionary algorithms for hyperparameterization is that they are relatively free of assumptions about the type of problem being solved, making them less sensitive to non-convexity, and they are more efficient than a grid search over the parameter space. Grid search over the parameter space is not feasible in large scale analysis.

Image data selection tools

The idea is to provide tools for selecting input data images by:

Filtering on filenames
Filtering spatially with numba functions
Filtering on time
Filtering on metadata at the file level or subdataset band level, such as removing images which have some tag related to night time (the function call to check metadata keys/values is in place, but we have more work to do on collecting more metadata like cloud cover or NaN fields (handled in issue #37),
Filtering on data with numba functions
- ~~[ ] spectral filter on data (e.g. remove clouds from input data). See the placeholder for this code here~~ (EDIT: I removed spectral filter here because it would be duplicative - the functionality is accomplished to some extent by #45 which adds support for transform as a sample pipeline step.)

elm/config/defaults/defaults.yaml simplification or removal

The elm.config.ConfigParser loads defaults.yaml into every ConfigParser instance. This results in excess dictionary checking/copying and less clear ElmConfigErrors related to config parsing and excessively verbose printing when the whole ConfigParser instance is echoed.

document setting NA from metadata

Document how to assign NaNs in a sample using metadata like valid_range: [0, 254]

Make sampling more efficient for small ensembles or ensemble of 1

Currently the predict runs a data_source's sample_args_generator, which may or may not be the same as that used by prior train steps. In the case that the data source is the same between train and predict and/or one known sample is going to be taken, the sample should be passed directly, rather than, as it is currently, doing the sample twice.

Verification / improvement of prediction output

Improve on elm/pipeline/predict.py and elm/pipeline/serialize.py. Currently the tests only test that the pickles or netcdf outputs show up. One improvement may be GeoTIFF output or a preparation for datashader.

further test model_selectors/kmeans.py

kmeans_model_averaging needs testing.

Investigate datashader for visualizing elm output

See Brendan's datashader example with LANDSAT . Try to adapt some of this example to providing a visual means of inspecting classification predictions.

Better name than "iamlp"

From flowdock:

I did some brainstorming on package names. I was thinking words of interest were "pipeline", "ensemble", "machine learning", "parallel", or "classification". Of these I thought that "Ensemble Learning Models" was a nice name because it abbreviates to elm, which is kind of related to the topic of land cover (e.g. a tree icon for the package later). Also from elm.pipeline import * is easier to remember and more easily pronounced than from iampl.pipeline import *. I think "ensemble" is a word with a lot of good recognition in the satellite and weather data communities as a robust approach to deterministic weather models or data analysis, and it usually implies parallelism in that context. I think we should make a decision on the package name soon before the iamlp name becomes too engrained.

partial fit methods using dask

Currently the parallelization is over the ensemble level. We should provide an alternate workflow of parallelization over partial_fit. It may be hard to parallelize partial_fit and ensemble at the same time.

Environment vars and CLI args for dask scheduler type

Currently there is a decorator in iamlp/settings.py called delayed which is dask.delayed decorator by default but becomes a do-nothing decorator with the env var USE_SERIAL=1. USE_SERIAL=1 can be helpful for debugging small problems.

There needs to be an argument on all CLI's to control the type the dask scheduler or to specify serial evaluation without dask. Make a base parser for this in iamlp/cli.py
Consider changing current use of env var SERIAL_EVAL (int) to IAMLP_DASK_SCHEDULER as a type of scheduler. See this list pf schedulers to support:
- http://dask.pydata.org/en/latest/scheduler-overview.html#scheduler-overview

Peter's note: I checked the boxes regarding environment support for different dask schedulers. The dask code is not worked out currently, but the environment variables are passed.

Find most efficient way to start/stop cluster

Use EBS Volumes
Anaconda cluster tools?
get passwords from troy

Provide more control over sample flattening

Currently all samples go from a shape of ( band, y, x) to (space, band). We should make {flatten: True} be a step in a sample_pipeline explicitly rather than calling it automatically. Some methods may actually want to flatten it to (lat points, lon points * bands) and breaking flatten out as an explicit step will allow more control

Documentation

Convenient CLI

See the scripts directory and example kmeans script

Note that CLI is mostly factored out into cli.py so that argument groups can be better reused.

Provide:

Argument names that are mostly independent of which classification method is being used
Means of loading a configuration from yaml rather than, or in addition to, command line arguments. (Config yaml file gets loaded, then CLI args can override parts of the config if needed)
Test harness for CLI endpoints.
Where feasible allow the config to be overridden by env variables

Handle within-pixel locations more accurately

File readers such as hdf4, tif, etc need to be able to take specifications that determine which pixel corner or pixel center is the coordinate for the pixel. Provide x_offset, y_offset as floats between 0 and 1 to indicate offset from a corner.
Spatial filter functions may need to accept arguments x_offset, y_offset

Scaling / normalization / log transform inputs as `sample_pipeline` step

Several of the example machine learning projects mentioned by our contact as GSFC use normalization steps before learning. Here are some of the use cases we should support:

z-score taking mean and std dev from config or from function in config that returns the right mean/ std dev for a given pixel
log: log base e
log10
anomaly_probs:
- caclulate time-mean for each space grid cell
- calculate the deviation from time-mean for each space grid cell
- bin / round the deviation into increments of X
- calculate log10 probability as log10(np.bincount(i)/ total_num_obs) where i is the binned anomaly and total_num_obs is the # of obs for a given space grid cell
further notes on anomaly_probs:
- Allow anomaly_probs as described above, but within time windows, such as calendar months
- Instead of finding the anomaly based on the time-mean for each space grid cell, compare to some other zonal stat, such as the mean of a polygon in which a pixel falls.
other methods from sklearn for scaling such as StandardScaler, MinMaxScaler, etc.

add "transform" optional step to pipeline

Currently in the config's pipeline section (a list of dictionaries indicating training or prediction actions), the actions must have the keys:

train, or
predict

We should add a 3rd one for transform to cover the cases of models that do not have a predict method but rather a transform and/or fit_transform. Included in this work would be ensuring the transform actions:

can be specified in config, perhaps in the sample_pipeline section of a pipeline step
can serialize / deserialize the outputs of transform for transforms like PCA

Image sample operations in `sample_pipeline`

See sample_pipeline in the defaults.yaml default config. We need to make it support the following operations:

feature_selection: (done mostly)
resampling: ~~(to-be-developed using numba or scipy.interpolate.interp2D)~~ EDIT July 11: It makes sense to use rasterio resampling from disk.
aggregations: ~~(to-be-developed using numba)~~ EDIT July 11: It makes sense to use rasterio resampling from disk.
masks: (to-be-developed using numba)

See also README_config.md with notes on the config that will need to be updated for new options.

Interim Demonstration Report

We need to deliver an Interim Demonstration Report. From the contract:

At the end of month three (3) of contract performance the Contractor shall submit an interim technical demonstration report of all work accomplished. The report shall be in narrative form, be brief, and informal. This report shall include:

(i) A quantitative description of work performed during the period;

(ii) An indication of any current problems, which may impede performance or impact program  schedule or cost, and proposed corrective action;

(iii) A discussion of the work to be performed during the next reporting period;

(iv) A description of any changes to the planned use of subcontractors since contract               award; and

(v) Estimated percentage of physical completion of the contract.

refactor documentation approach for non-config usage of elm

Use the api_example.py, api_example_evo.py (flowdock) in explaining how to work with elm, then make the config file a secondary focus rather than primary as it is now.

Well-thought out example data analysis pipelines

~~Use / improve the iamlp.acquire subpackage to acquire and cache an interesting data set that can be used for unit-testing, integration testing, and demonstration.~~

~~Improve the iamlp-download-ladsweb CLI tool:~~

~~[ ] Output a list of available data sources (list the allData directory on ladsweb up to two levels, e.g. allData/3001/NPP_DSRF1KD_L2GD~~

Script out a full analyst workflow for demonstration, then take parts of it as unit tests:
~~List available data sets to download from ladsweb~~
~~Selecting a ladsweb ftp data set for download~~
Defining a large regional selection
Defining a spectral pre-filter for machine learning, e.g. exclude clouds
Defining custom columns like NDVI
Try several classifiers
Output classifier predictions to images for datashader.
Make sure to inspect image metadata to exclude images where needed, e.g. night images

Data sources to download to support example problems

Update this list as we identify data sets to download:

MERRA2 - 2-D temperature global ~~3 hourly~~ daily summary at 0.5 x 0.625 (lat x lon). This will be used for an example config like Loikith et al (2013): Classifying reanalysis surface temperature probability density functions (PDFs) over North America with cluster analysis.
- Summary: M2SDNXSLV: MERRA-2 statD_2d_slv_Nx: 2d,Daily,Aggregated Statistics,Single-Level,Assimilation,Single-Level Diagnostics V5.12.4
- Data access: http
- wget ideas to help
- Time period: Download the July and August data for all years
M2IMNXASM: MERRA-2 instM_2d_asm_Nx: 2d,Monthly mean,Single-Level,Assimilation,Single-Level Diagnostics V5.12.4
- Download the July and August data for each of years 1980 to present (approximately 3 GB in total)
International Best Track Archive for Climate Stewardship (IBTrACS). From the Bowen et al paper, here's what we are looking for:
- "IBTrACS is a historical, best effort accounting of all the tropical storms for about the past 170 years. The eye of the storms latitude and longitude coordinates (track) is provided at six hour increments from the time that the tropical storm was first detected up until it dissipated. In total, there are 11,967 tropical storms recorded in the dataset. 1,144 of these storms occurred in the West Pacific storm basin between the years of 1980 and 2016. Of these, only the storms occurring during the months of August and September were used, bringing the total used in analysis to just under 400." Where the "West Pacific" is defined by "The West Pacific storm basin is defined by latitude and longitude bounding coordinates 100 West, by 0 South, by 180 East, by 60 North"
- Ensure we have coverage on this data source for some part of 1980 to present.
  Entry point (both "All" and "WMO" are represented): https://www.ncdc.noaa.gov/ibtracs/index.php?name=ibtracs-data-access
- Create metadata mapping for hurricanes -> data files using ftp://eclipse.ncdc.noaa.gov/pub/ibtracs/documents/Tropical_Storm_Serial_Number.pdf
MERRA2 M2T1NXSLV. Follow the Subset button on the page to download a spatial subset for West Pacific: 100 West, by 0 South, by 180 East, by 60 North. Some calcs on download size with subsetting where lat_frac is the fraction of latitudes included, lon_frac is fraction of world longitudes included, months is 1 for just August, mb_per_file is 400 MB per hourly file, yrs is set to 36 (1980 to present), dpm is days per month, hours is 24 because there is one file per hour:

Data URL (Note: data is truncated to 2012 - change STARTYR=2011 to see the remaining URLs. Also, if this URL stops working, follow screenshots below to reproduce):
http://goldsmr4.gesdisc.eosdis.nasa.gov/cgi-bin/OTF/HTTP_DpFileDownloadMERRA2.pl?DATASET=MERRA_DP&FCP_DIR=/ftp/private/tmp/&APPLICATION=SUBSET_MERRA2&FILTER=SUBSET_MERRA2&SUB_LONMIN=85&SUB_LONMAX=180&SUB_LATMAX=75&SUB_LATMIN=-15&OUTPUT_FORMAT=nc4&LOOKUPID_List=M2T1NXSLV&STARTYR=1980&STARTMON=01&STARTDAY=01&ENDYR=2016&ENDMON=12&ENDDAY=31&variable=t250&variable=t500&variable=t850&variable=u250&variable=u500&variable=u850&variable=v250&variable=v500&variable=v850&

yrs=36;hours=24;dpm=31;months=1;mb_per_file=400;lat_frac=.333;lon_frac=.5;print(yrs*hours*dpm*months*mb_per_file*lat_frac*lon_frac/1024/1024,'TB')

1.7011779785156251 TB

Spacial subset changed to 85 West, -15 South, 180 East, 75 North to match paper's methodologies section
Screenshots (no time-subsetting nor aggregations are required):
- Look at the IBTrACS data (see bullet above) and then decide if we should do a different subset of years for the M2T1NXSLV dataset

Handle valid range / no-data markers

Many of the data files we work with have metadata containing information that a given int/float is used as a NaN marker and/or that the data have a valid range (e.g. the bottom end of the numeric range is a NaN marker). We should allow users to specify some metadata lookup to propagate those NaNs through the matrices in the sample. With our standardization of metadata for each of the file types, it should be possible to allow this user-option in a file-type-independent way.

Revise loading of files logic to support hdf4, hdf5, netcdf4, geotiff

See this TODO note on support for other file types.

In elm.readers change this module's name to hdf4_tools.py and make corresponding _tools.py files for the other file types.

Consider fitting multiple models per sample of data

Currently a model may receive 1 or more samples for training (one sample if the model does not have a partial_fit or possibly >1 if the model has a partial_fit and batches_per_gen is > 1). It may be worthwhile to consider fitting multiple models to a given sample, which would better suit situations where constructing the input sample to machine learning has a significant time cost relative to the time it takes to fit an input data set.

Need to provide a "skip model" or "skip sample" functionality

In the process of semi-random sample formation from file iterators, it happens that one may select a file unsuitable for learning. There are currently ways of filtering files on metadata and file name, but if all or some of the sample_pipeline has to be run in order to determine whether to skip the sample, then we have no system of dealing with this.

We should provide a way in run_sample_pipeline to end the sample pipeline early on a statistical filter (such removing an image of all-night conditions with a zero-variance filter).
Also a fit in fit.py should be able to return an unfitted model and info about why fitting could not happen.

More test coverage for existing pipeline features

This is a reminder for more test coverage on features already added. Specifically I am thinking of:

tests of ensemble and partial fit training methods
tests of elm/readers
tests of elm/samplers.py

Custom Columns (Features) in Machine Learning

Allow user to define new columns:

Assumptions:

The input data can be a ~~pandas dataframe~~xarray.Dataset with user given column names like "band_1", "band_2" ,etc
The user should be able to define new features like NDVI (generic band_ratio function)
Then the user can optionally use a newly defined feature in machine learning or as a new column to subset the dataframe. We will have to have a config entry called something like ml_features to indicate which dataframe columns go into machine learning (defaulting to all columns).
This user-defined-features-related-code should also handle the case of adding all polynomial features to a matrix or a subset of them.

Note: This work should see how the config's feature_selection is handled. It allows the user to specify choices, the columns which may go into a feature_selection and keep_columns, which are columns which must be passed through, regardless of the feature_selection, such as a column of meta information or a band that should definitely be in the learning input. The two previous code links are from feature_selection.py which takes input from one of the config's feature_selection dicts, e.g. this default.

Feature selection needs to remember its choices

We support the use of feature_selection in the sample_pipeline. When we serialize the trained model from such a sample_pipeline we need to serialize also the decisions about which columns were kept, else, as it is currently, the predictor will re-run the feature_selection, perhaps arriving at a dif't choice about which columns to keep.

Move defaults to standardized location

See notes on #5 regarding loading from config.

Move the defaults from the top of cli.py to a yaml file in the new iamlp/config dir.
Create a defaults.py module that loads the yaml file (it may be in the egg).
Make cli.py import from defaults.py
Provide tests of config loader

Ensemble methods using dask

See elm.pipeline.ensemble

Provide an ensemble method using dask or serial evaluation for common classifiers
Provide model averaging tools for various classifiers to select among models of an ensemble, e.g. this example for kmeans
Provide ~~CLI~~ config for running ensemble methods with partial_fit on each ensemble member.

Support relative path in ELM_TRAIN_PATH environment var

absolute works but not relative path

Data readers for common formats

Support file globs of:

HDF4-EOS version common in VIIRS, MODIS data. See elm/readers
HDF5
NetCDF (no example yet - find a good example)
Numpy npz binaries with band/ feature names taken from the npz file keys (this is helpful for making synthetic data to test)
GeoTiff

With each file type, look for common metatdata patterns in the ladsweb ftp data and other products. For example, this function depends on having a given type of spatial bounds metadata. Metadata tags that would be useful include:

Maximum value flag
NaN flag
Night flag
Quality flag
Cloud cover flag

Add small example data to repo

Add small example data (hdf4, tiff, netcdf, ...) with explanatory info to the repo so that basic image pipeline pre-processing functions can be started and accelerated.

https://github.com/ContinuumIO/nasasbir/blob/master/README_features.md

expand documentation on `data_sources` part of config

It needs fully worked out examples.

Plan for spectral embedding and nonlinear dim. reduction

Develop a plan to implement spectral embedding and nonlinear dimensionality reduction for large data using dask / numba. Focus areas: pinv, eigenvalues, connectivity graphs. Consider ensemble methods that are embarassingly parallel as an easier alternative?

More tests asserting malformed configurations fail

We added some testing of bad configuration yaml files towards the beginning of the project. We have created many new config options since then.

Later in Phase I of the project, we should

Revisit the tests in elm/config/tests and make sure bad configs are rejected.
Add to the validation functions some checks for parameters that are ignored (currently we only validate what we need in the algorithms and we do not reject a config based on a parameter that is never going to be used).

Make nightly test run elm-examples/example_scripts

Make the elm-run-all-tests also run each example_scripts with 0 return code.

continuumio / elm Goto Github PK

elm's Introduction

NASA SBIR Phase I & II - Open Source Parallel Image Analysis and Machine Learning Pipeline

Using the Code

Install:

Read more docs

elm's People

Contributors

Stargazers

Watchers

Forkers

elm's Issues

Recommend Projects

Recommend Topics

Recommend Org