Currently, Dockerfile.devel-gpu inherits from GDAL Dockerfile, and then adds in both s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

split data creation and analysis into separate Docker apps about deeposm HOT 7 CLOSED

trailbehind commented on August 23, 2024

split data creation and analysis into separate Docker apps

from deeposm.

Comments (7)

zain commented on August 23, 2024

I like the API idea but we'd have to figure out a standardized way to structure the training data (for input into the data analysis dockerfile). I think you're still mucking with pickle/json/etc as the output of create_training_data.py, right?

Am I correct in my understanding that the structure of the training data is almost as important as the Tensorflow model itself?

from deeposm.

silberman commented on August 23, 2024

Working backwards, ultimately the Tensorflow models want 2 or 4 big ndarrays: train_input_features, train_labels, and in research mode you'll probably want a test_input_features and test_labels that ideally don't share too much with the training set.

Different models and experiments will want differently shaped versions of those. Say you want to try training on the red-band + infrared-band, with your labels being the current one-hot "does a road go through the middle 3x3 square of this tile". Then the input tensor that tf or keras or tflearn wants would be shaped something like 30000 x 64 x 64 x 2, and output 30000 x 2. (Or maybe you want to try predicting whether each individual pixel is a road or not, so output needs to be 30000 x 64 x 64 x 1. Would be nice if that was easy to try.) An experiment then can be abstracted to be those 2-4 ndarrays plus some arbitrary Tensorflow model with appropriate placeholder shapes at the beginning and end.

So a 2-step, 2-Dockerfile workflow could look something like:
/bin/create_training_data.py -numtiles 35000
Wrote 35000 tiles worth of features and labels to /data/experiment17

Then the first step of running an experiment on the Tensorflow end could look like:

data_sets = load_data("/data/experiment17",
                                      input_features=[NAIP_RED, NAIP_INFRARED],
                                      labels=[MIDDLE_CONTAINS_WAY_3by3])

model = make_some_model()
model.fit(data_set.train_input, data_set.train_labels, validation=0.1)
model.evaluate(data_set.test_input, data_set.test_labels)

Few different ways I can see that cache directory working:

35,000 little 64x64x4 .npy files
1 big 35000x64x64x4 .npy file
Each RGBI feature could be saved individually, so 35000 files like tile_1_NAIPRED.npy, tile_1_NAIPGREEN.npy, etc
All of each feature saved together, so for now we'd have 4 big 35000x64x64x1 .npy, one for each of RGBI.

Then we also have to cache the various labels. Could be tile_1_middle3by3_label.npy, a tiny 2x1 .npy file, or put them all together in all_middle3by3_label.npy

Simplest I think would be keeping it as 1 giant 35000x64x64x4 .npy file for all the input layers we have now, and a few 35000xWhatever, 1 for each of the various label methods we have.

Upside of having tons of little files is that it would scale better if we wanted 100s of thousands of these things, since you only really ever need ~128 at a time, or whatever is one batch, so we could avoid putting a 500000x64x64x4 array in memory. But there's probably a better solution before that point, like a database or some tensorflow built-in feeding methods.

from deeposm.

silberman commented on August 23, 2024

Btw, Amazon's dsstne that was open-sourced yesterday uses NetCDF as their ndarray serialization format. I don't think we should use either, though at some point that library may be part of the best way to train huge models on AWS. (Right now though it doesn't support convolutions, is optimized for sparse data, and "emphasizing speed and scale over experimental flexibility.", which is not what we want, but maybe something built on top of that will be.)

https://github.com/amznlabs/amazon-dsstne

from deeposm.

zain commented on August 23, 2024

@silberman: Why do you think we shouldn't use NetCDF4?

from deeposm.

andrewljohnson commented on August 23, 2024

@silberman how about move your comments to this issue: #23. (Move/delete from here?)

I think this is a separate ticket.

from deeposm.

silberman commented on August 23, 2024

@zain Oh I've just never used NetCDF4 before, but now looking into it a bit it looks great for this actually.

So instead of "/data/experiment17", we have a NetCDF group and a data loading function that knows how to handle an experiment name and turn it into the 4 numpy arrays tensorflow wants. (We may want to divide up into train/test groups as well before serializing, depending on the experiment).

So far most experimentation has been on the labelling side, so before this cache. I'd like to speed up that whole pipeline by moving the raster and osm data into a postgis database, and at that point the normal experiment workflow might skip this serialization step, though it would still be useful for testing, reproducibility, or eventually sending through dsstne.

If it worked like that, nearly all the action could happen in the data analysis container. The first step would just be responsible for downloading and inserting a bunch of data into a database, and could be run once. Then the analysis environment should be able to create, save, and load datasets, as long as it can access the database.

re: API, I think one going in the opposite direction could be cool. Website interface for constructing net architectures (producing json like: https://github.com/amznlabs/amazon-dsstne/blob/master/docs/getting_started/userguide.md#neural-network-layer-definition-language), and we tell you how accurate it was/other tensorboard output. Writing the labeller function is a harder interface to come up with, so might have to do the feature sets as already-made, like they do at http://playground.tensorflow.org/

from deeposm.

andrewljohnson commented on August 23, 2024

merging with other infrastructure issues

from deeposm.

split data creation and analysis into separate Docker apps about deeposm HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent