Giter Club home page Giter Club logo

Comments (7)

zain avatar zain commented on August 23, 2024

I like the API idea but we'd have to figure out a standardized way to structure the training data (for input into the data analysis dockerfile). I think you're still mucking with pickle/json/etc as the output of create_training_data.py, right?

Am I correct in my understanding that the structure of the training data is almost as important as the Tensorflow model itself?

from deeposm.

silberman avatar silberman commented on August 23, 2024

Working backwards, ultimately the Tensorflow models want 2 or 4 big ndarrays: train_input_features, train_labels, and in research mode you'll probably want a test_input_features and test_labels that ideally don't share too much with the training set.

Different models and experiments will want differently shaped versions of those. Say you want to try training on the red-band + infrared-band, with your labels being the current one-hot "does a road go through the middle 3x3 square of this tile". Then the input tensor that tf or keras or tflearn wants would be shaped something like 30000 x 64 x 64 x 2, and output 30000 x 2. (Or maybe you want to try predicting whether each individual pixel is a road or not, so output needs to be 30000 x 64 x 64 x 1. Would be nice if that was easy to try.) An experiment then can be abstracted to be those 2-4 ndarrays plus some arbitrary Tensorflow model with appropriate placeholder shapes at the beginning and end.

So a 2-step, 2-Dockerfile workflow could look something like:
/bin/create_training_data.py -numtiles 35000
Wrote 35000 tiles worth of features and labels to /data/experiment17

Then the first step of running an experiment on the Tensorflow end could look like:

data_sets = load_data("/data/experiment17",
                                      input_features=[NAIP_RED, NAIP_INFRARED],
                                      labels=[MIDDLE_CONTAINS_WAY_3by3])

model = make_some_model()
model.fit(data_set.train_input, data_set.train_labels, validation=0.1)
model.evaluate(data_set.test_input, data_set.test_labels)

Few different ways I can see that cache directory working:

  • 35,000 little 64x64x4 .npy files
  • 1 big 35000x64x64x4 .npy file
  • Each RGBI feature could be saved individually, so 35000 files like tile_1_NAIPRED.npy, tile_1_NAIPGREEN.npy, etc
  • All of each feature saved together, so for now we'd have 4 big 35000x64x64x1 .npy, one for each of RGBI.

Then we also have to cache the various labels. Could be tile_1_middle3by3_label.npy, a tiny 2x1 .npy file, or put them all together in all_middle3by3_label.npy

Simplest I think would be keeping it as 1 giant 35000x64x64x4 .npy file for all the input layers we have now, and a few 35000xWhatever, 1 for each of the various label methods we have.

Upside of having tons of little files is that it would scale better if we wanted 100s of thousands of these things, since you only really ever need ~128 at a time, or whatever is one batch, so we could avoid putting a 500000x64x64x4 array in memory. But there's probably a better solution before that point, like a database or some tensorflow built-in feeding methods.

from deeposm.

silberman avatar silberman commented on August 23, 2024

Btw, Amazon's dsstne that was open-sourced yesterday uses NetCDF as their ndarray serialization format. I don't think we should use either, though at some point that library may be part of the best way to train huge models on AWS. (Right now though it doesn't support convolutions, is optimized for sparse data, and "emphasizing speed and scale over experimental flexibility.", which is not what we want, but maybe something built on top of that will be.)

https://github.com/amznlabs/amazon-dsstne

from deeposm.

zain avatar zain commented on August 23, 2024

@silberman: Why do you think we shouldn't use NetCDF4?

from deeposm.

andrewljohnson avatar andrewljohnson commented on August 23, 2024

@silberman how about move your comments to this issue: #23. (Move/delete from here?)

I think this is a separate ticket.

from deeposm.

silberman avatar silberman commented on August 23, 2024

@zain Oh I've just never used NetCDF4 before, but now looking into it a bit it looks great for this actually.

So instead of "/data/experiment17", we have a NetCDF group and a data loading function that knows how to handle an experiment name and turn it into the 4 numpy arrays tensorflow wants. (We may want to divide up into train/test groups as well before serializing, depending on the experiment).

So far most experimentation has been on the labelling side, so before this cache. I'd like to speed up that whole pipeline by moving the raster and osm data into a postgis database, and at that point the normal experiment workflow might skip this serialization step, though it would still be useful for testing, reproducibility, or eventually sending through dsstne.

If it worked like that, nearly all the action could happen in the data analysis container. The first step would just be responsible for downloading and inserting a bunch of data into a database, and could be run once. Then the analysis environment should be able to create, save, and load datasets, as long as it can access the database.

re: API, I think one going in the opposite direction could be cool. Website interface for constructing net architectures (producing json like: https://github.com/amznlabs/amazon-dsstne/blob/master/docs/getting_started/userguide.md#neural-network-layer-definition-language), and we tell you how accurate it was/other tensorboard output. Writing the labeller function is a harder interface to come up with, so might have to do the feature sets as already-made, like they do at http://playground.tensorflow.org/

from deeposm.

andrewljohnson avatar andrewljohnson commented on August 23, 2024

merging with other infrastructure issues

from deeposm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.