Giter Club home page Giter Club logo

sea_ice_type_cnn_training's Introduction

description

This repository provides the facilities for ASIP v2 data (webpage). Manual of this dataset is provided here.

DISCLAIMER: This code project is released with automated tests of data engineering aspect of code, not the data science aspect of it. It is meant to guide and help researchers and students get started on sea ice modelling with Convolutional neural networks.

The order of execution of different parts of the code is as follow:

  1. Execute the data building
  2. Execute the tensorflow training
  3. Execute the inference (apply) code
  4. Plotting the result of inference

Requirement

just run the following command in your environment in order to install the requirements:

pip install -r requirements.txt

The users of Microsoft VScode can easily open the remote development container with the help of .devcontainer folder and Dockerfile

Usage

This code uses python argparse for reading the input from command line. Each of three scripts including train_model.py, apply_model.py and build_dataset.py can show which argument belongs to them by placing -h after their name. For example, python train_model.py -h shows the ones that belong to training activities.

Execute the data building

By just giving the full absolute address of the folder that contains all of the uncompressed .nc files of ASIP data, data building part of code is able to build the data based on those files and make them ready for further Machine learning training.

This can be done by writing this command for training purposes:

python build_dataset.py /absolute/path/to/the/folder/of/input_files

Only the unmasked locations of data are selected among the others for having a completely clean data in order to feed it for ML training.

HINT: The folder containing input files must have all of the .nc files without any subfoldering

This command will create a folder named output in the folder that contains the build_dataset.py and write all the output files into it.

As an example, for the case of building data from /fold1 folder and store them in /fold2 folder with nersc noise calculation, and having both window size and stride of 400, the command below is used:

python build_dataset.py /fold1 -o /fold2 -n nersc_ -w 400 -s 400

Table below shows how the arguments are working:


Argument short form Argument long form default value Description
[] [] [no default] The first and the only positional argument is the path to directory with input netCDF files needed for data building or applying the trained model (inference)
-o --output_dir [no default] Path to directory for output of building (.npz files)
-n --noise_method 'nersc_' the method that error calculation had been used for error. Leave as empty string '' for ESA noise corrections or as 'nersc_' for the Nansen center noise correction.
-w --window_size 700 window size (of sar and ice chart data) for batching calculation (must be dividable to aspect ratio,the ratio between the cell size of primary and secondary input of network)(This will be the size of image samples that has been used for ML training step)
-s --stride 700 stride (of sar and ice chart data) for batching calculation (must be dividable to aspect ratio,the ratio between the cell size of primary and secondary input of network)(This will be the stride that determines the overlapping areas between image samples for ML training step)
-r --aspect_ratio 50 The ration between the cell size of primary and secondary input of ML model. stride and window_size must be dividable to it.
-swa --rm_swath 0 threshold value for comparison with netCDF file.aoi_upperleft_sample to border the calculation
-d --distance_threshold 0 threshold for distance from land in mask calculation
-a --step_resolution_sar 1 step for resizing the sar data (default value leads to no resizing)
-b --step_resolution_output 1 step for resizing the ice chart data (default value leads to no resizing)

Execute the tensorflow training

After building the data, you can train the tensorflow model with those .npz files as the result of data building calculation. To do this, run the script train_model.py by setting the address of output folder (which has been used with training mode) from pervious calculation (data building) with '-o' in the arguments.

It is strongly recommend to read the link below before using this part of the code because everything for file based config (including the classes and scripts) is developed based on explanation of this web page: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

If you want to run the training with scenes that are belong to a specific season of the year(spring,summer,etc), then you can set beginning_day_of_year and ending_day_of_year variable in the arguments of command line in order to make use of the files that are only belong to the period of year between these two numbers. These two numbers are start and end day count from the beginning of the year for reading data between them.

Train the tensorflow can be done by writing this command in order to training from npz files of fold2 folder:

python train_model.py -o /fold2 -bs 4 -p 0.8 -see -sft

In the above example the npz files are being read from /fold2 folder.

Table below shows how the arguments are working:


Argument short form Argument long form default value Description
-o --output_dir [no default] Path to directory with output files (.npz files as the output of building data)
-see --shuffle_on_epoch_end False (in the case of absence in the arguments) Flag for Shuffling the training subset of IDs at the end of every epoch during the training
-sft --shuffle_for_training False (in the case of absence in the arguments) Flag for Shuffling the list of IDs before dividing it into two 'training' and 'validation' subsets
-bd --beginning_day_of_year 0 min threshold value for comparison with scenedate of files for considering a limited subset of files based on their counts from the first of january of the same year
-ed --ending_day_of_year 365 max threshold value for comparison with scenedate of files for considering a limited subset of files based on their counts from the first of january of the same year
-p --percentage_of_training [] percentage of IDs that should be considered as training data (between 0,1). '1-percentage_of_training' fraction of data is considered as validation data.
-bs --batch_size [] batch size for data generator

Execute the inference code

The output of the ML network is the patches of image, not the whole image with original size. For seeing the result of network (as a whole image,i.e. a scene) after training, apply_model.py can be used. To do this, just like previous example of command line of training, we can run the below command in the command line:

python apply_model.py /fold1 -n nersc_ -w 400 -s 400 -bs 4

Hint: If you:

  • use resizing for building the data
  • use values of stride and window size in a way that there is a overlapping area for building and training.

and then train the network, this apply_model.py code (and consequent plotting) is not applicable. This inference code is only for cases that resizing and overlapping is not used.

This mode is executed in memory based manner. In this case, only the 'nc' files of /fold1 is taking into consideration for applying the trained model on them. The trained model is selected automatically by tensoeflow as the last trained model (configurable by checkpoint file; more information about the saving mechanism and checkpoint file is here).

Hint: it is important to give values of window_size, stride and batch size identical to those of data building calculation. Otherwise, applying the model is meaningless.

A folder named reconstructs_folder will be created at the same level of output_dir and the reconstructed files will be saved inside that folder.


Table below shows how the arguments are working:

Argument short form Argument long form default value Description
[] [] [no default] The first and the only positional argument is the path to directory with input netCDF files needed for data building or applying the trained model (inference)
-n --noise_method 'nersc_' the method that error calculation had been used for error. Leave as empty string '' for ESA noise corrections or as 'nersc_' for the Nansen center noise correction.
-w --window_size 700 window size (of sar and ice chart data) for batching calculation (must be dividable to aspect ratio,the ratio between the cell size of primary and secondary input of network)(This will be the size of image samples that has been used for ML training step)
-s --stride 700 stride (of sar and ice chart data) for batching calculation (must be dividable to aspect ratio,the ratio between the cell size of primary and secondary input of network)(This will be the stride that determines the overlapping areas between image samples for ML training step)
-r --aspect_ratio 50 The ration between the cell size of primary and secondary input of ML model. stride and window_size must be dividable to it.
-swa --rm_swath 0 threshold value for comparison with netCDF file.aoi_upperleft_sample to border the calculation
-d --distance_threshold 0 threshold for distance from land in mask calculation
-a --step_resolution_sar 1 step for resizing the sar data (default value leads to no resizing)
-b --step_resolution_output 1 step for resizing the ice chart data (default value leads to no resizing)
-bs --batch_size [no default] batch size for data generator

Plotting the result of inference

For plotting, you can run a separate python script called show.py. You have to make sure that the dependencies are ready for this script. It means you have to install scipy and numpy on your env and run the show.py. This show.py code can also be substituted with an interactive jupyter-notebook. Plotting can be done by writing this command:

python show.py

sea_ice_type_cnn_training's People

Contributors

akorosov avatar alissa13777 avatar azamifard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sea_ice_type_cnn_training's Issues

Nan values in .npz files

When running the training on the .npz files I have got the loss function equal to nan. When debugging, I found that there a nan values in the .npz files (in the matrixes for 'nersc_sar_primary' and 'nersc_sar_secondary').
I have run the python train_model.py -o /fold2 -bs 4 -p 0.8 -see -sft on the .npz files extracted from the following .nc files:

  • 20180410T084537_S1B_AMSR2_Icechart-Greenland-SouthEast.nc
  • 20190404T201246_S1A_AMSR2_Icechart-Greenland-SouthEast.nc
  • 20190423T200433_S1A_AMSR2_Icechart-Greenland-SouthEast.nc
  • 20190509T081206_S1A_AMSR2_Icechart-Greenland-CentralEast.nc
  • 20190509T081306_S1A_AMSR2_Icechart-Greenland-CentralEast.nc
  • 20190519T194808_S1A_AMSR2_Icechart-Greenland-SouthEast.nc
  • 20190519T194908_S1A_AMSR2_Icechart-Greenland-SouthEast.nc
  • 20190523T200352_S1B_AMSR2_Icechart-Greenland-SouthEast.nc

To find the nan values, I have run a Python script that stocks the different values of each file and then did a CTRL+F.
In the Method 2 of the python script (https://github.com/Alissa13777/Internship_NERSC_CNN_IceTypes), one can see the file in which there are nan values in the shell.

Refactor self.util into classes

Another tree of classes can be developed:

class Batches:
    def calculate_variable_ML


class SarBatches(Batches):
    def pading

class OutputBatches(SarBatches):

class Amsr2Batches(Batches):

Attributes and methods from self.util can be moved to the new classes.

Inside main()

these classes can be instantiated and used in a loop after archive_.calculate_batches_for_masks() or even earlier:

for cls in [SarBatches, OutputBatches, Amsr2Batches]:
    o = cls(archive_)
    o.process()

Develop the generator code

Questions about generators:

In a simple case one input (sar), one output (CT) with the same size do we need just one generator?
In the same case, do we need just one generator for training and validation data?

In a more complex case, a typical workflow:

  • one input from sar at input layer
  • another input from amsr2 at intermediate layer
  • output (CT) at output layer
    How many generators do we need?

Parameters for generator:

  • which bands to use
  • which files to use (list of files or mask)
  • at which layers to add the data

To keep track of processed files

To write which files were processed and which files are unprocessable.
If processing has to be restarted (due to crash) process only files not from this list.

data_builded should not be called before inference

It is quite disadvantageous that the data_builded script has to be called before inference. Then it will work only with the data from the ASIP (note P instead of D) dataset. However we will use it also for other data from Sentinel-1 and AMSR-2. Therefore the model should be applicable to any input with such data.

One way to apply a model without a builder script is to make a new generator which accepts an in-memory object as input, instead of a list of NPZ files. The Archive class already has the functionality to read everything into memory (and also to write NPZ files, which is relevant only for training dataset). Now a new DataGenerator should be developed to take Archive object as input.
Then Archive becomes not a proper class as it mixes operations on archive (multiple files) and as single dataset. So it should be split into two (e.g. Archive and Dataset) and then the new DataGenerator should take only a Dataset object.

Later (another issue), in order to adapt the generator to other input data, we will develop a class that can read Sentinel-1 and AMSR2 from two different files, collocate them on the same grid, create an object with the same interface as the Dataset above and use it either for building another training dataset or for inference.

Filtering dataset for nan values

Filter dataset for nan values that are not land (put a condition for land with distance_map) for sar and amsr2 images in build_dataset.py (after line 33). fil must be considered as a dictionnary. (Filtering should work be able to function independently from functions in archive and mask).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.