microsoft / torchgeo Goto Github PK

TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

Home Page: https://www.osgeo.org/projects/torchgeo/

License: MIT License

Python 99.22% Shell 0.78%

pytorch torchvision datasets models transforms remote-sensing deep-learning earth-observation computer-vision geospatial

torchgeo's Introduction

TorchGeo is a PyTorch domain library, similar to torchvision, providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.

The goal of this library is to make it simple:

for machine learning experts to work with geospatial data, and
for remote sensing experts to explore machine learning solutions.

Community:

Packaging:

Testing:

Installation

The recommended way to install TorchGeo is with pip:

$ pip install torchgeo

For conda and spack installation instructions, see the documentation.

Documentation

You can find the documentation for TorchGeo on ReadTheDocs. This includes API documentation, contributing instructions, and several tutorials. For more details, check out our paper, podcast episode, tutorial, and blog post.

Example Usage

The following sections give basic examples of what you can do with TorchGeo.

First we'll import various classes and functions used in the following sections:

from lightning.pytorch import Trainer
from torch.utils.data import DataLoader

from torchgeo.datamodules import InriaAerialImageLabelingDataModule
from torchgeo.datasets import CDL, Landsat7, Landsat8, VHR10, stack_samples
from torchgeo.samplers import RandomGeoSampler
from torchgeo.trainers import SemanticSegmentationTask

Geospatial datasets and samplers

Many remote sensing applications involve working with geospatial datasets—datasets with geographic metadata. These datasets can be challenging to work with due to the sheer variety of data. Geospatial imagery is often multispectral with a different number of spectral bands and spatial resolution for every satellite. In addition, each file may be in a different coordinate reference system (CRS), requiring the data to be reprojected into a matching CRS.

In this example, we show how easy it is to work with geospatial data and to sample small image patches from a combination of Landsat and Cropland Data Layer (CDL) data using TorchGeo. First, we assume that the user has Landsat 7 and 8 imagery downloaded. Since Landsat 8 has more spectral bands than Landsat 7, we'll only use the bands that both satellites have in common. We'll create a single dataset including all images from both Landsat 7 and 8 data by taking the union between these two datasets.

landsat7 = Landsat7(root="...", bands=["B1", ..., "B7"])
landsat8 = Landsat8(root="...", bands=["B2", ..., "B8"])
landsat = landsat7 | landsat8

Next, we take the intersection between this dataset and the CDL dataset. We want to take the intersection instead of the union to ensure that we only sample from regions that have both Landsat and CDL data. Note that we can automatically download and checksum CDL data. Also note that each of these datasets may contain files in different coordinate reference systems (CRS) or resolutions, but TorchGeo automatically ensures that a matching CRS and resolution is used.

cdl = CDL(root="...", download=True, checksum=True)
dataset = landsat & cdl

This dataset can now be used with a PyTorch data loader. Unlike benchmark datasets, geospatial datasets often include very large images. For example, the CDL dataset consists of a single image covering the entire continental United States. In order to sample from these datasets using geospatial coordinates, TorchGeo defines a number of samplers. In this example, we'll use a random sampler that returns 256 x 256 pixel images and 10,000 samples per epoch. We also use a custom collation function to combine each sample dictionary into a mini-batch of samples.

sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, collate_fn=stack_samples)

This data loader can now be used in your normal training/evaluation pipeline.

for batch in dataloader:
    image = batch["image"]
    mask = batch["mask"]

    # train a model, or make predictions using a pre-trained model

Many applications involve intelligently composing datasets based on geospatial metadata like this. For example, users may want to:

Combine datasets for multiple image sources and treat them as equivalent (e.g., Landsat 7 and 8)
Combine datasets for disparate geospatial locations (e.g., Chesapeake NY and PA)

These combinations require that all queries are present in at least one dataset, and can be created using a UnionDataset. Similarly, users may want to:

Combine image and target labels and sample from both simultaneously (e.g., Landsat and CDL)
Combine datasets for multiple image sources for multimodal learning or data fusion (e.g., Landsat and Sentinel)

These combinations require that all queries are present in both datasets, and can be created using an IntersectionDataset. TorchGeo automatically composes these datasets for you when you use the intersection (&) and union (|) operators.

Benchmark datasets

TorchGeo includes a number of benchmark datasets—datasets that include both input images and target labels. This includes datasets for tasks like image classification, regression, semantic segmentation, object detection, instance segmentation, change detection, and more.

If you've used torchvision before, these datasets should seem very familiar. In this example, we'll create a dataset for the Northwestern Polytechnical University (NWPU) very-high-resolution ten-class (VHR-10) geospatial object detection dataset. This dataset can be automatically downloaded, checksummed, and extracted, just like with torchvision.

from torch.utils.data import DataLoader

from torchgeo.datamodules.utils import collate_fn_detection
from torchgeo.datasets import VHR10

# Initialize the dataset
dataset = VHR10(root="...", download=True, checksum=True)

# Initialize the dataloader with the custom collate function
dataloader = DataLoader(
    dataset,
    batch_size=128,
    shuffle=True,
    num_workers=4,
    collate_fn=collate_fn_detection,
)

# Training loop
for batch in dataloader:
    images = batch["image"]  # list of images
    boxes = batch["boxes"]  # list of boxes
    labels = batch["labels"]  # list of labels
    masks = batch["masks"]  # list of masks

    # train a model, or make predictions using a pre-trained model

All TorchGeo datasets are compatible with PyTorch data loaders, making them easy to integrate into existing training workflows. The only difference between a benchmark dataset in TorchGeo and a similar dataset in torchvision is that each dataset returns a dictionary with keys for each PyTorch Tensor.

Pre-trained Weights

Pre-trained weights have proven to be tremendously beneficial for transfer learning tasks in computer vision. Practitioners usually utilize models pre-trained on the ImageNet dataset, containing RGB images. However, remote sensing data often goes beyond RGB with additional multispectral channels that can vary across sensors. TorchGeo is the first library to support models pre-trained on different multispectral sensors, and adopts torchvision's multi-weight API. A summary of currently available weights can be seen in the docs. To create a timm Resnet-18 model with weights that have been pretrained on Sentinel-2 imagery, you can do the following:

import timm
from torchgeo.models import ResNet18_Weights

weights = ResNet18_Weights.SENTINEL2_ALL_MOCO
model = timm.create_model("resnet18", in_chans=weights.meta["in_chans"], num_classes=10)
model.load_state_dict(weights.get_state_dict(progress=True), strict=False)

These weights can also directly be used in TorchGeo Lightning modules that are shown in the following section via the weights argument. For a notebook example, see this tutorial.

Reproducibility with Lightning

In order to facilitate direct comparisons between results published in the literature and further reduce the boilerplate code needed to run experiments with datasets in TorchGeo, we have created Lightning datamodules with well-defined train-val-test splits and trainers for various tasks like classification, regression, and semantic segmentation. These datamodules show how to incorporate augmentations from the kornia library, include preprocessing transforms (with pre-calculated channel statistics), and let users easily experiment with hyperparameters related to the data itself (as opposed to the modeling process). Training a semantic segmentation model on the Inria Aerial Image Labeling dataset is as easy as a few imports and four lines of code.

datamodule = InriaAerialImageLabelingDataModule(root="...", batch_size=64, num_workers=6)
task = SemanticSegmentationTask(
    model="unet",
    backbone="resnet50",
    weights=True,
    in_channels=3,
    num_classes=2,
    loss="ce",
    ignore_index=None,
    lr=0.1,
    patience=6,
)
trainer = Trainer(default_root_dir="...")

trainer.fit(model=task, datamodule=datamodule)

TorchGeo also supports command-line interface training using LightningCLI. It can be invoked in two ways:

# If torchgeo has been installed
torchgeo
# If torchgeo has been installed, or if it has been cloned to the current directory
python3 -m torchgeo

It supports command-line configuration or YAML/JSON config files. Valid options can be found from the help messages:

# See valid stages
torchgeo --help
# See valid trainer options
torchgeo fit --help
# See valid model options
torchgeo fit --model.help ClassificationTask
# See valid data options
torchgeo fit --data.help EuroSAT100DataModule

Using the following config file:

trainer:
  max_epochs: 20
model:
  class_path: ClassificationTask
  init_args:
    model: "resnet18"
    in_channels: 13
    num_classes: 10
data:
  class_path: EuroSAT100DataModule
  init_args:
    batch_size: 8
  dict_kwargs:
    download: true

we can see the script in action:

# Train and validate a model
torchgeo fit --config config.yaml
# Validate-only
torchgeo validate --config config.yaml
# Calculate and report test accuracy
torchgeo test --config config.yaml --trainer.ckpt_path=...

It can also be imported and used in a Python script if you need to extend it to add new features:

from torchgeo.main import main

main(["fit", "--config", "config.yaml"])

See the Lightning documentation for more details.

Citation

If you use this software in your work, please cite our paper:

@inproceedings{Stewart_TorchGeo_Deep_Learning_2022,
    address = {Seattle, Washington},
    author = {Stewart, Adam J. and Robinson, Caleb and Corley, Isaac A. and Ortiz, Anthony and Lavista Ferres, Juan M. and Banerjee, Arindam},
    booktitle = {Proceedings of the 30th International Conference on Advances in Geographic Information Systems},
    doi = {10.1145/3557915.3560953},
    month = nov,
    pages = {1--12},
    publisher = {Association for Computing Machinery},
    series = {SIGSPATIAL '22},
    title = {{TorchGeo}: Deep Learning With Geospatial Data},
    url = {https://dl.acm.org/doi/10.1145/3557915.3560953},
    year = {2022}
}

Contributing

This project welcomes contributions and suggestions. If you would like to submit a pull request, see our Contribution Guide for more information.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

torchgeo's People

Contributors

Stargazers

Watchers

Forkers

isaaccorley ashnair1 standardgalactic marmikreal estherrolf peterzhousz z-zheng nonmean giserh multiconsult-group buildtogether12 harryeslick mengkunzhao junjue-wang vieiv iejmac julien-blanchon jkumarraj caibirdhsa sam-sommerer alketcecaj12 bizancio3 aditya701 mrchen18 ngam morandiaye isaacmrsmile juliacheng0805 sxjscience isamaljawarneh p4rk3r mirah-jz the-intelligence-of-information marireeves classicvalues stjordanis daifeng2016 geoyee nilsleh robeson1010 srmsoumya ritwikgupta sharmalakshay93 ooxxxxoo 3spp weifengou khlaifiabilel okezue crossbow-vision recursix lzy-v tritolol weecology ethanwhite chouisgiser cczls1991 zhaoxueliang remtav surfcao matrix4284 dorotheamueller fanfan-94 raysome spacificai kennsmithds khdlr python-repository-hub mtrazzak zgsxwsdxg ancao310 ctdou gaetanbahl veurman3 modexus tdroseval milo14222 mehmetgunturkun skirowen saumyasinha abhradeepmaiti mc-o atalaveracuya ladyk-21 sim017 shahmed test-mass-forker-org-1 simrit1 tymiles003 prasadjivane isabella232 zhaoxin94 tcherici metavai cjrd neogyk azuria-earth winterchaufr sandupal pingyangtiaer lenrichitba

torchgeo's Issues

0.1.0 release and publication

The following list enumerates all tasks that need to be completed before:

making the repo public,
tagging a 0.1.0 release,
publishing on PyPI/conda-forge/Spack, and
publishing a paper on TorchGeo.

This list may change over time as we reevaluate the remaining tasks that need to be done.

Documentation

Decide on a logo
Add tutorial (#38, #93)
Add installation instructions (#153)
Add contributing instructions
Document release procedure (https://github.com/microsoft/torchgeo/wiki/Releasing-Instructions)

Continuous Integration

Set up codecov, add badge (must be public first?)
Set up RTD, add badge (must be public first?)
Determine minimum supported dependency versions (#32, #80)
Achieve 100% test coverage

Publication

Decide on a publication venue (preprint on arXiv, paper in ICLR)
Benchmark different ways of performing the reprojection, etc. (#81)
Compare no-pretrain/imagenet-pretrain/torchgeo-pretrain
FInd publications that discuss and compare various geospatial datasets
Create an Overleaf document with ICLR template

Datasets

Implement a few more GeoDatasets
- Chesapeake (#18)
- NAIP (#57)
- MSFT building footprints (#69)
Refactor indexing and getitem logic from datasets to GeoDataset base class (#73)

Models

Set up infrastructure for hosting pre-trained model weights

Samplers

Implement GridGeoSampler (#78)
More intelligent sampling (#82, #84)

Trainers

Factor out task specific configuration logic into the different task classes
Pass the "task level" arguments to the DataModules as well as the Tasks. Batch size/seed/num workers should probably go in the "task level" arguments.
Add trainer for GeoDataset
Refactor trainers to reduce code duplication (#205)

Transforms

Experiment with different ways of doing augmentation pipelines
Use Kornia, provide wrapper
Add geospatial-specific transforms (#127)

DOTA Dataset

https://captain-whu.github.io/DOTA/index.html

DOTA is a large-scale dataset for object detection in aerial images. It can be used to develop and evaluate object detectors in aerial images. The images are collected from different sensors and platforms. Each image is of the size in the range from 800 × 800 to 20,000 × 20,000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes.

Benchmarking of GeoDataset for a paper result

Datasets

We want to test several popular image sources, as well as both raster and vector labels.

NAIP + Chesapeake
Landsat + CDL
Sentinel + Canadian Building Footprints

There is also a question of which file formats to test. For example, sampling from GeoJSON can take 3 min per getitem, whereas ESRI Shapefile only takes 1 sec per getitem (#69 (comment)).

Experiments

Data location: local vs. blob storage
Sampling strategy: random vs. 2-step random tile/chip vs. non-random grid sampling
Warping strategy: on-the-fly vs. pre-processing step
I/O strategy: load/warp entire file vs. load/warp/cache entire file vs. load/warp single window

For the warping strategy, we should test the following possibilities:

Already in correct CRS/resolution
Need to change CRS
Need to change resolution
Need to change both CRS and resolution

What is the upfront cost of these pre-processing steps?

Example notebook: https://gist.github.com/calebrob6/d9bc5609ff638d601e2c35a1ab0a2dec

CV4A Kenya Crop Type Competition Dataset

Dataset of sentinel 2 imagery and crop types used in one of the competitions in the CV4A workshop in ICLR 2020.

Links:

Dataset page on MLHub: https://registry.mlhub.earth/10.34911/rdnt.dw605x/
Workshop page: https://www.cv4gc.org/cv4a2020/
Competition page: https://zindi.africa/competitions/iclr-workshop-challenge-2-radiant-earth-computer-vision-for-crop-recognition/data
Baseline paper: https://arxiv.org/abs/2004.03023

Chesapeake datasets don't extract properly

The ChesapeakeMD dataset fails when attempting to extract the downloaded zip file as zipfile.ZipFile doesn't support the deflate64 compression type that _MD_STATEWIDE.zip uses.

Tests fail if GPU 0 is busy

The default config files use the GPU.

Open Buildings Dataset

The dataset contains 516M building detections, across an area of 19.4M km2 (64% of the African continent).

For each building in this dataset we include the polygon describing its footprint on the ground, a confidence score indicating how sure we are that this is a building, and a Plus Code corresponding to the centre of the building. There is no information about the type of building, its street address, or any details other than its geometry.

Add glossary

We should add some kind of glossary to document common terms that may be unfamiliar to either:

Deep learning people who don't know remote sensing
Remote sensing people who don't know deep learning

Examples:

tiles/swaths vs. chips/patches
classification vs. object detection vs. semantic segmentation vs. instance segmentation

See https://sublime-and-sphinx-guide.readthedocs.io/en/latest/glossary.html

Add utility to compute bounding box of resulting prediction

Rationale

For GeoDatasets, at sampling time, we should know the CRS, bounding box, and resolution of the image that gets sampled. As the image is passed through transforms like Resample or Warp, we should be able to recompute the new CRS/bbox/res pretty easily. However, depending on the padding and stride used in convolutional and pooling layers, the bounding box and resolution may change significantly as the image is passed through the model. In order to save or stitch together our predictions, we'll need to be able to compute the new bbox/res.

Description

We have two possible options:

Build a utility that takes in a nn.Module (the neural network) and computes the resulting bbox/res.
Modify PyTorch's builtin modules (at least the convolution and pooling ones) to take the bbox/res as input and modify them directly.

In the short-term, we will likely go with 1 since it involves the least work. In the long-term, we may end up going with 2 since we'll want to be able to design networks that can take advantage of this kind of geospatial information.

COWC Dataset

https://gdo152.llnl.gov/cowc/

The Cars Overhead With Context (COWC) data set is a large set of annotated cars from overhead. It is useful for training a device such as a deep neural network to learn to detect and/or count cars.

Jupyter Notebook visualization support

We should add integration with IPython:

By defining a _repr_png_ method that displays an image using the appropriate RGB channels.

It seems like this is defined at the class level, so we would have to either register this for ALL torch.Tensor objects (doesn't allow customization for which bands to use) or do something else at the Dataset level.

Hide deprecation warnings

When using pytest, deprecation warnings for both torchgeo and all of its dependencies are displayed. The version of tensorboard I'm using raises hundreds of deprecation warnings:

.spack-env/view/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py:18
  /home/t-astewart/torchgeo/.spack-env/view/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py:18: DeprecationWarning: Call to deprecated create function FileDescriptor(). Note: Create unlinked descriptors is going to go away. Please use get/find descriptors from generated code or query the descriptor_pool.
    DESCRIPTOR = _descriptor.FileDescriptor(
...
.spack-env/view/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:114
  /home/t-astewart/torchgeo/.spack-env/view/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:114: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.bool: SlowAppendBoolArrayToTensorProto,

The first set of warnings about FileDescription/FieldDescriptor seems to have been fixed in master, as these are no longer present in the source code.

The second set of warnings were fixed in tensorflow/tensorboard#5138.

We should silence these specific warnings since they will be taken care of when a new tensorboard release comes out.

Segmentation models

https://github.com/qubvel/segmentation_models.pytorch

It may be useful to add wrappers around some of these models with pre-trained weights for satellite imagery tasks.

Sentinel Dataset

https://www.sentinel-hub.com/

Add datasets for various generations of Sentinel satellites.

Improvements to VHR-10 documentation

Need better docs describing the format of the dataset.

Remove cartopy dependency

We don't really need it for plotting, and it adds a lot of complexity to get it installed in CI.

Pipeline to generate pre-trained weights

We should set up some kind of Azure pipeline that can automatically generate and deploy new pre-trained weights whenever a new model/dataset is added or whenever we create a new release.

Required dependencies, lazy imports

Our number of dependencies is rapidly increasing. We should think about which of these dependencies are required (install_requires) vs. optional (extras_require). Here is a proposal:

Required: dependencies that are needed to do just about anything with TorchGeo
Optional: dependencies that are only needed for optional functionality, or for a small number of datasets

This gets a bit tricky, and is currently at odds with our dependency list. For example, rasterio is only used in RasterDataset, and fiona is only used in VectorDataset. While almost half of our datasets are RasterDataset, ony 1 is currently VectorDataset. Also, matplotlib is only needed to plot example samples.

Here is another proposal:

Required: any dependency that is used in regular usage
Optional: dependencies used only in rare cases (single dataset)

This may be a better default. Most users will likely run pip install torchgeo, which will install only the things in install_requires. We don't want a useless installation to be the default, and extras_require is off by default. Alternatively, we could do a better job of documenting the recommended way to install TorchGeo and specify that you may want pip install torchgeo[datasets,train] or something like that.

Another thing to consider is how to handle these optional imports. We can't put them at the module level (in the case of datasets) so we use lazy imports instead. We also may want to create a wrapper like importorraise (akin to pytest's importorskip) that prints a more useful error message upon ImportError. We could go even further and use lazy imports for almost all imports, not just optional ones. This will greatly speed up importing torchgeo.

Redesign of ZipDataset

In my mind, there are several different reasons that someone might want to combine two GeoDatasets:

Combine image and target labels to sample both simultaneously (Landsat8 + CDL)
Combine datasets for multiple sensors (Landsat8 + Landsat7 or Landsat8 + Sentinel)
Combine datasets for disparate geospatial locations (ChesapeakeDE + ChesapeakeMD)

Right now, ZipDataset is designed to exclusively handle case 1. Case 2 doesn't work because the "image" key gets replaced instead of merged or concatenated. Case 3 doesn't work because we explicitly check for overlap between datasets.

We need to think about whether it is possible to support all possible use cases, whether there are any additional use cases, and how to implement this support. Ideally, these could all be wrapped into ZipDataset so that addition handles everything. Hopefully we don't need to add an additional ABC for MergeDataset or something like that.

Jupyter Notebook tutorials

We need to figure out how to render Jupyter Notebooks in our documentation so that we can provide easy-to-use tutorials for new users. This should work similarly to https://pytorch.org/tutorials/.

Ideally I would like to be able to test these tutorials so that they stay up-to-date.

Logging

We should replace all of our print statements and verbose parameters with Python's logging library. This will allow for more uniform access to messages. I'm thinking of the following levels:

logging.INFO: all operations that could be slow (download, checksum, decompression, extraction, indexing)
logging.DEBUG: all file access

More intelligent sampling

Here are some ideas:

All GeoSamplers should take a GeoDataset index as input
Randomly choose a file, then randomly sample from within bounds of that file (solves sampling out of bounds problem)
Add new sampler (RandomBatchGeoSampler) that subclasses BatchSampler and returns a batch of random patches from a single tile

When using ZipDataset with random samplers, the index should come from whichever dataset is tile-based. When using ZipDataset with grid samplers, the index should come from whichever dataset is not tile-based. Not yet sure how to handle something like Landsat + Sentinel, but we can figure that out another day.

Class hierarchy:

Sampler
- GeoSampler
  - RandomGeoSampler
  - GridGeoSampler
- BatchGeoSampler
  - RandomBatchGeoSampler

Make sure to document the difference between samplers and batch samplers and when to use which. Should store samplers and batch samplers in different files and combine in __init__ like we do with datasets. Add utils.py for things like _to_tuple.

Question: if I'm using an LRU cache and BatchSampler and multiple workers, if something isn't yet in the cache, will PyTorch spawn multiple workers all trying to warp the entire tile? It may actually be faster to use a single worker in this case.

Maxar imagery dataset

https://www.maxar.com/products/satellite-imagery

It seems like the imagery isn't free/open-source, but they do have samples we could use to write a data loader: https://resources.maxar.com/product-samples

Add `split` arg to VHR-10 Dataset

The VHR-10 Dataset consists of both "positive" images that contain objects of interest, as well as "negative" images that only contain background data. Currently, our dataset only handle positive images. We could add a split argument that allows users to select between "positive", "negative", and "both" image sets.

The problem is that this greatly increases the complexity of the code in the data loader because the annotations file doesn't list annotations/filenames for negative images. For this reason, even torchvision's COCO dataset doesn't contain unlabeled images.

Resolution/resampling

Many datasets like Sentinel2 have a different resolution per band. Currently we don't handle this and things crash when you try to concatenate bands with different shape. There are a few options for how to handle this:

Automatically resample to the highest resolution
Allow user to specify resolution to resample to
Store bands with different resolution in different dict keys

Option 3 makes it hard to automatically detect the "image" keys during data augmentation, but offers the greatest flexibility for modeling. Option 1 and 2 aren't mutually exclusive, and are probably the easiest to implement in the short term.

We also need to add transforms for resampling.

LandCover.ai Dataset

https://landcover.ai/

The LandCover.ai (Land Cover from Aerial Imagery) dataset is a dataset for automatic mapping of buildings, woodlands, water and roads from aerial images.

Dataset downloading expected behavior

What is the expected behavior when a dataset is not downloaded, and someone passes download=True, checksum=False?

I would expect that we download the dataset if it doesn't exist, but not verify that download to be correct, however I think the LandcoverAI dataset (at least) will just return that the dataset exists.

Determine minimum supported dependency versions

Before releasing, we should determine the minimum supported version of each dependency. We should also consider a test with this version just to make sure it doesn't change.

NAIP Dataset

https://www.fsa.usda.gov/programs-and-services/aerial-photography/imagery-programs/naip-imagery/

The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. A primary goal of the NAIP program is to make digital ortho photography available to governmental agencies and the public within a year of acquisition.

NAIP is administered by the USDA's Farm Service Agency (FSA) through the Aerial Photography Field Office in Salt Lake City. This "leaf-on" imagery is used as a base layer for GIS programs in FSA's County Service Centers, and is used to maintain the Common Land Unit (CLU) boundaries.

Simplify dataset docs

For transforms/trainers/models, the rST file simply tells Sphinx to automatically generate all documentation. For datasets, we instead hard-code the order and section titles, meaning that the file needs to be updated every time a new dataset is added. This isn't that much work, but I keep forgetting to do it. We should see if there's an easy way to get some of the same structure we have now and still autogenerate the entire documentation page.

Things to investigate:

Can we change the order of __all__ to change the order in which each dataset gets documented? If not, we'll just have to stop caring about the order.
Can we put each section title in the module-level docstring so that datasets.rst never needs to be updated?

Chesapeake7 doesn't have an embedded color table

All the tiffs in the Chesapeake series have embedded color tables except for the Chesapeake7 tiff. This causes the cmap = src.colormap(1) line to throw a ValueError.

Deciding on "masks" or "mask" in `getitem`

The __getitem__ methods in Chesapeake and CDL datasets return a key "masks" while SEN12MS and LandcoverAI return "mask". Should we choose one or is there some reason we need to differentiate?

BigEarthNet dataset

http://bigearth.net/

EO dataset with over 100 citations
Used in immediately related work https://arxiv.org/pdf/1911.06721.pdf
It is included in the Tensorflow datasets (https://www.tensorflow.org/datasets/catalog/bigearthnet)

Add stitching utilities

We should add a torchgeo.utils package that provides utilities for common operations, including stitching together patches to create a prediction on an entire tile. This is equivalent to torchvision.utils. See https://arxiv.org/pdf/1805.12219.pdf for a survey of common stitching techniques that we should support. This includes clipping (best default) and averaging. May also want to add weighted averaging. Let's see what libraries exist for this. If rasterio can do this for us for free, that would be awesome.

Download/checksum/decompression/extraction utilities

All of our datasets share various utilities for download/checksum/decompression/extraction of datasets available online. For now, I've been trying to use the utilities provided in torchvision.datasets.utils as much as possible, but these have many limitations. Specifically, the decompression/extraction logic doesn't handle many of our dataset formats (bz2, rar, etc). I was trying to submit PRs to add these features to torchvision, but they don't seem interested in many of them. Even if they do get merged, they will require a dependency on torchvision@master to actually use. We may want to write our own utilities instead of using torchvision's, at least for decompression/extraction. I'm a little afraid of writing my own utilities for downloading because they are complicated (especially for Google Drive) and would require internet access to test (slow).

For more info, see:

Chesapeake Land Cover Dataset

This dataset contains high-resolution aerial imagery from the USDA NAIP program [1], high-resolution land cover labels from the Chesapeake Conservancy [2], low-resolution land cover labels from the USGS NLCD 2011 dataset [3], low-resolution multi-spectral imagery from Landsat 8 [4], and high-resolution building footprint masks from Microsoft Bing [5], formatted to accelerate machine learning research into land cover mapping.

OpenStreetMap dataset

We should add a GeoDataset for OpenStreetMap: https://www.openstreetmap.org/

There are a couple of approaches that we could take:

Query the API every time we call __getitem__ (slow)
Download the data in a single large rectangle and load this (fast, more complex)

Convert many NonGeoDatasets to GeoDatasets

During the GeoDataset refactor (#37), all existing GeoDatasets were moved to VisionDataset. Now that the GeoDataset API has settled down, we should attempt to convert many of these VisionDatasets that have geospatial information back to GeoDataset.

We may need a STACDataset base class that subclasses GeoDataset and describes how to pull geospatial information from STAC JSON files.

Consolidate GeoDataset tests

Before RasterDataset and VectorDataset were added, each dataset class had to be tested separately. Now that most of the logic has been consolidated in RasterDataset and VectorDataset, we should move those tests to tests/datasets/test_geo.py. This will make it much easier to add new datasets without having to add expansive tests.

Denver Land Cover Dataset

https://drcog.org/services-and-resources/data-maps-and-modeling/regional-land-use-land-cover-project

A pilot land use land cover endeavor was undertaken by DRCOG, the Babbitt Center for Land and Water Policy(link is external), and the Conservation Innovation Center(link is external) in 2019. During this pilot, 1,000 square miles of the Denver region were classified at 1-meter resolution using high-resolution imagery acquired as part of the 2018 Denver Regional Aerial Photography Project. Eight classes were identified: structures, impervious surface, water, grassland/prairie, tree canopy, irrigated lands/turf, cropland and barren/rock.

So2Sat

So2Sat LCZ42: A Benchmark Dataset for Global Local Climate Zones Classification

Links:

Drop `base_folder` from datasets

Like torchvision, all of our datasets have a base_folder attribute that specifies the subdirectory of root that we want to store downloaded data to. However, this also prevents users from using data that is already on a system that doesn't have this same directory structure. Most users will only need 1 or 2 datasets at a time, so instead of specifying data and having data end up in data/sentinel2 users can just specify data/sentinel2 themselves. This gives more flexibility and simplifies the dataset code.

Image Models

https://github.com/rwightman/pytorch-image-models

It may be useful to add wrappers around some of these models with pre-trained weights for satellite imagery tasks.

Augmentation library options

The following libraries provide APIs for performing augmentations:

albumentations
- Dependencies on opencv-python-headless, scikit-image, numpy, scipy, PyYAML
- Looks to be CPU only
- Looks to support >3 channels where possible
kornia
- Only dependency is pytorch
- GPU enabled
- Supports >3 channels where possible
imgaug
- TODO

Citation

Once we release a paper on TorchGeo, we can use the following citation format file to directly add citation instructions to the repo:

This does not yet support citing external papers, but that feature is coming soon:

Add unit testing and coverage

Use pytest and codecov to get 100% coverage and prevent bugs.

SEN12MS Dataset

The availability of curated large-scale training data is a crucial factor for the development of well-generalizing deep learning methods for the extraction of geoinformation from multi-sensor remote sensing imagery. While quite some datasets have already been published by the community, most of them suffer from rather strong limitations, e.g. regarding spatial coverage, diversity or simply number of available samples. Exploiting the freely available data acquired by the Sentinel satellites of the Copernicus program implemented by the European Space Agency, as well as the cloud computing facilities of Google Earth Engine, we provide a dataset consisting of 180,662 triplets of dual-pol synthetic aperture radar (SAR) image patches, multi-spectral Sentinel-2 image patches, and MODIS land cover maps. With all patches being fully georeferenced at a 10 m ground sampling distance and covering all inhabited continents during all meteorological seasons, we expect the dataset to support the community in developing sophisticated deep learning-based approaches for common tasks such as scene classification or semantic segmentation for land cover mapping.

https://arxiv.org/pdf/1906.07789.pdf
https://github.com/schmitt-muc/SEN12MS
https://mediatum.ub.tum.de/1474000 (510 GB download)

Microsoft Canadian Building Footprints Dataset

https://github.com/Microsoft/CanadianBuildingFootprints

This dataset contains 11,842,186 computer generated building footprints in all Canadian provinces and territories. This data is freely available for download and use.

Landsat Dataset

https://landsat.gsfc.nasa.gov/

Add data loaders for various generations of Landsat data.

Add documention

Use sphinx/napoleon to generate documentation and upload it to readthedocs.io. This will be a good chance to reserve the torchgeo domain. Only problem is that I don't think readthedocs has free private docs, so this may have to wait until it is public.

microsoft / torchgeo Goto Github PK

torchgeo's Introduction

Installation

Documentation

Example Usage

Geospatial datasets and samplers

Benchmark datasets

Pre-trained Weights

Reproducibility with Lightning

Citation

Contributing

torchgeo's People

Contributors

Stargazers

Watchers

Forkers

torchgeo's Issues

Documentation

Continuous Integration

Publication

Datasets

Models

Samplers

Trainers

Transforms

Datasets

Experiments

Rationale

Description

Recommend Projects

Recommend Topics

Recommend Org