nasaharvest / cropharvest Goto Github PK
View Code? Open in Web Editor NEWOpen source remote sensing dataset with benchmarks
License: Creative Commons Attribution Share Alike 4.0 International
Open source remote sensing dataset with benchmarks
License: Creative Commons Attribution Share Alike 4.0 International
This repository should contain the original code from geomaml for:
I am having some trouble with teh 'run interference' section of the demo. I am getting the error: AttributeError: module 'xarray' has no attribute 'open_rasterio'
I think it has something to do with xarray being replaced with rioxarray but nothing i try works.
If anyone has a workaround to fix this it would be very helpful :)
This looks like a terrific ML resource with a powerful API. But your documentation is a bit lean, especially for EO newbies. The map in README.md suggests there is terrific image coverage in the dataset of Europe and North America, but the example code is limited to Togo, with benchmarks for Kenya & Brazil.
Can we use cropharvest to feed data for Europe or North America to ML models? I am guessing we need to supplement the features data download with those features in geographies we want to perform ML on. How do we use cropharvest to do that? It is not obvious.
Forgive me if the dataset is intended only for Kenya/Brazil/Togo only and I have misunderstood. As EO professionals you will be familiar with the sentinelsat library whose documentation is brilliant for EO newbies but does not produce ML ready products. Could you produce something as explanatory but with a ML ready output?
Several packages in the cropharvest pip package (ie. rasterio, geopandas) require extensive steps to install on windows.
See:
Additionally, previously installed packages, the order these packages are installed manually, all impact whether the cropharvest pip package will run on windows.
In short, it makes sense to release the cropharvest
package on conda which should take care of these installation issues.
Bird's-Eye (pula.io)~4k field polygons with planting date, crop type, and yield(!) in Kenya and Zambia.-- https://drive.google.com/drive/folders/1nEhHxWzsZxqozO2LZa-uUl6DoKNVYZVZ
Hello @gabrieltseng and CropHarvest team,
Actually ı am newbie about usage. what ı did is ı exported a tif file using export_for_labels method:
private_labels = pd.DataFrame({ "lon": [min_lon, max_lon], "lat": [min_lat, max_lat], "start_date": [date(2021, 1, 1), date(2021, 1, 1)], "end_date": [date(2022, 1, 1), date(2022, 1, 1)] }) private_labels google_cloud_exporter.export_for_labels(labels=private_labels)
Until this point, everything is good ı exported tif file succesfully and ı downloaded. After that
ı tried to test my exported tif file in the model here it is:
preds = Inference(model=modelp, normalizing_dict=None).run(local_path = f)
but ı get this error
preds = Inference(model=modelp, normalizing_dict=None).run(local_path = f) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cropharvest/inference.py", line 89, in run x_np, flat_lat, flat_lon = Engineer.process_test_file(local_path, start_date) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cropharvest/engineer.py", line 405, in process_test_file da, slope = Engineer.load_tif(path_to_file, start_date=start_date) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/cropharvest/engineer.py", line 236, in load_tif time_specific_da["band"] = range(bands_per_timestep + len(STATIC_BANDS)) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/xarray/core/dataarray.py", line 754, in __setitem__ self.coords[key] = value File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/xarray/core/coordinates.py", line 41, in __setitem__ self.update({key: value}) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/xarray/core/coordinates.py", line 166, in update self._update_coords(coords, indexes) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/xarray/core/coordinates.py", line 342, in _update_coords dims = calculate_dimensions(coords_plus_data) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/xarray/core/dataset.py", line 208, in calculate_dimensions raise ValueError( ValueError: conflicting sizes for dimension 'band': length 4 on <this-array> and length 19 on {'band': 'band', 'y': 'y', 'x': 'x'}
Issue: Zenodo is returning a 503 error; this prevents our test suite from running (due to the integration test).
Desired behaviour: If all other tests pass and the error is due to the Zenodo 503 error, tests should still pass
We want to improve the integration between cropharvest and crop-mask.
On the CropHarvest side this consists of:
tif
files to follow a <location>_<date>
naming convention instead of coupling them to the labels.geojson
This has the advantage of not requiring tif files to be downloaded before updating the dataset, which should make it easier to contribute new datasets.
cc @ivanzvonkov
Li et al., 2022 provides 238,471 training samples for crop/non-crop in Syria over 2002-2019 via Google Drive: https://drive.google.com/drive/folders/1HLI3YAfCXcaccFJPQuhmgSlvfYuy4cBK
Description from the paper:
We used a supervised classification approach to map annual productive cropland in Syria. We collected training data manually by visual interpretation using growing-season and non-growing-season Landsat images, time series of NDVI from Landsat and MODIS, and high-resolution images on Google Earth.
In the process of selecting training samples, we considered two critical factors to ensure the representativeness and accuracy of the training: (1) the training samples need to be selected from multiple years, covering the dry year (2007), the wet year (2002) and the year with moderate precipitation (2012) to ensure the diversity of the samples; (2) high-resolution images on Google Earth are adequate to ensure the accuracy of visual interpretation of various land-cover types. In the end, 238,471 training pixels covering the years 2002, 2007, 2012, 2017, 2018 and 2019 were selected on the Google Earth Engine platform.
When loading the data into R via the sf::st_read()
function, I get a bunch of degenerate edges, I think it's mostly limited to an Ethiopian dataset that has been loaded as POLYGON
features by said function. The problem seems to be a bunch of duplicated coordinates in about 173 features. In R, this stops me for instance to calculate centroids, but I assume also other software would run into some sort of issue here, so it may be worth resolving this regardless.
And thanks for the nice dataset btw!
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] CoordinateCleaner_2.0-20 sp_1.4-5 foreign_0.8-81 lubridate_1.8.0
[5] sf_1.0-3 handlr_0.3.0 readxl_1.3.1 forcats_0.5.1
[9] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.2
[13] tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5 tidyverse_1.3.1
[17] checkmate_2.0.0 bibtex_0.4.2.3 geometr_0.2.11 tabshiftr_0.3.1
[21] luckiTools_0.0.1
loaded via a namespace (and not attached):
[1] fs_1.5.0 oai_0.3.2 bit64_4.0.5 httr_1.4.2 rgbif_3.6.0
[6] tools_4.1.2 backports_1.3.0 utf8_1.2.2 rgdal_1.5-27 R6_2.5.1
[11] KernSmooth_2.23-20 rgeos_0.5-9 lazyeval_0.2.2 DBI_1.1.1 colorspace_2.0-2
[16] raster_3.5-2 withr_2.4.2 tidyselect_1.1.1 bit_4.0.4 curl_4.3.2
[21] compiler_4.1.2 cli_3.1.0 rvest_1.0.2 xml2_1.3.2 triebeard_0.3.0
[26] scales_1.1.1 classInt_0.4-3 proxy_0.4-26 R.utils_2.11.0 pkgconfig_2.0.3
[31] dbplyr_2.1.1 fastmap_1.1.0 rlang_0.4.12 rstudioapi_0.13 httpcode_0.3.0
[36] RSQLite_2.2.8 generics_0.1.1 jsonlite_1.7.2 vroom_1.5.5 R.oo_1.24.0
[41] magrittr_2.0.1 s2_1.0.7 geosphere_1.5-14 Rcpp_1.0.7 munsell_0.5.0
[46] fansi_0.5.0 lifecycle_1.0.1 R.methodsS3_1.8.1 terra_1.4-11 whisker_0.4
[51] stringi_1.7.5 plyr_1.8.6 grid_4.1.2 blob_1.2.2 parallel_4.1.2
[56] crayon_1.4.2 deldir_1.0-6 lattice_0.20-45 conditionz_0.1.0 haven_2.4.3
[61] hms_1.1.1 wellknown_0.7.4 pillar_1.6.4 uuid_1.0-3 gdalUtils_2.0.3.2
[66] codetools_0.2-18 wk_0.5.0 crul_1.2.0 reprex_2.0.1 glue_1.4.2
[71] data.table_1.14.2 modelr_0.1.8 vctrs_0.3.8 tzdb_0.2.0 urltools_1.7.3
[76] foreach_1.5.1 testthat_3.1.0 cellranger_1.1.0 gtable_0.3.0 NLMR_1.1
[81] assertthat_0.2.1 cachem_1.0.6 mime_0.12 broom_0.7.10 rnaturalearth_0.1.0
[86] e1071_1.7-9 class_7.3-19 iterators_1.0.13 memoise_2.0.0 units_0.7-2
[91] ellipsis_0.3.2
Hi, I am trying to use the Kenya maize dataset. However, I found that for the test data, there are large numbers of -1 values for y. But for the training data, there are only binary values which are 0 or 1. What does the -1 values mean?
evaluation_datasets = CropHarvest.create_benchmark_datasets(DATA_DIR)
kenya_dataset = evaluation_datasets[0]
X, y = kenya_dataset.as_array(flatten_x=True)
instances = []
for _, test_instance in kenya_dataset.test_data(flatten_x=True):
instances.append(test_instance)
print(numpy.count_nonzero(instances[1][0:-1].y == -1)) #7517 <--- ?
print(numpy.count_nonzero(instances[1][0:-1].y == 0)) #76
print(numpy.count_nonzero(instances[1][0:-1].y == 1)) #284
These are crop-type labels collected for 2022 season
helmets_crop_type_mapping_2022_10_05_15_22_17_579394.csv
Consider adding data from WorldCereal reference data repository:
https://worldcereal.rdm.geo-wiki.org/
However, we should be careful about choosing where the data come from within the repository - for example, most of the US data (if not all) is sampled from the USDA CDL, which is itself a classification not ground truth.
Hello, I am trying to use the dataset. I have downloaded the dataset from Zenodo. However, I found that there is no explanation of the data format, such as the meaning of the name of each file in the"features" and dictionary's keys in the "labels.geojson" . I can only guess the meaning by codes. How can I get the official explanation of the dataset including the filename and so on. Can you help me?
From their paper:
Our 409-tile test dataset, including expert consensus annotations and corresponding Dynamic World estimated probabilities and class labels for each 5100 m × 5100 m tile are archived in Zenodo at the following https://doi.org/10.5281/zenodo.4766508[36](https://www.nature.com/articles/s41597-022-01307-4#ref-CR36). The training dataset has been archived in PANGAEA in a separate repository: https://doi.org/10.1594/PANGAEA.933475[37](https://www.nature.com/articles/s41597-022-01307-4#ref-CR37). The training and test data collected for Dynamic World are also available as Earth Engine Image Collection and can be accessed with:
ee.ImageCollection(‘projects/wri-datalab/dynamic_world/v1/DW_LABELS’)
See https://github.com/nasaharvest/crop-mask/blob/master/src/ETL/processor.py#L44 for the vectorized function to use
I want to be able to:
All data should be stored within the package, so that there is no need for file management on behalf of the user.
Motivated by this conversation:
#21 (comment)
Ideally, have an API similar to this to call train vs. test datasets:
task = Task(...) # togo
train = CropHarvest(train=True, task=task)
test = CropHarvest(train=False, task=task)
X_train, Y_train = x_y_from_dataset(train)
X_test Y_test = x_y_from_dataset(train)
Hello! I wanted to know whether I'm doing something wrong or not when accessing the Brazil evaluation data. In the CropHarvest paper Table 2 it said that Brazil benchmark has 682,559 samples (174,026 positives and 508,533 negatives). However, when I accessed the data I only got 537,454 (the 174,026 positives but only 363,428 negatives).
The code I used is the following:
I'll appreciate any comments on this, thanks!
Hi!
I'm trying to use the Inference class for a region I exported with a bounding box and the EarthEngineExporter class. For some of the tiff files I get the following error:
RuntimeError: fillna on the test instance returned None; does the test instance contain NaN only bands?
Checking the DataArray returned from Engineer.load_tif I do see that some bands have only NaN values for all pixels at any timestep. Here are the pixel NaN counts per timestep per channel of the tiff (array of shape 12x19). The last 4 bands are always NaN, which if I'm not mistaken are the ERA5 and SRTM bands.
array([[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56350, 56350, 56350, 56350]])
,
Do you know how good is the coverage of the ERA5 and SRTM datasets in land? Maybe this happens at the sea but I'm not sure. Otherwise I was thinking of maybe somehow imputing values that make sense for my region on these bands, but I guess that would require some new functionality in the package. Maybe passing some optional fill_values per band as a dict to the constructor of the Inference class or something like this.
Please let me know your thoughts and if you have any suggestions of what to do in this case.
Thank you very much in advance!
As a pip package user I want to have a Dataset object that inherits from torch.utils.data.Dataset
So that I can plug it into a torch.utils.data.DataLoader
and load remote sensing data easily into my model.
Might be relevant: https://pytorch.org/vision/stable/datasets.html#
Hello, this is my first year participating in this competition.
I couldn't follow the instructions in 'demo.ipynb '.
Are you planning on giving the complete instructions soon?
Hello, I am also participating in this competition for the first time this year, so I would like to ask about the data set of this competition. The data set of the competition is downloaded in a total of 4 parts (labels.geojson/eo_data/features/test_features), and in github The engineer file is provided to generate features and test_features through the *.tif files in eo_data, but I can only generate features, and only togo_eval.h5 files can be generated in test_features, and the other four files cannot be generated? Is there something missing from data generation?
labels
should be made input to the export_for_labels() functionlabels
should be checked for "start_date" and "end_date" and used if they are founddest_bucket
parameter where user can specify the destination GCP bucketf"min_lat={min_lat}_min_lon={min_lon}_max_lat={max_lat}_max_lon={max_lon}_dates={start_date}_{end_date}_all"
(where the all
indicates all bands are being exported, not just Sentinel 2)check_gcp
option which would check if the tif about to be exported already exists on Google Cloudcheckpoint
but for cloud storage)check_ee
option which would check if the tif about to be exported is currently in the earth engine queue.credentials
should be made as input to the region exporterHello @gabrieltseng and CropHarvest team,
I'm facing difficulties visualizing the eo_data in QGIS. Despite setting the RGB (B4, B3, B2) channels, I'm encountering patches that don't align with the Google satellite image. In addition, some images show padding around the borders of the images. I wonder why this happens in the images. The contrast enhancement setting in QGIS is set to Stretch to MinMax
. The examples are shown in the pictures below.
I'm also curious about why cropped the images to 17 x 17 (some cases are 18 x 17, and France data are 25 x 17). This makes them relatively small and more challenging to recognize the contents in images. What is the strategy or benefit to crop train/val/test data in this way?
The other minor issue is the example code in this issue. While running
>>> my_dataset = CropHarvest("data", Task(bounding_box=get_country_bbox("France")[1], normalize=True))
,
I got an error: ValueError: Assigning CRS to a GeoDataFrame without a geometry column is not supported. Supply geometry using the 'geometry=' keyword argument, or by providing a DataFrame with column name 'geometry'
.
This seems to be the issue in geopandas.read_file()
.
If anyone has suggestions or insights on addressing these issues, I would greatly appreciate it! Thank you in advance for your help.
Similar to the export end date, but make the associated observation date explicit
Hi!
I noticed Inference.run() assumes there is a start_date on the filename to process. However, when exporting a region from bounding box with the EarthEngineExporter the start_date is not being added to the export filename.
Associated issue: #101
The exporter can be called using the following:
EarthEngineExporter().export_for_labels()
instead of having to initialize the exporter first.
Hello! I am very new to this package and was hoping you could clarify if I am understanding things correctly.
I am trying to get a training and test dataset for togo.
Training Data for Togo:
Using the code below I was able to get training data with the correct dimensions based on the demo.ipynb. However my understanding is that the dimensions of this data are [ number of corresponding rows in labels.geojson x (12 months * 18 bands)]. So 1290 rows of togo within the geojson file and 216 values for each band/time. However when I open the geojson file and search for togo datasets the list comes up short with only 1276 entries. Can you confirm if I am misunderstanding the setup?
Additionally does the is_crop
column mean that for each row the location had crops for the entire year (12 months of regressor data available)?
Test Data for Togo:
Also going off the demo notebook, to get the test data I used this code.
Can I confirm that the values are pulled from 'data/test_features/togo-eval.h5'. with the same dimensions of test location, (12 months * 18 bands)?
I really appreciate any help!
How is features folder different from eo_data folder? Has an technique been used to derive features folder from eo_data?
Hello! I'm relatively new to the Python language and to EO analysis. I found CropHarvest an interesting package to work with. However, I'm having some issues in understanding where things are.
I know datalabels.geojson
is where datasets are and by subsetting through ['lem-brazil']
I can access data specificaly from this source and can plot the points where each sample is through plot
. I would like to know how to plot the associated (NDVI) satellite image from each data sample, like Figure 3 from the CropHarvest paper (CropHarvest: a global satellite dataset for crop type
classification). Is it possible or did I misunderstand and it is not possible to work with the satellite image to work with time series images? Also, for example, meteorological data available in the same function? My guess is that everything is in cropharvest.datasets
but I'm struggling to find it out.
Thank you very much.
This makes for easier analysis using GIS software
pip install cropharvest
in Windows-based (Windows 10) miniconda prompt fails based on GDAL dependency.
Below is the error message from terminal:
`Collecting rasterio>=1.2.6
Using cached rasterio-1.2.10.tar.gz (2.3 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... error
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [2 lines of output]
INFO:root:Building on Windows requires extra options to setup.py to locate needed GDAL files. More information is available in the README.
ERROR: A GDAL API version must be specified. Provide a path to gdal-config using a GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.
note: This error originates from a subprocess, and is likely not a problem with pip.
`
Rasterio docs calls for providing file location of GDAL binary.
Resorting to working with Windows Linux Subsystem, but not all potential users might have this option?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.