weed-ai / weed-ai Goto Github PK

A repository to support the development of a repository and interchange format for weed identification annotation

Home Page: https://weed-ai.sydney.edu.au/

License: MIT License

Python 47.59% Dockerfile 0.20% HTML 0.46% CSS 1.12% JavaScript 41.39% Makefile 0.04% Shell 2.42% TypeScript 2.47% Jupyter Notebook 4.32%

weed-recognition computer-vision datasets data-formats mscoco

weed-ai's Introduction

Welcome to Weed-AI

Weed-AI provides is an open source, searchable, weeds image platform designed to facilitate the research and development of machine learning algorithms for weed recognition in cropping systems. It brings together existing datasets, enables users to contribute their own data and pulls together custom datasets for straightforward download.

See our Weed Explorer at https://weed-ai.sydney.edu.au

Train YOLOv5 with Weed-AI Datasets

Follow the Google Colab Notebook to train your own YOLOv5 algorithms from Weed-AI datasets.

Background

Large numbers of high quality, annotated weed images are essential for the development of weed recognition algorithms that are accurate and reliable in complex biological systems. Accurate weed recognition enables the use of site-specific weed control (SSWC) in agricultural systems, eliminating the need for wasteful whole field treatments. This approach substantially reduces weed control inputs and creates opportunities for the introduction of alternative weed control technologies that were not previously feasible for use as indiscriminate whole field treatments. SSWC relies on accurate detection (is a weed present) and identification (what is the species/further information on morphology) of weeds in agricultural and environmental systems (crop, pastures, rangelands and non-crop areas, etc.). Camera-based weed recognition using deep learning algorithms has emerged as a frontrunner for in-crop site-specific control with an improved ability to handle variation.

Training and development of algorithms require significant quantities of high-quality, annotated images. Weed-AI is addressing this challenge by enabling the easy access and contribution of weed image data on an open source platform with search, dynamic filter and preview functions for custom dataset download capability.

Data supported

To support the largest number of use cases and the unique demands of SSWC technology development, we have developed a standard for storing weed images and their anotations. Our standard - WeedCOCO - is an extension on Microsoft's Common Objects in Context format (MS COCO). WeedCOCO incorporates additional whole-dataset contextual information that provides descriptions of the agricultural context as well as details of how the images were capture. This "AgContext" includes:

Crop type
Crop growth stage (text and BBCH)
Soil colour
Surface coverage
Weather description
Location
Camera metadata (camera model, collection height, angle, lens, focal length, field of view)
Lighting

The format may also be applicable to related agricultural purposes. As with MS COCO, the format supports classification, bounding box and segmentation labels indicating the presence of a specific or unspecified type of weed, expressed at species or other taxonomic level. Reporting these details will help ensure consistency in published datasets for ease of comparison and use in further research and development.

Acknowledgements

This project has been funded by the Grains Research and Development Corporation. The platform was developed by the Sydney Informatics Hub, a core research facility of the University of Sydney, as part of a research collaboration with the Australian Centre for Field Robotics and the Precision Weed Control Group at the University of Sydney.

We make use of data from EPPO to validate and cross-reference plant species information, in accordance with the EPPO Codes Open Data Licence.

Citation Guidelines

General

If you found Weed-AI useful in your research or project, please cite the database as:

Weed-AI: A repository of Weed Images in Crops. Precision Weed Control Group and Sydney Informatics Hub, the University of Sydney. https://weed-ai.sydney.edu.au/, accessed YYYY-MM-DD.

An academic citation is TBA.

Specific Datasets

Each set of imagery used within the database should also be cited with the correct database Digital Object Identifier (DOI) and relevant papers.

weed-ai's People

Stargazers

Watchers

Forkers

geezacoleman sandhya2605 hlydecker hgcsoft ddanielvaz

weed-ai's Issues

Add a GitHub workflow to validate

Ideally any .py files should pass flake8. Any .json files should parse as json.

Export to ElasticSearch should add a field describing the tasks that can be performed with the annotation

So we should be producing in each image one or more "annotation_types" labels from:

"classification"
"bounding box"
"segmentation"
"instance segmentation"

We should be able to infer the from the annotations associated with each image....

Identify how to handle multiple identification tasks

MSCOCO can support multiple computer vision tasks, and has unique annotations for each. To what extent are these annotations interoperable?

validate multiple json files as valid Weed COCO

Currently the validation workflow only really applies to specific files. This probably is fine if we just want to use it to validate a schema. But in the future, would we want to validate any number of JSON files, and perhaps recursively? Using glob, as you had suggested, to get all relevant files for validation might be a good idea in that case. There is a glob in the github actions toolkit, but I don't know if we will need to make our own version of the JSON validation action to make use of this.

Ingestion Pipeline

Given a dataset in weed coco, we need an ETL pipeline to ingest and register it safely into a repository, and index it in Elastic.

Ingestion process needs to:

Validate that the input is of the required format
Hash each image with SHA512 and store the hash in the image blob
Add a unique identifier to each category (e.g. based on its role and
species)
Group together all annotations for an image and together assign them
a unique identifier
- This unique "Annotated Image ID" should hash together:
  - The image being annotated
  - The masks or bounding boxes (in some deterministic order,
    and with a deterministic encoding of masks)
  - The corresponding categories' globally unique IDs
Assign each annotated image a set of task labels for what they can
be used for, e.g. classification, bounding box, segmentation,
instance segmentation
Assign the collection an arbitrary globally unique ID
Check that if any of the images are already known to the database,
they use agcontexts with global IDs already known to the database
If any of the images are not already known to the database, they
must either reference an agcontext with a known global ID (that is
consistent with what's in the database) or a new ID must be
generated for each new agcontext.
Downsize the image for display in search results and store on
website static file server
- also prepare masks for display in search results
Store the full-size image on dataset static file server
Store the prepared dataset in the repository
Generate and upload Elastic Search docs for each modified
"Annotated Image ID" (reincluding any collection information
from other datasets containing that Annotated Image ID)
- Under this model, we could store everything in the repository
  in one giant Weed-COCO, at the risk of O(n) latency for many
  operations. Alternatively, we could "explode" the Weed-COCO
  such that core entities in the repository are Annotated
  Images, Collections, AgContexts, etc.

Frontend: second draft

Improve UI
Change faceting/filtering behaviour to be and rather than or
Change URL based on search params
Add mapping widget

Anything else @jnothman ?

Schema could constrain the valid range of latitude and longitude

It could also constrain the allowed location_datum, and for now it could be required to be 4236

Handle folder and path consistently in importers

Should "file_name" in WeedCOCO be a relative path, or be a URL? It certainly can't be an on-disk absolute path. Or do we just always assume that the files are passed with the collection being imported, and then the ingestion pipeline can rename the file? In that case, should we keep an "original_file_name" field?

How do we represent Near Infrared images aligned with RGB images

Options include:

store as separate file and reference nir_file_name as well as file_name in COCO
store as alpha channel (but separate it out for frontend viewing)

Harvest image attributes from EXIF

The CameraTraps code for extracting info from image EXIF (e.g. width) doesn't seem to work. Their code looks very similar to the top stackoverflow answer re "python image exif info". I've tried to mess with it some to get it to work, but the issues relate to some of the general weirdness surrounding file paths in this convertor.

Something looks odd in bbch_code pattern

pattern: "^gs[0-9][0-9]$|^na$|^(?!.*gs00)"

What is ^(?!.*gs00) meant to do? As far as I can tell, it should match almost any string.

crop_type is too restrictive if we want to demonstrate with diverse weed datasets

restricted currently to grains... so we need at least a free text alternative for "other", e.g. an other_crop_type field

Create validation script for valid interchange data

We should create a library/script that determines whether a dataset is reasonably valid and sufficient for import into our database. JSONSchema can be used to define structural requirements. Integrity constraints, such that *_id and id elements correspond, need to be checked separately.

Define one or more places in the schema to store DCMI terms

DCMI is the standard for describing creative works and their provenance, e.g. datasets. It would be appropriate to ascribe images and/or annotations to originating datasets using this standard.

Expand categories from taxonomy when indexing images

E.g. expand "crop: avena sativa" to "crop: grasses" and "crop".

Validate JSON failing on master

schema AgContext.json is invalid
error: schema is invalid: data.properties['location_datum'].enum should be array

Custom dataset export through frontend

I don't know how to classify this issue, but I realise that we don't really need a concatenation tool at the ingestion level. Instead we can use a "shopping cart" like feature to assemble a set of images in ES. There are heaps of options for shopping cart like tools, however we will then want to include a ES -> WeedCOCO annotation exporter function somwhere.

Ease contribution by pulling Collection and AgContext out of the Coco

The idea would be that a new dataset for ingestion consists of:

one or more COCO files. The things that make them WeedId compatible are that:
- the categories have sufficient specification to be mapped to one of our central category definitions (e.g. they have role and some recognised species identifier).
- the info blob must specify an AgContext name that all the images in that COCO belong to
- the info blob may also specify that it is a specific named subset (e.g. "train" or "test")
one file conforming to the schema.org/Dataset metadata specification. There are a few tools that can help to generate this file.
one or more AgContext files, each with a distinct name that is used to reference the agcontext within COCO infos.
image files

So something like:

|- dataset.json
|- agcontexts/
| |- illuminated.json
| |- daylight.json
|- cocos/
| |- coco-illuminated-train.json
| |- coco-illuminated-test.json
| |- coco-daylight-train.json
| |- coco-daylight-test.json
|- images/
| |- img01.jpeg
| |- img02.jpeg
| |- ...

This allows the contributor to use more pre-existing tools, and avoids convoluted things like collection_memberships

Automatically find eppo/bayer codes

Currently we are hard coding most aspects of category objects, however in the future it may be useful to try to automatically fill in information. For example, if we have the scientific name of a plant species, we can then easily query the EPPO API: https://data.eppo.int/documentation/rest#collapse1

Make schema .json instead of .yaml

After #17

Convertor for dataset with images and segmentation mask

One dataset on RDS is a bunch of images and a binary segmentation mask. Add an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO, with a category definition for the masked area.

Prototype spec / documentation doc

Create a document describing some of the changes we have made from COCO, the limitations of COCO, etc.

Add HTTP Basic authentication

There are several tutorials for HTTP Basic auth on React apps.
Passwords should not be stored in a git-tracked file. Instead they could be stored with Docker Secrets.

Implement and test weedcoco.validation.validate_image_sizes

Don't store cwfid and deepweeds JSON in repo

Rather, store a test sample for each (even with fake images), and convert to JSON in CI.

Ensure deepweed and cwfid ingesters produce JSON matching schema

I think at the same time we should make their ingestion code .py rather than .ipynb.

Search results summary

When searching, users should see a paginated list of images with masks shown, as well as some basic summary details like crop and weed species.

Dataset view

shows agcontexts, links to search results

How to deal with NIR and RGB data

Represent the notion of fixed dataset splits

Datasets including deepweeds (Olsen et al) and cwfid (Haug et al) are distributed with predefined train/test splits. We should consider representing this information in our interchange format.

Add categories to schema

Add schema/Category.yaml and link it into schema/main.yaml.

Adopt extensions to agcontext from Asher

@asherbender proposed being able to record the following attributes as context of data collection:

#-------------------------------------------------------------------------------
#                                 institution
#-------------------------------------------------------------------------------

# Description: Name of organisation/institution that collected the data.
# Requirement: Mandatory.
# Type:        string
institution_name:

# Description: Name of project that data was collected under.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_project:

# Description: Name of organisation/institution that funded the data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_funding:

# Description: Name of individuals responsible for collecting the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:     anonymous
#institution_authors:

# Description: Description of organisation/institution/individuals that
#              collected the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_notes:

#-------------------------------------------------------------------------------
#                                    system
#-------------------------------------------------------------------------------

# Description: Name of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_name:

# Description: Description of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_description:

# Description: Operational space where data collection system operates.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['marine', 'terrestrial', 'aerial', 'space']
system_theatre:

# Description: Mobility of data collection system. Static data collection
#              systems remain stationary and collect data in only one location.
#              Mobile data collection systems are capable of locomotion and can
#              collect data from multiple locations.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['static', 'mobile']
system_mobility:

# Description: Method of controlling the data collection platform. Handheld
#              control are systems that are physically carried by a human
#              operator (e.g. digital camera). Direct control are systems that
#              are controlled by a human operator on or within the platform
#              (e.g. driving a tractor). Remote control are systems that are
#              controlled by a human operator outside the platform (e.g.
#              manually piloting a drone). Autonomous control are systems that
#              operate without human assistance (e.g. autonomous vehicle).
# Requirement: Mandatory.
# Type:        enum
# Options:     ['handheld', 'direct', 'remote', 'autonomous']
system_control:

#-------------------------------------------------------------------------------
#                                   location
#-------------------------------------------------------------------------------

# Description: Country where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_country:

# Description: State/province where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_state:

# Description: City where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_city:

# Description: Latitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_latitude:

# Description: Longitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_longitude:

# Description: Spatial reference system of latitude/longitude expressed as an
#              EPSG code (e.g. WGS 84 is 4326).
# Requirement: Optional (uncomment and edit).
# Type:        int
# Default:     0
#location_epsg:

#-------------------------------------------------------------------------------
#                                   weather
#-------------------------------------------------------------------------------

# Description: Average temperature during data collection in degrees Celsius.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees Celsius (C)
#weather_temperature:

# Description: Average relative humidity during data collection as a percentage.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Percentage (%)
#weather_humidity:

# Description: Average barometric pressure during data collection in
#              hectopascals (hPa).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Hectopascal (hPa)
#weather_pressure:

# Description: Subjective cloud cover during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_cloud_cover:

# Description: Subjective atmospheric clarity during data collection. Note that
#              this condition is distinct from cloud cover. Atmospheric clarity
#              covers conditions such as pollution or haze due to bushfires.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'clear', 'intermediate haze', 'heavy haze']
# Default:     unknown
#weather_atmospheric_clarity:

# Description: Average rainfall during data collection in millimetres (mm).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Millimetres (mm)
#weather_rainfall:

# Description: Subjective rain conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_rain_qualitative:

# Description: Average wind speed during data collection in metres per second
#              (m/s).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Metres per second (m/s)
#weather_wind_speed:

# Description: Average wind direction during data collection in degrees from
#              North.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees from north
#weather_wind_direction:

# Description: Subjective wind conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_wind_qualitative:

# Description: True if dew was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_dew:

# Description: True if frost was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_frost:

# Description: True if mist or fog was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_mist:

# Description: True if snow was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_snow:

# Description: Description of weather conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#weather_notes:

Implement and test weedcoco.validation.validate_coordinates

Converters for Pascal VOC datasets into Weed-COCO

Datasets on RDS are a bunch of images and VOC XML files, plus an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO.

Add a tool to concatenate datasets

Not sure how useful this is in practice, but might be a good exercise to demonstrate the structure of the objects, and demonstrate pointer swizzling/unswizzling.

WeedCOCO category names should be specified with both role and species

They should take the format {role}: {species}.

This avoids having two categories with the same name but different role, and makes some data entry and processing simpler.

Should we use queryFormat: "or" for multiselect categories?

If an image has multiple values for some field, we might want "and" otherwise it's useless. Users will certainly want to be able to express "or" across values for a field. So if we don't use queryFormat: "or", we will need to provide some kind of advanced search interface to combine queries by disjunction. OTOH, queryFormat: "and" behaves more intuitively.

Image view

shows an image & masks (still lo-res), licence info, agcontexts it's in, datasets it's in.

Convert a third dataset

Develop a python tool for converting another dataset's annotations into weedcoco. Ideally we will use one of Guy or Asher's datasets.

react-scripts not found

Issues with react-scripts have resurfaced. Reactivesearch doesn't work with master currently; this is because react-scripts are not found when running docker-compose up.

Currently trialing various attempts to fix this in the dockerfile and/or docker-compose.

Attaching to search_kibana_1, elasticsearch, search_reactivesearch_1, dejavu
reactivesearch_1  | 
reactivesearch_1  | > [email protected] start /code
reactivesearch_1  | > react-scripts start
reactivesearch_1  | 
reactivesearch_1  | sh: 1: react-scripts: not found
reactivesearch_1  | npm ERR! code ELIFECYCLE
reactivesearch_1  | npm ERR! syscall spawn
reactivesearch_1  | npm ERR! file sh
reactivesearch_1  | npm ERR! errno ENOENT
reactivesearch_1  | npm ERR! [email protected] start: `react-scripts start`
reactivesearch_1  | npm ERR! spawn ENOENT
reactivesearch_1  | npm ERR! 
reactivesearch_1  | npm ERR! Failed at the [email protected] start script.
reactivesearch_1  | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
reactivesearch_1  | 
reactivesearch_1  | npm ERR! A complete log of this run can be found in:
reactivesearch_1  | npm ERR!     /root/.npm/_logs/2020-09-04T03_59_05_412Z-debug.log
search_reactivesearch_1 exited with code 1

Create convertor for a bounding box data set

Create and/or adapt a convertor to turn a bounding box dataset into COCO weeds JSON.

Engineering for extendable datasets

I've made some changes to try and support a hypothetical future where users can filter, search, and combine multiple datasets. This basically means that for data set level info, I have added in identifiers for them in images, and made it so the "agadata" object is actually a set of "agdataset" objects. Naming should probably be reviewed to make sure it makes sense, because I know this can get confusing.

Create list of possible categories

COCO datasets have classifications that are broken down into super categories and categories within them. For example, "weed" might be a super category and "rye grass" a category of weed.

Categories are set at the dataset level; for now it might be good to assemble a list of possible categories. This should be reviewed by Guy.

Support search with different variant growth stage descriptors

Per the matrix sent by Guy & Michael, there are many ways to refer to the same stage of crop growth.

We have two choices about how to handle this:

Store the most specific label for an agcontext in the ES index, then provide a frontend widget that allows aliases and converts them into the indexed form for querying.
Expand the label given in the agcontext when the data is indexed, and index the list of alternative growth stage labels in each image. Then these become labels that will be auto-suggested for category search.

Search & faceting widgets

A core feature we need is the ability to search and facet data. Any changes in the query, either through searching or faceting should affect the page's URL. URLs should be usable for recovering queries.

Export functionality

Relatively self explanatory; users need to be able to export

Store static images on web server and show them in search results summary

Not trying to build a generic, longstanding tool here... just hacking pieces together for demo.

Without masks for now. Masks later.

expand schema yamls

Add schema yamls for all parts of the weedcoco blob.

Will need an explicit schema on ElasticSearch; or change "variable" and "na" into NULL

ES seems to have done some type sniffing, which is a problem for something like camera_fov, since it is detected as a numeric type, and then a later dataset might say "variable" and it breaks. The other option might be to store "variable" as null in the index, thus avoiding the problem (currently implemented in #49).

https://github.com/kristianmandrup/json-schema-to-es-mapping might help. I've only found it, not looked at it.

Don't show images/masks of test-set samples in UI

I don't know if other ML web sites do this, but I think it's a nice idea to be able to switch off showing held-out samples as search results, whether to remove them from the search results or to hide their images.

It might be worth thinking with other colleagues (or collaborators from Zhiyong's team) about how they would want train/test distinctions represented in a repository.