Giter Club home page Giter Club logo

weed-ai's Introduction

Welcome to Weed-AI

Contributors

Weed-AI provides is an open source, searchable, weeds image platform designed to facilitate the research and development of machine learning algorithms for weed recognition in cropping systems. It brings together existing datasets, enables users to contribute their own data and pulls together custom datasets for straightforward download.

See our Weed Explorer at https://weed-ai.sydney.edu.au

Train YOLOv5 with Weed-AI Datasets

Follow the Google Colab Notebook to train your own YOLOv5 algorithms from Weed-AI datasets.

Open In Colab

Background

Large numbers of high quality, annotated weed images are essential for the development of weed recognition algorithms that are accurate and reliable in complex biological systems. Accurate weed recognition enables the use of site-specific weed control (SSWC) in agricultural systems, eliminating the need for wasteful whole field treatments. This approach substantially reduces weed control inputs and creates opportunities for the introduction of alternative weed control technologies that were not previously feasible for use as indiscriminate whole field treatments. SSWC relies on accurate detection (is a weed present) and identification (what is the species/further information on morphology) of weeds in agricultural and environmental systems (crop, pastures, rangelands and non-crop areas, etc.). Camera-based weed recognition using deep learning algorithms has emerged as a frontrunner for in-crop site-specific control with an improved ability to handle variation.

Training and development of algorithms require significant quantities of high-quality, annotated images. Weed-AI is addressing this challenge by enabling the easy access and contribution of weed image data on an open source platform with search, dynamic filter and preview functions for custom dataset download capability.

Data supported

To support the largest number of use cases and the unique demands of SSWC technology development, we have developed a standard for storing weed images and their anotations. Our standard - WeedCOCO - is an extension on Microsoft's Common Objects in Context format (MS COCO). WeedCOCO incorporates additional whole-dataset contextual information that provides descriptions of the agricultural context as well as details of how the images were capture. This "AgContext" includes:

  • Crop type

  • Crop growth stage (text and BBCH)

  • Soil colour

  • Surface coverage

  • Weather description

  • Location

  • Camera metadata (camera model, collection height, angle, lens, focal length, field of view)

  • Lighting

The format may also be applicable to related agricultural purposes. As with MS COCO, the format supports classification, bounding box and segmentation labels indicating the presence of a specific or unspecified type of weed, expressed at species or other taxonomic level. Reporting these details will help ensure consistency in published datasets for ease of comparison and use in further research and development.

Weed-AI Data Flow

Acknowledgements

This project has been funded by the Grains Research and Development Corporation. The platform was developed by the Sydney Informatics Hub, a core research facility of the University of Sydney, as part of a research collaboration with the Australian Centre for Field Robotics and the Precision Weed Control Group at the University of Sydney.

We make use of data from EPPO to validate and cross-reference plant species information, in accordance with the EPPO Codes Open Data Licence.

Citation Guidelines

General

If you found Weed-AI useful in your research or project, please cite the database as:

Weed-AI: A repository of Weed Images in Crops. Precision Weed Control Group and Sydney Informatics Hub, the University of Sydney. https://weed-ai.sydney.edu.au/, accessed YYYY-MM-DD.

An academic citation is TBA.

Specific Datasets

Each set of imagery used within the database should also be cited with the correct database Digital Object Identifier (DOI) and relevant papers.

weed-ai's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weed-ai's Issues

validate multiple json files as valid Weed COCO

Currently the validation workflow only really applies to specific files. This probably is fine if we just want to use it to validate a schema. But in the future, would we want to validate any number of JSON files, and perhaps recursively? Using glob, as you had suggested, to get all relevant files for validation might be a good idea in that case. There is a glob in the github actions toolkit, but I don't know if we will need to make our own version of the JSON validation action to make use of this.

Ingestion Pipeline

Given a dataset in weed coco, we need an ETL pipeline to ingest and register it safely into a repository, and index it in Elastic.

Ingestion process needs to:

  • Validate that the input is of the required format
  • Hash each image with SHA512 and store the hash in the image blob
  • Add a unique identifier to each category (e.g. based on its role and
    species)
  • Group together all annotations for an image and together assign them
    a unique identifier
    • This unique "Annotated Image ID" should hash together:
      • The image being annotated
      • The masks or bounding boxes (in some deterministic order,
        and with a deterministic encoding of masks)
      • The corresponding categories' globally unique IDs
  • Assign each annotated image a set of task labels for what they can
    be used for, e.g. classification, bounding box, segmentation,
    instance segmentation
  • Assign the collection an arbitrary globally unique ID
  • Check that if any of the images are already known to the database,
    they use agcontexts with global IDs already known to the database
  • If any of the images are not already known to the database, they
    must either reference an agcontext with a known global ID (that is
    consistent with what's in the database) or a new ID must be
    generated for each new agcontext.
  • Downsize the image for display in search results and store on
    website static file server
    • also prepare masks for display in search results
  • Store the full-size image on dataset static file server
  • Store the prepared dataset in the repository
  • Generate and upload Elastic Search docs for each modified
    "Annotated Image ID" (reincluding any collection information
    from other datasets containing that Annotated Image ID)
    • Under this model, we could store everything in the repository
      in one giant Weed-COCO, at the risk of O(n) latency for many
      operations. Alternatively, we could "explode" the Weed-COCO
      such that core entities in the repository are Annotated
      Images, Collections, AgContexts, etc.

Frontend: second draft

  • Improve UI
  • Change faceting/filtering behaviour to be and rather than or
  • Change URL based on search params
  • Add mapping widget

Anything else @jnothman ?

Handle folder and path consistently in importers

Should "file_name" in WeedCOCO be a relative path, or be a URL? It certainly can't be an on-disk absolute path. Or do we just always assume that the files are passed with the collection being imported, and then the ingestion pipeline can rename the file? In that case, should we keep an "original_file_name" field?

Harvest image attributes from EXIF

The CameraTraps code for extracting info from image EXIF (e.g. width) doesn't seem to work. Their code looks very similar to the top stackoverflow answer re "python image exif info". I've tried to mess with it some to get it to work, but the issues relate to some of the general weirdness surrounding file paths in this convertor.

Create validation script for valid interchange data

We should create a library/script that determines whether a dataset is reasonably valid and sufficient for import into our database. JSONSchema can be used to define structural requirements. Integrity constraints, such that *_id and id elements correspond, need to be checked separately.

Custom dataset export through frontend

I don't know how to classify this issue, but I realise that we don't really need a concatenation tool at the ingestion level. Instead we can use a "shopping cart" like feature to assemble a set of images in ES. There are heaps of options for shopping cart like tools, however we will then want to include a ES -> WeedCOCO annotation exporter function somwhere.

Ease contribution by pulling Collection and AgContext out of the Coco

The idea would be that a new dataset for ingestion consists of:

  • one or more COCO files. The things that make them WeedId compatible are that:
    • the categories have sufficient specification to be mapped to one of our central category definitions (e.g. they have role and some recognised species identifier).
    • the info blob must specify an AgContext name that all the images in that COCO belong to
    • the info blob may also specify that it is a specific named subset (e.g. "train" or "test")
  • one file conforming to the schema.org/Dataset metadata specification. There are a few tools that can help to generate this file.
  • one or more AgContext files, each with a distinct name that is used to reference the agcontext within COCO infos.
  • image files

So something like:

|- dataset.json
|- agcontexts/
| |- illuminated.json
| |- daylight.json
|- cocos/
| |- coco-illuminated-train.json
| |- coco-illuminated-test.json
| |- coco-daylight-train.json
| |- coco-daylight-test.json
|- images/
| |- img01.jpeg
| |- img02.jpeg
| |- ...

This allows the contributor to use more pre-existing tools, and avoids convoluted things like collection_memberships

Convertor for dataset with images and segmentation mask

One dataset on RDS is a bunch of images and a binary segmentation mask. Add an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO, with a category definition for the masked area.

Add HTTP Basic authentication

There are several tutorials for HTTP Basic auth on React apps.
Passwords should not be stored in a git-tracked file. Instead they could be stored with Docker Secrets.

Search results summary

When searching, users should see a paginated list of images with masks shown, as well as some basic summary details like crop and weed species.

Represent the notion of fixed dataset splits

Datasets including deepweeds (Olsen et al) and cwfid (Haug et al) are distributed with predefined train/test splits. We should consider representing this information in our interchange format.

Adopt extensions to agcontext from Asher

@asherbender proposed being able to record the following attributes as context of data collection:

#-------------------------------------------------------------------------------
#                                 institution
#-------------------------------------------------------------------------------

# Description: Name of organisation/institution that collected the data.
# Requirement: Mandatory.
# Type:        string
institution_name:

# Description: Name of project that data was collected under.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_project:

# Description: Name of organisation/institution that funded the data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_funding:

# Description: Name of individuals responsible for collecting the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:     anonymous
#institution_authors:

# Description: Description of organisation/institution/individuals that
#              collected the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_notes:

#-------------------------------------------------------------------------------
#                                    system
#-------------------------------------------------------------------------------

# Description: Name of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_name:

# Description: Description of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_description:

# Description: Operational space where data collection system operates.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['marine', 'terrestrial', 'aerial', 'space']
system_theatre:

# Description: Mobility of data collection system. Static data collection
#              systems remain stationary and collect data in only one location.
#              Mobile data collection systems are capable of locomotion and can
#              collect data from multiple locations.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['static', 'mobile']
system_mobility:

# Description: Method of controlling the data collection platform. Handheld
#              control are systems that are physically carried by a human
#              operator (e.g. digital camera). Direct control are systems that
#              are controlled by a human operator on or within the platform
#              (e.g. driving a tractor). Remote control are systems that are
#              controlled by a human operator outside the platform (e.g.
#              manually piloting a drone). Autonomous control are systems that
#              operate without human assistance (e.g. autonomous vehicle).
# Requirement: Mandatory.
# Type:        enum
# Options:     ['handheld', 'direct', 'remote', 'autonomous']
system_control:

#-------------------------------------------------------------------------------
#                                   location
#-------------------------------------------------------------------------------

# Description: Country where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_country:

# Description: State/province where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_state:

# Description: City where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_city:

# Description: Latitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_latitude:

# Description: Longitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_longitude:

# Description: Spatial reference system of latitude/longitude expressed as an
#              EPSG code (e.g. WGS 84 is 4326).
# Requirement: Optional (uncomment and edit).
# Type:        int
# Default:     0
#location_epsg:

#-------------------------------------------------------------------------------
#                                   weather
#-------------------------------------------------------------------------------

# Description: Average temperature during data collection in degrees Celsius.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees Celsius (C)
#weather_temperature:

# Description: Average relative humidity during data collection as a percentage.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Percentage (%)
#weather_humidity:

# Description: Average barometric pressure during data collection in
#              hectopascals (hPa).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Hectopascal (hPa)
#weather_pressure:

# Description: Subjective cloud cover during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_cloud_cover:

# Description: Subjective atmospheric clarity during data collection. Note that
#              this condition is distinct from cloud cover. Atmospheric clarity
#              covers conditions such as pollution or haze due to bushfires.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'clear', 'intermediate haze', 'heavy haze']
# Default:     unknown
#weather_atmospheric_clarity:

# Description: Average rainfall during data collection in millimetres (mm).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Millimetres (mm)
#weather_rainfall:

# Description: Subjective rain conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_rain_qualitative:

# Description: Average wind speed during data collection in metres per second
#              (m/s).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Metres per second (m/s)
#weather_wind_speed:

# Description: Average wind direction during data collection in degrees from
#              North.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees from north
#weather_wind_direction:

# Description: Subjective wind conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_wind_qualitative:

# Description: True if dew was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_dew:

# Description: True if frost was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_frost:

# Description: True if mist or fog was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_mist:

# Description: True if snow was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_snow:

# Description: Description of weather conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#weather_notes:

Add a tool to concatenate datasets

Not sure how useful this is in practice, but might be a good exercise to demonstrate the structure of the objects, and demonstrate pointer swizzling/unswizzling.

Should we use queryFormat: "or" for multiselect categories?

If an image has multiple values for some field, we might want "and" otherwise it's useless. Users will certainly want to be able to express "or" across values for a field. So if we don't use queryFormat: "or", we will need to provide some kind of advanced search interface to combine queries by disjunction. OTOH, queryFormat: "and" behaves more intuitively.

Image view

shows an image & masks (still lo-res), licence info, agcontexts it's in, datasets it's in.

Convert a third dataset

Develop a python tool for converting another dataset's annotations into weedcoco. Ideally we will use one of Guy or Asher's datasets.

react-scripts not found

Issues with react-scripts have resurfaced. Reactivesearch doesn't work with master currently; this is because react-scripts are not found when running docker-compose up.

Currently trialing various attempts to fix this in the dockerfile and/or docker-compose.

Attaching to search_kibana_1, elasticsearch, search_reactivesearch_1, dejavu
reactivesearch_1  | 
reactivesearch_1  | > [email protected] start /code
reactivesearch_1  | > react-scripts start
reactivesearch_1  | 
reactivesearch_1  | sh: 1: react-scripts: not found
reactivesearch_1  | npm ERR! code ELIFECYCLE
reactivesearch_1  | npm ERR! syscall spawn
reactivesearch_1  | npm ERR! file sh
reactivesearch_1  | npm ERR! errno ENOENT
reactivesearch_1  | npm ERR! [email protected] start: `react-scripts start`
reactivesearch_1  | npm ERR! spawn ENOENT
reactivesearch_1  | npm ERR! 
reactivesearch_1  | npm ERR! Failed at the [email protected] start script.
reactivesearch_1  | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
reactivesearch_1  | 
reactivesearch_1  | npm ERR! A complete log of this run can be found in:
reactivesearch_1  | npm ERR!     /root/.npm/_logs/2020-09-04T03_59_05_412Z-debug.log
search_reactivesearch_1 exited with code 1

Engineering for extendable datasets

I've made some changes to try and support a hypothetical future where users can filter, search, and combine multiple datasets. This basically means that for data set level info, I have added in identifiers for them in images, and made it so the "agadata" object is actually a set of "agdataset" objects. Naming should probably be reviewed to make sure it makes sense, because I know this can get confusing.

Create list of possible categories

COCO datasets have classifications that are broken down into super categories and categories within them. For example, "weed" might be a super category and "rye grass" a category of weed.

Categories are set at the dataset level; for now it might be good to assemble a list of possible categories. This should be reviewed by Guy.

Support search with different variant growth stage descriptors

Per the matrix sent by Guy & Michael, there are many ways to refer to the same stage of crop growth.

We have two choices about how to handle this:

  1. Store the most specific label for an agcontext in the ES index, then provide a frontend widget that allows aliases and converts them into the indexed form for querying.
  2. Expand the label given in the agcontext when the data is indexed, and index the list of alternative growth stage labels in each image. Then these become labels that will be auto-suggested for category search.

Search & faceting widgets

A core feature we need is the ability to search and facet data. Any changes in the query, either through searching or faceting should affect the page's URL. URLs should be usable for recovering queries.

Will need an explicit schema on ElasticSearch; or change "variable" and "na" into NULL

ES seems to have done some type sniffing, which is a problem for something like camera_fov, since it is detected as a numeric type, and then a later dataset might say "variable" and it breaks. The other option might be to store "variable" as null in the index, thus avoiding the problem (currently implemented in #49).

https://github.com/kristianmandrup/json-schema-to-es-mapping might help. I've only found it, not looked at it.

Don't show images/masks of test-set samples in UI

I don't know if other ML web sites do this, but I think it's a nice idea to be able to switch off showing held-out samples as search results, whether to remove them from the search results or to hide their images.

It might be worth thinking with other colleagues (or collaborators from Zhiyong's team) about how they would want train/test distinctions represented in a repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.