weed-ai / weed-ai Goto Github PK

A repository to support the development of a repository and interchange format for weed identification annotation

Home Page: https://weed-ai.sydney.edu.au/

License: MIT License

Python 47.59% Dockerfile 0.20% HTML 0.46% CSS 1.12% JavaScript 41.39% Makefile 0.04% Shell 2.42% TypeScript 2.47% Jupyter Notebook 4.32%

weed-recognition computer-vision datasets data-formats mscoco

weed-ai's Issues

Ensure deepweed and cwfid ingesters produce JSON matching schema

I think at the same time we should make their ingestion code .py rather than .ipynb.

Convertor for dataset with images and segmentation mask

One dataset on RDS is a bunch of images and a binary segmentation mask. Add an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO, with a category definition for the masked area.

Will need an explicit schema on ElasticSearch; or change "variable" and "na" into NULL

ES seems to have done some type sniffing, which is a problem for something like camera_fov, since it is detected as a numeric type, and then a later dataset might say "variable" and it breaks. The other option might be to store "variable" as null in the index, thus avoiding the problem (currently implemented in #49).

https://github.com/kristianmandrup/json-schema-to-es-mapping might help. I've only found it, not looked at it.

Validate JSON failing on master

schema AgContext.json is invalid
error: schema is invalid: data.properties['location_datum'].enum should be array

Ingestion Pipeline

Given a dataset in weed coco, we need an ETL pipeline to ingest and register it safely into a repository, and index it in Elastic.

Ingestion process needs to:

Validate that the input is of the required format
Hash each image with SHA512 and store the hash in the image blob
Add a unique identifier to each category (e.g. based on its role and
species)
Group together all annotations for an image and together assign them
a unique identifier
- This unique "Annotated Image ID" should hash together:
  - The image being annotated
  - The masks or bounding boxes (in some deterministic order,
    and with a deterministic encoding of masks)
  - The corresponding categories' globally unique IDs
Assign each annotated image a set of task labels for what they can
be used for, e.g. classification, bounding box, segmentation,
instance segmentation
Assign the collection an arbitrary globally unique ID
Check that if any of the images are already known to the database,
they use agcontexts with global IDs already known to the database
If any of the images are not already known to the database, they
must either reference an agcontext with a known global ID (that is
consistent with what's in the database) or a new ID must be
generated for each new agcontext.
Downsize the image for display in search results and store on
website static file server
- also prepare masks for display in search results
Store the full-size image on dataset static file server
Store the prepared dataset in the repository
Generate and upload Elastic Search docs for each modified
"Annotated Image ID" (reincluding any collection information
from other datasets containing that Annotated Image ID)
- Under this model, we could store everything in the repository
  in one giant Weed-COCO, at the risk of O(n) latency for many
  operations. Alternatively, we could "explode" the Weed-COCO
  such that core entities in the repository are Annotated
  Images, Collections, AgContexts, etc.

Handle folder and path consistently in importers

Should "file_name" in WeedCOCO be a relative path, or be a URL? It certainly can't be an on-disk absolute path. Or do we just always assume that the files are passed with the collection being imported, and then the ingestion pipeline can rename the file? In that case, should we keep an "original_file_name" field?

Store static images on web server and show them in search results summary

Not trying to build a generic, longstanding tool here... just hacking pieces together for demo.

Without masks for now. Masks later.

Add a GitHub workflow to validate

Ideally any .py files should pass flake8. Any .json files should parse as json.

Don't show images/masks of test-set samples in UI

I don't know if other ML web sites do this, but I think it's a nice idea to be able to switch off showing held-out samples as search results, whether to remove them from the search results or to hide their images.

It might be worth thinking with other colleagues (or collaborators from Zhiyong's team) about how they would want train/test distinctions represented in a repository.

validate multiple json files as valid Weed COCO

Currently the validation workflow only really applies to specific files. This probably is fine if we just want to use it to validate a schema. But in the future, would we want to validate any number of JSON files, and perhaps recursively? Using glob, as you had suggested, to get all relevant files for validation might be a good idea in that case. There is a glob in the github actions toolkit, but I don't know if we will need to make our own version of the JSON validation action to make use of this.

Ease contribution by pulling Collection and AgContext out of the Coco

The idea would be that a new dataset for ingestion consists of:

one or more COCO files. The things that make them WeedId compatible are that:
- the categories have sufficient specification to be mapped to one of our central category definitions (e.g. they have role and some recognised species identifier).
- the info blob must specify an AgContext name that all the images in that COCO belong to
- the info blob may also specify that it is a specific named subset (e.g. "train" or "test")
one file conforming to the schema.org/Dataset metadata specification. There are a few tools that can help to generate this file.
one or more AgContext files, each with a distinct name that is used to reference the agcontext within COCO infos.
image files

So something like:

|- dataset.json
|- agcontexts/
| |- illuminated.json
| |- daylight.json
|- cocos/
| |- coco-illuminated-train.json
| |- coco-illuminated-test.json
| |- coco-daylight-train.json
| |- coco-daylight-test.json
|- images/
| |- img01.jpeg
| |- img02.jpeg
| |- ...

This allows the contributor to use more pre-existing tools, and avoids convoluted things like collection_memberships

Implement and test weedcoco.validation.validate_coordinates

WeedCOCO category names should be specified with both role and species

They should take the format {role}: {species}.

This avoids having two categories with the same name but different role, and makes some data entry and processing simpler.

Create list of possible categories

COCO datasets have classifications that are broken down into super categories and categories within them. For example, "weed" might be a super category and "rye grass" a category of weed.

Categories are set at the dataset level; for now it might be good to assemble a list of possible categories. This should be reviewed by Guy.

Don't store cwfid and deepweeds JSON in repo

Rather, store a test sample for each (even with fake images), and convert to JSON in CI.

Search results summary

When searching, users should see a paginated list of images with masks shown, as well as some basic summary details like crop and weed species.

Adopt extensions to agcontext from Asher

@asherbender proposed being able to record the following attributes as context of data collection:

#-------------------------------------------------------------------------------
#                                 institution
#-------------------------------------------------------------------------------

# Description: Name of organisation/institution that collected the data.
# Requirement: Mandatory.
# Type:        string
institution_name:

# Description: Name of project that data was collected under.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_project:

# Description: Name of organisation/institution that funded the data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_funding:

# Description: Name of individuals responsible for collecting the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:     anonymous
#institution_authors:

# Description: Description of organisation/institution/individuals that
#              collected the data.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#institution_notes:

#-------------------------------------------------------------------------------
#                                    system
#-------------------------------------------------------------------------------

# Description: Name of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_name:

# Description: Description of data collection system.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#system_description:

# Description: Operational space where data collection system operates.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['marine', 'terrestrial', 'aerial', 'space']
system_theatre:

# Description: Mobility of data collection system. Static data collection
#              systems remain stationary and collect data in only one location.
#              Mobile data collection systems are capable of locomotion and can
#              collect data from multiple locations.
# Requirement: Mandatory.
# Type:        enum
# Options:     ['static', 'mobile']
system_mobility:

# Description: Method of controlling the data collection platform. Handheld
#              control are systems that are physically carried by a human
#              operator (e.g. digital camera). Direct control are systems that
#              are controlled by a human operator on or within the platform
#              (e.g. driving a tractor). Remote control are systems that are
#              controlled by a human operator outside the platform (e.g.
#              manually piloting a drone). Autonomous control are systems that
#              operate without human assistance (e.g. autonomous vehicle).
# Requirement: Mandatory.
# Type:        enum
# Options:     ['handheld', 'direct', 'remote', 'autonomous']
system_control:

#-------------------------------------------------------------------------------
#                                   location
#-------------------------------------------------------------------------------

# Description: Country where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_country:

# Description: State/province where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_state:

# Description: City where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#location_city:

# Description: Latitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_latitude:

# Description: Longitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Decimal degrees (DD)
#location_longitude:

# Description: Spatial reference system of latitude/longitude expressed as an
#              EPSG code (e.g. WGS 84 is 4326).
# Requirement: Optional (uncomment and edit).
# Type:        int
# Default:     0
#location_epsg:

#-------------------------------------------------------------------------------
#                                   weather
#-------------------------------------------------------------------------------

# Description: Average temperature during data collection in degrees Celsius.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees Celsius (C)
#weather_temperature:

# Description: Average relative humidity during data collection as a percentage.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Percentage (%)
#weather_humidity:

# Description: Average barometric pressure during data collection in
#              hectopascals (hPa).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Hectopascal (hPa)
#weather_pressure:

# Description: Subjective cloud cover during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_cloud_cover:

# Description: Subjective atmospheric clarity during data collection. Note that
#              this condition is distinct from cloud cover. Atmospheric clarity
#              covers conditions such as pollution or haze due to bushfires.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'clear', 'intermediate haze', 'heavy haze']
# Default:     unknown
#weather_atmospheric_clarity:

# Description: Average rainfall during data collection in millimetres (mm).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Millimetres (mm)
#weather_rainfall:

# Description: Subjective rain conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_rain_qualitative:

# Description: Average wind speed during data collection in metres per second
#              (m/s).
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Metres per second (m/s)
#weather_wind_speed:

# Description: Average wind direction during data collection in degrees from
#              North.
# Requirement: Optional (uncomment and edit).
# Type:        float
# Default:     nan
# Unit:        Degrees from north
#weather_wind_direction:

# Description: Subjective wind conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        enum
# Options:     ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default:     unknown
#weather_wind_qualitative:

# Description: True if dew was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_dew:

# Description: True if frost was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_frost:

# Description: True if mist or fog was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_mist:

# Description: True if snow was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        bool
# Default:     False
#weather_snow:

# Description: Description of weather conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type:        string
# Default:
#weather_notes:

Implement and test weedcoco.validation.validate_image_sizes

crop_type is too restrictive if we want to demonstrate with diverse weed datasets

restricted currently to grains... so we need at least a free text alternative for "other", e.g. an other_crop_type field

Something looks odd in bbch_code pattern

pattern: "^gs[0-9][0-9]$|^na$|^(?!.*gs00)"

What is ^(?!.*gs00) meant to do? As far as I can tell, it should match almost any string.

Add categories to schema

Add schema/Category.yaml and link it into schema/main.yaml.

Prototype spec / documentation doc

Create a document describing some of the changes we have made from COCO, the limitations of COCO, etc.

Make schema .json instead of .yaml

After #17

Schema could constrain the valid range of latitude and longitude

It could also constrain the allowed location_datum, and for now it could be required to be 4236

Should we use queryFormat: "or" for multiselect categories?

If an image has multiple values for some field, we might want "and" otherwise it's useless. Users will certainly want to be able to express "or" across values for a field. So if we don't use queryFormat: "or", we will need to provide some kind of advanced search interface to combine queries by disjunction. OTOH, queryFormat: "and" behaves more intuitively.

How do we represent Near Infrared images aligned with RGB images

Options include:

store as separate file and reference nir_file_name as well as file_name in COCO
store as alpha channel (but separate it out for frontend viewing)

Search & faceting widgets

A core feature we need is the ability to search and facet data. Any changes in the query, either through searching or faceting should affect the page's URL. URLs should be usable for recovering queries.

Harvest image attributes from EXIF

The CameraTraps code for extracting info from image EXIF (e.g. width) doesn't seem to work. Their code looks very similar to the top stackoverflow answer re "python image exif info". I've tried to mess with it some to get it to work, but the issues relate to some of the general weirdness surrounding file paths in this convertor.

Define one or more places in the schema to store DCMI terms

DCMI is the standard for describing creative works and their provenance, e.g. datasets. It would be appropriate to ascribe images and/or annotations to originating datasets using this standard.

expand schema yamls

Add schema yamls for all parts of the weedcoco blob.

Identify how to handle multiple identification tasks

MSCOCO can support multiple computer vision tasks, and has unique annotations for each. To what extent are these annotations interoperable?

react-scripts not found

Issues with react-scripts have resurfaced. Reactivesearch doesn't work with master currently; this is because react-scripts are not found when running docker-compose up.

Currently trialing various attempts to fix this in the dockerfile and/or docker-compose.

Attaching to search_kibana_1, elasticsearch, search_reactivesearch_1, dejavu
reactivesearch_1  | 
reactivesearch_1  | > [email protected] start /code
reactivesearch_1  | > react-scripts start
reactivesearch_1  | 
reactivesearch_1  | sh: 1: react-scripts: not found
reactivesearch_1  | npm ERR! code ELIFECYCLE
reactivesearch_1  | npm ERR! syscall spawn
reactivesearch_1  | npm ERR! file sh
reactivesearch_1  | npm ERR! errno ENOENT
reactivesearch_1  | npm ERR! [email protected] start: `react-scripts start`
reactivesearch_1  | npm ERR! spawn ENOENT
reactivesearch_1  | npm ERR! 
reactivesearch_1  | npm ERR! Failed at the [email protected] start script.
reactivesearch_1  | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
reactivesearch_1  | 
reactivesearch_1  | npm ERR! A complete log of this run can be found in:
reactivesearch_1  | npm ERR!     /root/.npm/_logs/2020-09-04T03_59_05_412Z-debug.log
search_reactivesearch_1 exited with code 1

Export functionality

Relatively self explanatory; users need to be able to export

Represent the notion of fixed dataset splits

Datasets including deepweeds (Olsen et al) and cwfid (Haug et al) are distributed with predefined train/test splits. We should consider representing this information in our interchange format.

How to deal with NIR and RGB data

Image view

shows an image & masks (still lo-res), licence info, agcontexts it's in, datasets it's in.

Automatically find eppo/bayer codes

Currently we are hard coding most aspects of category objects, however in the future it may be useful to try to automatically fill in information. For example, if we have the scientific name of a plant species, we can then easily query the EPPO API: https://data.eppo.int/documentation/rest#collapse1

Add HTTP Basic authentication

There are several tutorials for HTTP Basic auth on React apps.
Passwords should not be stored in a git-tracked file. Instead they could be stored with Docker Secrets.

Frontend: second draft

Improve UI
Change faceting/filtering behaviour to be and rather than or
Change URL based on search params
Add mapping widget

Anything else @jnothman ?

Create convertor for a bounding box data set

Create and/or adapt a convertor to turn a bounding box dataset into COCO weeds JSON.

Expand categories from taxonomy when indexing images

E.g. expand "crop: avena sativa" to "crop: grasses" and "crop".

Custom dataset export through frontend

I don't know how to classify this issue, but I realise that we don't really need a concatenation tool at the ingestion level. Instead we can use a "shopping cart" like feature to assemble a set of images in ES. There are heaps of options for shopping cart like tools, however we will then want to include a ES -> WeedCOCO annotation exporter function somwhere.

Export to ElasticSearch should add a field describing the tasks that can be performed with the annotation

So we should be producing in each image one or more "annotation_types" labels from:

"classification"
"bounding box"
"segmentation"
"instance segmentation"

We should be able to infer the from the annotations associated with each image....

Engineering for extendable datasets

I've made some changes to try and support a hypothetical future where users can filter, search, and combine multiple datasets. This basically means that for data set level info, I have added in identifiers for them in images, and made it so the "agadata" object is actually a set of "agdataset" objects. Naming should probably be reviewed to make sure it makes sense, because I know this can get confusing.

Support search with different variant growth stage descriptors

Per the matrix sent by Guy & Michael, there are many ways to refer to the same stage of crop growth.

We have two choices about how to handle this:

Store the most specific label for an agcontext in the ES index, then provide a frontend widget that allows aliases and converts them into the indexed form for querying.
Expand the label given in the agcontext when the data is indexed, and index the list of alternative growth stage labels in each image. Then these become labels that will be auto-suggested for category search.

Create validation script for valid interchange data

We should create a library/script that determines whether a dataset is reasonably valid and sufficient for import into our database. JSONSchema can be used to define structural requirements. Integrity constraints, such that *_id and id elements correspond, need to be checked separately.

weed-ai / weed-ai Goto Github PK

weed-ai's Issues

Recommend Projects

Recommend Topics

Recommend Org