weed-ai / weed-ai Goto Github PK
View Code? Open in Web Editor NEWA repository to support the development of a repository and interchange format for weed identification annotation
Home Page: https://weed-ai.sydney.edu.au/
License: MIT License
A repository to support the development of a repository and interchange format for weed identification annotation
Home Page: https://weed-ai.sydney.edu.au/
License: MIT License
I think at the same time we should make their ingestion code .py rather than .ipynb.
One dataset on RDS is a bunch of images and a binary segmentation mask. Add an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO, with a category definition for the masked area.
ES seems to have done some type sniffing, which is a problem for something like camera_fov
, since it is detected as a numeric type, and then a later dataset might say "variable" and it breaks. The other option might be to store "variable" as null in the index, thus avoiding the problem (currently implemented in #49).
https://github.com/kristianmandrup/json-schema-to-es-mapping might help. I've only found it, not looked at it.
schema AgContext.json is invalid
error: schema is invalid: data.properties['location_datum'].enum should be array
Given a dataset in weed coco, we need an ETL pipeline to ingest and register it safely into a repository, and index it in Elastic.
Ingestion process needs to:
Should "file_name" in WeedCOCO be a relative path, or be a URL? It certainly can't be an on-disk absolute path. Or do we just always assume that the files are passed with the collection being imported, and then the ingestion pipeline can rename the file? In that case, should we keep an "original_file_name" field?
Not trying to build a generic, longstanding tool here... just hacking pieces together for demo.
Without masks for now. Masks later.
Ideally any .py files should pass flake8. Any .json files should parse as json.
I don't know if other ML web sites do this, but I think it's a nice idea to be able to switch off showing held-out samples as search results, whether to remove them from the search results or to hide their images.
It might be worth thinking with other colleagues (or collaborators from Zhiyong's team) about how they would want train/test distinctions represented in a repository.
Currently the validation workflow only really applies to specific files. This probably is fine if we just want to use it to validate a schema. But in the future, would we want to validate any number of JSON files, and perhaps recursively? Using glob, as you had suggested, to get all relevant files for validation might be a good idea in that case. There is a glob in the github actions toolkit, but I don't know if we will need to make our own version of the JSON validation action to make use of this.
The idea would be that a new dataset for ingestion consists of:
role
and some recognised species identifier).info
blob must specify an AgContext name that all the images in that COCO belong toinfo
blob may also specify that it is a specific named subset (e.g. "train" or "test")So something like:
|- dataset.json
|- agcontexts/
| |- illuminated.json
| |- daylight.json
|- cocos/
| |- coco-illuminated-train.json
| |- coco-illuminated-test.json
| |- coco-daylight-train.json
| |- coco-daylight-test.json
|- images/
| |- img01.jpeg
| |- img02.jpeg
| |- ...
This allows the contributor to use more pre-existing tools, and avoids convoluted things like collection_memberships
They should take the format {role}: {species}
.
This avoids having two categories with the same name but different role, and makes some data entry and processing simpler.
COCO datasets have classifications that are broken down into super categories and categories within them. For example, "weed" might be a super category and "rye grass" a category of weed.
Categories are set at the dataset level; for now it might be good to assemble a list of possible categories. This should be reviewed by Guy.
Rather, store a test sample for each (even with fake images), and convert to JSON in CI.
When searching, users should see a paginated list of images with masks shown, as well as some basic summary details like crop and weed species.
@asherbender proposed being able to record the following attributes as context of data collection:
#-------------------------------------------------------------------------------
# institution
#-------------------------------------------------------------------------------
# Description: Name of organisation/institution that collected the data.
# Requirement: Mandatory.
# Type: string
institution_name:
# Description: Name of project that data was collected under.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#institution_project:
# Description: Name of organisation/institution that funded the data collection.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#institution_funding:
# Description: Name of individuals responsible for collecting the data.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default: anonymous
#institution_authors:
# Description: Description of organisation/institution/individuals that
# collected the data.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#institution_notes:
#-------------------------------------------------------------------------------
# system
#-------------------------------------------------------------------------------
# Description: Name of data collection system.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#system_name:
# Description: Description of data collection system.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#system_description:
# Description: Operational space where data collection system operates.
# Requirement: Mandatory.
# Type: enum
# Options: ['marine', 'terrestrial', 'aerial', 'space']
system_theatre:
# Description: Mobility of data collection system. Static data collection
# systems remain stationary and collect data in only one location.
# Mobile data collection systems are capable of locomotion and can
# collect data from multiple locations.
# Requirement: Mandatory.
# Type: enum
# Options: ['static', 'mobile']
system_mobility:
# Description: Method of controlling the data collection platform. Handheld
# control are systems that are physically carried by a human
# operator (e.g. digital camera). Direct control are systems that
# are controlled by a human operator on or within the platform
# (e.g. driving a tractor). Remote control are systems that are
# controlled by a human operator outside the platform (e.g.
# manually piloting a drone). Autonomous control are systems that
# operate without human assistance (e.g. autonomous vehicle).
# Requirement: Mandatory.
# Type: enum
# Options: ['handheld', 'direct', 'remote', 'autonomous']
system_control:
#-------------------------------------------------------------------------------
# location
#-------------------------------------------------------------------------------
# Description: Country where data was collected.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#location_country:
# Description: State/province where data was collected.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#location_state:
# Description: City where data was collected.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#location_city:
# Description: Latitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Decimal degrees (DD)
#location_latitude:
# Description: Longitude in decimal degrees where data was collected.
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Decimal degrees (DD)
#location_longitude:
# Description: Spatial reference system of latitude/longitude expressed as an
# EPSG code (e.g. WGS 84 is 4326).
# Requirement: Optional (uncomment and edit).
# Type: int
# Default: 0
#location_epsg:
#-------------------------------------------------------------------------------
# weather
#-------------------------------------------------------------------------------
# Description: Average temperature during data collection in degrees Celsius.
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Degrees Celsius (C)
#weather_temperature:
# Description: Average relative humidity during data collection as a percentage.
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Percentage (%)
#weather_humidity:
# Description: Average barometric pressure during data collection in
# hectopascals (hPa).
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Hectopascal (hPa)
#weather_pressure:
# Description: Subjective cloud cover during data collection.
# Requirement: Optional (uncomment and edit).
# Type: enum
# Options: ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default: unknown
#weather_cloud_cover:
# Description: Subjective atmospheric clarity during data collection. Note that
# this condition is distinct from cloud cover. Atmospheric clarity
# covers conditions such as pollution or haze due to bushfires.
# Requirement: Optional (uncomment and edit).
# Type: enum
# Options: ['unknown', 'clear', 'intermediate haze', 'heavy haze']
# Default: unknown
#weather_atmospheric_clarity:
# Description: Average rainfall during data collection in millimetres (mm).
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Millimetres (mm)
#weather_rainfall:
# Description: Subjective rain conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type: enum
# Options: ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default: unknown
#weather_rain_qualitative:
# Description: Average wind speed during data collection in metres per second
# (m/s).
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Metres per second (m/s)
#weather_wind_speed:
# Description: Average wind direction during data collection in degrees from
# North.
# Requirement: Optional (uncomment and edit).
# Type: float
# Default: nan
# Unit: Degrees from north
#weather_wind_direction:
# Description: Subjective wind conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type: enum
# Options: ['unknown', 'none', 'light', 'intermediate', 'heavy']
# Default: unknown
#weather_wind_qualitative:
# Description: True if dew was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type: bool
# Default: False
#weather_dew:
# Description: True if frost was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type: bool
# Default: False
#weather_frost:
# Description: True if mist or fog was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type: bool
# Default: False
#weather_mist:
# Description: True if snow was present during data collection.
# Requirement: Optional (uncomment and edit).
# Type: bool
# Default: False
#weather_snow:
# Description: Description of weather conditions during data collection.
# Requirement: Optional (uncomment and edit).
# Type: string
# Default:
#weather_notes:
restricted currently to grains... so we need at least a free text alternative for "other", e.g. an other_crop_type
field
pattern: "^gs[0-9][0-9]$|^na$|^(?!.*gs00)"
What is ^(?!.*gs00)
meant to do? As far as I can tell, it should match almost any string.
Add schema/Category.yaml
and link it into schema/main.yaml
.
Create a document describing some of the changes we have made from COCO, the limitations of COCO, etc.
After #17
It could also constrain the allowed location_datum, and for now it could be required to be 4236
If an image has multiple values for some field, we might want "and" otherwise it's useless. Users will certainly want to be able to express "or" across values for a field. So if we don't use queryFormat: "or"
, we will need to provide some kind of advanced search interface to combine queries by disjunction. OTOH, queryFormat: "and"
behaves more intuitively.
Options include:
nir_file_name
as well as file_name
in COCOA core feature we need is the ability to search and facet data. Any changes in the query, either through searching or faceting should affect the page's URL. URLs should be usable for recovering queries.
The CameraTraps code for extracting info from image EXIF (e.g. width) doesn't seem to work. Their code looks very similar to the top stackoverflow answer re "python image exif info". I've tried to mess with it some to get it to work, but the issues relate to some of the general weirdness surrounding file paths in this convertor.
DCMI is the standard for describing creative works and their provenance, e.g. datasets. It would be appropriate to ascribe images and/or annotations to originating datasets using this standard.
Add schema yamls for all parts of the weedcoco blob.
MSCOCO can support multiple computer vision tasks, and has unique annotations for each. To what extent are these annotations interoperable?
Issues with react-scripts have resurfaced. Reactivesearch doesn't work with master currently; this is because react-scripts are not found when running docker-compose up
.
Currently trialing various attempts to fix this in the dockerfile and/or docker-compose.
Attaching to search_kibana_1, elasticsearch, search_reactivesearch_1, dejavu
reactivesearch_1 |
reactivesearch_1 | > [email protected] start /code
reactivesearch_1 | > react-scripts start
reactivesearch_1 |
reactivesearch_1 | sh: 1: react-scripts: not found
reactivesearch_1 | npm ERR! code ELIFECYCLE
reactivesearch_1 | npm ERR! syscall spawn
reactivesearch_1 | npm ERR! file sh
reactivesearch_1 | npm ERR! errno ENOENT
reactivesearch_1 | npm ERR! [email protected] start: `react-scripts start`
reactivesearch_1 | npm ERR! spawn ENOENT
reactivesearch_1 | npm ERR!
reactivesearch_1 | npm ERR! Failed at the [email protected] start script.
reactivesearch_1 | npm ERR! This is probably not a problem with npm. There is likely additional logging output above.
reactivesearch_1 |
reactivesearch_1 | npm ERR! A complete log of this run can be found in:
reactivesearch_1 | npm ERR! /root/.npm/_logs/2020-09-04T03_59_05_412Z-debug.log
search_reactivesearch_1 exited with code 1
Relatively self explanatory; users need to be able to export
Datasets including deepweeds (Olsen et al) and cwfid (Haug et al) are distributed with predefined train/test splits. We should consider representing this information in our interchange format.
shows an image & masks (still lo-res), licence info, agcontexts it's in, datasets it's in.
Currently we are hard coding most aspects of category objects, however in the future it may be useful to try to automatically fill in information. For example, if we have the scientific name of a plant species, we can then easily query the EPPO API: https://data.eppo.int/documentation/rest#collapse1
There are several tutorials for HTTP Basic auth on React apps.
Passwords should not be stored in a git-tracked file. Instead they could be stored with Docker Secrets.
Anything else @jnothman ?
Create and/or adapt a convertor to turn a bounding box dataset into COCO weeds JSON.
E.g. expand "crop: avena sativa" to "crop: grasses" and "crop".
I don't know how to classify this issue, but I realise that we don't really need a concatenation tool at the ingestion level. Instead we can use a "shopping cart" like feature to assemble a set of images in ES. There are heaps of options for shopping cart like tools, however we will then want to include a ES -> WeedCOCO annotation exporter function somwhere.
So we should be producing in each image one or more "annotation_types" labels from:
We should be able to infer the from the annotations associated with each image....
I've made some changes to try and support a hypothetical future where users can filter, search, and combine multiple datasets. This basically means that for data set level info, I have added in identifiers for them in images, and made it so the "agadata" object is actually a set of "agdataset" objects. Naming should probably be reviewed to make sure it makes sense, because I know this can get confusing.
Per the matrix sent by Guy & Michael, there are many ways to refer to the same stage of crop growth.
We have two choices about how to handle this:
We should create a library/script that determines whether a dataset is reasonably valid and sufficient for import into our database. JSONSchema can be used to define structural requirements. Integrity constraints, such that *_id
and id
elements correspond, need to be checked separately.
Datasets on RDS are a bunch of images and VOC XML files, plus an agcontext json file and a collection json file. Given paths to all these, we need to be able to convert to Weed-COCO.
Not sure how useful this is in practice, but might be a good exercise to demonstrate the structure of the objects, and demonstrate pointer swizzling/unswizzling.
shows agcontexts, links to search results
Develop a python tool for converting another dataset's annotations into weedcoco. Ideally we will use one of Guy or Asher's datasets.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.