Giter Club home page Giter Club logo

traiter_efloras's Introduction

The eFloras Traits Project Python application

All right, what's this all about then?

Challenge: Extract trait information from plant treatments. That is, if I'm given treatment text like: (Reformatted to emphasize targeted traits.)

Treatment

I should be able to extract: (Colors correspond to the text above.)

Treatment

Terms

Essentially, we are finding relevant terms in the text (NER) and then linking them (Entity Linking). There are 5 types of terms:

  1. The traits themselves: These are things like color, size, shape, woodiness, etc. They are either a measurement, count, or a member of a controlled vocabulary.
  2. Plant parts: Things like leaves, branches, roots, seeds, etc. These have traits. So they must be linked to them.
  3. Plant subparts: Things like hairs, pores, margins, veins, etc. Leaves can have hairs and so can seeds. They also have traits and will be linked to them, but they must also be linked to a part to have any meaning.
  4. Sex: Plants exhibit sexual dimorphism, so we to note which part/subpart/trait notation is associated with which sex. 5Other text: Things like conjunctions, punctuation, etc. Although they are not recorded, they are often important for parsing and linking of terms.

Multiple methods for parsing

  1. Rule based parsing. Most machine learning models require a substantial training dataset. I use this method to bootstrap the training data. If machine learning methods fail, I can fall back to this.
  2. Machine learning models. (In progress)

Rule-based parsing strategy

  1. I label terms using Spacy's phrase and rule-based matchers.
  2. Then I match terms using rule-based matchers repeatedly until I have built up a recognizable trait like: color, size, count, etc.
  3. Finally, I associate traits with plant parts.

For example, given the text: Petiole 1-2 cm.:

  • I recognize vocabulary terms like:
    • Petiole is plant part
    • 1 a number
    • - a dash
    • 2 a number
    • cm is a unit notation
  • Then I group tokens. For instance:
    • 1-2 cm is a range with units which becomes a size trait.
  • Finally, I associate the size with the plant part Petiole by using a tree base parser. Spacy will build a labeled sentence dependency tree. We look for patterns in the tree to link traits with plant parts.

There are, of course, complications and subtleties not outlined above, but you should get the gist of what is going on here.

Install

You will need to have Python 3.11 (or later) installed. You can install the requirements into your python environment like so:

git clone https://github.com/rafelafrance/traiter_efloras.git
cd traiter_efloras
optional: virtualenv -p python3.11 .venv
optional: source .venv/bin/activate
make install

Run

./extract.py ... TODO ...

Tests

Having a test suite is absolutely critical. The strategy I use is every new trait gets its own test set. Any time there is a parser error I add the parts that caused the error to the test suite and correct the parser. I.e. I use the standard red/green testing methodology.

You can run the tests like so:

cd /my/path/to/efloras_mimosa
python -m unittest discover

traiter_efloras's People

Contributors

rafelafrance avatar

Watchers

 avatar  avatar

traiter_efloras's Issues

Extract families

  • Anisophylleaceae? (not found)
  • Apodanthaceae
  • Barbeyaceae? (not found)
  • Begoniaceae
  • Betulaceae
  • Cannabaceae
  • Casuarinaceae
  • Coriariaceae
  • Corynocarpaceae? (not found)
  • Cucurbitaceae
  • Datiscaceae
  • Dirachmaceae? (not found)
  • Elaeagnaceae
  • Fabaceae (=Leguminosae)
  • Fagaceae
  • Juglandaceae
  • Moraceae
  • Myricaceae
  • Nothofagaceae? (not found)
  • Polygalaceae
  • Quillajaceae? (not found)
  • Rhamnaceae
  • Rosaceae
  • Saxifragaceae
  • Surianaceae? (not found)
  • Tetramelaceae
  • Ticodendraceae? (not found)
  • Ulmaceae
  • Urticaceae

Parse "hairy"

Hairy can be:

  • hairs
  • surface description
  • prefix
  • suffix
  • infix

Systematically handle hyphenated words

Right now I'm putting literal hyphens into the phrase patterns. This is unsustainable. One possible strategy.

  1. Remove the hyphenated terms from the vocabulary
  2. Hyphenate terms patterns using the hyphenate python library
  3. Add new term entries for the term with the hyphen included

Fix trait bounds

The start and end indices of traits are not being updated when I make compound traits. For instance, the units are no longer in the bounds of a length trait.

This may be because I'm using the trait update pipeline.

Reorganize terms

Change some parts into categories of their own. Like inflorescence or flower parts.

Simplify rules

Rules trait rules are too complex and the trait linking rules are even worse.

Add new traits

  • Lobes
    • Any of the following can have lobes: leaves, sepals, petals, and female floral structures (stigma, ovary). But leaves are by far the most common.
    • Capture size and shape information similar to leaves.
    • Also capture the number of lobes: 3-lobed, trilobate,
    • Unlobed and entire both mean 0 lobes, in which case they cannot have shape and size
    • Watch out to separately capture shape about the leaf margin (not technically lobes, see next item)
  • Margin (of blade or lobes or petals or sepals)
    • This is one that should get its own terminology as that will help you
    • Words: crenate, dentate, undulate, sinuate, scalloped, serrate.
  • Stamens
    • Substructures: anther, filament (they can have their own size and shape)
    • Use existing shape vocabularies
    • Also capture number
  • Ovary
    • Substructures: style, stigma, carpel
    • Special position terms:
      • Superior = hypogynous
      • Inferior = epigynous
      • Perigynous
    • Capture number also, substructures can have different numbers
  • Seeds
    • Just capture length and width
  • Fruit
    • Color, size, shape
    • Pepo is a special fruit term, other special terms: capsule, berry, legume, pod, silique, follicle – record the name (special type of fruit) but otherwise treat as synonyms for fruit
  • General plant descriptors (usually beginning of treatment)
    • Perennial/annual/biennial
    • Deciduous/evergreen
  • Plant height (note – will typically be missing in Cucurbitaceae because they are vines)
    • HINT: if a measurement is given for structure “plants” or “stems” – at the beginning of a treatment – parse as height. See Ecballium elaterium in Cucurbitaceae as an example.
    • Shrub/herb/vine (synonyms: climbing, trailing)/liana/tree/caulescent/acaulescent/cespitose/caespitose/epiphytic/lithophytic/epiphyte/lithophyte/parasitic/parasite/woody/herbaceous/monocarpic/frutescent
  • General plant descriptors (not always beginning of treatment)
    • Monoecious/dioecious/androdioecious/gynodioecious/hermaphroditic/bisexual/perfect/protandrous/protogynous/unisexual
    • Symmetry descriptors
      • Zygomorphic = irregular
      • Actinomorphic = regular
    • Just find and report these terms in the entire paragraph – they apply to nothing else so no parsing needed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.