Giter Club home page Giter Club logo

gateplugin-stringannotation's Introduction

!!!!!!!OLD gateplugin-StringAnnotation OLD!!!!!!!

IMPORTANT: this version of the plugin will not be actively maintained any more. The new, mavenized version of the plugin for GATE version 8.5 and later can be found here: https://github.com/GateNLP/gateplugin-StringAnnotation


A plugin for the GATE language technology framework that provides PRs for string annotation.

This is an updated and modified version (not backwards compatible any more) of what can be found here: https://code.google.com/p/gateplugin-stringannotation/

For more information please consult the Wiki: https://github.com/johann-petrak/gateplugin-stringannotation/wiki

To download see: https://github.com/johann-petrak/gateplugin-stringannotation/releases

Feedback: please report bugs or feature request to the issue tracker: https://github.com/johann-petrak/gateplugin-stringannotation/issues

gateplugin-stringannotation's People

Contributors

johann-petrak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

philgooch

gateplugin-stringannotation's Issues

Investigate collapsing multiple matches into one annotation

This could be done by either dealing with priorities (e.g. only first entry from a list, only first list) and/or by adding features that contain information for all matches in a list.
It would be interesting if we could combine this with case variation matching if case-insensitive matching is performed.

Remove dependency on virtual corpus

Re-implement the indirect regular expression annotator so it uses the same mechanism as the ExtendedGazetteer instead of the VirtualDocument approach and remove the dependency on VirtualDocument.

Mavenize

Mavenize and move the mavenized version with a new version number to live under the GateNLP organization.

Better handling of list-specific features

If the config file is a yaml file, allow to specify alternate features and values which are the same for all entries in that list, replacing majorType, minorType etc.

Make it easier to use for non-exact matches

Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.

Make alternate to writing of file to run

Writing a new file to run this plugin causes the system to break in a multithreaded production environment. Furthermore, the way that the plugin resources are loaded means that there always has to be a physical file to point to for loading the plugin, which doesn't work for using resources bundled inside of jars.

Make an alternate run strategy that loads from resource and doesn't write this unnecessary file.

Rethink PR parameters

Rethink which of the runtime parameters of the ExtendedGazetteer PR should be kept and which could go into a later yaml config file. Also, align parameters between the gazetteer and regexp annotator PR (same name, similar function, especially after re-implementing the indirect regexp annotator using the virtual chunk annotation approach)

Add parameter/setting to handle latin characters

Add a parameter that will handle text/gazetteers with UTF-8 characters which have mainly one-byte encodings so that the UTF-16 characters used by Java will first be converted to bytes. This should make it possible to reduce the memory requirements for the trie to nearly half of what we need now.

Support TSV files as list files

In addition to the .lst format support .tsv where the first column is the entry and the rest of the row has tab-separated feature values. Each tsv file has to start with a header line that contains the feature names for each column. Also all rows must have the same number of fields.

Distinguish between missing features and empty features

Maybe: if a feature is present in the gazetteer but empty, make sure it is added with an empty string value instead of not being added?

If we ever support TSV, should an empty field result in an empty-String feature or a missing feature? What if the feature is normally numeric?

Investigate is jaspell could be useful

It would be interesting in general to see how well jaspell compresses data in comparison and how fast it is for looking up information.
However it may be possible to also support some form of fuzzy matching with jaspell, at least when the processing mode is limited to words only.
If we support fuzzy matching the following issues arise:
= allow various ways to prefer one or n out of many fuzzy matches. This could be based on purely string based similarity measures or also include e.g. frequency information (e.g. a specific featue in the gazetteer list).
= there may be border cases where fuzzy matches may be better then longer or shorter matches, so this is related to how we treat the matching boundaries.

Add features that indicate offsets of feature-based matches

When matching against features, at the moment the begin offset of a new annotation will always be the begin offset of the annotation matched, even if the match occurs at an offset > 0 in the feature string. We should at least add features that indicate which offsets inside the actual match we got, and maybe also the actual string we matched against and the part we matched. That way the match annotations can get adapted by postprocessing, if necessary.
Example: we match the root but we match within the word and want to identify any prefix.

Rename classes

Use better class names: the regexp annotator is not simple any more, use com.jpetrak.gate.stringannotation.regexp.RegexpAnnotator

Rename the extended gazetteer to com.jpetrak.gate.stringannotation.extendedgazetteer.ExtendedGazetteer etc.

Rethink case sensitivity

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.