johann-petrak / gateplugin-stringannotation Goto Github PK

!!! OLD/OUTDATED, use https://github.com/GateNLP/gateplugin-StringAnnotation

License: GNU Lesser General Public License v2.1

Shell 0.55% Java 98.80% Perl 0.65%

gateplugin-stringannotation's Introduction

!!!!!!!OLD gateplugin-StringAnnotation OLD!!!!!!!

IMPORTANT: this version of the plugin will not be actively maintained any more. The new, mavenized version of the plugin for GATE version 8.5 and later can be found here: https://github.com/GateNLP/gateplugin-StringAnnotation

A plugin for the GATE language technology framework that provides PRs for string annotation.

This is an updated and modified version (not backwards compatible any more) of what can be found here: https://code.google.com/p/gateplugin-stringannotation/

For more information please consult the Wiki: https://github.com/johann-petrak/gateplugin-stringannotation/wiki

To download see: https://github.com/johann-petrak/gateplugin-stringannotation/releases

Feedback: please report bugs or feature request to the issue tracker: https://github.com/johann-petrak/gateplugin-stringannotation/issues

gateplugin-stringannotation's People

Contributors

Stargazers

Watchers

Forkers

philgooch

gateplugin-stringannotation's Issues

Add GUI option to re-build the cache

Add a menu entry to the GUI right click menu which will re-initialize the PR after deleting the .gazbin file, if it exists.

Investigate collapsing multiple matches into one annotation

This could be done by either dealing with priorities (e.g. only first entry from a list, only first list) and/or by adding features that contain information for all matches in a list.
It would be interesting if we could combine this with case variation matching if case-insensitive matching is performed.

Remove dependency on virtual corpus

Re-implement the indirect regular expression annotator so it uses the same mechanism as the ExtendedGazetteer instead of the VirtualDocument approach and remove the dependency on VirtualDocument.

Mavenize

Mavenize and move the mavenized version with a new version number to live under the GateNLP organization.

Better handling of list-specific features

If the config file is a yaml file, allow to specify alternate features and values which are the same for all entries in that list, replacing majorType, minorType etc.

Make it easier to use for non-exact matches

Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.

Make alternate to writing of file to run

Writing a new file to run this plugin causes the system to break in a multithreaded production environment. Furthermore, the way that the plugin resources are loaded means that there always has to be a physical file to point to for loading the plugin, which doesn't work for using resources bundled inside of jars.

Make an alternate run strategy that loads from resource and doesn't write this unnecessary file.

Investigate approximate string matching

There are various possible approaches, several of them based on tries.
See also http://dbgroup.cs.tsinghua.edu.cn/dd/projects/taste/index.html

Read from URLs instead of files where possible

Change all code where we assume we get a file to read from to reading from a URL where possible. This is to allow for reading resources from inside a JAR (or other URL locations).

Make sure version changes in the backend library are detected when loading gazbin files

We should throw an (easy to understand) exception when the format of the gazbin serialized file has changed and we want to load it. An exception is probably already thrown but we should double-check and also make the text more understandable than the original "version mismatch"

Rethink PR parameters

Rethink which of the runtime parameters of the ExtendedGazetteer PR should be kept and which could go into a later yaml config file. Also, align parameters between the gazetteer and regexp annotator PR (same name, similar function, especially after re-implementing the indirect regexp annotator using the virtual chunk annotation approach)

Add parameter/setting to handle latin characters

Add a parameter that will handle text/gazetteers with UTF-8 characters which have mainly one-byte encodings so that the UTF-16 characters used by Java will first be converted to bytes. This should make it possible to reduce the memory requirements for the trie to nearly half of what we need now.

Support TSV files as list files

In addition to the .lst format support .tsv where the first column is the entry and the rest of the row has tab-separated feature values. Each tsv file has to start with a header line that contains the feature names for each column. Also all rows must have the same number of fields.

Distinguish between missing features and empty features

Maybe: if a feature is present in the gazetteer but empty, make sure it is added with an empty string value instead of not being added?

If we ever support TSV, should an empty field result in an empty-String feature or a missing feature? What if the feature is normally numeric?

Add benchmark information to the Wiki

Redo what we did here:
https://code.google.com/archive/p/gateplugin-stringannotation/wikis/ExtendedGazetteer2_Benchmarks.wiki

only better

Investigate is jaspell could be useful

It would be interesting in general to see how well jaspell compresses data in comparison and how fast it is for looking up information.
However it may be possible to also support some form of fuzzy matching with jaspell, at least when the processing mode is limited to words only.
If we support fuzzy matching the following issues arise:
= allow various ways to prefer one or n out of many fuzzy matches. This could be based on purely string based similarity measures or also include e.g. frequency information (e.g. a specific featue in the gazetteer list).
= there may be border cases where fuzzy matches may be better then longer or shorter matches, so this is related to how we treat the matching boundaries.

Add features that indicate offsets of feature-based matches

When matching against features, at the moment the begin offset of a new annotation will always be the begin offset of the annotation matched, even if the match occurs at an offset > 0 in the feature string. We should at least add features that indicate which offsets inside the actual match we got, and maybe also the actual string we matched against and the part we matched. That way the match annotations can get adapted by postprocessing, if necessary.
Example: we match the root but we match within the word and want to identify any prefix.

Rename classes

Use better class names: the regexp annotator is not simple any more, use com.jpetrak.gate.stringannotation.regexp.RegexpAnnotator

Rename the extended gazetteer to com.jpetrak.gate.stringannotation.extendedgazetteer.ExtendedGazetteer etc.

Rethink case sensitivity

Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)