johann-petrak / gateplugin-stringannotation Goto Github PK
View Code? Open in Web Editor NEW!!! OLD/OUTDATED, use https://github.com/GateNLP/gateplugin-StringAnnotation
License: GNU Lesser General Public License v2.1
!!! OLD/OUTDATED, use https://github.com/GateNLP/gateplugin-StringAnnotation
License: GNU Lesser General Public License v2.1
There are various possible approaches, several of them based on tries.
See also http://dbgroup.cs.tsinghua.edu.cn/dd/projects/taste/index.html
This could be done by either dealing with priorities (e.g. only first entry from a list, only first list) and/or by adding features that contain information for all matches in a list.
It would be interesting if we could combine this with case variation matching if case-insensitive matching is performed.
When matching against features, at the moment the begin offset of a new annotation will always be the begin offset of the annotation matched, even if the match occurs at an offset > 0 in the feature string. We should at least add features that indicate which offsets inside the actual match we got, and maybe also the actual string we matched against and the part we matched. That way the match annotations can get adapted by postprocessing, if necessary.
Example: we match the root but we match within the word and want to identify any prefix.
Try to at least support matches where certain characters can be treated equal to embedded white space, e.g. hyphens.
This could maybe get implemented as part of our own trie implementation, but see the issue about using jaspell for a possible alternative.
Add a menu entry to the GUI right click menu which will re-initialize the PR after deleting the .gazbin file, if it exists.
Use better class names: the regexp annotator is not simple any more, use com.jpetrak.gate.stringannotation.regexp.RegexpAnnotator
Rename the extended gazetteer to com.jpetrak.gate.stringannotation.extendedgazetteer.ExtendedGazetteer etc.
In addition to the .lst format support .tsv where the first column is the entry and the rest of the row has tab-separated feature values. Each tsv file has to start with a header line that contains the feature names for each column. Also all rows must have the same number of fields.
If the config file is a yaml file, allow to specify alternate features and values which are the same for all entries in that list, replacing majorType, minorType etc.
Mavenize and move the mavenized version with a new version number to live under the GateNLP organization.
We should throw an (easy to understand) exception when the format of the gazbin serialized file has changed and we want to load it. An exception is probably already thrown but we should double-check and also make the text more understandable than the original "version mismatch"
Change all code where we assume we get a file to read from to reading from a URL where possible. This is to allow for reading resources from inside a JAR (or other URL locations).
Writing a new file to run this plugin causes the system to break in a multithreaded production environment. Furthermore, the way that the plugin resources are loaded means that there always has to be a physical file to point to for loading the plugin, which doesn't work for using resources bundled inside of jars.
Make an alternate run strategy that loads from resource and doesn't write this unnecessary file.
Re-implement the indirect regular expression annotator so it uses the same mechanism as the ExtendedGazetteer instead of the VirtualDocument approach and remove the dependency on VirtualDocument.
Add a parameter that will handle text/gazetteers with UTF-8 characters which have mainly one-byte encodings so that the UTF-16 characters used by Java will first be converted to bytes. This should make it possible to reduce the memory requirements for the trie to nearly half of what we need now.
Consider: always store the actual case (if case-normalization is wanted, must be a preparation step for the list file). Then, if case-insensitive matching is required (then: a runtime parameter!!) use a parallel matching algorithm: for each character position, match the lower-case and upper-case version for all active matches in parallel. In theory, we could double the number of active matches at each position, but this will actually not happen and the number of active matches will be bounded by the maximum number of differently capitalized prefixes of a potential full match (or set of matches if we want to find all possible matches of any length).
Then, actually create several annotations for each case-variation we matched, or just one annotation based on a preference setting (e.g. first, best case match ...?)
Redo what we did here:
https://code.google.com/archive/p/gateplugin-stringannotation/wikis/ExtendedGazetteer2_Benchmarks.wiki
only better
It would be interesting in general to see how well jaspell compresses data in comparison and how fast it is for looking up information.
However it may be possible to also support some form of fuzzy matching with jaspell, at least when the processing mode is limited to words only.
If we support fuzzy matching the following issues arise:
= allow various ways to prefer one or n out of many fuzzy matches. This could be based on purely string based similarity measures or also include e.g. frequency information (e.g. a specific featue in the gazetteer list).
= there may be border cases where fuzzy matches may be better then longer or shorter matches, so this is related to how we treat the matching boundaries.
Maybe: if a feature is present in the gazetteer but empty, make sure it is added with an empty string value instead of not being added?
If we ever support TSV, should an empty field result in an empty-String feature or a missing feature? What if the feature is normally numeric?
Rethink which of the runtime parameters of the ExtendedGazetteer PR should be kept and which could go into a later yaml config file. Also, align parameters between the gazetteer and regexp annotator PR (same name, similar function, especially after re-implementing the indirect regexp annotator using the virtual chunk annotation approach)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.