Giter Club home page Giter Club logo

Comments (9)

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
A patch for this issue.

Original comment by [email protected] on 13 Jun 2011 at 6:41

Attachments:

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
I have had a look at the patch and it seems to work for German or English, but 
I am not quite sure what the side effects will be for other languages.
Additionally, the opinions about what should be considered as punctuation might 
differ.

Thus, I would rather suggest to implement this functionality as a subsequent 
filtering step where all annotations (type of the annotation would be a 
parameter) are removed that correspond to some pattern (another parameter).

Original comment by [email protected] on 29 Sep 2011 at 10:09

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
I find it a bit strange that the patch looks at the last character of a token 
to decide if a token should be removed or not. 

In terms of alternatives, we have a TokenFilter in the tokit module that 
currently filters out tokens that are too long. That, however, does not handle 
attached POS or Lemma annotations. 
There is also the StopWordRemover which is dictionary-based and is configurable 
with respect to types.

Maybe it would be useful to merge all of that into a single AnnotationFilter 
which can do regexes or dictionaries - or maybe even dictionaries of regexes? 
;) For sake of speed separate parameters for max length and min length might 
also be useful.

Original comment by richard.eckart on 29 Sep 2011 at 10:19

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
> I find it a bit strange that the patch looks at the last character of a token 
to
> decide if a token should be removed or not. 

Indeed.

> Maybe it would be useful to merge all of that into a single AnnotationFilter 
which
> can do regexes or dictionaries - or maybe even dictionaries of regexes? ;) 
For sake
> of speed separate parameters for max length and min length might also be 
useful

Sounds good. Do we aim at the whole thing (lists of dictionaries of regexes ;) 
or better start small?

Original comment by [email protected] on 29 Sep 2011 at 10:31

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
Changing from the dictionary-based to dictionary-of-regexes should be little 
more a parameter (PARAM_REGEX = true) and switching from equals() to matches().

What's bugging me more is the question of how to deal with POS, Lemma, Stem and 
so on. Traditionally these are co-located and there was no link in Token to 
refer to them. But this is awkward for programming and bad for performance. So 
since recently, the Token has explicit fields to refer to POS and Lemma and I 
am not sure about Stem. It would be possible to override the removeFromIndex() 
method in Token to automatically also remove associated POS. Lemma, Stems - but 
this would only work for JCas. The other alternative is the mechanism used in 
the StopWordRemover - works but is much slower and needs more configuration 
effort. Maybe a combination of both would be good so that CAS-based AEs could 
use the configurable method (that could become a convenience method in uimaFIT 
CASUtil) and JCas-based AEs could call Token.removeFromIndexes() and Lemma, POS 
and Stem are automatically cascaded.

What do you think?

Original comment by richard.eckart on 29 Sep 2011 at 11:03

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
As far as I see, it would make the component quite dependent on our type 
system, right.
This is not a problem in general, but I would prefer a type system agnostic 
component, and maybe additionally a more specialized one.

Original comment by [email protected] on 29 Sep 2011 at 12:25

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
What do you mean by "more specialized".

Original comment by richard.eckart on 29 Sep 2011 at 6:50

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024

Original comment by richard.eckart on 8 Feb 2012 at 10:51

  • Added labels: Milestone-1.4.0

from dkpro-core-asl.

GoogleCodeExporter avatar GoogleCodeExporter commented on June 19, 2024
Looks like this issue has been superseded by issue 14 (rename and enhance 
tokenfilter). We won't change the BreakIteratorSegmenter because that would 
imply we also need to change all other tokenizers that we have and may have in 
the future.

Original comment by richard.eckart on 7 Jun 2012 at 3:14

  • Changed state: WontFix
  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect, Milestone-1.4.0

from dkpro-core-asl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.