<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

BreakIteratorSegmenter: New parameter for punctuation marks,about aschaeffer/dkpro-core-asl

Comments (9)

GoogleCodeExporter commented on June 19, 2024

A patch for this issue.

Original comment by [email protected] on 13 Jun 2011 at 6:41

Attachments:

BreakIteratorSegmenter-patch-issue-16.txt

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

I have had a look at the patch and it seems to work for German or English, but 
I am not quite sure what the side effects will be for other languages.
Additionally, the opinions about what should be considered as punctuation might 
differ.

Thus, I would rather suggest to implement this functionality as a subsequent 
filtering step where all annotations (type of the annotation would be a 
parameter) are removed that correspond to some pattern (another parameter).

Original comment by [email protected] on 29 Sep 2011 at 10:09

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

I find it a bit strange that the patch looks at the last character of a token 
to decide if a token should be removed or not. 

In terms of alternatives, we have a TokenFilter in the tokit module that 
currently filters out tokens that are too long. That, however, does not handle 
attached POS or Lemma annotations. 
There is also the StopWordRemover which is dictionary-based and is configurable 
with respect to types.

Maybe it would be useful to merge all of that into a single AnnotationFilter 
which can do regexes or dictionaries - or maybe even dictionaries of regexes? 
;) For sake of speed separate parameters for max length and min length might 
also be useful.

Original comment by richard.eckart on 29 Sep 2011 at 10:19

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

> I find it a bit strange that the patch looks at the last character of a token 
to
> decide if a token should be removed or not. 

Indeed.

> Maybe it would be useful to merge all of that into a single AnnotationFilter 
which
> can do regexes or dictionaries - or maybe even dictionaries of regexes? ;) 
For sake
> of speed separate parameters for max length and min length might also be 
useful

Sounds good. Do we aim at the whole thing (lists of dictionaries of regexes ;) 
or better start small?

Original comment by [email protected] on 29 Sep 2011 at 10:31

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

Changing from the dictionary-based to dictionary-of-regexes should be little 
more a parameter (PARAM_REGEX = true) and switching from equals() to matches().

What's bugging me more is the question of how to deal with POS, Lemma, Stem and 
so on. Traditionally these are co-located and there was no link in Token to 
refer to them. But this is awkward for programming and bad for performance. So 
since recently, the Token has explicit fields to refer to POS and Lemma and I 
am not sure about Stem. It would be possible to override the removeFromIndex() 
method in Token to automatically also remove associated POS. Lemma, Stems - but 
this would only work for JCas. The other alternative is the mechanism used in 
the StopWordRemover - works but is much slower and needs more configuration 
effort. Maybe a combination of both would be good so that CAS-based AEs could 
use the configurable method (that could become a convenience method in uimaFIT 
CASUtil) and JCas-based AEs could call Token.removeFromIndexes() and Lemma, POS 
and Stem are automatically cascaded.

What do you think?

Original comment by richard.eckart on 29 Sep 2011 at 11:03

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

As far as I see, it would make the component quite dependent on our type 
system, right.
This is not a problem in general, but I would prefer a type system agnostic 
component, and maybe additionally a more specialized one.

Original comment by [email protected] on 29 Sep 2011 at 12:25

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

What do you mean by "more specialized".

Original comment by richard.eckart on 29 Sep 2011 at 6:50

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

Original comment by richard.eckart on 8 Feb 2012 at 10:51

Added labels: Milestone-1.4.0

from dkpro-core-asl.

GoogleCodeExporter commented on June 19, 2024

Looks like this issue has been superseded by issue 14 (rename and enhance 
tokenfilter). We won't change the BreakIteratorSegmenter because that would 
imply we also need to change all other tokenizers that we have and may have in 
the future.

Original comment by richard.eckart on 7 Jun 2012 at 3:14

Changed state: WontFix
Added labels: Type-Enhancement
Removed labels: Type-Defect, Milestone-1.4.0

from dkpro-core-asl.

BreakIteratorSegmenter: New parameter for punctuation marks about dkpro-core-asl HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent