Comments (9)
A patch for this issue.
Original comment by [email protected]
on 13 Jun 2011 at 6:41
Attachments:
from dkpro-core-asl.
I have had a look at the patch and it seems to work for German or English, but
I am not quite sure what the side effects will be for other languages.
Additionally, the opinions about what should be considered as punctuation might
differ.
Thus, I would rather suggest to implement this functionality as a subsequent
filtering step where all annotations (type of the annotation would be a
parameter) are removed that correspond to some pattern (another parameter).
Original comment by [email protected]
on 29 Sep 2011 at 10:09
from dkpro-core-asl.
I find it a bit strange that the patch looks at the last character of a token
to decide if a token should be removed or not.
In terms of alternatives, we have a TokenFilter in the tokit module that
currently filters out tokens that are too long. That, however, does not handle
attached POS or Lemma annotations.
There is also the StopWordRemover which is dictionary-based and is configurable
with respect to types.
Maybe it would be useful to merge all of that into a single AnnotationFilter
which can do regexes or dictionaries - or maybe even dictionaries of regexes?
;) For sake of speed separate parameters for max length and min length might
also be useful.
Original comment by richard.eckart
on 29 Sep 2011 at 10:19
from dkpro-core-asl.
> I find it a bit strange that the patch looks at the last character of a token
to
> decide if a token should be removed or not.
Indeed.
> Maybe it would be useful to merge all of that into a single AnnotationFilter
which
> can do regexes or dictionaries - or maybe even dictionaries of regexes? ;)
For sake
> of speed separate parameters for max length and min length might also be
useful
Sounds good. Do we aim at the whole thing (lists of dictionaries of regexes ;)
or better start small?
Original comment by [email protected]
on 29 Sep 2011 at 10:31
from dkpro-core-asl.
Changing from the dictionary-based to dictionary-of-regexes should be little
more a parameter (PARAM_REGEX = true) and switching from equals() to matches().
What's bugging me more is the question of how to deal with POS, Lemma, Stem and
so on. Traditionally these are co-located and there was no link in Token to
refer to them. But this is awkward for programming and bad for performance. So
since recently, the Token has explicit fields to refer to POS and Lemma and I
am not sure about Stem. It would be possible to override the removeFromIndex()
method in Token to automatically also remove associated POS. Lemma, Stems - but
this would only work for JCas. The other alternative is the mechanism used in
the StopWordRemover - works but is much slower and needs more configuration
effort. Maybe a combination of both would be good so that CAS-based AEs could
use the configurable method (that could become a convenience method in uimaFIT
CASUtil) and JCas-based AEs could call Token.removeFromIndexes() and Lemma, POS
and Stem are automatically cascaded.
What do you think?
Original comment by richard.eckart
on 29 Sep 2011 at 11:03
from dkpro-core-asl.
As far as I see, it would make the component quite dependent on our type
system, right.
This is not a problem in general, but I would prefer a type system agnostic
component, and maybe additionally a more specialized one.
Original comment by [email protected]
on 29 Sep 2011 at 12:25
from dkpro-core-asl.
What do you mean by "more specialized".
Original comment by richard.eckart
on 29 Sep 2011 at 6:50
from dkpro-core-asl.
Original comment by richard.eckart
on 8 Feb 2012 at 10:51
- Added labels: Milestone-1.4.0
from dkpro-core-asl.
Looks like this issue has been superseded by issue 14 (rename and enhance
tokenfilter). We won't change the BreakIteratorSegmenter because that would
imply we also need to change all other tokenizers that we have and may have in
the future.
Original comment by richard.eckart
on 7 Jun 2012 at 3:14
- Changed state: WontFix
- Added labels: Type-Enhancement
- Removed labels: Type-Defect, Milestone-1.4.0
from dkpro-core-asl.
Related Issues (20)
- Configurable punctuation and sentence boundary detection HOT 1
- Integrate Penn Discourse Treebank (PDTB) parser HOT 3
- Modules languagetool und decompounding use different versions of jwordsplitter HOT 10
- Upgrade to CoreNLP 3.5.1 HOT 1
- Add RTF Reader HOT 10
- Convert documentation to asciidoc
- StanfordCoreferenceResolver should copy dependencies from CAS instead of regenerating
- Integrate Stanford Dependency parser
- Add method to get output folder to DkproTestContext HOT 1
- Several Maven plugins not yet compatible with Java 7/8 language features HOT 4
- Improve TEI support HOT 1
- Support for constituents in TEI format HOT 3
- Add support for named entities to TEI HOT 2
- Possible bug if model location is set but language and mapping are not
- New "div" type for generic segmentation HOT 12
- Annotation types for speakers and direct/-indirect speech HOT 13
- For multiple files with same language code, only one is applied HOT 6
- Langdetect-asl requires unescaping of strings HOT 1
- Add Writer to Mallet module HOT 1
- Superseded by official GitHub migration
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dkpro-core-asl.