Comments (4)
Decision to be made: for languages with no spaces, for the last token of each sentence (e.g. punctuation) the SpaceAfter attribute cannot be determined in the _get_tokens function.
Two options:
- Use a flag that is passed to the eval/tokenize function by hand: space_after_end_of_sentence = True by default for all languages like EN/DE/RO etc , and set to false for ZH/JA etc. I suggest we use this method and let the user decide to set this optional flag if they need the correct SpaceAfter attribute for the last token.
- Auto determine from the eval/tokenize input string if the string contains spaces or not. This means that for each tokenize function call, before doing anything, count the number of whitespaces and then if the ratio of whitespace/(total number of characters) < 0.02 (for example), or if the input string does not contain spaces at all, then set space_after_end_of_sentence = False as we will assume that the language does not use whitespaces (ZH/JA/etc.). This adds some overhead for each tokenize call, and could also fail by assuming that one particular sentence does not have spaces (e.g. for sentence "(n.1948-d.2005)." <-yes, that happens).
I implemented option 1, but we can switch to option 2 easily.
from nlp-cube.
Can we just send the SpaceAfter information from the SentenceSplitter?
from nlp-cube.
Added automatic detection for last token of sentence.
from nlp-cube.
Thank you!
from nlp-cube.
Related Issues (20)
- Adobe-wide pypi deployment credentials HOT 2
- Issues with nn(Norwegian nynorsk) and nb(Norwegian bokmål). HOT 1
- Missing files in sdist
- Does cube support enhanced/collapsed dependency parsing. HOT 2
- license type of models HOT 1
- The future of the Cube project? HOT 2
- Getting stuck at "Configuring tzdata" HOT 5
- The NlpCloud seems to be down HOT 4
- Error while training a model HOT 9
- ERROR in app: Exception on /nlp [GET] HOT 4
- Kazakh language wrong result HOT 3
- Kazakh language wrong result HOT 1
- Russian model of nlpcube 0.3.1.0 does not work HOT 3
- Greek model of nlpcube 0.3.1.0 separates punctuation into other words HOT 3
- Kazakh model of nlpcube 0.3.1.0 does not tokenize well HOT 2
- Additional language models needed HOT 17
- Issues with different language models HOT 10
- ModuleNotFoundError when launching NLP-Cube HOT 2
- numpy<1.20.0 required HOT 3
- Problem with model loading HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nlp-cube.