Comments (4)
Hi @gtbot2007 and thanks so much for raising this issue, could I get a bit more detail in what's preventing uncommon letters from being used? Is this a limitation you've found in the guidelines that could be better clarified or are you getting a bug or error message when trying to input sentences that contain words like "yupekosi", "Pingo" and "kalamARR"?
from common-voice.
I believe at least the last one happens because of the default validation rules defined below, which tries to handle abbreviations:
For other cases you need to create language specific validation rules, like the ones here:
https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation/languages
from common-voice.
Since they are used so few words and the words are also ”non-standard” they are considered foreign letters. Which is technically true but maybe there should be a case to allow the words “Pingo”, “yupekosi”, “kalamARR”, “yutu” and maybe even “yu” and “y”.
from common-voice.
Sorry for the noise, I posted before checking. There IS actually a tok
validation file here:
https://github.com/common-voice/common-voice/blob/main/server/src/core/sentences/validation/languages/tok.ts
It seems, each one of the samples you gave is hitting one regex rule there. Either they are not wanted, or the regex rules might need tweaks.
from common-voice.
Related Issues (20)
- [BUG] Sentence input is not fully cleaned in "write", thus errors in "*_sentences.tsv" HOT 2
- [BUG] Non-unique entries in validated_sentences.tsv
- [BUG] Both ways of donating in CV not functional (android) HOT 1
- [BUG] validated_sentences.tsv for pa-IN is incomplete HOT 1
- LOCALISATION REQUEST: ISO-639-2/3 HOT 12
- [FR] Detail unvalidated text corpus status
- [BUG] reported.tsv has broken rows due to LF & TAB characters in sentence and reason fields HOT 2
- Create issues template for documentation updates or new docs needed HOT 2
- [BUG] Unable to modify e-mail address. HOT 2
- [FR] (suggestion) Make delta releases easily usable
- [DOCS] Removing discontinued platforms.
- [DOCS] Create information architecture draft for docs HOT 1
- [FR] Add missing major "sentence_domain"s
- Change language name of 'gom' to "Konkani (Romi)" HOT 2
- Multi-orthography for Konkani - linking sentences collected in the gom and knn datasets HOT 13
- [BUG] Delta for v10.0 & v11.0 are buggy and should be removed
- LOCALISATION REQUEST: nqo_Nkoo HOT 2
- [BUG] Should purge voted sentences in "review" from local storage
- [BUG] On changing the language on review page, sentences from previous language appear even after refresh HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common-voice.