ftyers / commonvoice-utils Goto Github PK
View Code? Open in Web Editor NEWLinguistic processing for Common Voice
License: GNU Affero General Public License v3.0
Linguistic processing for Common Voice
License: GNU Affero General Public License v3.0
Is it possible to implement optional "--exclude-xxx fn" flags to exclude recordings during cv export?
--exclude-voices voices.txt // E.g. to measure the effect of a single person recording too much
--exclude-sentences sentences.txt // E.g. to exclude reported sentences
--exclude-gender [male|female|other|empty] // E.g. to train with male voices and test with female voices
etc
That would very much ease any experiments on biasing effects.
PS: The correct place to implement these would be CorporaCreator but it is not actively maintained as you know.
Similar can be implemented for opus corpora.
Bülent
Looks like that you forgot to commit it.
~$ python3 -m pip install git+https://github.com/ftyers/commonvoice-utils.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/ftyers/commonvoice-utils.git
Cloning https://github.com/ftyers/commonvoice-utils.git to /tmp/pip-req-build-4ptidukg
Running command git clone -q https://github.com/ftyers/commonvoice-utils.git /tmp/pip-req-build-4ptidukg
Building wheels for collected packages: commonvoice-utils
Building wheel for commonvoice-utils (setup.py) ... done
Created wheel for commonvoice-utils: filename=commonvoice_utils-0.2.7-py3-none-any.whl size=142813 sha256=541d42fa2c786d602f4ca04e6f1ad8848a57ded5376f69a19629b1b602577fc7
Stored in directory: /tmp/pip-ephem-wheel-cache-x_86ocb3/wheels/56/67/73/4bf2d8a681334251a44405673d52e767f646121bbd89c8b7fa
Successfully built commonvoice-utils
Installing collected packages: commonvoice-utils
Successfully installed commonvoice-utils-0.2.7
~$ python3 -c "from cvutils import Alphabet"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/selimcan/.local/lib/python3.8/site-packages/cvutils/__init__.py", line 10, in <module>
from transliterator import Transliterator
ModuleNotFoundError: No module named 'transliterator'
I don't think it's this package https://pypi.org/project/transliterator/ that's required ?
the apostrophe is needed to write Ukrainian, as in ім'я
("name")
https://en.wiktionary.org/wiki/%D1%96%D0%BC%27%D1%8F#Ukrainian
h/t @robinhad
Currently commonvoice-utils returns a list of paths if you want to check if e.g. segmentation is supported by a language, thus you need to scan the list to get a result. Like:
import cvutils as cvu
cv: cvu.CV = cvu.CV()
lc = 'tr'
supported: bool = False
validator: cvu.Validator = cvu.Validator(lc)
tokeniser: cvu.Tokeniser = cvu.Tokeniser(lc)
for val in cv.validators():
if lc == os.path.split(os.path.split(val)[0])[1]:
supported = True
It would be very helpful to have utility functions like cvu.hasValidator(lc) or cvu.hasTokenizer(lc) which does the same out of the box.
I might have messed up with that because I had git misfortune, deleted and re forked/re-cloned the repo.
Therefore the following commit seems not be merged:
https://github.com/HarikalarKutusu/commonvoice-utils/blob/c471900a7591737bd3e451476623e6b414729256/cvutils/data/tr/abbr.tsv
If it is not accessible by you, I can create another PR for it. Please advise.
Hi. I tried using the g2p tool to phonemize hindi words, but there was some encoding issues.
from cvutils import Phonemiser
p = Phonemiser('hi')
p.phonemise('अवकाशग्रहण')
At first, the error message was like:
UnicodeDecodeError Traceback (most recent call last)
C:\Users\MAGICD~1\AppData\Local\Temp/ipykernel_10860/951158175.py in
1 from cvutils import Phonemiser
----> 2 p = Phonemiser('hi')
3 p.phonemise('अवकाशग्रहण')
~\Anaconda3\lib\site-packages\cvutils\phonemiser.py in init(self, lang)
22 print('[Phonemiser] Function not implemented', file=sys.stderr)
23 try:
---> 24 self.validator = Validator(self.lang)
25 except FileNotFoundError:
26 pass
~\Anaconda3\lib\site-packages\cvutils\validator.py in init(self, lang)
13 self.nfkd = False
14 try:
---> 15 self.load_data()
16 except FileNotFoundError:
17 print('[Validator] Function not implemented', file=sys.stderr)
~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv').readlines():
27 if line[0] == '#':
28 continue
UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence
Then I set encoding='utf-8 in the line 26 of '~\Anaconda3\lib\site-packages\cvutils\validator.py' , but it didn't work. It still went like:
UnicodeDecodeError Traceback (most recent call last)
...
~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv',encoding='utf-8').readlines():
27 if line[0] == '#':
28 continue
UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence
Is there anything I did wrong? And I wonder is there any other method to solve the encoding issue? Thanks!
Does changes in abbr.tsv affect wiki and opus downloaders?
Should I re-create my LM's after this becomes alive?
Although Korean is not fully enabled on Common Voice yet, it only lacks 1500 sentences. If added, we can start using alphabet/normalization support provided by covo.
Perhaps something like PASS
to basically return whatever was input and REPL
for removing punctuation.
Another option would be something like CB
for check Unicode Block.
At the moment the alphabet for Chatino includes sequences of numerals for the tone characters. This is an artefact of the original dataset used to generate the data. The official orthography uses superscript uppercase letters.
It should be possible to use Unicode superscript letters and implement the conversion within covo, but first we need a mapping from sequence of numerals → superscript uppercase letter.
thinking it would be useful to add digits and yes / no from https://github.com/JRMeyer/common-voice-stats#single-digit-numbers--yes--no
thoughts?
Hi, thanks for the practical toolkit for CV data preprocessing!
I recently utilized this toolkit to validate data of different languages, but found the Validator
failed to initialize, i.e. it
. After checking the code I found, the initialization of Validator
demands data/$lang/validate.tsv
to be given.
Thus my question is: 1) Will the missing data be updated recently? and 2) How to prepare the data/$lang/validate.tsv
file from the scratch?
Thanks in advance!
At the moment it downloads the file into memory and then syncs to disk. This is not great for big files.
I would like to be able to count syllables and segment words by syllables, e.g. in the word:
caltlamachtiloyan → cal·tla·mach·til·oy·an
camioneta → ca·mio·ne·ta
these look like valid Hausa characters, but covo validate ha
will either remove them or fail on them
’
ā
ă
If you are using the code directly in Python, you still get "Function not implemented" errors. I do check the existing functionality before calling them, but this time, when using the phonemiser
class, which further calls the validator
, it is not possible.
Here is what I get when analyzing 144 corpora in parallel and using a progressbar:
=== Text-Corpora Compilation Process for cv-tbox-dataset-compiler ===
Processing text-corpora for 144 locales in 12 processes with chunk_size 10...
0%| | 0/144 [00:00<?, ?it/s][Validator] Function not implemented
[Validator] Function not implemented
[Validator] Function not implemented
1%|█▏ | 1/144 [00:10<25:29, 10.69s/it][Validator] Function not implemented
8%|█████████████▎ | 11/144 [00:15<02:30, 1.13s/it][Validator] Function not implemented
[Validator] Function not implemented
15%|█████████████████████████▍ | 21/144 [00:28<02:34, 1.25s/it][Validator] Function not implemented
22%|█████████████████████████████████████▍ | 31/144 [00:30<01:25, 1.33it/s][Validator] Function not implemented
31%|██████████████████████████████████████████████████████▍ | 45/144 [03:18<08:47, 5.33s/it][Validator] Function not implemented
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [12:34<00:00, 5.24s/it]
Finished compiling text-corpus for 144 locales in 754.62 avg=5.24 sec/locale
At https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/uk/alphabet.txt
should be абвгґдеєжзиіїйклмнопрстуфхцчшщьюя-ʼ
.
ы
is a Russian letter.
{"Armenian": "hy-AM"},
{"Uigur": "ug"}
Either something like thai segmenter or maybe sentence piece.
Somehow pip version is 0.2.30 and github version is kept at 0.2.29.
Maybe bump it in the next release?
This happened when I try to pip install it in the Windows Anaconda cmd (Windows 10 US English version, Python 3.9 and 3.8 tested).
c:\Users\xxxx> pip install git+https://github.com/ftyers/commonvoice-utils.git
Collecting git+https://github.com/ftyers/commonvoice-utils.git
Cloning https://github.com/ftyers/commonvoice-utils.git to c:\temp1\pip-req-build-06f01lti
Running command git clone -q https://github.com/ftyers/commonvoice-utils.git 'C:\TEMP1\pip-req-build-06f01lti'
Resolved https://github.com/ftyers/commonvoice-utils.git to commit c738e7f8031cd2e1ca83fdbd6dd3e8a5db1ad583
ERROR: Command errored out with exit status 1:
command: 'D:\Anaconda\Anaconda3\envs\p39_c112\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"'; __file__='"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\TEMP1\pip-pip-egg-info-mvngstdl'
cwd: C:\TEMP1\pip-req-build-06f01lti\
Complete output (9 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\TEMP1\pip-req-build-06f01lti\setup.py", line 8, in <module>
README = (HERE / "README.md").read_text()
File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\pathlib.py", line 1267, in read_text
return f.read()
File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2697: character maps to <undefined>
----------------------------------------
WARNING: Discarding git+https://github.com/ftyers/commonvoice-utils.git. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
It seems setup.py needs to configure encoding as UTF-8, but I'm a noob here...
The argument structure of covo is position based. Therefore it only works on some systems where covo is called directly as:
covo [arguments]
On Windows, you need to invoke the related python executable if not in path like:
python3 covo [arguments]
This changes the argument indexes and covo fails.
If an argument parser is added, there will be not such an issue. Currently, I need to open a VM (or use WSL) to just get an alphabet for example.
these languages are on https://open.bible/, but aren't in covo... yet:)
Akuapem Twi
Asante Twi
Chichewa
Ewe
Kikuyu
Lingala
Luo
Sorani Kurdi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.