ftyers / commonvoice-utils Goto Github PK

View Code? Open in Web Editor NEW

50.0 5.0 15.0 456 KB

Linguistic processing for Common Voice

License: GNU Affero General Public License v3.0

Python 100.00%

common-voice-languages grapheme alphabet language phoneme asr g2p

commonvoice-utils's Issues

Adding feature to exclude group of information during export

Is it possible to implement optional "--exclude-xxx fn" flags to exclude recordings during cv export?

--exclude-voices voices.txt            // E.g. to measure the effect of a single person recording too much
--exclude-sentences sentences.txt             // E.g. to exclude reported sentences
--exclude-gender [male|female|other|empty]             // E.g. to train with male voices and test with female voices
etc

That would very much ease any experiments on biasing effects.

PS: The correct place to implement these would be CorporaCreator but it is not actively maintained as you know.

Similar can be implemented for opus corpora.

Bülent

Transliterator module missing

Looks like that you forgot to commit it.

~$ python3 -m  pip install git+https://github.com/ftyers/commonvoice-utils.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/ftyers/commonvoice-utils.git
  Cloning https://github.com/ftyers/commonvoice-utils.git to /tmp/pip-req-build-4ptidukg
  Running command git clone -q https://github.com/ftyers/commonvoice-utils.git /tmp/pip-req-build-4ptidukg
Building wheels for collected packages: commonvoice-utils
  Building wheel for commonvoice-utils (setup.py) ... done
  Created wheel for commonvoice-utils: filename=commonvoice_utils-0.2.7-py3-none-any.whl size=142813 sha256=541d42fa2c786d602f4ca04e6f1ad8848a57ded5376f69a19629b1b602577fc7
  Stored in directory: /tmp/pip-ephem-wheel-cache-x_86ocb3/wheels/56/67/73/4bf2d8a681334251a44405673d52e767f646121bbd89c8b7fa
Successfully built commonvoice-utils
Installing collected packages: commonvoice-utils
Successfully installed commonvoice-utils-0.2.7
~$ python3 -c "from cvutils import Alphabet"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/selimcan/.local/lib/python3.8/site-packages/cvutils/__init__.py", line 10, in <module>
    from transliterator import Transliterator
ModuleNotFoundError: No module named 'transliterator'

I don't think it's this package https://pypi.org/project/transliterator/ that's required ?

Ukrainian needs apostrophe

the apostrophe is needed to write Ukrainian, as in ім'я ("name")

https://en.wiktionary.org/wiki/%D1%96%D0%BC%27%D1%8F#Ukrainian

h/t @robinhad

Feature Request: Please add "hasValidator" etc

Currently commonvoice-utils returns a list of paths if you want to check if e.g. segmentation is supported by a language, thus you need to scan the list to get a result. Like:

import cvutils as cvu

cv: cvu.CV = cvu.CV()

lc = 'tr'
supported: bool = False

validator: cvu.Validator = cvu.Validator(lc)
tokeniser: cvu.Tokeniser = cvu.Tokeniser(lc)

for val in cv.validators():
    if lc == os.path.split(os.path.split(val)[0])[1]:
        supported = True

It would be very helpful to have utility functions like cvu.hasValidator(lc) or cvu.hasTokenizer(lc) which does the same out of the box.

Turkish abbr not merged

I might have messed up with that because I had git misfortune, deleted and re forked/re-cloned the repo.

Therefore the following commit seems not be merged:
https://github.com/HarikalarKutusu/commonvoice-utils/blob/c471900a7591737bd3e451476623e6b414729256/cvutils/data/tr/abbr.tsv

If it is not accessible by you, I can create another PR for it. Please advise.

hindi encoding issue

Hi. I tried using the g2p tool to phonemize hindi words, but there was some encoding issues.

from cvutils import Phonemiser
p = Phonemiser('hi')
p.phonemise('अवकाशग्रहण')

At first, the error message was like:

UnicodeDecodeError Traceback (most recent call last)
C:\Users\MAGICD~1\AppData\Local\Temp/ipykernel_10860/951158175.py in
1 from cvutils import Phonemiser
----> 2 p = Phonemiser('hi')
3 p.phonemise('अवकाशग्रहण')

~\Anaconda3\lib\site-packages\cvutils\phonemiser.py in init(self, lang)
22 print('[Phonemiser] Function not implemented', file=sys.stderr)
23 try:
---> 24 self.validator = Validator(self.lang)
25 except FileNotFoundError:
26 pass

~\Anaconda3\lib\site-packages\cvutils\validator.py in init(self, lang)
13 self.nfkd = False
14 try:
---> 15 self.load_data()
16 except FileNotFoundError:
17 print('[Validator] Function not implemented', file=sys.stderr)

~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv').readlines():
27 if line[0] == '#':
28 continue

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

Then I set encoding='utf-8 in the line 26 of '~\Anaconda3\lib\site-packages\cvutils\validator.py' , but it didn't work. It still went like:

UnicodeDecodeError Traceback (most recent call last)
...
~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv',encoding='utf-8').readlines():
27 if line[0] == '#':
28 continue

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

Is there anything I did wrong? And I wonder is there any other method to solve the encoding issue? Thanks!

Question: Effect of changes in abbr.tsv

Does changes in abbr.tsv affect wiki and opus downloaders?
Should I re-create my LM's after this becomes alive?

Please add Korean support

Although Korean is not fully enabled on Common Voice yet, it only lacks 1500 sentences. If added, we can start using alphabet/normalization support provided by covo.

Add a method of checking CJK

Perhaps something like PASS to basically return whatever was input and REPL for removing punctuation.

Another option would be something like CB for check Unicode Block.

Chatino tones are written with superscript letters

At the moment the alphabet for Chatino includes sequences of numerals for the tone characters. This is an artefact of the original dataset used to generate the data. The official orthography uses superscript uppercase letters.

It should be possible to use Unicode superscript letters and implement the conversion within covo, but first we need a mapping from sequence of numerals → superscript uppercase letter.

Add single digits and yes / no?

thinking it would be useful to add digits and yes / no from https://github.com/JRMeyer/common-voice-stats#single-digit-numbers--yes--no

thoughts?

on the Validator

Hi, thanks for the practical toolkit for CV data preprocessing!

I recently utilized this toolkit to validate data of different languages, but found the Validator failed to initialize, i.e. it. After checking the code I found, the initialization of Validator demands data/$lang/validate.tsv to be given.

Thus my question is: 1) Will the missing data be updated recently? and 2) How to prepare the data/$lang/validate.tsv file from the scratch?

Thanks in advance!

Get checkpoint functionality should stream file to disk not memory

At the moment it downloads the file into memory and then syncs to disk. This is not great for big files.

Japanese support

Phon
Valid
Alphabet
Segment

Feature request: Add syllabification support

I would like to be able to count syllables and segment words by syllables, e.g. in the word:

caltlamachtiloyan → cal·tla·mach·til·oy·an
camioneta → ca·mio·ne·ta

missing Hausa characters

these look like valid Hausa characters, but covo validate ha will either remove them or fail on them

’
ā
ă

[FR] Make "Function not implemented" errors only valid for "covo"

If you are using the code directly in Python, you still get "Function not implemented" errors. I do check the existing functionality before calling them, but this time, when using the phonemiser class, which further calls the validator, it is not possible.

Here is what I get when analyzing 144 corpora in parallel and using a progressbar:

=== Text-Corpora Compilation Process for cv-tbox-dataset-compiler ===
Processing text-corpora for 144 locales in 12 processes with chunk_size 10...

  0%|                                                                                                                                                                                       | 0/144 [00:00<?, ?it/s][Validator] Function not implemented
[Validator] Function not implemented
[Validator] Function not implemented
  1%|█▏                                                                                                                                                                             | 1/144 [00:10<25:29, 10.69s/it][Validator] Function not implemented
  8%|█████████████▎                                                                                                                                                                | 11/144 [00:15<02:30,  1.13s/it][Validator] Function not implemented
[Validator] Function not implemented
 15%|█████████████████████████▍                                                                                                                                                    | 21/144 [00:28<02:34,  1.25s/it][Validator] Function not implemented
 22%|█████████████████████████████████████▍                                                                                                                                        | 31/144 [00:30<01:25,  1.33it/s][Validator] Function not implemented
 31%|██████████████████████████████████████████████████████▍                                                                                                                       | 45/144 [03:18<08:47,  5.33s/it][Validator] Function not implemented
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [12:34<00:00,  5.24s/it]
Finished compiling text-corpus for 144 locales in 754.62 avg=5.24 sec/locale

Incorrect alphabet for Ukrainian

At https://github.com/ftyers/commonvoice-utils/blob/main/cvutils/data/uk/alphabet.txt
should be абвгґдеєжзиіїйклмнопрстуфхцчшщьюя-ʼ.
ы is a Russian letter.

Armenian and Uigur became newly added and are not yet in the table

{"Armenian": "hy-AM"},
{"Uigur": "ug"}

Come up with a sensible method of using processing Thai

Either something like thai segmenter or maybe sentence piece.

pip-github version mismatch

Somehow pip version is 0.2.30 and github version is kept at 0.2.29.

Maybe bump it in the next release?

Issue with encoding during setup in Windows

This happened when I try to pip install it in the Windows Anaconda cmd (Windows 10 US English version, Python 3.9 and 3.8 tested).

c:\Users\xxxx> pip install git+https://github.com/ftyers/commonvoice-utils.git
Collecting git+https://github.com/ftyers/commonvoice-utils.git
  Cloning https://github.com/ftyers/commonvoice-utils.git to c:\temp1\pip-req-build-06f01lti
  Running command git clone -q https://github.com/ftyers/commonvoice-utils.git 'C:\TEMP1\pip-req-build-06f01lti'
  Resolved https://github.com/ftyers/commonvoice-utils.git to commit c738e7f8031cd2e1ca83fdbd6dd3e8a5db1ad583
    ERROR: Command errored out with exit status 1:
     command: 'D:\Anaconda\Anaconda3\envs\p39_c112\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"'; __file__='"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\TEMP1\pip-pip-egg-info-mvngstdl'
         cwd: C:\TEMP1\pip-req-build-06f01lti\
    Complete output (9 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\TEMP1\pip-req-build-06f01lti\setup.py", line 8, in <module>
        README = (HERE / "README.md").read_text()
      File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\pathlib.py", line 1267, in read_text
        return f.read()
      File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2697: character maps to <undefined>
    ----------------------------------------
WARNING: Discarding git+https://github.com/ftyers/commonvoice-utils.git. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

It seems setup.py needs to configure encoding as UTF-8, but I'm a noob here...

Import KPS' Irish to IPA

https://github.com/kscanne/filiocht

covo is not portable

The argument structure of covo is position based. Therefore it only works on some systems where covo is called directly as:

covo [arguments]

On Windows, you need to invoke the related python executable if not in path like:

python3 covo [arguments]

This changes the argument indexes and covo fails.

If an argument parser is added, there will be not such an issue. Currently, I need to open a VM (or use WSL) to just get an alphabet for example.

Feature Request: Add Validators for languages in Biblica data

these languages are on https://open.bible/, but aren't in covo... yet:)

Akuapem Twi
Asante Twi
Chichewa
Ewe
Kikuyu
Lingala
Luo
Sorani Kurdi

ftyers / commonvoice-utils Goto Github PK

commonvoice-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org