Giter Club home page Giter Club logo

commonvoice-utils's People

Contributors

ccoreilly avatar ftyers avatar harikalarkutusu avatar jrmeyer avatar kudanai avatar lucarinelli avatar stefangrotz avatar wenjie-p avatar zuazo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

commonvoice-utils's Issues

Issue with encoding during setup in Windows

This happened when I try to pip install it in the Windows Anaconda cmd (Windows 10 US English version, Python 3.9 and 3.8 tested).

c:\Users\xxxx> pip install git+https://github.com/ftyers/commonvoice-utils.git
Collecting git+https://github.com/ftyers/commonvoice-utils.git
  Cloning https://github.com/ftyers/commonvoice-utils.git to c:\temp1\pip-req-build-06f01lti
  Running command git clone -q https://github.com/ftyers/commonvoice-utils.git 'C:\TEMP1\pip-req-build-06f01lti'
  Resolved https://github.com/ftyers/commonvoice-utils.git to commit c738e7f8031cd2e1ca83fdbd6dd3e8a5db1ad583
    ERROR: Command errored out with exit status 1:
     command: 'D:\Anaconda\Anaconda3\envs\p39_c112\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"'; __file__='"'"'C:\\TEMP1\\pip-req-build-06f01lti\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\TEMP1\pip-pip-egg-info-mvngstdl'
         cwd: C:\TEMP1\pip-req-build-06f01lti\
    Complete output (9 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\TEMP1\pip-req-build-06f01lti\setup.py", line 8, in <module>
        README = (HERE / "README.md").read_text()
      File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\pathlib.py", line 1267, in read_text
        return f.read()
      File "D:\Anaconda\Anaconda3\envs\p39_c112\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2697: character maps to <undefined>
    ----------------------------------------
WARNING: Discarding git+https://github.com/ftyers/commonvoice-utils.git. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

It seems setup.py needs to configure encoding as UTF-8, but I'm a noob here...

[FR] Make "Function not implemented" errors only valid for "covo"

If you are using the code directly in Python, you still get "Function not implemented" errors. I do check the existing functionality before calling them, but this time, when using the phonemiser class, which further calls the validator, it is not possible.

Here is what I get when analyzing 144 corpora in parallel and using a progressbar:

=== Text-Corpora Compilation Process for cv-tbox-dataset-compiler ===
Processing text-corpora for 144 locales in 12 processes with chunk_size 10...

  0%|                                                                                                                                                                                       | 0/144 [00:00<?, ?it/s][Validator] Function not implemented
[Validator] Function not implemented
[Validator] Function not implemented
  1%|█▏                                                                                                                                                                             | 1/144 [00:10<25:29, 10.69s/it][Validator] Function not implemented
  8%|█████████████▎                                                                                                                                                                | 11/144 [00:15<02:30,  1.13s/it][Validator] Function not implemented
[Validator] Function not implemented
 15%|█████████████████████████▍                                                                                                                                                    | 21/144 [00:28<02:34,  1.25s/it][Validator] Function not implemented
 22%|█████████████████████████████████████▍                                                                                                                                        | 31/144 [00:30<01:25,  1.33it/s][Validator] Function not implemented
 31%|██████████████████████████████████████████████████████▍                                                                                                                       | 45/144 [03:18<08:47,  5.33s/it][Validator] Function not implemented
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 144/144 [12:34<00:00,  5.24s/it]
Finished compiling text-corpus for 144 locales in 754.62 avg=5.24 sec/locale

on the Validator

Hi, thanks for the practical toolkit for CV data preprocessing!

I recently utilized this toolkit to validate data of different languages, but found the Validator failed to initialize, i.e. it. After checking the code I found, the initialization of Validator demands data/$lang/validate.tsv to be given.

Thus my question is: 1) Will the missing data be updated recently? and 2) How to prepare the data/$lang/validate.tsv file from the scratch?

Thanks in advance!

covo is not portable

The argument structure of covo is position based. Therefore it only works on some systems where covo is called directly as:

covo [arguments]

On Windows, you need to invoke the related python executable if not in path like:

python3 covo [arguments]

This changes the argument indexes and covo fails.

If an argument parser is added, there will be not such an issue. Currently, I need to open a VM (or use WSL) to just get an alphabet for example.

missing Hausa characters

these look like valid Hausa characters, but covo validate ha will either remove them or fail on them


ā
ă

Transliterator module missing

Looks like that you forgot to commit it.

~$ python3 -m  pip install git+https://github.com/ftyers/commonvoice-utils.git
Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/ftyers/commonvoice-utils.git
  Cloning https://github.com/ftyers/commonvoice-utils.git to /tmp/pip-req-build-4ptidukg
  Running command git clone -q https://github.com/ftyers/commonvoice-utils.git /tmp/pip-req-build-4ptidukg
Building wheels for collected packages: commonvoice-utils
  Building wheel for commonvoice-utils (setup.py) ... done
  Created wheel for commonvoice-utils: filename=commonvoice_utils-0.2.7-py3-none-any.whl size=142813 sha256=541d42fa2c786d602f4ca04e6f1ad8848a57ded5376f69a19629b1b602577fc7
  Stored in directory: /tmp/pip-ephem-wheel-cache-x_86ocb3/wheels/56/67/73/4bf2d8a681334251a44405673d52e767f646121bbd89c8b7fa
Successfully built commonvoice-utils
Installing collected packages: commonvoice-utils
Successfully installed commonvoice-utils-0.2.7
~$ python3 -c "from cvutils import Alphabet"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/selimcan/.local/lib/python3.8/site-packages/cvutils/__init__.py", line 10, in <module>
    from transliterator import Transliterator
ModuleNotFoundError: No module named 'transliterator'

I don't think it's this package https://pypi.org/project/transliterator/ that's required ?

Feature Request: Please add "hasValidator" etc

Currently commonvoice-utils returns a list of paths if you want to check if e.g. segmentation is supported by a language, thus you need to scan the list to get a result. Like:

import cvutils as cvu

cv: cvu.CV = cvu.CV()

lc = 'tr'
supported: bool = False

validator: cvu.Validator = cvu.Validator(lc)
tokeniser: cvu.Tokeniser = cvu.Tokeniser(lc)

for val in cv.validators():
    if lc == os.path.split(os.path.split(val)[0])[1]:
        supported = True

It would be very helpful to have utility functions like cvu.hasValidator(lc) or cvu.hasTokenizer(lc) which does the same out of the box.

Add a method of checking CJK

Perhaps something like PASS to basically return whatever was input and REPL for removing punctuation.

Another option would be something like CB for check Unicode Block.

Adding feature to exclude group of information during export

Is it possible to implement optional "--exclude-xxx fn" flags to exclude recordings during cv export?

--exclude-voices voices.txt            // E.g. to measure the effect of a single person recording too much
--exclude-sentences sentences.txt             // E.g. to exclude reported sentences
--exclude-gender [male|female|other|empty]             // E.g. to train with male voices and test with female voices
etc

That would very much ease any experiments on biasing effects.

PS: The correct place to implement these would be CorporaCreator but it is not actively maintained as you know.

Similar can be implemented for opus corpora.

Bülent

hindi encoding issue

Hi. I tried using the g2p tool to phonemize hindi words, but there was some encoding issues.

from cvutils import Phonemiser
p = Phonemiser('hi')
p.phonemise('अवकाशग्रहण')

At first, the error message was like:

UnicodeDecodeError Traceback (most recent call last)
C:\Users\MAGICD~1\AppData\Local\Temp/ipykernel_10860/951158175.py in
1 from cvutils import Phonemiser
----> 2 p = Phonemiser('hi')
3 p.phonemise('अवकाशग्रहण')

~\Anaconda3\lib\site-packages\cvutils\phonemiser.py in init(self, lang)
22 print('[Phonemiser] Function not implemented', file=sys.stderr)
23 try:
---> 24 self.validator = Validator(self.lang)
25 except FileNotFoundError:
26 pass

~\Anaconda3\lib\site-packages\cvutils\validator.py in init(self, lang)
13 self.nfkd = False
14 try:
---> 15 self.load_data()
16 except FileNotFoundError:
17 print('[Validator] Function not implemented', file=sys.stderr)

~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv').readlines():
27 if line[0] == '#':
28 continue

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

Then I set encoding='utf-8 in the line 26 of '~\Anaconda3\lib\site-packages\cvutils\validator.py' , but it didn't work. It still went like:

UnicodeDecodeError Traceback (most recent call last)
...
~\Anaconda3\lib\site-packages\cvutils\validator.py in load_data(self)
24 self.lower = False
25 data_dir = os.path.abspath(os.path.dirname(file)) + '/data/'
---> 26 for line in open(data_dir + self.lang + '/validate.tsv',encoding='utf-8').readlines():
27 if line[0] == '#':
28 continue

UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 50: illegal multibyte sequence

Is there anything I did wrong? And I wonder is there any other method to solve the encoding issue? Thanks!

Please add Korean support

Although Korean is not fully enabled on Common Voice yet, it only lacks 1500 sentences. If added, we can start using alphabet/normalization support provided by covo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.