cldf-datasets / doreco Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 173.5 MB

CLDF dataset derived from DoReCo's core corpus

Home Page: https://doreco.info/

Python 44.49% TeX 55.51%

doreco's Introduction

CLDF Datasets

doreco's People

Contributors

Stargazers

Watchers

doreco's Issues

Add ExampleTable

Copying from #8:

adding an ExampleTable, i.e. aggregating the DoReCo data on sentence level into - ideally glossed - IGT sentences

@xrotwang The information whether each file is glossed is also part of metadata.csv

Add MediaTable

Copying from #8:

to add a MediaTable linking the lexical data to the audio files.

@xrotwang Quite lost on how we would proceed to implement this. Is this something you will take charge of?

X-Sampa to CLTS

For the transcription, all phones are currently in X-Sampa and need to be transfered to CLTS.

Download link not working

https://sharedocs.huma-num.fr/wl/?id=6OkBYGXrPkLEuHchF4kOXpsJf7MOKcLv&fmode=download

The download-links do not seem to be very stable, sadly. It is currently not working anymore.

Map sources to languages

The bibtex key is not part of any of the parsed tables, so we may need to link them ourselves. The easiest solution would be to add a column in the etc/languages file and map them myself-

Create languages.tsv

What information can or should we put in the languages.tsv file?

From the DoReCo mainpage, we have the following options:

Creators (/citation of the individual corpus)
License of the individual corpius
informationa about glosses (none/all/some)
stats: tokens, speakers, texts

I would like to add at least the citation key for the individual corpora and the information about glossing. This could make it easier to filter for specific studies etc., and the citation key assures (hopefully) that people who use the corpus cite the individual corpus creators.

SQL Tutorial: Unrecognized Features during CLDF conversion

Following the new Usage tutorial, I am running into the following error running `makecldf':

(doreco) blum@lingn45 doreco % cldfbench makecldf cldfbench_doreco.py --glottolog-version=v4.7
/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
INFO    running _cmd_makecldf on doreco ...
Path to clts data: /Users/blum/Library/Application Support/cldf/clts
Traceback (most recent call last):
  File "/Users/blum/Projects/venv/doreco/bin/cldfbench", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/cldfbench/__main__.py", line 89, in main
    return args.main(args) or 0
           ^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/cldfbench/commands/makecldf.py", line 32, in run
    with_dataset(args, 'makecldf')
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/cldfbench/cli_util.py", line 161, in with_dataset
    res = func(*arg, args)
          ^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/cldfbench/dataset.py", line 206, in _cmd_makecldf
    self.cmd_makecldf(args)
  File "/Users/blum/Projects/doreco/cldfbench_doreco.py", line 170, in cmd_makecldf
    bipa = clts.bipa[row['IPA']] if row['IPA'] else None
           ^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/clldutils/misc.py", line 241, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
                                                ^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/api.py", line 23, in bipa
    return self.transcriptionsystem('bipa')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/api.py", line 80, in transcriptionsystem
    if key in self.transcriptionsystem_dict:
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/clldutils/misc.py", line 241, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
                                                ^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/api.py", line 77, in transcriptionsystem_dict
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/api.py", line 77, in <dictcomp>
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/api.py", line 69, in iter_transcriptionsystem
    yield TranscriptionSystem(
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/pyclts/transcriptionsystem.py", line 77, in __init__
    raise ValueError(
ValueError: Unrecognized features (duration: ultra-long, line 129))

I am using a fresh venv with the most recent CLTS. @xrotwang Can you spot what I am doing wrong?

Add raw data without upload to Github

I will add a folder within raw that includes the data and a script that parses the relevant csv-files from the subfolders. The raw data will not be uploaded to Github so that we don't bloat the repository (as discussed with xrotwang).

This will only be done for the languages that do not have a ND-tag.

Idempotency and sorting of contributions

It seems as if the contributions table isn't stable with respect to its sorting after a fresh run of makecldf (without ND data).

Run query

@xrotwang Not working with the database makes me lose all knowledge, it seems. In a PR, you described that you run the query with the following command:

time cldfbench doreco.query --format tsv init_query.sql > res.tsv

However, I only get a cldfbench error message that the command is invalid. How do I have to adapt the command so that the query runs correctly?

Cite constituent corpora

Individual corpora aggregated in DoReCo should be cited in

README.md
.zenodo.json
CLDF metadata

Parameters that exist twice

The following parameters exist twice in ParameterTable:

a (id: 1, 339)
dz (id: 53, 88)
nʲ (id: 126, 311)
ʉː (id: 130, 237)

dz e.g. is used 5027 times with id 53 and 1446 times with id 88. I think that depending on how these values are used for counts/frequencies, this might affect the overall results of parameter counts etc.?

Check uniqueness of ID's

Many of the ID's (filenames, speakers) have one of two, or both problems:

a) Their ID's are not unique
b) They are referenced with different names in different tables (e.g. filenames with or without prefixes)

I need to go through the data and make sure that the ID's are unique and referenced with identical names.

@xrotwang Did I understand correct, that cldf.add_foreign_key adds a lookup for column A of table 1, against column B of table 2?

So this code maps the Language of ValueTable against ID of LanguageTable, making the information from this file available for retrieval when loading the CLDF metadata?

        cldf.add_foreign_key('ValueTable', 'Language', 'LanguageTable', 'ID')
        cldf.add_foreign_key('ValueTable', 'Filename', 'metadata.csv', 'Filename')

Errors and warning during SQLite database creation

@xrotwang While running through all the steps, I receive the following warnings and errors. Could you help me fix them?

Running cldfbench makecldf:

(doreco) blum@lingn45 doreco % cldfbench makecldf cldfbench_doreco.py --glottolog-version v4.8
/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/clldutils/clilib.py:291: UserWarning: ImportError loading entry point doreco
  warnings.warn('ImportError loading entry point {0.name}'.format(ep))
WARNING Error importing doreco: No module named 'util'
INFO    running _cmd_makecldf on doreco ...
Path to clts data: ../../cldf_resources/clts
/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/csvw/utils.py:23: UserWarning: Invalid value for property: str
  warnings.warn('Invalid value for property: {}'.format(s))

Workaround: Ignore. Is this mac-specific, that the utils does not get loaded?
Also: We need to change the requirements for one package: clldutils==3.20.0(instead of 3.19.0)

Running the SQLite conversion:

(doreco) blum@lingn45 doreco % cldf createdb cldf/Generic-metadata.json doreco.sqlite
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 1357, in iterdicts
    res[col.header] = col.read(v, strict=strict)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 787, in read
    return datatype.read(v)
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 631, in read
    return self.validate(self.parse(v))
                         ^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 600, in parse
    return self.basetype.to_python(v, **self.derived_description)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/datatypes.py", line 105, in to_python
    string.value_error(v)
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/datatypes.py", line 63, in value_error
    raise ValueError('invalid lexical value for {}: {}'.format(cls.name, v))
ValueError: invalid lexical value for string: CC BY-NC-ND

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/bin/cldf", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pycldf/__main__.py", line 30, in main
    return args.main(args) or 0
           ^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pycldf/commands/createdb.py", line 17, in run
    db.write_from_tg()
  File "/opt/homebrew/lib/python3.11/site-packages/pycldf/db.py", line 247, in write_from_tg
    items = {
            ^
  File "/opt/homebrew/lib/python3.11/site-packages/pycldf/db.py", line 248, in <dictcomp>
    tname: list(t.iterdicts())
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 1364, in iterdicts
    log_or_raise(
  File "/opt/homebrew/lib/python3.11/site-packages/csvw/metadata.py", line 192, in log_or_raise
    raise exception_cls(msg)
ValueError: cldf/contributions.csv:8:8 AnnotationLicense: invalid lexical value for string: CC BY-NC-ND

Workaround: Add "CC BY-NC-ND" manually to "cldf/Generic-metadata.json". But how can we do this automatically, during the cldf-conversion?

Create custom command for data preprocessing

Run create_raw.py through mk-file

Load custom tables in Python: A metadata issue?

I am currently trying to import the CLDf dataset in Python with the following code:

from pycldf import Dataset
doreco = Dataset.from_metadata('doreco_cldf/cldf/StructureDataset-metadata.json')

for x in doreco.components:
    print(x)

However, the output only contains three components: ValueTable, LanguageTable, and ContributionTable. How do I have to adapt the code from cldfbench so that I can access the other, custom added components, as well? Could you point me to the relevant code @xrotwang ?

SQlite Tutorial: Storing the query and running termgraph

In the last steps of the sqlite tutorial, I am trying to reproduce the termgraph, but running to an error.

I store the output as this:

sqlite> .output sr_by_lang.sql
sqlite> SELECT
   ...>     w.cldf_languagereference,
   ...>     AVG(u.speech_rate) AS sr
   ...> FROM
   ...>     utterance_initials AS ui,
   ...>     'words.csv' AS w,
   ...>     utterances AS u
   ...> WHERE 
   ...>     u.u_id = ui.u_id AND ui.wd_id = w.cldf_id 
   ...> GROUP BY w.cldf_languagereference 
   ...> ORDER BY sr;
sqlite> .output
sqlite> .quit

But when I run the command as indicated, I run into the following problem:

(doreco) blum@lingn45 doreco % sqlite3 -csv doreco.sqlite < sr_by_lang.sql | termgraph
Parse error near line 1: near "kama1351": syntax error
  kama1351|8.50360541282399 nngg1234|9.4337050500625 lowe1385|9.49056660765644 t
  ^--- error here

Traceback (most recent call last):
  File "/Users/blum/Projects/venv/doreco/bin/termgraph", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/termgraph/termgraph.py", line 133, in main
    _, labels, data, colors = read_data(args)
                              ^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/termgraph/termgraph.py", line 712, in read_data
    colors = check_data(labels, data, args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/doreco/lib/python3.11/site-packages/termgraph/termgraph.py", line 569, in check_data
    len_categories = len(data[0])
                         ~~~~^^^
IndexError: list index out of range

@xrotwang Is there something wrong with the way I store the query output? How should it be done correctly?

Metadata: Core vs. extended, speakers

@xrotwang The core set contains all the data that is time-aligned, while the extended set also includes the files that have morphological segmentation but no time alignment. We have two options:

We only include the data with time-alignment
We include all data and include a look-up in the language-specific metadata files to tag the files that are part of the extended set.

There is also several other metadata like speaker age, date of recording, and sound quality. Where do we store this information? Do we create a new doc in the cldf folder?

mk-file to download additional data

I am thinking what might be the best way to add the annotations that are restricted by the ND license. @xrotwang Is there an easy way to create a mk-file that downloads the respective files and converts them into CLDF, once the cldfbench-workflow is done? This will probably be more relevant for my study than for this CLDF dataset, as we cannot publish this data as CLDF.