cldf-datasets / gata Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 1.0 15.71 MB

License: Creative Commons Attribution 4.0 International

Python 4.49% TeX 92.85% R 2.66%

gata's Introduction

CLDF dataset for the Grammars Across Time Analyzed (GATA) dataset

How to cite

If you use these data please cite

the original source

Blum, Frederic, Carlos Barrientos, Adriano Ingunza, Damian E. Blasi and Roberto Zariquiey (2023): Grammars Across Time Analyzed (GATA): a dataset of 52 languages. Scientific Data 10, 835 (2023). https://doi.org/10.1038/s41597-023-02659-1
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY license

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF StructureDataset at cldf/StructureDataset-metadata.json

gata's People

Contributors

Watchers

Forkers

badbatched

gata's Issues

Year column is sometimes not a year

Some of the values in the Year column of values.csv are not years:

ID=1043: 'via lexicon'
ID=1757: 'ngarinyin1970'
ID=2162: 'yupik1990'
ID=2244: ''
ID=2247: 'rembarrnga1975'
ID=2503: 'soo1971'
ID=2560: 'wa2012'
ID=2882: 'tlingit1917'
ID=2928: ''
ID=2932: ''

Issues with Table

Korana: Work from Maingard is based on data from 1879 - correct in data, but not in Table 2 of the data
Reference 1 for Nganasan has the wrong year in the Table; should be 1854; again, this is correctly coded in the data, but not in the table
Reference 2 for NYangi has no year in the Table, but is correct in data

I had those cases marked in the production check, but they did not get changed.

List doculects in LanguageTable rather than Glottolog languages?

Just came across this dataset. Looks cool.
One suggestion, though: Since the whole point of the dataset seems to be highlighting language change over time, wouldn't it be more transparent to list doculects as rows in LanguageTable, rather than Glottolog languages? This would get across the point that languages at different points in time might need to be treated as different more thoroughly - while aggregation on Glottolog language level would still be possible via the glottocode property.

Inconsistent column names in parameters

In parameters.csv, the column names are like "Variable_type" vs. "Category_Esp" vs. "Description_esp". In the Python code, there were used different terms, I had to modify them. So I suggest to unify spelling (capitals, not capitals, etc.).

move `languages.tsv` to `etc` folder and change ending to `csv`

Push-Access

Would it be possible to give me push-access for this repository? Then I could adapt some of the changes that we still need to make before publishing the database.

Link from README to CLDF metadata is wrong

The link from README.md to
https://github.com/cldf-datasets/gata/blob/main/cldf/Generic-metadata.json
is broken.
Not sure, why this might have happened - maybe you just need to recreate the README.

Handling References: parse sources

In the spreadsheet in raw/pooled_all.csv, you list sources in a colloquial style, not as their bibtex keys. They should be parsed in such a way that we have:

BibtexKey[pagerange]

So something like:

Muller (1963:555)

should be converted to:

Muller1963[555]

But note that this requires that the bibtex-keys are AuthorYear.

ValueTable includes R-isms "NA" as a value.

The ValueTable (values.csv) has multiple values coded as "NA" which is due to R being used somewhere in the pipeline. It would be better to code these as "" or tag these values as null in the StructureDataset definition (where ? and <empty string> are already defined).

Can you give me admin rights?

I need admin rights to transfer the directory :)

Run CLDF conversion and check

@MuffinLinwist, what is the status on teh cldf conversion, are there any errors now?

Transfer repository to `cldf-datasets`

To have us work further on this, please transfer to https://github.com/cldf-datasets, we then keep working from there.

Change name to lower case

The repository names we use are typically lower case, this is easier to handle. So I suggest to change the name to gata.

Codes

It will be better if we add codes to the dataset.

Create release for submission

@LinguList The journal asked us to provide the data in one of their featured repositories. The most reasonable options would be Zenodo and OSF. As many other CLDF datasets are also linked to Zenodo, I guess that this would be the most reasonable way. Does anything speak against creating a release 0.1 with the current form of the dataset? We would like to not delay the submission any further, and while there are some open issues (#13 for some utility plots, #10 codes that could be added for a Version 1.0, and #5 which is too much work to implement it right now for little reward), we think that they should not prevent us from creating this release for submission.

If you are okay with that, I would go ahead and create a release 0.1 and link it to a Zenodo page. Alternatively, we would put everything on an OSF repository for the submission process.

Problems with values.csv

@LinguList, I made some modifications on the dataset but still run with three problems:

It seems that the link with parameters.csv is not working because values.csv doesn't show this column. The same happens with languages.csv.
There is a problem with the Source column (I get cells with something like this as an output: a;r;a;p;a;h;o;1;9;6;3). I don't know why this occurs.
I tried deleting the Code_ID column. Everytime I run cldfbench makecldf, however, it appears again.