Giter Club home page Giter Club logo

gata's Introduction

CLDF dataset for the Grammars Across Time Analyzed (GATA) dataset

How to cite

If you use these data please cite

Description

This dataset is licensed under a CC-BY license

Language sample of GATA

CLDF Datasets

The following CLDF datasets are available in cldf:

gata's People

Contributors

muffinlinwist avatar fredericblum avatar

Watchers

Johann-Mattis List avatar  avatar

Forkers

badbatched

gata's Issues

Year column is sometimes not a year

Some of the values in the Year column of values.csv are not years:

ID=1043: 'via lexicon'
ID=1757: 'ngarinyin1970'
ID=2162: 'yupik1990'
ID=2244: ''
ID=2247: 'rembarrnga1975'
ID=2503: 'soo1971'
ID=2560: 'wa2012'
ID=2882: 'tlingit1917'
ID=2928: ''
ID=2932: ''

Issues with Table

  • Korana: Work from Maingard is based on data from 1879 - correct in data, but not in Table 2 of the data
  • Reference 1 for Nganasan has the wrong year in the Table; should be 1854; again, this is correctly coded in the data, but not in the table
  • Reference 2 for NYangi has no year in the Table, but is correct in data

I had those cases marked in the production check, but they did not get changed.

List doculects in LanguageTable rather than Glottolog languages?

Just came across this dataset. Looks cool.
One suggestion, though: Since the whole point of the dataset seems to be highlighting language change over time, wouldn't it be more transparent to list doculects as rows in LanguageTable, rather than Glottolog languages? This would get across the point that languages at different points in time might need to be treated as different more thoroughly - while aggregation on Glottolog language level would still be possible via the glottocode property.

Inconsistent column names in parameters

In parameters.csv, the column names are like "Variable_type" vs. "Category_Esp" vs. "Description_esp". In the Python code, there were used different terms, I had to modify them. So I suggest to unify spelling (capitals, not capitals, etc.).

Push-Access

Would it be possible to give me push-access for this repository? Then I could adapt some of the changes that we still need to make before publishing the database.

Handling References: parse sources

In the spreadsheet in raw/pooled_all.csv, you list sources in a colloquial style, not as their bibtex keys. They should be parsed in such a way that we have:

BibtexKey[pagerange]

So something like:

Muller (1963:555)

should be converted to:

Muller1963[555]

But note that this requires that the bibtex-keys are AuthorYear.

ValueTable includes R-isms "NA" as a value.

The ValueTable (values.csv) has multiple values coded as "NA" which is due to R being used somewhere in the pipeline. It would be better to code these as "" or tag these values as null in the StructureDataset definition (where ? and <empty string> are already defined).

Change name to lower case

The repository names we use are typically lower case, this is easier to handle. So I suggest to change the name to gata.

Codes

It will be better if we add codes to the dataset.

Create release for submission

@LinguList The journal asked us to provide the data in one of their featured repositories. The most reasonable options would be Zenodo and OSF. As many other CLDF datasets are also linked to Zenodo, I guess that this would be the most reasonable way. Does anything speak against creating a release 0.1 with the current form of the dataset? We would like to not delay the submission any further, and while there are some open issues (#13 for some utility plots, #10 codes that could be added for a Version 1.0, and #5 which is too much work to implement it right now for little reward), we think that they should not prevent us from creating this release for submission.

If you are okay with that, I would go ahead and create a release 0.1 and link it to a Zenodo page. Alternatively, we would put everything on an OSF repository for the submission process.

Problems with values.csv

@LinguList, I made some modifications on the dataset but still run with three problems:

  1. It seems that the link with parameters.csv is not working because values.csv doesn't show this column. The same happens with languages.csv.
  2. There is a problem with the Source column (I get cells with something like this as an output: a;r;a;p;a;h;o;1;9;6;3). I don't know why this occurs.
  3. I tried deleting the Code_ID column. Everytime I run cldfbench makecldf, however, it appears again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.