Giter Club home page Giter Club logo

inventory-study's Introduction

CLDF Datasets

inventory-study's People

Contributors

lingulist avatar simongreenhill avatar tresoldi avatar xrotwang avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

inventory-study's Issues

JIPA data

Lapsyd data is integrated now, but JIPA data still needs to be checked.

Plotting the correlations

@SimonGreenhill, if you have time to have a look into the correlation plots (or other visualizations) in your preferred programming language, this would be nice. Maybe, imagine, some heatmap with the differences between datasets in sound inventory sizes world-wide would be interesting (if one can color the sizes). But I'd leave this to you and keep the code that I wrote as non-publication-final preliminary data-exploration for now.

Similarity calculations

# Phoible / JIPA
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8416     0.0000    4.9353       0.7051       0.8527        85
Consonants          0.7841     0.0000    2.7353       0.7426       0.8665        85
Vowels              0.8658     0.0000    1.7471       0.7772       0.8833        85
Consonantal         0.7859     0.0000    2.7353       0.0000       0.0000        85
Vocalic             0.8456     0.0000    2.9118       0.0000       0.0000        85
Ratio               0.8241     0.0000    0.6892       0.0000       0.0000        85
1it [00:03,  3.16s/it]
# Phoible / LAPSYD
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8238     0.0000    6.4353       0.6247       0.7857       116
Consonants          0.8703     0.0000    4.2198       0.6495       0.8045       116
Vowels              0.7378     0.0000    2.0172       0.7060       0.8135       116
Consonantal         0.8710     0.0000    4.2457       0.0000       0.0000       116
Vocalic             0.6949     0.0000    2.8491       0.0000       0.0000       116
Ratio               0.6863     0.0000    0.8379       0.0000       0.0000       116
2it [00:06,  3.20s/it]
# Phoible / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.5930     0.0000   11.0185       0.4753       0.6889        54
Consonants          0.6245     0.0000    5.6852       0.5589       0.7427        54
Vowels              0.6953     0.0000    3.9444       0.3897       0.6212        54
Consonantal         0.6245     0.0000    5.6852       0.0000       0.0000        54
Vocalic             0.6884     0.0000    4.9815       0.0000       0.0000        54
Ratio               0.6715     0.0000    1.2868       0.0000       0.0000        54
3it [00:08,  2.76s/it]
# JIPA / LAPSYD
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8063     0.0000    6.9907       0.6359       0.7978        54
Consonants          0.8014     0.0000    5.5000       0.6423       0.8005        54
Vowels              0.8631     0.0000    1.1852       0.7320       0.8443        54
Consonantal         0.8013     0.0000    5.5556       0.0000       0.0000        54
Vocalic             0.8692     0.0000    1.8056       0.0000       0.0000        54
Ratio               0.7437     0.0000    0.9517       0.0000       0.0000        54
4it [00:09,  2.45s/it]
# JIPA / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.4820     0.0025   11.8919       0.4252       0.6809        37
Consonants          0.6471     0.0000    8.2703       0.5036       0.7235        37
Vowels              0.5917     0.0001    3.3784       0.3491       0.6234        37
Consonantal         0.6471     0.0000    8.2703       0.0000       0.0000        37
Vocalic             0.6378     0.0000    4.5946       0.0000       0.0000        37
Ratio               0.5049     0.0014    1.7684       0.0000       0.0000        37
5it [00:11,  2.07s/it]
# LAPSYD / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8385     0.0000    3.6892       0.6702       0.8282       296
Consonants          0.8927     0.0000    1.6959       0.7726       0.8720       296
Vowels              0.6977     0.0000    2.1047       0.5114       0.7331       296
Consonantal         0.8930     0.0000    1.6824       0.0000       0.0000       296
Vocalic             0.7181     0.0000    2.3108       0.0000       0.0000       296
Ratio               0.7583     0.0000    0.7534       0.0000       0.0000       296

Finalize, refactor, and officially publish

This should be done some time in November, so we have this also combined with the CLTS update. If we want to go for a strict, downscaled IPA by then, it would mean we'd implement this here as well in an experimental manner and then later discuss the inclusion into pyclts or clts.

Reasons for Differences in Datasets

Major reasons for differences are (as far as I can tell now):

  1. different language varieties chosen (problem on us, since we then select glottocodes that are too broad, compare the dialect problem in Chinese varieties)
  2. different interpretation of the same source by scholars when writing down the data (what is spurious, how to interpret something)
  3. different sources used by authors

Anything more? We cannot quantify these but want to show them in our results...

Application for Checking Individual Pairings in the Data

@cormacanderson, this is quite important, as I made an app where you can check individual differences:

https://digling.org/phonobank/

If you paste in a glottocode (sorry, this is trial and error, you need
to know if it is in the data) and then press OK, you will see a
comparison of the data, up to the sources (if they are available).

It also offers the comparison of raw entries with the comparison of
strict CLTS-mapped entries.

I think this is A) something that one could use to visualize a couple of
examples in the paper, and B) a useful enough tool for Cormac to pull
out some interesting examples.

BTW: Italian is not there in the 70 items version, since in this
version, there are tripthongs, so it was excluded. But one can keep it
as an anecdote.

Compute the Deltas

@cormacanderson, I have now computed the differences for different aspects (consonantal = consonants + clusters, vocalic = vowel + diphthong)

# Phoible / JIPA
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8185     0.0000    5.1364        66
Consonants          0.7464     0.0000    3.0833        66
Vowels              0.8745     0.0000    1.7121        66
Consonantal         0.7481     0.0000    3.0833        66
Vocalic             0.8568     0.0000    2.8939        66
Ratio               0.8023     0.0000    0.7777        66

# Phoible / LAPSYD
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.7996     0.0000    5.3514        74
Consonants          0.8821     0.0000    2.9662        74
Vowels              0.7198     0.0000    2.1554        74
Consonantal         0.8841     0.0000    2.9662        74
Vocalic             0.6816     0.0000    3.0135        74
Ratio               0.6532     0.0000    0.6540        74

# Phoible / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.5104     0.0040    9.0833        30
Consonants          0.6322     0.0002    4.9000        30
Vowels              0.7287     0.0000    2.6500        30
Consonantal         0.6322     0.0002    4.9000        30
Vocalic             0.7103     0.0000    3.7167        30
Ratio               0.7432     0.0000    1.1224        30

# JIPA / LAPSYD
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8298     0.0000    6.6707        41
Consonants          0.8301     0.0000    4.8780        41
Vowels              0.8159     0.0000    1.3659        41
Consonantal         0.8301     0.0000    4.8780        41
Vocalic             0.8297     0.0000    2.0854        41
Ratio               0.6515     0.0000    0.7726        41

# JIPA / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.3594     0.1004   12.0455        22
Consonants          0.6357     0.0015    8.0909        22
Vowels              0.6102     0.0026    3.3636        22
Consonantal         0.6357     0.0015    8.0909        22
Vocalic             0.6214     0.0020    4.9545        22
Ratio               0.5082     0.0157    1.8084        22

# LAPSYD / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8368     0.0000    3.2199       141
Consonants          0.9101     0.0000    1.2553       141
Vowels              0.6933     0.0000    1.9716       141
Consonantal         0.9102     0.0000    1.2553       141
Vocalic             0.7220     0.0000    2.1348       141
Ratio               0.6875     0.0000    0.6477       141

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.