cldf-datasets / inventory-study Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 19.1 MB

Study on Sound Inventories coded as CLDF Datasets

License: MIT License

Python 0.79% JavaScript 99.15% HTML 0.01% CSS 0.01% R 0.04%

inventory-study's Introduction

CLDF Datasets

inventory-study's People

Contributors

Watchers

inventory-study's Issues

Add an Appendix with Tutorial Sessions for the Paper

The paper should have an appendix in which we explain how to use our software, etc.

New paper in JOLE that should be quoted

https://academic.oup.com/jole/advance-article-abstract/doi/10.1093/jole/lzac003/6565849?redirectedFrom=fulltext

JIPA data

Lapsyd data is integrated now, but JIPA data still needs to be checked.

@SimonGreenhill, if you have time to have a look into the correlation plots (or other visualizations) in your preferred programming language, this would be nice. Maybe, imagine, some heatmap with the differences between datasets in sound inventory sizes world-wide would be interesting (if one can color the sizes). But I'd leave this to you and keep the code that I wrote as non-publication-final preliminary data-exploration for now.

Similarity calculations

# Phoible / JIPA
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8416     0.0000    4.9353       0.7051       0.8527        85
Consonants          0.7841     0.0000    2.7353       0.7426       0.8665        85
Vowels              0.8658     0.0000    1.7471       0.7772       0.8833        85
Consonantal         0.7859     0.0000    2.7353       0.0000       0.0000        85
Vocalic             0.8456     0.0000    2.9118       0.0000       0.0000        85
Ratio               0.8241     0.0000    0.6892       0.0000       0.0000        85
1it [00:03,  3.16s/it]
# Phoible / LAPSYD
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8238     0.0000    6.4353       0.6247       0.7857       116
Consonants          0.8703     0.0000    4.2198       0.6495       0.8045       116
Vowels              0.7378     0.0000    2.0172       0.7060       0.8135       116
Consonantal         0.8710     0.0000    4.2457       0.0000       0.0000       116
Vocalic             0.6949     0.0000    2.8491       0.0000       0.0000       116
Ratio               0.6863     0.0000    0.8379       0.0000       0.0000       116
2it [00:06,  3.20s/it]
# Phoible / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.5930     0.0000   11.0185       0.4753       0.6889        54
Consonants          0.6245     0.0000    5.6852       0.5589       0.7427        54
Vowels              0.6953     0.0000    3.9444       0.3897       0.6212        54
Consonantal         0.6245     0.0000    5.6852       0.0000       0.0000        54
Vocalic             0.6884     0.0000    4.9815       0.0000       0.0000        54
Ratio               0.6715     0.0000    1.2868       0.0000       0.0000        54
3it [00:08,  2.76s/it]
# JIPA / LAPSYD
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8063     0.0000    6.9907       0.6359       0.7978        54
Consonants          0.8014     0.0000    5.5000       0.6423       0.8005        54
Vowels              0.8631     0.0000    1.1852       0.7320       0.8443        54
Consonantal         0.8013     0.0000    5.5556       0.0000       0.0000        54
Vocalic             0.8692     0.0000    1.8056       0.0000       0.0000        54
Ratio               0.7437     0.0000    0.9517       0.0000       0.0000        54
4it [00:09,  2.45s/it]
# JIPA / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.4820     0.0025   11.8919       0.4252       0.6809        37
Consonants          0.6471     0.0000    8.2703       0.5036       0.7235        37
Vowels              0.5917     0.0001    3.3784       0.3491       0.6234        37
Consonantal         0.6471     0.0000    8.2703       0.0000       0.0000        37
Vocalic             0.6378     0.0000    4.5946       0.0000       0.0000        37
Ratio               0.5049     0.0014    1.7684       0.0000       0.0000        37
5it [00:11,  2.07s/it]
# LAPSYD / UPSID
               Correlation    P-Value    Deltas    StrictSim    ApproxSim    Sample
-----------  -------------  ---------  --------  -----------  -----------  --------
Sounds              0.8385     0.0000    3.6892       0.6702       0.8282       296
Consonants          0.8927     0.0000    1.6959       0.7726       0.8720       296
Vowels              0.6977     0.0000    2.1047       0.5114       0.7331       296
Consonantal         0.8930     0.0000    1.6824       0.0000       0.0000       296
Vocalic             0.7181     0.0000    2.3108       0.0000       0.0000       296
Ratio               0.7583     0.0000    0.7534       0.0000       0.0000       296

Finalize, refactor, and officially publish

This should be done some time in November, so we have this also combined with the CLTS update. If we want to go for a strict, downscaled IPA by then, it would mean we'd implement this here as well in an experimental manner and then later discuss the inclusion into pyclts or clts.

Reasons for Differences in Datasets

Major reasons for differences are (as far as I can tell now):

different language varieties chosen (problem on us, since we then select glottocodes that are too broad, compare the dialect problem in Chinese varieties)
different interpretation of the same source by scholars when writing down the data (what is spurious, how to interpret something)
different sources used by authors

Anything more? We cannot quantify these but want to show them in our results...

Weighted Jaccard Similarity and Distance

https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance

This means essentially that the distance we describe is not a weighted Jaccard, as far as I can tell from now. This should be maybe modified in the paper. Unfortunately, I did not so far find how this kind of comparison would be called.

recompute with Kendall's tau

Suggestion by @SimonGreenhill, this can in fact been done quickly in the Python script:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html

Application for Checking Individual Pairings in the Data

@cormacanderson, this is quite important, as I made an app where you can check individual differences:

https://digling.org/phonobank/

If you paste in a glottocode (sorry, this is trial and error, you need
to know if it is in the data) and then press OK, you will see a
comparison of the data, up to the sources (if they are available).

It also offers the comparison of raw entries with the comparison of
strict CLTS-mapped entries.

I think this is A) something that one could use to visualize a couple of
examples in the paper, and B) a useful enough tool for Cormac to pull
out some interesting examples.

BTW: Italian is not there in the 70 items version, since in this
version, there are tripthongs, so it was excluded. But one can keep it
as an anecdote.

Compute the Deltas

@cormacanderson, I have now computed the differences for different aspects (consonantal = consonants + clusters, vocalic = vowel + diphthong)

# Phoible / JIPA
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8185     0.0000    5.1364        66
Consonants          0.7464     0.0000    3.0833        66
Vowels              0.8745     0.0000    1.7121        66
Consonantal         0.7481     0.0000    3.0833        66
Vocalic             0.8568     0.0000    2.8939        66
Ratio               0.8023     0.0000    0.7777        66

# Phoible / LAPSYD
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.7996     0.0000    5.3514        74
Consonants          0.8821     0.0000    2.9662        74
Vowels              0.7198     0.0000    2.1554        74
Consonantal         0.8841     0.0000    2.9662        74
Vocalic             0.6816     0.0000    3.0135        74
Ratio               0.6532     0.0000    0.6540        74

# Phoible / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.5104     0.0040    9.0833        30
Consonants          0.6322     0.0002    4.9000        30
Vowels              0.7287     0.0000    2.6500        30
Consonantal         0.6322     0.0002    4.9000        30
Vocalic             0.7103     0.0000    3.7167        30
Ratio               0.7432     0.0000    1.1224        30

# JIPA / LAPSYD
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8298     0.0000    6.6707        41
Consonants          0.8301     0.0000    4.8780        41
Vowels              0.8159     0.0000    1.3659        41
Consonantal         0.8301     0.0000    4.8780        41
Vocalic             0.8297     0.0000    2.0854        41
Ratio               0.6515     0.0000    0.7726        41

# JIPA / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.3594     0.1004   12.0455        22
Consonants          0.6357     0.0015    8.0909        22
Vowels              0.6102     0.0026    3.3636        22
Consonantal         0.6357     0.0015    8.0909        22
Vocalic             0.6214     0.0020    4.9545        22
Ratio               0.5082     0.0157    1.8084        22

# LAPSYD / UPSID
               Correlation    P-Value    Deltas    Sample
-----------  -------------  ---------  --------  --------
Sounds              0.8368     0.0000    3.2199       141
Consonants          0.9101     0.0000    1.2553       141
Vowels              0.6933     0.0000    1.9716       141
Consonantal         0.9102     0.0000    1.2553       141
Vocalic             0.7220     0.0000    2.1348       141
Ratio               0.6875     0.0000    0.6477       141

Lingtyp discussion on sound inventory datasets

The lingtyp discussion is quite instructive and should be quoted in the study, reflecting major reservations and also misconceptions about inventory databases.

See specifically Moran's answer.