concepticon / concepticon-data Goto Github PK

The curation repository for the data behind Concepticon.

Python 5.53% TeX 91.61% HTML 0.76% JavaScript 1.56% RouterOS Script 0.54%

linguistics concepts cross-linguistic-data

concepticon-data's Introduction

Concepticon data curation

The data underlying the Concepticon is maintained in this repository. Released versions of this data are distributed as CLDF datasets, uploaded to Zenodo from the concepticon-cldf repository

Here, you can find

previous and current releases,
issues we are trying to handle,
the current unreleased form of the data, as well as
errata that have been corrected

Concepticon Data

For an overview on the status of all currently linked conceptlists, see here.
For information on how you can contribute to the project or profit from the data sources we offer, see here.

Data Structure

conceptlists/ folder contains conceptlists with links to IDs in concepticon.tsv, the lists are named after the first person who proposed them, the year of the reference publication in which we extracted them, and the number of concepts. All these three parts of information are separated by a dash. Furthermore, in cases where two lists would have an identical name, we add alphabetical letters to the lists to distinguish them. Files need to have the columns "GLOSS" (some still have "ENGLISH" instead, but this needs to be changed), additionally, most (if not all files) have a "NUMBER" field indicating the number in the reference, which is also important for ordering the entries as given in the original source. Additional columns are more or less free to the user, but we tried to be consistent.

Some concept lists are based on sources that may change, thus require a mechanism for re-creation. In this case, there will a directory named after the list, containing the relevant curation scripts.

Concept lists may contain information about relations between concepts. If so, such relations must be stored as content of columns named LINKED|SOURCE|TARGET_CONCEPTS. The values for these columns must be
- lists of edge objects, where
- the concept described in the same row is assumed to be one node of the edge,
- the second node is specified via a property ID the value of which must be a concept identifier in the list,
- serialized as JSON.
Edges in the graph described in LINKED_CONCEPTS are considered undirected, whereas edges in SOURCE|TARGET_CONCEPTS are considered directed, with the concepts specified in the edge objects identifying the SOURCE or TARGET, respectively, of the edge.
conceptlists.tsv contains metadata about the lists in conceptlists/.
references/references.bib the bibtex file showing links to all concept lists (bibtex-key identical to the name of the conceptlist file, without file-ending. File further contains links to the references in which the conceptlists were published (references stored in the "crossref" field).
sources/ contains pdf-files of each conceptlist (only the list-parts, not the full publications for copyright reasons), naming is the same as for the conceptlists, but with the ending ".pdf" instead of ".tsv".
concepticon.tsv the backbone concept list. All concepts from individual concept lists are linked to entries in this file.
app/ contains data for running the JavaScript-based Concepticon lookup tool.

Norms, Ratings and Relations associated with words and concepts

Before release 3.0, this repository contained metadata linked to Concepticon concept sets. With release 3.0, this data moved to a separate (though related) project - NoRaRe. For the curation and publication workflow of NoRaRe data see https://github.com/concepticon

Update policy

We try to release concepticon-data (as well as the CLDF dataset and the concepticon web app) regularly at least once a year. Generally, new releases should only become more comprehensive, i.e. all data ever released should also be part of the newest release. Occasionally, though, we may have to correct an erratum, which may result in some data being removed, or changes in identifiers of objects. So whenever a link to the web app breaks or a script using the concepticon-data API throws an error, you should consult the list of errata to see, whether an error correction may be the reason for this behaviour.

pyconcepticon

pyconcepticon provides a Python package to programmatically access Concepticon data.

concepticon-data's People

Contributors

Stargazers

Watchers

concepticon-data's Issues

Add links to wikidata

I didn't know of wikidata before, but now that I checked it, it seems it could give us valuable information in many respects, not for all concept sets, of course, but probably for quite a few. Here's, for example, the wikidata for hand:

https://www.wikidata.org/wiki/Q33767

They also seem to have a good API, so one may be able to search more or less automatically for good matches...

new lists by Anvita Abbi to be scanned and added

There are, apparently, proposed regional swadesh lists for indian languages in the book by A. Abbi (I looked at the TOC, and there seem to be some 2 or 3 new lists of different size):

Abbi, Anvita (2001). A Manual of Linguistic Field Work and Structures of Indian
Languages. München: Lincom Europa.

Is there a copy of that book in Marburg, @cysouw, and if so, could you send a student to scan it (or the relevant parts?).

Testing routine for wrongly assigned links based on distribution of concept labels

There's a simple test we can make and which enables us to find obviously wrong links:

assemble all concept labels in a dictionary
count and store the different concepticon-ids which are assigned to the same label
list the labels which are potentially problematic

The point is the following: when mapping, one may overview a very good concept set and link instead to another one, as I probably did with "no, not" which I inconsistently linked to either "NO" or "NO OR NOT". So here, the above-stated procedure would show which lists have the same concept labels, but link them differently.

Note however, that this procedure is not fully automatizable. In some cases, I linked words based on the information of concept labels in other languages (compare "dull" linked to "dull (of knife)" and "dull (stupid)", because in some lists, I have Chinese translations and know therefore better, what is the concept that was intended.

So the routine can either be integrated in the testing itself, but it should only throw warnings, no errors, or, it can be used separately, as a tool for those who link a new list to the concepticon.

LOGOS children dictionary with flashcards

The Logos Children dictionary has translations for some 1000 concepts into some 60 languages. It may be interesting to link these concepts, especially since they offer images for each of the concepts, which may be useful to have for certain purposes.

Verify that concepts meaning "thin" are now correctly handled

In our online-alpha-version, the words for "thin" all mean different things (as can be seen from the Chinese labels). The concepticon major list has already been adjusted in this sense, offering enough concepts here, but it needs to be confirmed that the links are actually applied to all the data now.

unify column names in concept lists

Concept lists should only have a GLOSS column, if English is not among the lists source languages (or if an English gloss used in the source was corrected, in which case the column ENGLISH would list the gloss as found in the source, and GLOSS the corrected version). Otherwise, the column GLOSS should be renamed to ENGLISH.

Test for conceptlist TAGS coming from a controlled list

The TAGS specified for concept lists in conceptlists.tsv should come from a controlled list as specified in the README.md.

Concept list of more than 1000 concepts by Pelkey-2011

In the SIL study on Phula languages (Sino-Tibetan family), two lists are given, one master-list of 1100 and more concepts (difficult to map), and one list of etyma, for which according to the author a 660 item masterlist from another source was taken (apparently based on Bradley 1979). How well this list can be added depends especially on how well the concepts can be extracted from the PDF and how nicely they are described.

lee and hasegawa (japonic)

don't know yet how many concepts, but supplementary is available (pdf, unfortunately, so will have to type it off)

dixon's list has wrong numbers

In the original Dixon list, the number 14 is missing in the source itself (apparently an entry for ear (1) which is simply not there). Furthermore, our current entry "12" is entry 11a in Dixon's list. This should be quickly updated after ID's have been changed.

ULD-1600 "Universal language dictionary" resource with 1600 concept in different languages

this is an interesting resource I was just pointed to by @kaleissin (from the CALS project):

http://www.uld3.org/uld27/index.html

They have a strange licence, though, but I think we should be allowed to provide links to their concepts, and since they provide an email of the copyright holder, we could directly contact him.

Handling identical or almost identical lists across multiple publications

Following up from the discussion in the PR #58, I thought it was useful to turn this into an issue.

It seems that with the lists by Shirō Hattori, we have an excellent example for identical concept lists across different publications. There is still the question of how to handle it. A simple solution would be to list several references, and indicate in the note-column of conceptlists.tsv which additional data-point stems from which list (suppose a list which had only English, but has added Japanese in a later identical edition). This is feasible, since there are not too many lists, where this needs to be done. Another possibility of what we could think would be to add yet another column to conceptlists.tsv which might be called USEDBY or something similar, indicating whether this very list was used in further publications. The list by Dyen et al. 1992/1997 would be a usecase for this, since it is exactly identical with the list by Swadesh-1952-200, but they use uppercase where Swadesh used normal orthography. If we, otherwise keep on following a policy by which we say that theoretically, no two lists are the same (and one can fight about this), this would mean that Dyen 1992 should also be added, and Hattori's lists should be split into three or four.

We basically have a small problem of ontology and epistemology already now in the concepticon, since it is clear that we cannot just trust the papers if they say they used the list by Swadesh, for example, since they often use new concept labels, but claim they are based on some list. So it might be the most coherent way to add all lists we can get, but this may then turn out to be redundant, so the "ALSO-USED-By" (or whatever better label) may be a compromise solution.

But I'm by no means completely convinced by either of the solutions mentioned above...

Missing concept lists

Currently we have metadata on 58 concept lists in conceptlists.tsv but only 50 concept lists in conceptlists/. The missing ones are

Kassian-2010-110
Gabelentz-1891-120
Snider-2004-1700
Zorc-1974-100
Shevoroshkin-1991-23
Swadesh-1960-100
Marsden-1782-50
Pallas-1785-441

We should either add these - maybe even as empty stubs - or remove them from conceptlists.tsv, I think. In case we add empty concept list files for these, we might need a flag signaling the process status in conceptlists.tsv, to prevent them from being imported in the concepticon app.

Provide mapping example using the python script on the website or to be linked on github

When launching the concepticon in version 1.0, there should be one testing example, showing how users can map their own lists. This requires some little information, and some reference to LingPy, since the code for comparing glosses is implemented in the meaning module of LingPy, but it seems worthwhile to have a brief description and alink on the main page, so that who wants can use the resource to quickly link a list.

Boston naming test and derivatives

The neurologists have also their "Swadesh", the "boston-naming-test" (60 items):

http://aphasiology.pitt.edu/archive/00000069/

And they have their Dolgopolsky-short-version (15 items):

http://psychsocgerontology.oxfordjournals.org/content/57/2/P187/T1.expansion.html

Provide mapping to babelnet, where possible

Babelnet is a very nice resource that defines its own synsets and maps them to omegawiki, princeton wordnet, multwordnet, and wikipedia. We should try to map our concepts to it, where possible. I am currently preparing an automatic pre-mapping using their API. It is probable that we will have unmappable items, but if we could provide some coverage of about 80% (which seems realistic), it would be very nice. To test babelnet, check out the babelfy website, where one can just insert words and see which synsets they infer for them.

If the mapping can be provided to, say, 80% of our concepts, we can delete the omegawiki-links, since they should be included in babelnet (and it is unlikely we are able to add more than the ones we already have, given that the api of omegawiki is working so slowly...).

Conceptlist from Grollemund et al. 2015

In Bantu expansion shows that habitat alters the route and pace of human dispersals Grollemund et al. use wordlists for ~420 Bantu Languages containing words for 100 concepts, sampled from 159 concepts in

Hombert J-M, Van der Veen L, Medjo Mve P (2011) ALGAB, Atlas Linguistique du GABon (Laboratoire Dynamique du Langage, Lyon).
Available at http://www.ddl.ish-lyon.cnrs.fr/equipes/index.asp?Langue=FR&Equipe=8&Page=Action&ActionNum=48
Accessed November 10, 2014.

Sutton and Walsh -- 1987

I don't have access to this book, but it could be useful to include it, since it seems to offer new concept lists for Australian region:

Sutton, P. and M. Walsh (1987). Wordlist for Australian Languages. Canberra:
Australian Institute of Aboriginal Studies.

Add full author names in bibliography

Re-edit the bibliography in order to guarantee nice full names of all authors.

wordlist by the texas/austing pie-project

http://www.utexas.edu/cola/centers/lrc/iedocctr/ie-ling/ie-sem/index.html

They provide a semantic field coding which may be interesting when mapped to the concepticon.

Sidwell-2015-20X

Paul Sidwell just published (or soon publishes) his data on Austro-Asiatic officially, so we can quote it and map it. I have the list, also the draft and the quotation (should re-confirm it is the same as in the data-file). It is interesting, since it gives a South-East-Asian perspective on stable words, similar to the lists by Matisoff and others.

Wordlist by Schryver-2015 of 92 items (Kongo languages)

The paper is currently in press, and the link is here:

http://poj.peeters-leuven.be/content.php?url=article&id=3122579&journal_code=AL

They still need to upload the supplementeray material, but they announce that this will be available in the paper, so I expect this to be similar in structure to the Grollemund-list.

Refine kinship terms and add parabank list

We should refine the kinship terms, by following what they have in ParaBank. question is of how to handle the relations, but maybe we can just put this to the parabank-list to be added.

Lists for cross-linguistic naming tests

Interestingly, historical linguistics have now another potential collaboration partner: neurologists. Since neurologists use "naming tests" to assess the degree of aphasia and the like, and they have also realized, that it might be interesting to look ad Swadesh lists:

http://www.sciencedirect.com/science/article/pii/S088761770700011X

The list contains forty items, but this time chosen for practical criteria important in neurology.

Data is already converted to table form. All that's needed is to link it (and this will be quick). The resources also has a nice collection of forty photographies (apparently for aphasia situations, where the doctor ask the patients to tell what they see), which we can link in the URL.

Badges for the Concept-Lists

In order to allow for quality control and the like, it would be great if we could create badges for all concept lists processed. This would answer questions like:

are there mergers inside the list, that is, are two or more concepts linked to the same concept sets?
[maybe silly, but might be interesting] how often do the concepts in the very list occur in other lists on average, that is, how "unique" is the list regarding its inventory
[also not necessary, but good for proof-checking] how large is the levenstheint distance on average between the concept labels in the list and the concept labels of the concepticon
which lists are most similar in terms of overlaps in concepts to the given list

I think this may be some interesting information we could assemble automatically whenever parsing the concepticon-data in, and it would be worthwhile to show the information. It may be enough to write a script that computes the values and to add them automatically to the file conceptlists.tsv, since it would probably here, where the information would be displayed afterwards...

Remove obsolete OMEGAWIKI column from concept lists

Since mapping from individual concepts to the concepticon is now done by CONCEPTICON_ID, the OMEGAWIKI column is no longer needed in concept lists and should be removed.

Snider-2004-1700

list by Bengtson and Ruhlen

This list has just been typed off by me, it contains 27 items with GLOBAL etymologies, so they say they represent some proto-world. Suprisingly, there are two words for "leg". Anyway, the concepts are a bit strange, containing many semantic-shift meanings (breast-milk), but this is interesting in the context of comparing it with other areal lists, where they merge meanings, like meat/animal, and the like.

Tests the Concepticon

When working on the concepticon, more things come to mind, which we need tests for, and I suppose we summarize them here, to not forget about this.

tests for concepticon.tsv should check for the number of categories, like person/thing, but also for the rough semantic field (animals, the body), etc., since spelling differences may create errors here, and the categories and fields are actually fixed

Of course, more tests are surely needed, but this is what I can think about at the moment.

wordlists provided by the comparalex project

http://comparalex.org/index.php

They offer quite a few specifically african wordlists, and seem to even have an underlying mapping of the concepts (but less clear to me). Anyway, it's worth a look to get more lists for the concepticon.

new potential list by mcelhanon

http://www.jstor.org/stable/3622923?origin=crossref&seq=1#page_scan_tab_contents

This is the jstor link. I don't have access to it from my current location, but will get it afterwards...

Ardila's 2007 wordlist on cross-linguistic naming tests

This is a list proposed for neurology, based on Swadesh lists (humanities feeds science) and in which the author proposes to use the Swadesh lists to test aphasia or strokes etc. List has already been typed of but needs to be mapped.

Word ratings data

http://crr.ugent.be/programs-data/word-ratings

This may be interesting to be mapped in parts.

It raises a general question: how to map: to we take a sample of all words we can get that we have concept sets for, or do we take a subsample, like IDS?

Bowern's Data on Hunter-Gatherers

This is also interesting in the light of Gram-Bank, they have lots of lexical words, and three lists, one of flora and fauna, one of basic vocabulary, and one of culture vocabulary:

https://huntergatherer.la.utexas.edu/lexical

At least basic vocabulary should be added.

kitchen-et-al-2012

95 concepts on semitic, list already copied from supplements

Mann -- 2004

The wordlist for mainland south-east-asia by mann (2004) is described here, and contains some 400 items:

http://www.sil.org/resources/publications/entry/60549

The author also provides a comparison with other lists, mainly those by matisoff etc.

Typo: rain

Hello from Tübingen!

First of all, thank you all for building and maintaining this project!

Here are a couple of typos that I found in the concept names:

RAIN (PRECIPATION) --> RAIN (PRECIPITATION) // ID: 658
FRONT TOOTH (INEISOR_ --> FRONT TOOTH (INCISOR) // ID: 442

The latter has been fixed here in the repo, but apparently the fix has not reached the web server.

Kind regards,
Pavel

BLESS-Data on semantic associations

The bless dataset offers some interesting semantic associations (hyperonomy, etc.) for 200 concrete words.

The data may be interesting for the concepticon, since it offers additional accounts on semantic relations between words/concepts.

Uralex Data should best be added before next release

Given how closely we work with Uralex in lexibank and the like, we should try to have an official uralex version for the next release (this is quickly to do, but we need the good list with authorisation of the uralex people).

match concepticon general with STEDT taxononomy?

this will be tedious, but they have this nice historically informed taxonomy in STEDT:

http://stedt.berkeley.edu/~stedt-cgi/rootcanal.pl/chapters#2.0

maybe, having the conceptlicon mapped to it, would be nice.

Problem is the size, and again the question of what to do, if something is just too big, so that we can only take a small part for the concepticon? Is that still the same kind of "concept list", or is it something else? I am thinking of major linkings to

wordnet (we have an indirect one, but no "real")
STEDT
some association data (age of acquisition or whatever)
wikipedia (most definitions for plants and animals were now directly taken from there)

But I have the feeling that we should not call these things "concept lists", and we won't be able to get full coverage for the whole bunch of about 2500 concept sets we have at the moment.

Generaly there are two possibilities:

(a) decide upon a subset of a certain number of concepts and link this subset as a concept list
(b) open a new category for meta-data which was assembled directly for the concept sets

I think that (b) would be better, also for users to find what they are looking for...

Link relevant concept sets to eol.org and/or gbif.org

To allow specialized lexical collections like Tsammalex to be linked to concepticon, it would be useful to provide EOL and GBIF taxon identifiers for relevant concept sets.

A wordlist of words ideal for compounds "Dublex"

Again a hint from the CALS project:

https://web.archive.org/web/20051222053121/http://langmaker.com/dublexcompoundinterest.htm

This is a list of words which are, as far as I understand at the moment, ideal to create compounds, so out of just 400 "roots", one can create 4500 new words. I still did not figure out the nature of the roots, if they're not too abstract, it might be interesting to link them.

Urum basic lexicon

http://urum.lili.uni-bielefeld.de/download/docs/uum-lexicon.pdf

This list draws from WOLD, adds 90 more concepts, and provides alternative categories. It is long, and it is a PDF, so now way to quickly extract a linking to the concepticon. The semantic categories would be interesting, though, but this is probably rather a long-term than a short-term list-to-map.

Concepticon data (original stuff by Good et al.) doesn't provide links to the concepticon concepts

I have already converted the Good-data containing IDS, WOLD, and one further mapping (Usher-Whitehouse) to CSV. However, we have a problem with the URLs there, since they do create an error on the website:

http://purl.org/linguistics/lego/concept/11

Should we still leave those urls in the file, or just discard them? The concepticon IDs, which are referenced in the data, seem to be OK, as far as I checked.

Missing russian source concepts

The following three concept lists list Russian as source language, but do not have a corresponding column RUSSIAN:

Starostin-1991-110
Jachontov-1991-100
Jachontov-1991-65

Pallas-1785-441

Metadata is already present in conceptlists.tsv.

MRC Psycholinguistic Database might give interesting subsets

The MRC Psycholinguistic Database contains tons of metadata which may be interesting for conceptual studies. It is only difficult to link it and it should be limited to a subset of the data (maybe a concept list itself that offers the meta-data for a specific purpose).

This is no issue of concrete hurry, but I consider it useful to collect conceptual metadata that could be partially linked at some point...

Buck -- 1949

This list has been digitized by many people, but they all differ from the original. i have now digitized the full list myself (with a little help of OCR). It still needs re-editing, but afterwards, it should be mapped as completely as possible to the Concepticon, since thsi list is historically quite important: it was proposed independently from Swadesh, and it was used in a couple of projects thereafter.

We could then also link this list to all available conlang lists, which are also interesting, since they made their own mapping which may be interesting to be compared with those made by the concepticon.

The lists can all be found here and they offer them for download in CSV.

Note, that the Buck-1949 in the conlang archive is not identical with Buck original. I made some quick tests and found that they do not retain the original wording and sometimes add certain words to the labels. If we quote Buck, it should be as narrow as possible to the original.

Change all lists by Hattori Shirō: current version confuses first name and family name

It's embarassing for one who studied Chinese, but I apparently confused family name and first name in all lists involving Shirō Hattori, so they should all be corrected, with Shiro-1973-200 then being called Hattori-1973-200.

Blust-1981-200

Robert Blust was so nice to give me the first alternative Swadesh list he created (presented in a talk from 1981). This list is thus already digital, but stil needs to be linked (since it is much more precise than the list of the ABVD).