Giter Club home page Giter Club logo

concepticon-data's Introduction

Concepticon data curation

Build Status

The data underlying the Concepticon is maintained in this repository. Released versions of this data are distributed as CLDF datasets, uploaded to Zenodo from the concepticon-cldf repository

Here, you can find

Concepticon Data

  • For an overview on the status of all currently linked conceptlists, see here.
  • For information on how you can contribute to the project or profit from the data sources we offer, see here.

Data Structure

  • conceptlists/ folder contains conceptlists with links to IDs in concepticon.tsv, the lists are named after the first person who proposed them, the year of the reference publication in which we extracted them, and the number of concepts. All these three parts of information are separated by a dash. Furthermore, in cases where two lists would have an identical name, we add alphabetical letters to the lists to distinguish them. Files need to have the columns "GLOSS" (some still have "ENGLISH" instead, but this needs to be changed), additionally, most (if not all files) have a "NUMBER" field indicating the number in the reference, which is also important for ordering the entries as given in the original source. Additional columns are more or less free to the user, but we tried to be consistent.

    Some concept lists are based on sources that may change, thus require a mechanism for re-creation. In this case, there will a directory named after the list, containing the relevant curation scripts.

    Concept lists may contain information about relations between concepts. If so, such relations must be stored as content of columns named LINKED|SOURCE|TARGET_CONCEPTS. The values for these columns must be

    • lists of edge objects, where
    • the concept described in the same row is assumed to be one node of the edge,
    • the second node is specified via a property ID the value of which must be a concept identifier in the list,
    • serialized as JSON.

    Edges in the graph described in LINKED_CONCEPTS are considered undirected, whereas edges in SOURCE|TARGET_CONCEPTS are considered directed, with the concepts specified in the edge objects identifying the SOURCE or TARGET, respectively, of the edge.

  • conceptlists.tsv contains metadata about the lists in conceptlists/.

  • references/references.bib the bibtex file showing links to all concept lists (bibtex-key identical to the name of the conceptlist file, without file-ending. File further contains links to the references in which the conceptlists were published (references stored in the "crossref" field).

  • sources/ contains pdf-files of each conceptlist (only the list-parts, not the full publications for copyright reasons), naming is the same as for the conceptlists, but with the ending ".pdf" instead of ".tsv".

  • concepticon.tsv the backbone concept list. All concepts from individual concept lists are linked to entries in this file.

  • app/ contains data for running the JavaScript-based Concepticon lookup tool.

Norms, Ratings and Relations associated with words and concepts

Before release 3.0, this repository contained metadata linked to Concepticon concept sets. With release 3.0, this data moved to a separate (though related) project - NoRaRe. For the curation and publication workflow of NoRaRe data see https://github.com/concepticon

Update policy

We try to release concepticon-data (as well as the CLDF dataset and the concepticon web app) regularly at least once a year. Generally, new releases should only become more comprehensive, i.e. all data ever released should also be part of the newest release. Occasionally, though, we may have to correct an erratum, which may result in some data being removed, or changes in identifiers of objects. So whenever a link to the web app breaks or a script using the concepticon-data API throws an error, you should consult the list of errata to see, whether an error correction may be the reason for this behaviour.

pyconcepticon

pyconcepticon provides a Python package to programmatically access Concepticon data.

concepticon-data's People

Contributors

anaphory avatar annikatjuka avatar blag avatar carolinhu avatar chrzyki avatar cysouw avatar evoling avatar fredericblum avatar ilchec avatar kristina-pianykh avatar laiyunfan avatar lannin avatar lingulist avatar macyl avatar marthuis avatar martino-vic avatar mathildavz avatar mottaam avatar muffinlinwist avatar natalia-morozova avatar phylostar avatar schweikhard avatar simongreenhill avatar stasreichert avatar tresoldi avatar wu-urbanek avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

concepticon-data's Issues

Ardila's 2007 wordlist on cross-linguistic naming tests

This is a list proposed for neurology, based on Swadesh lists (humanities feeds science) and in which the author proposes to use the Swadesh lists to test aphasia or strokes etc. List has already been typed of but needs to be mapped.

lee and hasegawa (japonic)

don't know yet how many concepts, but supplementary is available (pdf, unfortunately, so will have to type it off)

MRC Psycholinguistic Database might give interesting subsets

The MRC Psycholinguistic Database contains tons of metadata which may be interesting for conceptual studies. It is only difficult to link it and it should be limited to a subset of the data (maybe a concept list itself that offers the meta-data for a specific purpose).

This is no issue of concrete hurry, but I consider it useful to collect conceptual metadata that could be partially linked at some point...

new lists by Anvita Abbi to be scanned and added

There are, apparently, proposed regional swadesh lists for indian languages in the book by A. Abbi (I looked at the TOC, and there seem to be some 2 or 3 new lists of different size):

Abbi, Anvita (2001). A Manual of Linguistic Field Work and Structures of Indian
Languages. München: Lincom Europa.

Is there a copy of that book in Marburg, @cysouw, and if so, could you send a student to scan it (or the relevant parts?).

dixon's list has wrong numbers

In the original Dixon list, the number 14 is missing in the source itself (apparently an entry for ear (1) which is simply not there). Furthermore, our current entry "12" is entry 11a in Dixon's list. This should be quickly updated after ID's have been changed.

Sidwell-2015-20X

Paul Sidwell just published (or soon publishes) his data on Austro-Asiatic officially, so we can quote it and map it. I have the list, also the draft and the quotation (should re-confirm it is the same as in the data-file). It is interesting, since it gives a South-East-Asian perspective on stable words, similar to the lists by Matisoff and others.

Tests the Concepticon

When working on the concepticon, more things come to mind, which we need tests for, and I suppose we summarize them here, to not forget about this.

  • tests for concepticon.tsv should check for the number of categories, like person/thing, but also for the rough semantic field (animals, the body), etc., since spelling differences may create errors here, and the categories and fields are actually fixed

Of course, more tests are surely needed, but this is what I can think about at the moment.

Lists for cross-linguistic naming tests

Interestingly, historical linguistics have now another potential collaboration partner: neurologists. Since neurologists use "naming tests" to assess the degree of aphasia and the like, and they have also realized, that it might be interesting to look ad Swadesh lists:

http://www.sciencedirect.com/science/article/pii/S088761770700011X

The list contains forty items, but this time chosen for practical criteria important in neurology.

Data is already converted to table form. All that's needed is to link it (and this will be quick). The resources also has a nice collection of forty photographies (apparently for aphasia situations, where the doctor ask the patients to tell what they see), which we can link in the URL.

Provide mapping example using the python script on the website or to be linked on github

When launching the concepticon in version 1.0, there should be one testing example, showing how users can map their own lists. This requires some little information, and some reference to LingPy, since the code for comparing glosses is implemented in the meaning module of LingPy, but it seems worthwhile to have a brief description and alink on the main page, so that who wants can use the resource to quickly link a list.

match concepticon general with STEDT taxononomy?

this will be tedious, but they have this nice historically informed taxonomy in STEDT:

maybe, having the conceptlicon mapped to it, would be nice.

Problem is the size, and again the question of what to do, if something is just too big, so that we can only take a small part for the concepticon? Is that still the same kind of "concept list", or is it something else? I am thinking of major linkings to

  • wordnet (we have an indirect one, but no "real")
  • STEDT
  • some association data (age of acquisition or whatever)
  • wikipedia (most definitions for plants and animals were now directly taken from there)

But I have the feeling that we should not call these things "concept lists", and we won't be able to get full coverage for the whole bunch of about 2500 concept sets we have at the moment.

Generaly there are two possibilities:

  • (a) decide upon a subset of a certain number of concepts and link this subset as a concept list
  • (b) open a new category for meta-data which was assembled directly for the concept sets

I think that (b) would be better, also for users to find what they are looking for...

list by Bengtson and Ruhlen

This list has just been typed off by me, it contains 27 items with GLOBAL etymologies, so they say they represent some proto-world. Suprisingly, there are two words for "leg". Anyway, the concepts are a bit strange, containing many semantic-shift meanings (breast-milk), but this is interesting in the context of comparing it with other areal lists, where they merge meanings, like meat/animal, and the like.

Conceptlist from Grollemund et al. 2015

In Bantu expansion shows that habitat alters the route and pace of human dispersals Grollemund et al. use wordlists for ~420 Bantu Languages containing words for 100 concepts, sampled from 159 concepts in

Hombert J-M, Van der Veen L, Medjo Mve P (2011) ALGAB, Atlas Linguistique du GABon (Laboratoire Dynamique du Langage, Lyon).
Available at http://www.ddl.ish-lyon.cnrs.fr/equipes/index.asp?Langue=FR&Equipe=8&Page=Action&ActionNum=48
Accessed November 10, 2014.

Badges for the Concept-Lists

In order to allow for quality control and the like, it would be great if we could create badges for all concept lists processed. This would answer questions like:

  1. are there mergers inside the list, that is, are two or more concepts linked to the same concept sets?
  2. [maybe silly, but might be interesting] how often do the concepts in the very list occur in other lists on average, that is, how "unique" is the list regarding its inventory
  3. [also not necessary, but good for proof-checking] how large is the levenstheint distance on average between the concept labels in the list and the concept labels of the concepticon
  4. which lists are most similar in terms of overlaps in concepts to the given list

I think this may be some interesting information we could assemble automatically whenever parsing the concepticon-data in, and it would be worthwhile to show the information. It may be enough to write a script that computes the values and to add them automatically to the file conceptlists.tsv, since it would probably here, where the information would be displayed afterwards...

Missing concept lists

Currently we have metadata on 58 concept lists in conceptlists.tsv but only 50 concept lists in conceptlists/. The missing ones are

  • Kassian-2010-110
  • Gabelentz-1891-120
  • Snider-2004-1700
  • Zorc-1974-100
  • Shevoroshkin-1991-23
  • Swadesh-1960-100
  • Marsden-1782-50
  • Pallas-1785-441

We should either add these - maybe even as empty stubs - or remove them from conceptlists.tsv, I think. In case we add empty concept list files for these, we might need a flag signaling the process status in conceptlists.tsv, to prevent them from being imported in the concepticon app.

BLESS-Data on semantic associations

The bless dataset offers some interesting semantic associations (hyperonomy, etc.) for 200 concrete words.

The data may be interesting for the concepticon, since it offers additional accounts on semantic relations between words/concepts.

Uralex Data should best be added before next release

Given how closely we work with Uralex in lexibank and the like, we should try to have an official uralex version for the next release (this is quickly to do, but we need the good list with authorisation of the uralex people).

Testing routine for wrongly assigned links based on distribution of concept labels

There's a simple test we can make and which enables us to find obviously wrong links:

  • assemble all concept labels in a dictionary
  • count and store the different concepticon-ids which are assigned to the same label
  • list the labels which are potentially problematic

The point is the following: when mapping, one may overview a very good concept set and link instead to another one, as I probably did with "no, not" which I inconsistently linked to either "NO" or "NO OR NOT". So here, the above-stated procedure would show which lists have the same concept labels, but link them differently.

Note however, that this procedure is not fully automatizable. In some cases, I linked words based on the information of concept labels in other languages (compare "dull" linked to "dull (of knife)" and "dull (stupid)", because in some lists, I have Chinese translations and know therefore better, what is the concept that was intended.

So the routine can either be integrated in the testing itself, but it should only throw warnings, no errors, or, it can be used separately, as a tool for those who link a new list to the concepticon.

Concepticon data (original stuff by Good et al.) doesn't provide links to the concepticon concepts

I have already converted the Good-data containing IDS, WOLD, and one further mapping (Usher-Whitehouse) to CSV. However, we have a problem with the URLs there, since they do create an error on the website:

Should we still leave those urls in the file, or just discard them? The concepticon IDs, which are referenced in the data, seem to be OK, as far as I checked.

Blust-1981-200

Robert Blust was so nice to give me the first alternative Swadesh list he created (presented in a talk from 1981). This list is thus already digital, but stil needs to be linked (since it is much more precise than the list of the ABVD).

Provide mapping to babelnet, where possible

Babelnet is a very nice resource that defines its own synsets and maps them to omegawiki, princeton wordnet, multwordnet, and wikipedia. We should try to map our concepts to it, where possible. I am currently preparing an automatic pre-mapping using their API. It is probable that we will have unmappable items, but if we could provide some coverage of about 80% (which seems realistic), it would be very nice. To test babelnet, check out the babelfy website, where one can just insert words and see which synsets they infer for them.

If the mapping can be provided to, say, 80% of our concepts, we can delete the omegawiki-links, since they should be included in babelnet (and it is unlikely we are able to add more than the ones we already have, given that the api of omegawiki is working so slowly...).

Typo: rain

Hello from Tübingen!

First of all, thank you all for building and maintaining this project!

Here are a couple of typos that I found in the concept names:

  • RAIN (PRECIPATION) --> RAIN (PRECIPITATION) // ID: 658
  • FRONT TOOTH (INEISOR_ --> FRONT TOOTH (INCISOR) // ID: 442

The latter has been fixed here in the repo, but apparently the fix has not reached the web server.

Kind regards,
Pavel

unify column names in concept lists

Concept lists should only have a GLOSS column, if English is not among the lists source languages (or if an English gloss used in the source was corrected, in which case the column ENGLISH would list the gloss as found in the source, and GLOSS the corrected version). Otherwise, the column GLOSS should be renamed to ENGLISH.

LOGOS children dictionary with flashcards

The Logos Children dictionary has translations for some 1000 concepts into some 60 languages. It may be interesting to link these concepts, especially since they offer images for each of the concepts, which may be useful to have for certain purposes.

Sutton and Walsh -- 1987

I don't have access to this book, but it could be useful to include it, since it seems to offer new concept lists for Australian region:

Sutton, P. and M. Walsh (1987). Wordlist for Australian Languages. Canberra:
Australian Institute of Aboriginal Studies.

Concept list of more than 1000 concepts by Pelkey-2011

In the SIL study on Phula languages (Sino-Tibetan family), two lists are given, one master-list of 1100 and more concepts (difficult to map), and one list of etyma, for which according to the author a 660 item masterlist from another source was taken (apparently based on Bradley 1979). How well this list can be added depends especially on how well the concepts can be extracted from the PDF and how nicely they are described.

Verify that concepts meaning "thin" are now correctly handled

In our online-alpha-version, the words for "thin" all mean different things (as can be seen from the Chinese labels). The concepticon major list has already been adjusted in this sense, offering enough concepts here, but it needs to be confirmed that the links are actually applied to all the data now.

Missing russian source concepts

The following three concept lists list Russian as source language, but do not have a corresponding column RUSSIAN:

  • Starostin-1991-110
  • Jachontov-1991-100
  • Jachontov-1991-65

Handling identical or almost identical lists across multiple publications

Following up from the discussion in the PR #58, I thought it was useful to turn this into an issue.

It seems that with the lists by Shirō Hattori, we have an excellent example for identical concept lists across different publications. There is still the question of how to handle it. A simple solution would be to list several references, and indicate in the note-column of conceptlists.tsv which additional data-point stems from which list (suppose a list which had only English, but has added Japanese in a later identical edition). This is feasible, since there are not too many lists, where this needs to be done. Another possibility of what we could think would be to add yet another column to conceptlists.tsv which might be called USEDBY or something similar, indicating whether this very list was used in further publications. The list by Dyen et al. 1992/1997 would be a usecase for this, since it is exactly identical with the list by Swadesh-1952-200, but they use uppercase where Swadesh used normal orthography. If we, otherwise keep on following a policy by which we say that theoretically, no two lists are the same (and one can fight about this), this would mean that Dyen 1992 should also be added, and Hattori's lists should be split into three or four.

We basically have a small problem of ontology and epistemology already now in the concepticon, since it is clear that we cannot just trust the papers if they say they used the list by Swadesh, for example, since they often use new concept labels, but claim they are based on some list. So it might be the most coherent way to add all lists we can get, but this may then turn out to be redundant, so the "ALSO-USED-By" (or whatever better label) may be a compromise solution.

But I'm by no means completely convinced by either of the solutions mentioned above...

Refine kinship terms and add parabank list

We should refine the kinship terms, by following what they have in ParaBank. question is of how to handle the relations, but maybe we can just put this to the parabank-list to be added.

Buck -- 1949

This list has been digitized by many people, but they all differ from the original. i have now digitized the full list myself (with a little help of OCR). It still needs re-editing, but afterwards, it should be mapped as completely as possible to the Concepticon, since thsi list is historically quite important: it was proposed independently from Swadesh, and it was used in a couple of projects thereafter.

We could then also link this list to all available conlang lists, which are also interesting, since they made their own mapping which may be interesting to be compared with those made by the concepticon.

The lists can all be found here and they offer them for download in CSV.

Note, that the Buck-1949 in the conlang archive is not identical with Buck original. I made some quick tests and found that they do not retain the original wording and sometimes add certain words to the labels. If we quote Buck, it should be as narrow as possible to the original.

Add links to wikidata

I didn't know of wikidata before, but now that I checked it, it seems it could give us valuable information in many respects, not for all concept sets, of course, but probably for quite a few. Here's, for example, the wikidata for hand:

They also seem to have a good API, so one may be able to search more or less automatically for good matches...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.