clips / wordkit Goto Github PK

Featurize words into orthographic and phonological vectors.

License: GNU General Public License v3.0

Python 100.00%

wordkit's Introduction

wordkit

This is the repository of the wordkit package, a Python 3.X package for the featurization of words into orthographic and phonological vectors.

Overview

wordkit is a package for working with words. The package contains a variety of functions that allow you to:

Extract words from lexical databases in a structured format.
Normalize phonological strings across languages and databases.
Featurize words for usage in computational psycholinguistic models using the following features:
- Open ngrams
- Character ngrams
- Holographic features
- Consonant Vowel mapping (patpho)
- Onset Nucleus Coda mapping
Find synonyms, homographs, and homophones across languages.
Fuse lexical databases, also crosslingually.
Sample from (subsets of) corpora by frequency of occurrence.

and much more.

Installation

wordkit is on pip.

pip install wordkit

Examples

See the examples for some ways in which you can use wordkit. All examples assume you have wordkit installed (see above.)

If, after working through the examples, you want to dive deeper into wordkit, check out the following documentation.

wordkit is a modular system, and contains two broad families of components. The subpackages are documented using separate README.MD files. Feel free to click ahead to find descriptions of the contents of subpackages.

In general, a wordkit pipeline consists of one or more readers, which extract structured information from corpora. This information is then sent to one or more transformers, which are either assigned pre-defined features or a feature extractor.

Paper

A paper that describes wordkit was accepted at LREC 2018. If you use wordkit in your research, please cite the following paper:

@InProceedings{TULKENS18.249,
  author = {Tulkens, Stéphan and Sandra, Dominiek and Daelemans, Walter},
  title = {WordKit: a Python Package for Orthographic and Phonological Featurization},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Additionally, if you use any of the corpus readers in wordkit, you MUST cite the accompanying corpora and transformers. All of these references can be found in the docstrings of the applicable classes.

Example

This example shows one big wordkit pipeline.

import pandas as pd

from wordkit.corpora import celex_english, celex_dutch
from wordkit.features import LinearTransformer, NGramTransformer, fourteen
from string import ascii_lowercase

# The fields we want to extract from our corpora.
fields = ('orthography', 'frequency', 'phonology', 'syllables')

# Link to epl.cd
english = celex_english("epw.cd",
                        fields=fields)
# Link to dpl.cd
dutch = celex_dutch("dpw.cd",
                    fields=fields)

# Merge both corpora.
words = pd.concat([english, dutch], sort=False).reindex()

# We filter both corpora to only contain monosyllables and words
# with only alphabetical characters
words = words[[len(x) == 1 for x in words["syllables"]]]
words = words[[not set(x) - set(ascii_lowercase)
              for x in words["orthography"]]]

# words.iloc[0] =>
# orthography                      a
# phonology                   (e, ɪ)
# syllables                ((e, ɪ),)
# frequency                   844672
# log_frequency              5.92669
# frequency_per_million        21363
# zipf_score                 4.32966
# length                           1

# You can also query specific words
wind = words[words['orthography'] == "wind"]

# This gives
# wind =>
#        orthography        phonology  ... zipf_score  length
# 146523        wind  (w, a, ɪ, n, d)  ...   0.015757       4
# 146524        wind     (w, ɪ, n, d)  ...   1.683096       4
# 313527        wind     (w, ɪ, n, t)  ...   2.042675       4

# Now, let's transform into features
# Orthography is a linear transformer with the fourteen segment feature set.
o = LinearTransformer(fourteen, field='orthography')
# For phonology we use ngrams.
p = NGramTransformer(n=3, field='phonology')

X_o = o.fit_transform(words)
X_p = p.fit_transform(words)

# Get the feature vector length for each featurizer
o.vec_len # 126
p.vec_len # 5415

Corpora

wordkit currently offers readers for the following corpora. Note that, while we offer predefined fields for all these corpora, any fields present in these data can be retrieved by wordkit in addition to the fields we define. The Lexicon Projects, for example, also contain lexicality information, accuracy information, and so on. These can be retrieved by passing the appropriate fields as argument to fields.

BPAL

Download

You have to extract the nwphono.txt file from the .exe file. The corpus is not for download in a more practical fashion.

Publication

Fields:     Orthography, Phonology, Frequency
Languages:  Spanish

Celex

Currently not freely available.

Fields:     Orthography, Phonology, Syllables, Frequency
Languages:  Dutch, German, English

WARNING: the Celex frequency norms are no longer thought to be correct. Please use the SUBTLEX frequencies instead. You can use the Celex corpus with SUBTLEX frequency norms by using a pandas merge. If you use CELEX frequency norms at a psycholinguistic conference, you will get yelled at.

CMUDICT

Download

We can read the cmudict.dict file from the above repository.

Fields:     Orthography, Syllables  
Languages:  American English

Deri

Download

Download the pron_data.tar.gz file, and unzip it. We use the gold_data_train file.

Publication

Fields:     Orthography, Phonology  
Languages:  lots

WARNING: we manually checked the Dutch, Spanish and German phonologies in this corpus, and a lot of them seem to be incorrectly transcribed or extracted. Only use this corpus if you don't have another resource for your language.

Lexique

Download

Download the zip file, we use the lexique382.txt file.

Publication

Note that this is the publication for Lexique version 2. Lexique 3 does not seem to have an associated publication in English.

Fields:     Orthography, Phonology, Frequency, Syllables  
Languages:  French

NOTE: the currently implemented reader is for version 3.82 (the most recent version as of May 2018) of Lexique.

SUBTLEX

Check the link below for the various SUBTLEX corpora and their associated publications. We support all of the formats from the link below.

Link

Fields:     Orthography, Frequency  
Languages:  Dutch, American English, Greek
            British English, Polish, Chinese,
            Spanish

Wordnet

We support all the tab-separated formats of the open multilingual WordNet. If you use any of these WordNets, please cite the appropriate source, as well as the official WordNet reference.

Link

Fields: Orthography, Semantics
Languages: lots

Lexicon projects

We support all lexicon projects. These contain RT data with which you can validate models.

Link

Fields: Orthography, rt
Languages: Dutch, British English, American English, French

Experiments

The code for replicating the experiments in the wordkit paper can be found here

Requirements

ipapy
numpy
pandas
reach (for the semantics)
nltk (for wordnet-related semantics)

Contributors

Stéphan Tulkens

License

GPL v3

wordkit's People

Contributors

Stargazers

Watchers

Forkers

degerli belmont666 martinapastor vivekanon eaftos vishalbelsare jwijffels bambooforest burakozturk16

wordkit's Issues

Separate corpus readers in more meaningful way

The corpus readers are currently grouped in a topic-ish fashion, e.g., lexicon projects are grouped together. But not all lexicon projects share the same structure, which creates all kinds of weird shenanigans when we try to load them. It also restricts us to arbitrary conventions about what information the user can extract from a certain corpus.
For example, we currently say that lexicon projects don't contain frequency fields, but some of them do, which is confusing.

It is probably better to just create a separate class for all these corpora, e.g., elp for English Lexicon Project, dlp for the Dutch Lexicon Project, etc.

This would cause us to have many more smaller classes, but all of these classes would be more useful than they are now.

extremely big housekeeping

Housekeeping is necessary

Rewrite feature extraction from functions to objects

The feature extraction functions are currently all just functions. Making them objects would cause some serious reduction in overhead, and some more options for expansion in the future.

Currently, users have to do something like:

all_phonemes = get_characters(data, field='phonology')
features = extract_one_hot_phonemes(all_phonemes)
o = ONCTransformer(features)
X = o.fit_transform(data)

This isn't too bad, but can be simplified by making the extraction process above atomic:

all_phonemes = extract_one_hot_phonemes(data, field='phonology')

But this would require adding the same couple of lines of code to all extraction functions.

So, what I propose is:

o = ONCTransformer(OneHotPhonemeExtractor(), field='phonology')
X = o.fit_transform(data)

This merges the process of extracting the relevant phonemes from the data, and allows us to use inheritance for e.g. type checking, chaining etc.
I think it also puts less of a burden on the user, who no longer has to separately keep track of the features.

Rewrite readme

The readme is no longer valid.

Installing on macOS

I believe there is a small issue with dependencies as I met with an error about a module named ipapy which went along the lines of
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 1331: ordinal not in range(128)

I went and checked with the ipapy repo and it seemed that the issue was fixed in a recent pull request. is wordkit using an older version of ipapy?

Add frequency normalization

Currently we treat frequencies as being on the same scale even though they might not be on the same scale. That is, if we currently combine two databases of frequency norms, and one of them was counted on the basis of 10M words, while the other was counted on the basis of 1M words, we will overestimate the frequency of the 10M corpus by a factor of 10.

Some solutions:

Normalize the frequencies we get from corpora by the most frequent word.
Manually look up the number of words on which the frequency norms were based, and use this to normalize.

The first option is straightforward, second one is a bit more difficult, especially because we might not know this for every source.

prepare for pip release

This would be a nice pip package, but only if it is kind of stable.
Currently, there's too much going on (I think) to put it on pip.

Add more corpora

We currently offer a nice selection of corpora, but don't offer:

MRC database
~~Lexique~~ DONE
~~A spanish corpus~~ DONE

Adding Lexique would especially be nice because this would give us a Celex-like database for French (e.g. including syllables and frequency counts)

Add typing

Add typing, improve naming where necessary

Add docs for holographic features

Holographic features have been added, but they're not documented yet. Add the following:

README
docs
papers

Confusion caused by the word 'Transformer'

The word 'Transformer' has been used extensively throughout the module. This terminology is a cause of great confusion because the same term is used for the Transformer Model introduced in Attention is all you need which has been popular for various NLP applications in the past years. Though, that's not what the term means to refer to here.

Write new tutorials

The API of wordkit has changed somewhat.
Focus on the wordstore and the merging of corpora.

Add custom FeatureUnion

We currently use the sklearn.pipeline.FeatureUnion to combine different featurizers and corpora. This works great! But we want to replace it to add:

Merging different sources (e.g. merging the frequency table for a word from one data source with perceptual characteristics for the same word from another corpus).
Adding weights to transformers (e.g. assigning a weight of .5 to a phonology transformer to reduce the weight of phonology in any distance calculations).

The first point definitely makes a lot of sense and will be added ASAP, but I'm not sure about the second one.

Problem with subtlexus reader

When using the subtlexus (reader) function to load in the following corpus: SUBTLEXusfrequencyabove1.xls (obtained from here: http://crr.ugent.be/programs-data/subtitle-frequencies), I get the following error:

subtlexus('SUBTLEXusfrequencyabove1.xls', fields='orthography')

TypeError Traceback (most recent call last)
in
----> 1 subtlexus('SUBTLEXusfrequencyabove1.xls', fields='orthography')

~\anaconda3\lib\site-packages\wordkit\corpora\corpora\subtlex.py in subtlexus(path, fields)
46 def subtlexus(path,
47 fields=("orthography", "frequency")):
---> 48 return subtlex(path, fields, "eng-us")
49
50

~\anaconda3\lib\site-packages\wordkit\corpora\corpora\subtlex.py in subtlex(path, fields, language)
31 skiprows = 0
32
---> 33 return reader(path,
34 fields,
35 LANG2FIELD[language],

~\anaconda3\lib\site-packages\wordkit\corpora\base\reader.py in reader(path, fields, field_ids, language, preprocessors, opener, **kwargs)
68 field_ids = {}
69
---> 70 df = opener(path, **kwargs)
71 # Columns in dataset
72 colnames = set(df.columns)

~\anaconda3\lib\site-packages\wordkit\corpora\base\reader.py in _open(path, **kwargs)
27 extension = os.path.splitext(path)[-1]
28 if extension in {".xls", ".xlsx"}:
---> 29 df = pd.read_excel(path,
30 na_values=nans,
31 keep_default_na=False,

~\anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper

TypeError: read_excel() got an unexpected keyword argument 'sep'

Phonology: add support for featurization of arbitrary diacritics

Attaching multiple IPA diacritics is currently supported for both vowels and consonants (excluding tone diacritics and click-related things). We currently can't featurize these unless we assume one-hot encoding or pre-defined features.

Ideally, we would use the IPA characteristics of each phoneme to add features, but this requires that we know in advance which types of diacritics we are going to encounter (this is not a huge problem). More problematic is that this kind of featurization leads to really big feature spaces.

clips / wordkit Goto Github PK

wordkit's Introduction

wordkit

Overview

Installation

Examples

More

Paper

Example

Corpora

BPAL

Celex

CMUDICT

Deri

Lexique

SUBTLEX

Wordnet

Lexicon projects

Experiments

Requirements

Contributors

License

wordkit's People

Contributors

Stargazers

Watchers

Forkers

wordkit's Issues

Recommend Projects

Recommend Topics

Recommend Org