jmsv / ety-python Goto Github PK

View Code? Open in Web Editor NEW

144.0 10.0 18.0 36.77 MB

A Python module to discover the etymology of words

Home Page: http://ety-python.rtfd.io

License: MIT License

Makefile 1.93% Python 98.07%

python etymology origins english language words hacktoberfest

ety-python's Introduction

Intro

@jmsv and @parker57 started a side project to analyse etymologies of text written by various historical authors, expecting there to already be a library for retrieving etymological data. On discovering that this wasn't the case, ety was created!

There isn't a single source of truth for etymologies; words' origins can be heavily disputed. This package's source data, Gerard de Melo's Etymological Wordnet, is mostly mined from Wiktionary. Since this is a collaboratively edited dictionary, its data could be seen as the closest we can get to a public consensus.

Install

pip

pip install ety

Usage

Module

>>> import ety

>>> ety.origins("potato")
[Word(batata, language=Taino)]

>>> ety.origins("drink", recursive=True)
[Word(drync, language=Old English (ca. 450-1100)), Word(drinken, language=Middle English (1100-1500)), Word(drincan, language=Old English (ca. 450-1100))]

>>> print(ety.tree("aerodynamically"))
aerodynamically (English)
├── -ally (English)
└── aerodynamic (English)
    ├── aero- (English)
    │   └── ἀήρ (Ancient Greek (to 1453))
    └── dynamic (English)
        └── dynamique (French)
            └── δυναμικός (Ancient Greek (to 1453))
                └── δύναμις (Ancient Greek (to 1453))
                    └── δύναμαι (Ancient Greek (to 1453))

CLI

After installing, a command-line tool is also available. ety -h outputs the following help text describing arguments:

usage: ety [-h] [-r] [-t] words [words ...]

positional arguments:
  words            the search word(s)

optional arguments:
  -h, --help       show this help message and exit
  -r, --recursive  search origins recursively
  -t, --tree       display etymology tree

Examples

$ ety drink
drink   # List direct origins
 • drync (Old English (ca. 450-1100))
 • drinken (Middle English (1100-1500))

$ ety drink -r   # Recursive search
drink
 • drync (Old English (ca. 450-1100))
 • drinken (Middle English (1100-1500))
 • drincan (Old English (ca. 450-1100))

$ ety drink -t   # Etymology tree
drink (English)
├── drinken (Middle English (1100-1500))
│   └── drincan (Old English (ca. 450-1100))
└── drync (Old English (ca. 450-1100))

Development

In a virtual environment - Pipenv is recommended:

python setup.py install

ety-python's People

Contributors

Stargazers

Watchers

Forkers

parker57 ailuke alxwrd nthh marcanuy michael2012z paulvickery hugovk rockanandu zengjatzau katieyounglove benparkinson copperdong glennhefley malleshi-9025 triliteralverb chenjh19

ety-python's Issues

Make origin search case-insensitive

>>> ety.origins('Potato')
Traceback (most recent call last):
  File "/home/james/gitr/ety-python/ety/__init__.py", line 30, in origins
    origin = origins_dict[word]
KeyError: 'Potato'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/james/gitr/ety-python/ety/__init__.py", line 33, in origins
    raise ValueError(error)
ValueError: No etymology data available for word: Potato

How this exception is handled should be tidied up to avoid "During handling of the above exception..."

Random word method

Get a random word

No input necessary - could do random word containing string input
Output: random word

Cross-check etymologies dataset languages with ISO-639-3 codes

Issues #54 and #55 were caused by language codes used by the dataset that didn't exist in the ISO-639-3 JSON.

Need to build a script to search for cases where ety users could run into similar problems.

Replace string type checking hack with six

At the moment, checking a value is a string looks like this:

isinstance(word, ("".__class__, u"".__class__))

This is probably a dodgy way of doing it and since this package maintains compatibility with Python 2 and 3, it makes sense to start using six.

This would simplify string checking to:

isinstance(word, six.string_types)

From https://pythonhosted.org/six/#six.string_types:

six.string_types
Possible types for text data. This is basestring() in Python 2 and str in Python 3.

Add Word object methods such as len

Where __len__ is len(Word.word)

Python 2.7 support broken

str / unicode Python 2/3 differences have broken Python 2 support

See https://travis-ci.org/jmsv/ety-python/jobs/390283260

Unclear command line output with multiple words

Currently, the output is a bit cramped for more than one word:

$ python -m ety cheese aerodynamically beer -r
chese (Middle English (1100-1500))
cese (Old English (ca. 450-1100))
-ally (English)
aerodynamic (English)
aero- (English)
dynamic (English)
ἀήρ (Ancient Greek (to 1453))
dynamique (French)
δυναμικός (Ancient Greek (to 1453))
δύναμις (Ancient Greek (to 1453))
δύναμαι (Ancient Greek (to 1453))
beere (Middle English (1100-1500))
bere (Middle English (1100-1500))
beor (Old English (ca. 450-1100))
bera (Old English (ca. 450-1100))
bēr (Old English (ca. 450-1100))

For the most part, it's ok, but could maybe add whitespace, or a horizontal rule (----).

Invalid CSV format

All rows in CSVs should have the same number of columns
Opening in Excel and saving it should do the trick

Switch to using etymwn data

Etymological Wordnet, maintained by Gerard de Melo:

http://www1.icsi.berkeley.edu/~demelo/etymwn

This would require API changes and thus a major version bump (https://semver.org/#summary)

Shouldn't be possible to reassign Word.word property

Word.origins is cached for an instance of a Word object - if word property is changed, origins is out of date

Support analysis of large amounts of text

Read from files and analyse origin distribution of books etc

tree should check word exists

$ ety asdfghjkl -t
asdfghjkl (English)

Always outputs '(English)', since it's the default language. The word should be looked up in the data and not displayed if missing

Consider using rel:is_derived_from relationships

Current data based on tsv rows with rel:etymology. The rel:is_derived_from relationship seems to be functionally similar

Using this extra data makes the source data much larger, and ety import is much slower. It could be made slightly quicker by generating separate files for the different languages and loading them only as required

Circular origin reference

There's currently at least one circular reference in etymwn-relety.json.

ety.origins("software", recursive=True)

will eventually fail with a recursion error because software -> soft, -ware and -ware -> software.

Any word with generic Nahuatl language will raise exception

Language code for Nahuatl is missing.

print(ety.tree('avocado'))

Will raise an exception:
"Language with iso code 'nah' unknown"

ISO 639 specifies it:
https://en.wikipedia.org/wiki/Template:ISO_639_name_nah

Recommend adding it to data/iso-639-3.json

btw thank you! awesome work!

test_origins_recursion can fail

$ python tests.py
..F.
======================================================================
FAIL: test_origins_recursion (__main__.TestEty)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 13, in test_origins_recursion
    self.assertGreater(len(o), 0)
AssertionError: 0 not greater than 0

Since ety.random_word() is used for this test, it normally passes but can (and did) fail

case sensitivity

I'm not sure about making the module case insensitive like in #39, It might be a good idea to have a case insensitive option or to look for a lowercase version of an English word if the version presented yields no results. There are however a few words I can think of that have distinct etymologies but would be identical if rendered in lowercase.

Here are a few examples,

wasp annoyingly isn't in the data but WASP (acronym for white anglo-saxon protestant) is.

Turkey refers to a country, we get the word turkey from Turkey but it's the large guinea fowl that I guess Turkish people sold.

wed is a verb, "to marry", Wed is an abbreviation for the third day of the week from Woden

$ ety turkey Turkey wed Wed don Don median Median wasp WASP -t
No origins found for word: wasp
turkey (English)
└── Turkey (English)
    └── Turquie (French)

Turkey (English)
└── Turquie (French)

wed (English)
└── weddian (Old English (ca. 450-1100))

Wed (English)
└── Wednesday (English)
    ├── Wednesdai (Middle English (1100-1500))
    └── day (English)
        └── day (Middle English (1100-1500))
            └── dæg (Old English (ca. 450-1100))

don (English)
└── dominus (Latin)
    └── domus (Latin)

Don (English)
└── Donald (English)

median (English)
└── median (Middle French (ca. 1400-1600))
    └── medianus (Latin)
        ├── -anus (Latin)
        └── medius (Latin)

Median (English)
├── -ian (English)
│   └── -anus (Latin)
└── Mede (English)
    └── Medus (Latin)
        └── Μῆδος (Ancient Greek (to 1453))

WASP (English)
└── Anglo-Saxon (English)

There are over a thousand such examples in the English language and it's important to remember that most of the words in the data aren't English, I don't know how important case sensitivity is for the other 255 languages.

Personally, for now I would leave it case sensitive, users can always make the words they parse into our functions lowercase if they want to anyway.

Add emoji flags to languages

As mentioned on #25

Command line interface could have an -e flag for displaying relevant emojis alongside languages and maybe words

Very low priority feature, but might be fun to implement and use

CLI option to disable ANSI formatting

PR #29 added bold formatting on input words:

This is cool 😎 but not always desired, for example when piping to a file:

Flag to disable ANSI could be -p (--plaintext), e.g.

parser.add_argument("-p", "--plaintext",
                    help="output plaintext, disabling ANSI formatting",
                    action="store_true")

ANSI codes could be replaced with a library such as kennethreitz/crayons, which supports disabling colours etc. without too much extra code:

if args.plaintext:
    crayons.disable()

Add tests

Tests, linting and CI should be added

Reverse search method

Get words by origin

input: language/language family e.g) Middle English -- make case insensitive.
output: all words with that root

Better test coverage needed

More tests should be written to cover more internal methods etc

Write proper docs

New, more detailed docs should be written

http://ety-python.rtfd.io

Any word with Wintu language will raise exception

print(ety.tree('Wintun'))

Will throw an exception:
"Language with iso code 'wit' unknown"

'wit' seems to be deprecated in ISO 639-3 in favor of 'wnw', but alas, it still exists in the data.

https://iso639-3.sil.org/code/wit
https://iso639-3.sil.org/code/wnw

Restructure data for performance

At the moment, the json dataset is structured as follows:

[
  {
    "a_lang": "eng",
    "a_word": "potato",
    "b_lang": "tnq",
    "b_word": "batata"
  },
  { ...

This is loaded as a Python dict and filtered using:

row = list(filter(
    lambda entry: entry['a_word'] == self.word and entry[
        'a_lang'] == self.language.iso, etymwn_data))

If the data was restructured so words acted as dict keys, referencing words would be much faster since dicts are an implementation of hash tables.

Data could instead be structured by language then by word, as follows:

{
    "lang":{
        "word":[
            {
                "origin-word":"origin-lang"
            }
        ]
    }
}

for example,

{
    "eng":{
        "airport":[
            {"air":"eng"},
            {"port":"eng"}
        ],
        "banana":[
            {"banaana":"wol"}
        ]
    },
    "lat":{
        "fructus":[
            {"fruor":"lat"}
        ]
    }
}

Origin words are individual dicts to prevent key collisions.

Word origin census method(s)

A way to get a list of all languages from which English derives words and a way to tally the frequency of those languages.

e.g)

Old French: 21,073
Middle English: 14,501
Latin: 8,999
etc.

Recognising proto-languages

Gerard de Melo's readme states:

Words are given with ISO 639-3 codes
(additionally, there are some ISO 639-2 codes prefixed with "p_" to indicate proto-languages).

From Wikipedia for ISO 639-3:

ISO 639-3.[2] It provides an enumeration of languages as complete as possible, including living and extinct, ancient and constructed, major and minor, written and unwritten.[1] However, it does not include reconstructed languages such as Proto-Indo-European.

The etywn-relety.json contains the following proto_language references

124 instances of 'p_sla', from ISO 639-2 Proto-Slavic
13 instance of 'p_gem', from ISO 639-2 Proto-Germanic
6 instances of 'p_ine', from ISO 639-2 Proto-Indo-European
3 instance of 'p_gmw', not in ISO 639-2 but seems to be Proto-West-Germanic (it only points to the word "iuwiz")

It's probably best to just add add the relevant JSON and document accordingly, for instance 'p_sla' could be

  {
    "name": "Proto-Slavic",
    "type": "extinct",
    "scope": "individual",
    "iso6393": 'p_sla',
    "iso6392B": null,
    "iso6392T": null,
    "iso6391": null
  }

no idea what to put for scope tbh