Giter Club home page Giter Club logo

ety-python's Introduction

ety

PyPI version Python versions Wheel Support Documentation Status
Build Status Code style: black


Intro

@jmsv and @parker57 started a side project to analyse etymologies of text written by various historical authors, expecting there to already be a library for retrieving etymological data. On discovering that this wasn't the case, ety was created!

There isn't a single source of truth for etymologies; words' origins can be heavily disputed. This package's source data, Gerard de Melo's Etymological Wordnet, is mostly mined from Wiktionary. Since this is a collaboratively edited dictionary, its data could be seen as the closest we can get to a public consensus.

Install

pip install ety

Usage

Module

>>> import ety

>>> ety.origins("potato")
[Word(batata, language=Taino)]

>>> ety.origins("drink", recursive=True)
[Word(drync, language=Old English (ca. 450-1100)), Word(drinken, language=Middle English (1100-1500)), Word(drincan, language=Old English (ca. 450-1100))]

>>> print(ety.tree("aerodynamically"))
aerodynamically (English)
├── -ally (English)
└── aerodynamic (English)
    ├── aero- (English)
    │   └── ἀήρ (Ancient Greek (to 1453))
    └── dynamic (English)
        └── dynamique (French)
            └── δυναμικός (Ancient Greek (to 1453))
                └── δύναμις (Ancient Greek (to 1453))
                    └── δύναμαι (Ancient Greek (to 1453))

CLI

After installing, a command-line tool is also available. ety -h outputs the following help text describing arguments:

usage: ety [-h] [-r] [-t] words [words ...]

positional arguments:
  words            the search word(s)

optional arguments:
  -h, --help       show this help message and exit
  -r, --recursive  search origins recursively
  -t, --tree       display etymology tree

Examples

$ ety drink
drink   # List direct origins
 • drync (Old English (ca. 450-1100))
 • drinken (Middle English (1100-1500))

$ ety drink -r   # Recursive search
drink
 • drync (Old English (ca. 450-1100))
 • drinken (Middle English (1100-1500))
 • drincan (Old English (ca. 450-1100))

$ ety drink -t   # Etymology tree
drink (English)
├── drinken (Middle English (1100-1500))
│   └── drincan (Old English (ca. 450-1100))
└── drync (Old English (ca. 450-1100))

Development

In a virtual environment - Pipenv is recommended:

python setup.py install

ety-python's People

Contributors

ailuke avatar alxwrd avatar hugovk avatar jmsv avatar parker57 avatar paulvickerytr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ety-python's Issues

Make origin search case-insensitive

>>> ety.origins('Potato')
Traceback (most recent call last):
  File "/home/james/gitr/ety-python/ety/__init__.py", line 30, in origins
    origin = origins_dict[word]
KeyError: 'Potato'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/james/gitr/ety-python/ety/__init__.py", line 33, in origins
    raise ValueError(error)
ValueError: No etymology data available for word: Potato

How this exception is handled should be tidied up to avoid "During handling of the above exception..."

Random word method

Get a random word

No input necessary - could do random word containing string input
Output: random word

Replace string type checking hack with six

At the moment, checking a value is a string looks like this:

isinstance(word, ("".__class__, u"".__class__))

This is probably a dodgy way of doing it and since this package maintains compatibility with Python 2 and 3, it makes sense to start using six.

This would simplify string checking to:

isinstance(word, six.string_types)

From https://pythonhosted.org/six/#six.string_types:

six.string_types
Possible types for text data. This is basestring() in Python 2 and str in Python 3.

Unclear command line output with multiple words

Currently, the output is a bit cramped for more than one word:

$ python -m ety cheese aerodynamically beer -r
chese (Middle English (1100-1500))
cese (Old English (ca. 450-1100))
-ally (English)
aerodynamic (English)
aero- (English)
dynamic (English)
ἀήρ (Ancient Greek (to 1453))
dynamique (French)
δυναμικός (Ancient Greek (to 1453))
δύναμις (Ancient Greek (to 1453))
δύναμαι (Ancient Greek (to 1453))
beere (Middle English (1100-1500))
bere (Middle English (1100-1500))
beor (Old English (ca. 450-1100))
bera (Old English (ca. 450-1100))
bēr (Old English (ca. 450-1100))

For the most part, it's ok, but could maybe add whitespace, or a horizontal rule (----).

Invalid CSV format

All rows in CSVs should have the same number of columns
Opening in Excel and saving it should do the trick

tree should check word exists

$ ety asdfghjkl -t
asdfghjkl (English)

Always outputs '(English)', since it's the default language. The word should be looked up in the data and not displayed if missing

Consider using rel:is_derived_from relationships

Current data based on tsv rows with rel:etymology. The rel:is_derived_from relationship seems to be functionally similar

Using this extra data makes the source data much larger, and ety import is much slower. It could be made slightly quicker by generating separate files for the different languages and loading them only as required

Circular origin reference

There's currently at least one circular reference in etymwn-relety.json.

ety.origins("software", recursive=True)

will eventually fail with a recursion error because software -> soft, -ware and -ware -> software.

test_origins_recursion can fail

$ python tests.py
..F.
======================================================================
FAIL: test_origins_recursion (__main__.TestEty)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 13, in test_origins_recursion
    self.assertGreater(len(o), 0)
AssertionError: 0 not greater than 0

Since ety.random_word() is used for this test, it normally passes but can (and did) fail

case sensitivity

I'm not sure about making the module case insensitive like in #39, It might be a good idea to have a case insensitive option or to look for a lowercase version of an English word if the version presented yields no results. There are however a few words I can think of that have distinct etymologies but would be identical if rendered in lowercase.

Here are a few examples,

wasp annoyingly isn't in the data but WASP (acronym for white anglo-saxon protestant) is.

Turkey refers to a country, we get the word turkey from Turkey but it's the large guinea fowl that I guess Turkish people sold.

wed is a verb, "to marry", Wed is an abbreviation for the third day of the week from Woden

$ ety turkey Turkey wed Wed don Don median Median wasp WASP -t
No origins found for word: wasp
turkey (English)
└── Turkey (English)
    └── Turquie (French)

Turkey (English)
└── Turquie (French)

wed (English)
└── weddian (Old English (ca. 450-1100))

Wed (English)
└── Wednesday (English)
    ├── Wednesdai (Middle English (1100-1500))
    └── day (English)
        └── day (Middle English (1100-1500))
            └── dæg (Old English (ca. 450-1100))

don (English)
└── dominus (Latin)
    └── domus (Latin)

Don (English)
└── Donald (English)

median (English)
└── median (Middle French (ca. 1400-1600))
    └── medianus (Latin)
        ├── -anus (Latin)
        └── medius (Latin)

Median (English)
├── -ian (English)
│   └── -anus (Latin)
└── Mede (English)
    └── Medus (Latin)
        └── Μῆδος (Ancient Greek (to 1453))

WASP (English)
└── Anglo-Saxon (English)

There are over a thousand such examples in the English language and it's important to remember that most of the words in the data aren't English, I don't know how important case sensitivity is for the other 255 languages.

Personally, for now I would leave it case sensitive, users can always make the words they parse into our functions lowercase if they want to anyway.

Add emoji flags to languages

As mentioned on #25

Command line interface could have an -e flag for displaying relevant emojis alongside languages and maybe words

Very low priority feature, but might be fun to implement and use

CLI option to disable ANSI formatting

PR #29 added bold formatting on input words:

screenshot from 2018-06-13 09-21-30

This is cool 😎 but not always desired, for example when piping to a file:

screenshot from 2018-06-13 09-53-26

Flag to disable ANSI could be -p (--plaintext), e.g.

parser.add_argument("-p", "--plaintext",
                    help="output plaintext, disabling ANSI formatting",
                    action="store_true")

ANSI codes could be replaced with a library such as kennethreitz/crayons, which supports disabling colours etc. without too much extra code:

if args.plaintext:
    crayons.disable()

Add tests

Tests, linting and CI should be added

Reverse search method

Get words by origin

input: language/language family e.g) Middle English -- make case insensitive.
output: all words with that root

Restructure data for performance

At the moment, the json dataset is structured as follows:

[
  {
    "a_lang": "eng",
    "a_word": "potato",
    "b_lang": "tnq",
    "b_word": "batata"
  },
  { ...

This is loaded as a Python dict and filtered using:

row = list(filter(
    lambda entry: entry['a_word'] == self.word and entry[
        'a_lang'] == self.language.iso, etymwn_data))

If the data was restructured so words acted as dict keys, referencing words would be much faster since dicts are an implementation of hash tables.

Data could instead be structured by language then by word, as follows:

{
    "lang":{
        "word":[
            {
                "origin-word":"origin-lang"
            }
        ]
    }
}

for example,

{
    "eng":{
        "airport":[
            {"air":"eng"},
            {"port":"eng"}
        ],
        "banana":[
            {"banaana":"wol"}
        ]
    },
    "lat":{
        "fructus":[
            {"fruor":"lat"}
        ]
    }
}

Origin words are individual dicts to prevent key collisions.

Word origin census method(s)

A way to get a list of all languages from which English derives words and a way to tally the frequency of those languages.

e.g)

  • Old French: 21,073
  • Middle English: 14,501
  • Latin: 8,999
  • etc.

Recognising proto-languages

Gerard de Melo's readme states:

Words are given with ISO 639-3 codes
(additionally, there are some ISO 639-2 codes prefixed with "p_" to indicate proto-languages).

From Wikipedia for ISO 639-3:

ISO 639-3.[2] It provides an enumeration of languages as complete as possible, including living and extinct, ancient and constructed, major and minor, written and unwritten.[1] However, it does not include reconstructed languages such as Proto-Indo-European.

The etywn-relety.json contains the following proto_language references

  • 124 instances of 'p_sla', from ISO 639-2 Proto-Slavic
  • 13 instance of 'p_gem', from ISO 639-2 Proto-Germanic
  • 6 instances of 'p_ine', from ISO 639-2 Proto-Indo-European
  • 3 instance of 'p_gmw', not in ISO 639-2 but seems to be Proto-West-Germanic (it only points to the word "iuwiz")

It's probably best to just add add the relevant JSON and document accordingly, for instance 'p_sla' could be

  {
    "name": "Proto-Slavic",
    "type": "extinct",
    "scope": "individual",
    "iso6393": 'p_sla',
    "iso6392B": null,
    "iso6392T": null,
    "iso6391": null
  }

no idea what to put for scope tbh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.