Giter Club home page Giter Club logo

lamonpy's Introduction

Lamon, The Latin POS Tagger & Lemmatizer

Lamon (LAtin MOrphological tools, pronounced /leɪmən/) is a simple POS tagger & lemmatizer for Latin written in C++ and Lamonpy is a Python package of Lamon. You can easily obtain lemma and tag of each word in given text using Lamonpy.

Getting Started

You can install Lamonpy easily using pip. (https://pypi.org/project/lamonpy/)

$ pip install --upgrade pip
$ pip install lamonpy

The supported OS and Python versions are:

  • Linux (x86-64) with Python >= 3.5
  • macOS >= 10.13 with Python >= 3.5
  • Windows 7 or later (x86, x86-64) with Python >= 3.5
  • Other OS with Python >= 3.5: Compilation from source code required (with c++11 compatible compiler)

Here is a simple example using Lamonpy to analyze Latin texts.

from lamonpy import Lamon
lamon = Lamon()
score, tagged = lamon.tag('In principio creavit Deus caelum et terram.')[0]
print(tagged)
# `tagged` is a list of tuples `(start_pos, end_pos, lemma, tag)`
# [(0, 2, 'in', 'r--------'),
#  (3, 12, 'principium', 'n-s---nb-'),
#  (13, 20, 'creo', 'v3sria---'),
#  (21, 25, 'deus', 'n-s---mn-'),
#  (26, 32, 'caelus', 'n-s---ma-'),
#  (33, 35, 'et', 'c--------'),
#  (36, 42, 'terra', 'n-s---fa-'),
#  (42, 43, '.', '---------')]

Tagging Model and Its Accuracy

Lamon's tagging model is based on BiLSTM network trained with Perseus Latin Dependency Treebanks (4,000 sentences) and self-trained with raw Latin corpora (440,000 sentences) collected by Latina Vivense.

Since there is no available standard for evaluating Latin taggers, we built own test set named vivens of 900 sentences. The result of evaluation is shown below:

  vivens (900 sents) Perseus (4000 sents)
lemma tag both lemma tag both
Lamon 94.6 83.0 81.1 89.4 80.2 76.6
Lamon (large) 94.2 83.3 81.3 89.7 81.9 78.3
Lamon (uv.) 94.4 82.6 80.7 87.7 77.9 73.8
Backoff 88.1     92.4    
123 POS   58.1 54.8   83.8 79.6
CRF POS   69.1 63.4   77.3 72.9

Since Lamon and all cltk's tagger are trained with Perseus' dataset, the scores for Perseus are not significant for confirming the actual accuracy of each model. Rather, it shows that 123 POS and CRF POS are overfitting to Perseus's dataset.

Because the size of the vivens dataset is small, the results of this evaluation can be inaccurate. We plan to acquire larger dataset for evaluation and publish the dataset to make more accurate evaluation.

Tagset

Lamon supports three types of tagset.

1. perseus

1:  part of speech

n   noun
v   verb
a   adjective
d   adverb
c   conjunction
r   adposition
p   pronoun
m   numeral
i   interjection
e   exclamation
u   punctuation

2:  person

1   first person
2   second person
3   third person

3:  number

s   singular
p   plural

4:  tense

p   present
i   imperfect
r   perfect
l   pluperfect
t   future perfect
f   future

5:  mood

i   indicative
s   subjunctive
n   infinitive
m   imperative
p   participle
d   gerund
g   gerundive

6:  voice

a   active
p   passive
d   deponent

7:  gender

m   masculine
f   feminine
n   neuter

8:  case

n   nominative
g   genitive
d   dative
a   accusative
v   vocative
b   ablative
l   locative

9:  degree

p   positive
c   comparative
s   superlative

2. vivens

# Moods
D: indicative
S: subjunctive
I: imperative
T: infinitive
L: participle

# Tenses
0M: present
0E: perfect
RM: imperfect
RE: pluperfect
FM: future
FE: future perfect

# Voices
A: active
P: passive

# Participle (combination of mood, tense & voice)
L0A: present participle
LRP: past participle
LFA: future active participle
LFP: gerundive

# Persons
1: first
2: second
3: third

# Genders
m: masculine
f: feminine
n: neuter

# Numbers
s: singular
p: plural

# Cases
o: nominative
g: genitive
d: dative
a: accusative
b: ablative
v: vocative
x: adverbial

# Degrees
(positive isn't marked explicitly.)
c: comparative
u: superlative

# etc
r: preposition
j: conjunction

3. raw

...

Demo

https://latina.bab2min.pe.kr/xe/lTagger (Korean)

Larger Models

Due to the package size limit of pypi, the distributed wheel package contains base model only. We provide larger models by Google-drive links.

You can use these models by passing the model path to Lamon.__init__ as arguments.

from lamonpy import Lamon
lamon = Lamon(dict_path='dict.large.bin', tagger_path='tagger.large.bin')

License

Lamonpy is licensed under the terms of MIT License, meaning you can use it for any reasonable purpose and remain in complete ownership of all the documentation you produce.

History

  • 0.2.0 (2020-10-16)
    • [NUM] token for Roman numeral was added.
    • The accuracy was slightly increased by introducing joint lemma-tag layer.
  • 0.1.0 (2020-09-26)
    • the first version of lamonpy

Citation

@software{bab2min_2020_4091536,
  author       = {bab2min},
  title        = {bab2min/lamonpy: 0.2.0},
  month        = oct,
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.2.0},
  doi          = {10.5281/zenodo.4091536},
  url          = {https://doi.org/10.5281/zenodo.4091536}
}

lamonpy's People

Contributors

bab2min avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

matscalia

lamonpy's Issues

Error when installing using pip

When I run pip install lamonpy on my computer it results in an error message about missing Eigen/Dense. I got it working by cloning the repo, creating an include folder, and pasting the Eigen folder into it before running pip install . in the directory.

> pip install lamonpy                                                               
Collecting lamonpy
  Using cached lamonpy-0.2.0.tar.gz (43.6 MB)
Requirement already satisfied: py-cpuinfo in c:\users\maybells\documents\coding\python\lemma review\venv\lib\site-packages (from lamonpy) (8.0.0)
Requirement already satisfied: numpy>=1.10.0 in c:\users\maybells\documents\coding\python\lemma review\venv\lib\site-packages (from lamonpy) (1.21.0)
Building wheels for collected packages: lamonpy
  Building wheel for lamonpy (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\maybells\documents\coding\python\lemma review\venv\scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"'; __file__='"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\Maybells\AppData\Local\Temp\pip-wheel-2dhnbr7c'
       cwd: C:\Users\Maybells\AppData\Local\Temp\pip-install-mbtz8_r5\lamonpy_487628ef84174edfb5d98b288fb7927d\
  Complete output (29 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.9
  creating build\lib.win-amd64-3.9\lamonpy
  copying lamonpy\__init__.py -> build\lib.win-amd64-3.9\lamonpy
  running egg_info
  writing lamonpy.egg-info\PKG-INFO
  writing dependency_links to lamonpy.egg-info\dependency_links.txt
  writing requirements to lamonpy.egg-info\requires.txt
  writing top-level names to lamonpy.egg-info\top_level.txt
  adding license file 'LICENSE' (matched pattern 'LICEN[CS]E*')
  reading manifest file 'lamonpy.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
  writing manifest file 'lamonpy.egg-info\SOURCES.txt'
  copying lamonpy\dict.bin -> build\lib.win-amd64-3.9\lamonpy
  copying lamonpy\tagger.bin -> build\lib.win-amd64-3.9\lamonpy
  running build_ext
  building '_lamonpy' extension
  creating build\temp.win-amd64-3.9
  creating build\temp.win-amd64-3.9\Release
  creating build\temp.win-amd64-3.9\Release\src
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -DMODULE_NAME=PyInit__lamonpy -Iinclude -Ic:\users\maybells\documents\coding\python\lemma review\venv\lib\site-packages\numpy\core\include -Ic:\users\maybells\documents\coding\python\lemma review\venv\include -IC:\Users\Maybells\AppData\Local\Programs\Python\Python39\include -IC:\Users\Maybells\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include 
-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/Lemmatizer.cpp /Fobuild\temp.win-amd64-3.9\Release\src/Lemmatizer.obj /O2 /MT /Gy
  cl : Command line warning D9025 : overriding '/MD' with '/MT'
  Lemmatizer.cpp
  C:\Users\Maybells\AppData\Local\Temp\pip-install-mbtz8_r5\lamonpy_487628ef84174edfb5d98b288fb7927d\src\layers.hpp(6): fatal error C1083: Cannot open include file: 'Eigen/Dense': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
  ----------------------------------------
  ERROR: Failed building wheel for lamonpy
  Running setup.py clean for lamonpy
Failed to build lamonpy
Installing collected packages: lamonpy
    Running setup.py install for lamonpy ... error
    ERROR: Command errored out with exit status 1:
     command: 'c:\users\maybells\documents\coding\python\lemma review\venv\scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"'; __file__='"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\Maybells\AppData\Local\Temp\pip-record-tdtj1wju\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\maybells\documents\coding\python\lemma review\venv\include\site\python3.9\lamonpy'
         cwd: C:\Users\Maybells\AppData\Local\Temp\pip-install-mbtz8_r5\lamonpy_487628ef84174edfb5d98b288fb7927d\
    Complete output (29 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.9
    creating build\lib.win-amd64-3.9\lamonpy
    copying lamonpy\__init__.py -> build\lib.win-amd64-3.9\lamonpy
    running egg_info
    writing lamonpy.egg-info\PKG-INFO
    writing dependency_links to lamonpy.egg-info\dependency_links.txt
    writing requirements to lamonpy.egg-info\requires.txt
    writing top-level names to lamonpy.egg-info\top_level.txt
    adding license file 'LICENSE' (matched pattern 'LICEN[CS]E*')
    reading manifest file 'lamonpy.egg-info\SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
    writing manifest file 'lamonpy.egg-info\SOURCES.txt'
    copying lamonpy\dict.bin -> build\lib.win-amd64-3.9\lamonpy
    copying lamonpy\tagger.bin -> build\lib.win-amd64-3.9\lamonpy
    running build_ext
    building '_lamonpy' extension
    creating build\temp.win-amd64-3.9
    creating build\temp.win-amd64-3.9\Release
    creating build\temp.win-amd64-3.9\Release\src
    C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG 
/MD -DMODULE_NAME=PyInit__lamonpy -Iinclude -Ic:\users\maybells\documents\coding\python\lemma review\venv\lib\site-packages\numpy\core\include -Ic:\users\maybells\documents\coding\python\lemma review\venv\include -IC:\Users\Maybells\AppData\Local\Programs\Python\Python39\include -IC:\Users\Maybells\AppData\Local\Programs\Python\Python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/Lemmatizer.cpp /Fobuild\temp.win-amd64-3.9\Release\src/Lemmatizer.obj /O2 /MT /Gy        
    cl : Command line warning D9025 : overriding '/MD' with '/MT'
    Lemmatizer.cpp
    C:\Users\Maybells\AppData\Local\Temp\pip-install-mbtz8_r5\lamonpy_487628ef84174edfb5d98b288fb7927d\src\layers.hpp(6): fatal error C1083: Cannot 
open include file: 'Eigen/Dense': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.29.30133\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'c:\users\maybells\documents\coding\python\lemma review\venv\scripts\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"'; __file__='"'"'C:\\Users\\Maybells\\AppData\\Local\\Temp\\pip-install-mbtz8_r5\\lamonpy_487628ef84174edfb5d98b288fb7927d\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\Maybells\AppData\Local\Temp\pip-record-tdtj1wju\install-record.txt' --single-version-externally-managed --compile --install-headers 'c:\users\maybells\documents\coding\python\lemma review\venv\include\site\python3.9\lamonpy' Check the logs for full command output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.