obulat / zeyrek Goto Github PK

View Code? Open in Web Editor NEW

45.0 3.0 8.0 6.7 MB

Python morphological analyzer for Turkish language. Partial port of ZemberekNLP.

Home Page: https://zeyrek.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

nlp morphology turkish lemmatization morphological-analysis

zeyrek's Introduction

Zeyrek: Morphological Analyzer and Lemmatizer

Zeyrek is a partial port of the Zemberek library to Python for lemmatizing and analyzing Turkish language words. It is in alpha stage, and the API will probably change.

Free software: MIT license
Documentation: https://zeyrek.readthedocs.io.

Basic Usage

To use Zeyrek, first create an instance of MorphAnalyzer class:

import zeyrek
analyzer = zeyrek.MorphAnalyzer()

Then, you can call its analyze method on words or texts to get all possible analyses::

print(analyzer.analyze('benim'))
Parse(word='benim', lemma='ben', pos='Noun', morphemes=['Noun', 'A3sg', 'P1sg'], formatted='[ben:Noun] ben:Noun+A3sg+im:P1sg')
Parse(word='benim', lemma='ben', pos='Pron', morphemes=['Pron', 'A1sg', 'Gen'], formatted='[ben:Pron,Pers] ben:Pron+A1sg+im:Gen')
Parse(word='benim', lemma='ben', pos='Verb', morphemes=['Noun', 'A3sg', 'Zero', 'Verb', 'Pres', 'A1sg'], formatted='[ben:Noun] ben:Noun+A3sg|Zero→Verb+Pres+im:A1sg')
Parse(word='benim', lemma='ben', pos='Verb', morphemes=['Pron', 'A1sg', 'Zero', 'Verb', 'Pres', 'A1sg'], formatted='[ben:Pron,Pers] ben:Pron+A1sg|Zero→Verb+Pres+im:A1sg')

If you only need the base form of words, or lemmas, you can call lemmatize. It returns a list of tuples, with word itself and a list of possible lemmas::

print(analyzer.lemmatize('benim'))
[('benim', ['ben'])]

Credits

This package is a Python port of part of the Zemberek package by Ahmet A. Akın

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

zeyrek's People

Contributors

Stargazers

Watchers

Forkers

patronlargibi fossabot kntszmlk sourcery-ai-bot batuhankocak isa harikalarkutusu

zeyrek's Issues

a method to get the "last lemma"

analyzer.lemmatize() simply returns all the possible lemmas for the word. what would be really cool is to have a method which simply returns the first lemma; tail to head (i.e. last lemma; head to tail).
for example:

analyzer.lemmatize("çekiliş")
["çekilişle", ("çekiliş", "çekmek", "çekilmek")]
when we remove the inflectional suffix '-le' we still preserve the meaning. but after removing 'iş' we have a different meaning (çekilmek). so an example usecase would be;
analyzer.get_last_lemma("'çekilişle")
"çekiliş"

Zeyrek lemmatizes "içişleri" to "içmek"

I'm not Turkish, but as far as I know, this is incorrect. İçmek means to drink whereas İçişleri means interior; quite different words semantically.

LookupError on analyzer.lemmatize

Hello, I used to lemmatize Turkish text with Zeyrek somehow code is not working anymore. I checked if the example on documentation work but I receive this error.

MorphAnalyzer gets corrupted after a while.

MorphAnalyzer returns only "unk" after many iterations with analyzer.analyze(word). As a workaround I'm reinitiating it after it returns "unk" for known words. Any clues on why this happens?

The library cannot be imported. "'type' object is not subscriptable".

Too many log messages causes an interrupt

Hi,
I dont know if there is a way to cancel them I dont know but I hav problem with this.

actually I dont have so many sentences. I work with 250 lines of data. And in every line there is only 1 or 2 sentences. But after I run my loop, jupiter is interrapted bc there is so many logs. Is there any way to cancel logs?

cleansent = []
for sent in senteces:
     clean_sent =[]
     for word in sent:
          if word != “”:
               lemword = analyzer.lemmatize(word)
               clean_sent.append(lemword[0][1][0])
          lemsent.append(cleansent)

As I said log messages cause a interrupt :/

Parse Object

sorry for very simple question but I need to ask

The analyzer or lemmatizer functions return lists with keys lemma,pos, morphemes etc...

After I get the analyzer, how can I access lemma property?
Right now those functions return list of lists.

I am not very proficient at Python, apologise for bothering you.

thank you for help

`MorphAnalyzer.analyze` cannot parse numbers

Here's an example run along with INFO logs demonstrating the problem:

In [1]: import zeyrek

In [2]: ma = zeyrek.MorphAnalyzer()

In [3]: ma.analyze("3'te okuldaydım.")
APPENDING RESULT: <(okul_Noun)(-)(okul:noun_S + a3sg_S + pnon_S + da:loc_ST + nounZeroDeriv_S + nVerb_S + ydı:nPast_S + m:nA1sg_ST)>
APPENDING RESULT: <(._Punc)(-)(.:puncRoot_ST)>
Out[3]:
[[Parse(word='3te', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk')],
 [],
 [Parse(word='okuldaydım', lemma='okul', pos='Verb', morphemes=['Noun', 'A3sg', 'Loc', 'Zero', 'Verb', 'Past', 'A1sg'], formatted='[okul:Noun] okul:Noun+A3sg+da:Loc|Zero→Verb+ydı:Past+m:A1sg')],
 [Parse(word='.', lemma='.', pos='Punc', morphemes=['Punc'], formatted='[.:Punc] .:Punc')]]

empty result added after unknown word

While analyzing a sentene, when an unknown word is encountered, an empty parse array appears after it. For example:

text = "Mahjong oynamayı biliyor musun?"
analyzer = zeyrek.MorphAnalyzer()
analyzer.analyze(text)

returns the following results:

[Parse(word='Mahjong', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk')]
[]
[Parse(word='oynamayı', lemma='oynamak', pos='Noun', morphemes=['Verb', 'Inf2', 'Noun', 'A3sg', 'Acc'], formatted='[oynamak:Verb] oyna:Verb|ma:Inf2→Noun+A3sg+yı:Acc')]
[Parse(word='biliyor', lemma='bilmek', pos='Verb', morphemes=['Verb', 'Prog1', 'A3sg'], formatted='[bilmek:Verb] bil:Verb+iyor:Prog1+A3sg'), Parse(word='biliyor', lemma='bilemek', pos='Verb', morphemes=['Verb', 'Prog1', 'A3sg'], formatted='[bilemek:Verb] bil:Verb+iyor:Prog1+A3sg')]
[Parse(word='musun', lemma='mu', pos='Ques', morphemes=['Ques', 'Pres', 'A2sg'], formatted='[mu:Ques] mu:Ques+Pres+sun:A2sg'), Parse(word='musun', lemma='Mu', pos='Verb', morphemes=['Noun', 'A3sg', 'Zero', 'Verb', 'Pres', 'A2sg'], formatted='[Mu:Noun,Abbrv] mu:Noun+A3sg|Zero→Verb+Pres+sun:A2sg')]
[Parse(word='?', lemma='?', pos='Punc', morphemes=['Punc'], formatted='[?:Punc] ?:Punc')]

AttributeError: 'frozenset' object has no attribute 'add'

Hi there, I've been using Zeyrek to lemmatize Turkish Tweets of len 250_000. It starts to lemmatize but after 10 minutes or so, I get this error.

AttributeError Traceback (most recent call last)
in

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\morphology.py in lemmatize(self, text)
137 words = _tokenize_text(text)
138 for word in words:
--> 139 analysis = self._parse(word)
140 if len(analysis) == 0:
141 word_lemmas = [word]

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\morphology.py in _parse(self, word)
94 """ Parses a word and returns SingleAnalysis result. """
95 normalized_word = _normalize(word)
---> 96 return self.analyzer.analyze(normalized_word)
97
98 def _analyze_text(self, text, verbose=False):

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in analyze(self, word)
29 paths.append(SearchPath.initial(candidate, tail))
30 # search graph.
---> 31 result_paths = self.search(paths)
32
33 # generate results from successful paths.

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in search(self, current_paths)
59 continue
60 # Creates new paths with outgoing and matching transitions.
---> 61 new_paths = self.advance(path)
62 logging.debug(f"\n--\nNew paths are: ")
63 for p in new_paths:

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in advance(self, path)
123 last_token = transition.last_template_token
124 if last_token.type_ == 'LAST_VOICED':
--> 125 attributes.add(PhoneticAttribute.ExpectsConsonant)
126 elif last_token.type_ == 'LAST_NOT_VOICED':
127 attributes.add(PhoneticAttribute.ExpectsVowel)

AttributeError: 'frozenset' object has no attribute 'add'

Terkipleri tanımıyor!

Merhaba
Eski kelimelerde kullanılan -i -u -ül vb eklemeleri Unknown olarak işaretliyor.
Bu ekleri göz ardı ettirmek mümkün mü?

r = "tevahhuş Rezzâk Rezzâk-ı Zülcelâle bakiye-i ömrümü ahz-ı mal Mün'im-i Hakikîye şükrü, senâyı zâhirî esbaba"
data = analyzer.analyze(r)
result = [i for x in data for i in x ]
print(result)

[Parse(word='tevahhuş', lemma='tevahhuş', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[tevahhuş:Noun] tevahhuş:Noun+A3sg'),
 Parse(word='Rezzâk', lemma='Rezzak', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[Rezzak:Noun,Prop] rezzak:Noun+A3sg'),
 Parse(word='Rezzâk-ı', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='Zülcelâle', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='bakiye-i', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='ömrümü', lemma='ömür', pos='Noun', morphemes=['Noun', 'A3sg', 'P1sg', 'Acc'], formatted='[ömür:Noun] ömr:Noun+A3sg+üm:P1sg+ü:Acc'),
 Parse(word='ahz-ı', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='mal', lemma='mal', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[mal:Noun] mal:Noun+A3sg'),
 Parse(word='Münim-i', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='Hakikîye', lemma='hakikî', pos='Noun', morphemes=['Noun', 'A3sg', 'Dat'], formatted='[hakikî:Noun] hakiki:Noun+A3sg+ye:Dat'),
 Parse(word='şükrü', lemma='şükür', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[şükür:Noun] şükr:Noun+A3sg+ü:Acc'),
 Parse(word='şükrü', lemma='şükür', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[şükür:Noun] şükr:Noun+A3sg+ü:P3sg'),
 Parse(word='şükrü', lemma='Şükrü', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[Şükrü:Noun,Prop] şükrü:Noun+A3sg'),
 Parse(word=',', lemma=',', pos='Punc', morphemes=['Punc'], formatted='[,:Punc] ,:Punc'),
 Parse(word='senâyı', lemma='Sena', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Sena:Noun,Prop] sena:Noun+A3sg+yı:Acc'),
 Parse(word='senâyı', lemma='sena', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[sena:Noun] sena:Noun+A3sg+yı:Acc'),
 Parse(word='senâyı', lemma='Senay', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Senay:Noun,Prop] senay:Noun+A3sg+ı:Acc'),
 Parse(word='senâyı', lemma='Senay', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[Senay:Noun,Prop] senay:Noun+A3sg+ı:P3sg'),
 Parse(word='zâhirî', lemma='zahirî', pos='Adj', morphemes=['Adj'], formatted='[zahirî:Adj] zahiri:Adj'),
 Parse(word='zâhirî', lemma='Zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Zahir:Noun,Prop] zahir:Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='Zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[Zahir:Noun,Prop] zahir:Noun+A3sg+i:P3sg'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[zahir:Noun] zahir:Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[zahir:Noun] zahir:Noun+A3sg+i:P3sg'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Adj', 'Zero', 'Noun', 'A3sg', 'Acc'], formatted='[zahir:Adj] zahir:Adj|Zero→Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Adj', 'Zero', 'Noun', 'A3sg', 'P3sg'], formatted='[zahir:Adj] zahir:Adj|Zero→Noun+A3sg+i:P3sg'),
 Parse(word='esbaba', lemma='esbap', pos='Noun', morphemes=['Noun', 'A3sg', 'Dat'], formatted='[esbap:Noun] esbab:Noun+A3sg+a:Dat')]

Stemming unknown proper nouns

Hey Olga, good work with zeyrek.

I have a small improvement suggestion. Zeyrek is capable of providing stem of known proper nouns where inflections are attached with apostrophe. Example:
"istanbul'daki" -> "İstanbul"

but merges the inflection with the stem in case of unknown proper noun without parsing the inflections. Example:
"melik'in" -> "melikin"

So my suggestion is it should return the part before apostrophe. I'm not sure about if it should parse the inflection after apostrophe though. I might be missing some other case with apostrophe but here I am pointing out to something with unknown proper nouns and their inflections.