Giter Club home page Giter Club logo

zeyrek's Introduction

Zeyrek: Morphological Analyzer and Lemmatizer

PyPI - Version

Zeyrek is a partial port of the Zemberek library to Python for lemmatizing and analyzing Turkish language words. It is in alpha stage, and the API will probably change.

Basic Usage

To use Zeyrek, first create an instance of MorphAnalyzer class:

import zeyrek
analyzer = zeyrek.MorphAnalyzer()

Then, you can call its analyze method on words or texts to get all possible analyses::

print(analyzer.analyze('benim'))
Parse(word='benim', lemma='ben', pos='Noun', morphemes=['Noun', 'A3sg', 'P1sg'], formatted='[ben:Noun] ben:Noun+A3sg+im:P1sg')
Parse(word='benim', lemma='ben', pos='Pron', morphemes=['Pron', 'A1sg', 'Gen'], formatted='[ben:Pron,Pers] ben:Pron+A1sg+im:Gen')
Parse(word='benim', lemma='ben', pos='Verb', morphemes=['Noun', 'A3sg', 'Zero', 'Verb', 'Pres', 'A1sg'], formatted='[ben:Noun] ben:Noun+A3sg|Zero→Verb+Pres+im:A1sg')
Parse(word='benim', lemma='ben', pos='Verb', morphemes=['Pron', 'A1sg', 'Zero', 'Verb', 'Pres', 'A1sg'], formatted='[ben:Pron,Pers] ben:Pron+A1sg|Zero→Verb+Pres+im:A1sg')

If you only need the base form of words, or lemmas, you can call lemmatize. It returns a list of tuples, with word itself and a list of possible lemmas::

print(analyzer.lemmatize('benim'))
[('benim', ['ben'])]

Credits

This package is a Python port of part of the Zemberek package by Ahmet A. Akın

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

zeyrek's People

Contributors

abhi-kr-2100 avatar dependabot[bot] avatar obulat avatar sourcery-ai-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

zeyrek's Issues

a method to get the "last lemma"

analyzer.lemmatize() simply returns all the possible lemmas for the word. what would be really cool is to have a method which simply returns the first lemma; tail to head (i.e. last lemma; head to tail).
for example:

analyzer.lemmatize("çekiliş")
["çekilişle", ("çekiliş", "çekmek", "çekilmek")]
when we remove the inflectional suffix '-le' we still preserve the meaning. but after removing 'iş' we have a different meaning (çekilmek). so an example usecase would be;
analyzer.get_last_lemma("'çekilişle")
"çekiliş"

LookupError on analyzer.lemmatize

Hello, I used to lemmatize Turkish text with Zeyrek somehow code is not working anymore. I checked if the example on documentation work but I receive this error.

Screenshot 2023-03-06 at 00 51 03

MorphAnalyzer gets corrupted after a while.

MorphAnalyzer returns only "unk" after many iterations with analyzer.analyze(word). As a workaround I'm reinitiating it after it returns "unk" for known words. Any clues on why this happens?

Too many log messages causes an interrupt

Hi,
I dont know if there is a way to cancel them I dont know but I hav problem with this.

actually I dont have so many sentences. I work with 250 lines of data. And in every line there is only 1 or 2 sentences. But after I run my loop, jupiter is interrapted bc there is so many logs. Is there any way to cancel logs?

cleansent = []
for sent in senteces:
     clean_sent =[]
     for word in sent:
          if word != “”:
               lemword = analyzer.lemmatize(word)
               clean_sent.append(lemword[0][1][0])
          lemsent.append(cleansent)

As I said log messages cause a interrupt :/

Parse Object

sorry for very simple question but I need to ask

The analyzer or lemmatizer functions return lists with keys lemma,pos, morphemes etc...

After I get the analyzer, how can I access lemma property?
Right now those functions return list of lists.

I am not very proficient at Python, apologise for bothering you.

thank you for help

`MorphAnalyzer.analyze` cannot parse numbers

Here's an example run along with INFO logs demonstrating the problem:

In [1]: import zeyrek

In [2]: ma = zeyrek.MorphAnalyzer()

In [3]: ma.analyze("3'te okuldaydım.")
APPENDING RESULT: <(okul_Noun)(-)(okul:noun_S + a3sg_S + pnon_S + da:loc_ST + nounZeroDeriv_S + nVerb_S + ydı:nPast_S + m:nA1sg_ST)>
APPENDING RESULT: <(._Punc)(-)(.:puncRoot_ST)>
Out[3]:
[[Parse(word='3te', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk')],
 [],
 [Parse(word='okuldaydım', lemma='okul', pos='Verb', morphemes=['Noun', 'A3sg', 'Loc', 'Zero', 'Verb', 'Past', 'A1sg'], formatted='[okul:Noun] okul:Noun+A3sg+da:Loc|Zero→Verb+ydı:Past+m:A1sg')],
 [Parse(word='.', lemma='.', pos='Punc', morphemes=['Punc'], formatted='[.:Punc] .:Punc')]]

empty result added after unknown word

While analyzing a sentene, when an unknown word is encountered, an empty parse array appears after it. For example:

text = "Mahjong oynamayı biliyor musun?"
analyzer = zeyrek.MorphAnalyzer()
analyzer.analyze(text)

returns the following results:

[Parse(word='Mahjong', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk')]
[]
[Parse(word='oynamayı', lemma='oynamak', pos='Noun', morphemes=['Verb', 'Inf2', 'Noun', 'A3sg', 'Acc'], formatted='[oynamak:Verb] oyna:Verb|ma:Inf2→Noun+A3sg+yı:Acc')]
[Parse(word='biliyor', lemma='bilmek', pos='Verb', morphemes=['Verb', 'Prog1', 'A3sg'], formatted='[bilmek:Verb] bil:Verb+iyor:Prog1+A3sg'), Parse(word='biliyor', lemma='bilemek', pos='Verb', morphemes=['Verb', 'Prog1', 'A3sg'], formatted='[bilemek:Verb] bil:Verb+iyor:Prog1+A3sg')]
[Parse(word='musun', lemma='mu', pos='Ques', morphemes=['Ques', 'Pres', 'A2sg'], formatted='[mu:Ques] mu:Ques+Pres+sun:A2sg'), Parse(word='musun', lemma='Mu', pos='Verb', morphemes=['Noun', 'A3sg', 'Zero', 'Verb', 'Pres', 'A2sg'], formatted='[Mu:Noun,Abbrv] mu:Noun+A3sg|Zero→Verb+Pres+sun:A2sg')]
[Parse(word='?', lemma='?', pos='Punc', morphemes=['Punc'], formatted='[?:Punc] ?:Punc')]

AttributeError: 'frozenset' object has no attribute 'add'

Hi there, I've been using Zeyrek to lemmatize Turkish Tweets of len 250_000. It starts to lemmatize but after 10 minutes or so, I get this error.


AttributeError Traceback (most recent call last)
in

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\morphology.py in lemmatize(self, text)
137 words = _tokenize_text(text)
138 for word in words:
--> 139 analysis = self._parse(word)
140 if len(analysis) == 0:
141 word_lemmas = [word]

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\morphology.py in _parse(self, word)
94 """ Parses a word and returns SingleAnalysis result. """
95 normalized_word = _normalize(word)
---> 96 return self.analyzer.analyze(normalized_word)
97
98 def _analyze_text(self, text, verbose=False):

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in analyze(self, word)
29 paths.append(SearchPath.initial(candidate, tail))
30 # search graph.
---> 31 result_paths = self.search(paths)
32
33 # generate results from successful paths.

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in search(self, current_paths)
59 continue
60 # Creates new paths with outgoing and matching transitions.
---> 61 new_paths = self.advance(path)
62 logging.debug(f"\n--\nNew paths are: ")
63 for p in new_paths:

~\AppData\Roaming\Python\Python39\site-packages\zeyrek\rulebasedanalyzer.py in advance(self, path)
123 last_token = transition.last_template_token
124 if last_token.type_ == 'LAST_VOICED':
--> 125 attributes.add(PhoneticAttribute.ExpectsConsonant)
126 elif last_token.type_ == 'LAST_NOT_VOICED':
127 attributes.add(PhoneticAttribute.ExpectsVowel)

AttributeError: 'frozenset' object has no attribute 'add'

Terkipleri tanımıyor!

Merhaba
Eski kelimelerde kullanılan -i -u -ül vb eklemeleri Unknown olarak işaretliyor.
Bu ekleri göz ardı ettirmek mümkün mü?

r = "tevahhuş Rezzâk Rezzâk-ı Zülcelâle bakiye-i ömrümü ahz-ı mal Mün'im-i Hakikîye şükrü, senâyı zâhirî esbaba"
data = analyzer.analyze(r)
result = [i for x in data for i in x ]
print(result)
[Parse(word='tevahhuş', lemma='tevahhuş', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[tevahhuş:Noun] tevahhuş:Noun+A3sg'),
 Parse(word='Rezzâk', lemma='Rezzak', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[Rezzak:Noun,Prop] rezzak:Noun+A3sg'),
 Parse(word='Rezzâk-ı', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='Zülcelâle', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='bakiye-i', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='ömrümü', lemma='ömür', pos='Noun', morphemes=['Noun', 'A3sg', 'P1sg', 'Acc'], formatted='[ömür:Noun] ömr:Noun+A3sg+üm:P1sg+ü:Acc'),
 Parse(word='ahz-ı', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='mal', lemma='mal', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[mal:Noun] mal:Noun+A3sg'),
 Parse(word='Münim-i', lemma='Unk', pos='Unk', morphemes='Unk', formatted='Unk'),
 Parse(word='Hakikîye', lemma='hakikî', pos='Noun', morphemes=['Noun', 'A3sg', 'Dat'], formatted='[hakikî:Noun] hakiki:Noun+A3sg+ye:Dat'),
 Parse(word='şükrü', lemma='şükür', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[şükür:Noun] şükr:Noun+A3sg+ü:Acc'),
 Parse(word='şükrü', lemma='şükür', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[şükür:Noun] şükr:Noun+A3sg+ü:P3sg'),
 Parse(word='şükrü', lemma='Şükrü', pos='Noun', morphemes=['Noun', 'A3sg'], formatted='[Şükrü:Noun,Prop] şükrü:Noun+A3sg'),
 Parse(word=',', lemma=',', pos='Punc', morphemes=['Punc'], formatted='[,:Punc] ,:Punc'),
 Parse(word='senâyı', lemma='Sena', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Sena:Noun,Prop] sena:Noun+A3sg+yı:Acc'),
 Parse(word='senâyı', lemma='sena', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[sena:Noun] sena:Noun+A3sg+yı:Acc'),
 Parse(word='senâyı', lemma='Senay', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Senay:Noun,Prop] senay:Noun+A3sg+ı:Acc'),
 Parse(word='senâyı', lemma='Senay', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[Senay:Noun,Prop] senay:Noun+A3sg+ı:P3sg'),
 Parse(word='zâhirî', lemma='zahirî', pos='Adj', morphemes=['Adj'], formatted='[zahirî:Adj] zahiri:Adj'),
 Parse(word='zâhirî', lemma='Zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[Zahir:Noun,Prop] zahir:Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='Zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[Zahir:Noun,Prop] zahir:Noun+A3sg+i:P3sg'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'Acc'], formatted='[zahir:Noun] zahir:Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Noun', 'A3sg', 'P3sg'], formatted='[zahir:Noun] zahir:Noun+A3sg+i:P3sg'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Adj', 'Zero', 'Noun', 'A3sg', 'Acc'], formatted='[zahir:Adj] zahir:Adj|Zero→Noun+A3sg+i:Acc'),
 Parse(word='zâhirî', lemma='zahir', pos='Noun', morphemes=['Adj', 'Zero', 'Noun', 'A3sg', 'P3sg'], formatted='[zahir:Adj] zahir:Adj|Zero→Noun+A3sg+i:P3sg'),
 Parse(word='esbaba', lemma='esbap', pos='Noun', morphemes=['Noun', 'A3sg', 'Dat'], formatted='[esbap:Noun] esbab:Noun+A3sg+a:Dat')]

Stemming unknown proper nouns

Hey Olga, good work with zeyrek.

I have a small improvement suggestion. Zeyrek is capable of providing stem of known proper nouns where inflections are attached with apostrophe. Example:
"istanbul'daki" -> "İstanbul"

but merges the inflection with the stem in case of unknown proper noun without parsing the inflections. Example:
"melik'in" -> "melikin"

So my suggestion is it should return the part before apostrophe. I'm not sure about if it should parse the inflection after apostrophe though. I might be missing some other case with apostrophe but here I am pointing out to something with unknown proper nouns and their inflections.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.