sinaahmadi / klpt Goto Github PK

The Kurdish Language Processing Toolkit

Home Page: https://sinaahmadi.github.io/klpt/

License: Other

Python 100.00%

kurdish-language-processing natural-language-processing kurdish toolkit kurdish-tokenization kurdish-stemming kurdish-oss less-resource-languages language-technology

klpt's People

Contributors

Stargazers

Watchers

Forkers

jsajadi kyumarss ftyers alhm02 alanhilal choxos osaaso3 jagaryousef mhmd-azeez realameerhameed

klpt's Issues

treebank for Kurmanji

Note that there is a treebank for Kurmanji at https://github.com/UniversalDependencies/UD_Kurmanji-MG this can be used for training a part of speech tagger and dependency parser.

Integrate the Configuration module in other modules

The Configuration module needs to be properly integrated into each single function in all other modules. This is not the case for Stem module, for instance.

klpt/klpt/configuration.py

Line 18 in d540c52

class Configuration:

tokenization with word ending with "iy" instead of "îy"

In Kurmanji, words ending with "î" when inflected with a form starting with "î" undergo an alternation where the "îî" becomes "iy" in contrast to "îy". That should be included in the tokenization module to make sure that the correct word forms are looked up in the dictionary.

dîplomasiyê / dîplomasîyê

morphological generation for Sorani

morphology for kurmanji

The apertium project has a morphological analyser for Kurmanji:
https://github.com/apertium/apertium-kmr

You could include it to get morphological analysis for Kurmanji. :)

It recognises 342870 forms and you can get a full form list using the lt-expand tool:

$ lt-expand apertium-kmr.kmr.dix  | wc -l
342870

Excessive tokenization to be fixed

Some of the affixes are not required to be considered separate tokens on their own. This is particularly the case of articles, such as "eke" and "êk" in both Sorani and Kurmanji.

No clear usage instruction

Hi Kaka @sinaahmadi, very well done for the great effort you have been doing to fill this gap. once tiny feedback, I have noticed your instructions to run this tool is not up to date in README file. e.g pip3 needs to be used instead of pip since you are already using pythong3.*.

Also the instruction of the usage is not quite clear, would have been great if you put some complete examples?

nashwan@Nashwan:~/Desktop/klpt/klpt$ import klpt dialect sorani
-bash: import: command not found

Stop words?

Hi Sina,

Thanks for the great package. We're implementing usage of it for analysis of scraped social media data from Iraq and other Kurdish speaking areas to enable local peace builders to better understand online popularization and how conflicts are being played out in digital public spheres to aid their peace building initiative design. https://gitlab.com/howtobuildup/phoenix/

I wanted to ask regarding the Sorani and Kurmanji stop words found https://github.com/sinaahmadi/klpt/blob/master/klpt/data/stopwords.json
To confirm; they're not currently used within the packages functionality right?
Re:

klpt/klpt/__init__.py

Line 19 in 9c517f8

# def remove_stopwords(self, text):

Would you see any issue with us using the stopwords.json directly from the package ourselves, post preprocessing and pre tokenisation stages?

Thanks a lot for any of your time spent on considering this 🙂

in_separator in mwe_tokenize doesn't work properly

The argument in_separator seems to be overwritten at the end of the mwe_tokenize function. The argument is not accessible through word_tokenizer either.

Some words aren't analysed, although they are in Apertium

Output of Python analyser:

('dixwî', [[]])

Output of Apertium:

$ echo dixwî | apertium -d ~/source/apertium/languages/apertium-kmr/ kmr-morph
^dixwî/xwarin<vblex><tv><pri><p2><sg>$

I will look into this. Feel free to assign it to me.

cyhunspell installation fails

whatever i do cyhunspell fails
i wonder if you found a way to work it out

Error when trying example

When trying one of the transliteration examples in the main project page,

transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic"
transliterate.transliterate("rojhilata navîn")

I get the following error:
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 162, in transliterate
tokens_dict = self.to_pieces(token)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 131, in to_pieces
tokens_dict[char_index-i] = tokens_dict[char_index-i] + token[char_index]
KeyError: 1

I am using Python 3.6.3 under Windows 10.

Thanks for support!

Giuliano

morphological generation for Kurmanji

For Kurmanji we can just use the function apply() in the opposite direction. I'll take a look at it.