sinaahmadi / klpt Goto Github PK
View Code? Open in Web Editor NEWThe Kurdish Language Processing Toolkit
Home Page: https://sinaahmadi.github.io/klpt/
License: Other
The Kurdish Language Processing Toolkit
Home Page: https://sinaahmadi.github.io/klpt/
License: Other
Note that there is a treebank for Kurmanji at https://github.com/UniversalDependencies/UD_Kurmanji-MG this can be used for training a part of speech tagger and dependency parser.
The Configuration
module needs to be properly integrated into each single function in all other modules. This is not the case for Stem
module, for instance.
Line 18 in d540c52
In Kurmanji, words ending with "î" when inflected with a form starting with "î" undergo an alternation where the "îî" becomes "iy" in contrast to "îy". That should be included in the tokenization module to make sure that the correct word forms are looked up in the dictionary.
dîplomasiyê / dîplomasîyê
The apertium project has a morphological analyser for Kurmanji:
https://github.com/apertium/apertium-kmr
You could include it to get morphological analysis for Kurmanji. :)
It recognises 342870 forms and you can get a full form list using the lt-expand
tool:
$ lt-expand apertium-kmr.kmr.dix | wc -l
342870
Some of the affixes are not required to be considered separate tokens on their own. This is particularly the case of articles, such as "eke" and "êk" in both Sorani and Kurmanji.
Hi Kaka @sinaahmadi, very well done for the great effort you have been doing to fill this gap. once tiny feedback, I have noticed your instructions to run this tool is not up to date in README file. e.g pip3 needs to be used instead of pip since you are already using pythong3.*.
Also the instruction of the usage is not quite clear, would have been great if you put some complete examples?
nashwan@Nashwan:~/Desktop/klpt/klpt$ import klpt dialect sorani
-bash: import: command not found
Hi Sina,
Thanks for the great package. We're implementing usage of it for analysis of scraped social media data from Iraq and other Kurdish speaking areas to enable local peace builders to better understand online popularization and how conflicts are being played out in digital public spheres to aid their peace building initiative design. https://gitlab.com/howtobuildup/phoenix/
I wanted to ask regarding the Sorani and Kurmanji stop words found https://github.com/sinaahmadi/klpt/blob/master/klpt/data/stopwords.json
To confirm; they're not currently used within the packages functionality right?
Re:
Line 19 in 9c517f8
Would you see any issue with us using the stopwords.json
directly from the package ourselves, post preprocessing
and pre tokenisation
stages?
Thanks a lot for any of your time spent on considering this 🙂
The argument in_separator
seems to be overwritten at the end of the mwe_tokenize function. The argument is not accessible through word_tokenizer
either.
Output of Python analyser:
('dixwî', [[]])
Output of Apertium:
$ echo dixwî | apertium -d ~/source/apertium/languages/apertium-kmr/ kmr-morph
^dixwî/xwarin<vblex><tv><pri><p2><sg>$
I will look into this. Feel free to assign it to me.
whatever i do cyhunspell fails
i wonder if you found a way to work it out
When trying one of the transliteration examples in the main project page,
transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic"
transliterate.transliterate("rojhilata navîn")
I get the following error:
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 162, in transliterate
tokens_dict = self.to_pieces(token)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 131, in to_pieces
tokens_dict[char_index-i] = tokens_dict[char_index-i] + token[char_index]
KeyError: 1
I am using Python 3.6.3 under Windows 10.
Thanks for support!
Giuliano
For Kurmanji we can just use the function apply()
in the opposite direction. I'll take a look at it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.