Giter Club home page Giter Club logo

klpt's People

Contributors

ftyers avatar mhmd-azeez avatar sinaahmadi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

klpt's Issues

tokenization with word ending with "iy" instead of "îy"

In Kurmanji, words ending with "î" when inflected with a form starting with "î" undergo an alternation where the "îî" becomes "iy" in contrast to "îy". That should be included in the tokenization module to make sure that the correct word forms are looked up in the dictionary.

dîplomasiyê / dîplomasîyê

morphology for kurmanji

The apertium project has a morphological analyser for Kurmanji:
https://github.com/apertium/apertium-kmr

You could include it to get morphological analysis for Kurmanji. :)

It recognises 342870 forms and you can get a full form list using the lt-expand tool:

$ lt-expand apertium-kmr.kmr.dix  | wc -l
342870

Excessive tokenization to be fixed

Some of the affixes are not required to be considered separate tokens on their own. This is particularly the case of articles, such as "eke" and "êk" in both Sorani and Kurmanji.

No clear usage instruction

Hi Kaka @sinaahmadi, very well done for the great effort you have been doing to fill this gap. once tiny feedback, I have noticed your instructions to run this tool is not up to date in README file. e.g pip3 needs to be used instead of pip since you are already using pythong3.*.

Also the instruction of the usage is not quite clear, would have been great if you put some complete examples?

nashwan@Nashwan:~/Desktop/klpt/klpt$ import klpt dialect sorani
-bash: import: command not found

Stop words?

Hi Sina,

Thanks for the great package. We're implementing usage of it for analysis of scraped social media data from Iraq and other Kurdish speaking areas to enable local peace builders to better understand online popularization and how conflicts are being played out in digital public spheres to aid their peace building initiative design. https://gitlab.com/howtobuildup/phoenix/

I wanted to ask regarding the Sorani and Kurmanji stop words found https://github.com/sinaahmadi/klpt/blob/master/klpt/data/stopwords.json
To confirm; they're not currently used within the packages functionality right?
Re:

# def remove_stopwords(self, text):

Would you see any issue with us using the stopwords.json directly from the package ourselves, post preprocessing and pre tokenisation stages?

Thanks a lot for any of your time spent on considering this 🙂

Some words aren't analysed, although they are in Apertium

Output of Python analyser:

('dixwî', [[]])

Output of Apertium:

$ echo dixwî | apertium -d ~/source/apertium/languages/apertium-kmr/ kmr-morph
^dixwî/xwarin<vblex><tv><pri><p2><sg>$

I will look into this. Feel free to assign it to me.

Error when trying example

When trying one of the transliteration examples in the main project page,

transliterate = Transliterate("Kurmanji", "Latin", target_script="Arabic"
transliterate.transliterate("rojhilata navîn")

I get the following error:
Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 162, in transliterate
tokens_dict = self.to_pieces(token)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\klpt\transliterate.py", line 131, in to_pieces
tokens_dict[char_index-i] = tokens_dict[char_index-i] + token[char_index]
KeyError: 1

I am using Python 3.6.3 under Windows 10.

Thanks for support!

Giuliano

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.