Giter Club home page Giter Club logo

turkish-pos-tagger's Introduction

Part-of-Speech (POS) Tagger for Turkish

Build and Run

  • Python and NLTK is necessary to build and run this project.
  • Python (works with 2.7.11)
  • NLTK (works with 3.2.5)
  • The module is named as pos_tagger and for the given sentence tag(sentence) will return (word, tag) pairs.
  • The system is trained with the development file provided in CENG463 course it includes 5110 sentences. Dataset originally belongs to Turkish UD treebank.
  • For training:
python training_tagger.py
  • Example:
>>> from pos_tagger import tag
>>> tag('Bunu başından beri biliyordum zaten .')
[('Bunu', 'Pron'), ('başından', 'Noun_Abl'), ('beri', 'Postp'), ('biliyordum', 'Verb'), ('zaten', 'Adv'), ('.', 'Punc')]

Build and Run Using Docker

docker build -t tagger .
docker run -it tagger python

When the shell is opened:

>>> from pos_tagger import tag
>>> tag('Bunu başından beri biliyordum zaten .')
[('Bunu', 'Pron'), ('başından', 'Noun_Abl'), ('beri', 'Postp'), ('biliyordum', 'Verb'), ('zaten', 'Adv'), ('.', 'Punc')]

Implementation Idea

  • In this part-of-speech tagger application, a transformation based POS system is implemented. In this approach, transformation-based tagger uses rules to specify which tags are possible for words and supervised learning to examine possible transformations, improvements and re-tagging.
  • Using NLTK functions, tagged corpus provided in development.sdx is read for training and validation purposes. Then, this set is randomly divided into training and development with 85% and 15%.
  • As a transformation-based tagger, Brill tagger of NLTK is implemented with maximum rules of 300 and minimum score of 3. Brill tagger uses a general tagging method at the first stage and a trigram tagger is used for that purpose. Back-off stages of this trigram tagger is provided in the next page. Since sufficient information cannot be found about rule templates of Brill tagger, default templates given in demo code is directly used.
  • Considering k-fold cross validation, this tagger is trained and its performance is tracked, which will be explained in the next section. Flow of operations is shown as a diagram in the following page.

Results

  • The accuracy of this method is tracked for each folding stage in order to avoid over fitting. Since this tagger will be used for tagging unseen sentences we should avoid generating a model which over fits to our development set.
  • For 10-fold validation, accuracy of the model can be plotted as below:
  • Considering the plot above, the tagger is evaluated at most 6 times by folding training and evaluation tests. The final tagger provided is expected to have 95% of accuracy on development as can be seen from the figure. This final tagger is saved to my_tagger.yaml file by training_tagger module and it is exported when pos_tagger module is called.

Future Work

  • The main problems related to this model can be listed as following:
    • Considering the relatively small development set, there is a high probability of over fitting. Therefore, accuracy level of this model in unseen data can vary very largely.
    • Although implemented, good source of information about Brill rule templates cannot be found. Therefore, better rule templates can be found or unnecessary ones can be eliminated.
    • Considering Turkish as an agglutinative language, rule based methods can be used as back-off or base stage of Brill tagger. Because adding another affix can mislead the tagger as following:
Function Call Tags
tag('Ali koş !') [('Ali', 'Noun_Nom'), ('koş’, 'Verb'), ('!', 'Punc')]
tag('Ali koştu .') [('Ali', 'Noun_Nom'), ('koştu', 'Noun_Nom'), ('.', 'Punc')]

Conclusion

  • To conclude, when the model is evaluated with random parts from development set, accuracy level is calculated as following:
# of Trial Minimum Maximum Average Std. Dev.
49 91% 96% 95% 1%
  • As it is mentioned in Results part, this level of accuracy was expected on random parts of development set. Considering mentioned possible improvements, this model can be enhanced to result with a higher accuracy, especially on tagging unseen sentences.

turkish-pos-tagger's People

Contributors

onuryilmaz avatar rorimac avatar furkanakkurt1335 avatar

Stargazers

 avatar Ömer Faruk Tomurcuk avatar  avatar Ahmet Burak Gözel avatar Mehmet Öner Yalçın avatar  avatar  avatar Isıl Berfin Koparan avatar Esra Arslan avatar Bo Zheng, Allen avatar BugraG. avatar  avatar Aysu Yaman avatar Mesude avatar Dilemre Ülkü avatar Oğuz Çatal avatar eminvergil avatar  avatar Mustafa Durmus avatar Burcu avatar Yunusemre avatar Giovanni Gaglione avatar  avatar halil avatar Aaron Mauro avatar Fikret avatar Patrik Purgai avatar Mesut Pişkin avatar Seyma SARIGİL avatar  avatar Ozan Caglayan avatar Ergin ALTINTAŞ avatar Çağatay Çallı avatar Olga Bulat avatar Aslı Öztürk avatar Anıl Kaynar avatar  avatar Berkay Sargın avatar Erk Ekin avatar Furkan Aksoy avatar Ozgur avatar Burak Topal avatar Ryan Finlayson avatar Samir T. Mammadov avatar arif emre avatar Furkan Arslan avatar Randell Bentley avatar Mehmet Kaya avatar Kamil Toraman avatar Selman Kayrancioglu avatar akaratay avatar Emre Aydin avatar Candemir Doger avatar  avatar  avatar Gökçen Eraslan avatar zeichenkette avatar Sarwan avatar Matt Menzenski avatar Anıl Özbek avatar Pantelis Koukousoulas avatar

Watchers

James Cloos avatar  avatar zeichenkette avatar Emre avatar Sinan Çalışkan avatar  avatar  avatar

turkish-pos-tagger's Issues

yaml.load error

Yaptığınız çalışmayı test etmek istedim ama aşağıdaki gibi ymly dosyasını okurken bir hata alıyorum.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Semiha\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "E:/Python Scripts/turkish-pos-tagger/example.py", line 8, in
from pos_tagger import tr_tag
File "pos_tagger.py", line 22, in
myTagger=yaml.load(f)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\yaml__init__.py", line 71, in load
return loader.get_single_data()
File "C:\Users\Semiha\Anaconda2\lib\site-packages\yaml\constructor.py", line 39, in get_single_data
return self.construct_document(node)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\yaml\constructor.py", line 43, in construct_document
data = self.construct_object(node)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\yaml\constructor.py", line 88, in construct_object
data = constructor(self, node)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\yaml\constructor.py", line 414, in construct_undefined
node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!nltk.BrillTagger'
in "", line 1, column 1:
!nltk.BrillTagger
^

load kısmını load_all yaptım bu kezde "return_list.append(myTagger.tag(token))" kısmında bir hata alıyorum;

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Semiha\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Semiha\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "E:/Python Scripts/turkish-pos-tagger/example.py", line 9, in
print tr_tag('Bunu başından beri biliyordum zaten .')
File "pos_tagger.py", line 31, in tr_tag
return_list.append(myTagger.tag(token))
AttributeError: 'generator' object has no attribute 'tag'

tag diye bir alan yok diyor. bunu nasıl çözebilirim.

iyi çalışmalar.

Constructor error

Importing the pos_tagger module throws an error when it tries to access the pre-trained data in the yaml file:

``---------------------------------------------------------------------------
ConstructorError Traceback (most recent call last)
in ()
----> 1 import pos_tagger

/home/kuhn/repos/github/turkish-pos-tagger/pos_tagger.py in ()
20 # Open the file where tagger is saved
21 f=open('my_tagger.yaml')
---> 22 myTagger=yaml.load(f)
23
24 # Tagger function

/home/kuhn/Applications/anaconda2/lib/python2.7/site-packages/yaml/init.pyc in load(stream, Loader)
69 loader = Loader(stream)
70 try:
---> 71 return loader.get_single_data()
72 finally:
73 loader.dispose()

/home/kuhn/Applications/anaconda2/lib/python2.7/site-packages/yaml/constructor.pyc in get_single_data(self)
37 node = self.get_single_node()
38 if node is not None:
---> 39 return self.construct_document(node)
40 return None
41

/home/kuhn/Applications/anaconda2/lib/python2.7/site-packages/yaml/constructor.pyc in construct_document(self, node)
41
42 def construct_document(self, node):
---> 43 data = self.construct_object(node)
44 while self.state_generators:
45 state_generators = self.state_generators

/home/kuhn/Applications/anaconda2/lib/python2.7/site-packages/yaml/constructor.pyc in construct_object(self, node, deep)
86 constructor = self.class.construct_mapping
87 if tag_suffix is None:
---> 88 data = constructor(self, node)
89 else:
90 data = constructor(self, tag_suffix, node)

/home/kuhn/Applications/anaconda2/lib/python2.7/site-packages/yaml/constructor.pyc in construct_undefined(self, node)
412 raise ConstructorError(None, None,
413 "could not determine a constructor for the tag %r" % node.tag.encode('utf-8'),
--> 414 node.start_mark)
415
416 SafeConstructor.add_constructor(

ConstructorError: could not determine a constructor for the tag '!nltk.BrillTagger'
in "my_tagger.yaml", line 1, column 1
``

Docker Build Error

I have encountered with the following error when I followed the installation guide in README file:

$ sudo docker build -t tagger .
Sending build context to Docker daemon  3.316MB
Step 1/6 : FROM python:2.7.11
 ---> a047e3d0ae2b
Step 2/6 : COPY . /turkish-pos-tagger
 ---> Using cache
 ---> 9ad831634ab3
Step 3/6 : WORKDIR /turkish-pos-tagger
 ---> Using cache
 ---> 2512070eff47
Step 4/6 : RUN pip install pyyaml
 ---> Using cache
 ---> 81bf848eed24
Step 5/6 : RUN pip install -U nltk
 ---> Running in 962d7405c371
Collecting nltk
  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
Collecting click (from nltk)
  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
Collecting joblib (from nltk)
  Downloading https://files.pythonhosted.org/packages/ef/e9/80bdaef3848e8aa5e518f516bdfb79cc5d81c57ead939c541c20eecd7633/joblib-0.17.0.tar.gz (1.7MB)
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-OUiUtO/joblib/setup.py", line 6, in <module>
        import joblib
      File "/tmp/pip-build-OUiUtO/joblib/joblib/__init__.py", line 113, in <module>
        from .memory import Memory, MemorizedResult, register_store_backend
      File "/tmp/pip-build-OUiUtO/joblib/joblib/memory.py", line 274
        raise new_exc from exc
                         ^
    SyntaxError: invalid syntax
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-OUiUtO/joblib/
You are using pip version 8.1.2, however version 20.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The command '/bin/sh -c pip install -U nltk' returned a non-zero code: 1

How to update the code to python 3

It seems like this code was designed for an older version of python. How am I supposed to "translate" this into python 3+?

Traceback (most recent call last):
File "/home/mica/Downloads/turkish-pos-tagger-master/test.py", line 1, in
from pos_tagger import tag
File "/home/mica/Downloads/turkish-pos-tagger-master/pos_tagger.py", line 22, in
myTagger=yaml.load(f)
File "/usr/lib/python3/dist-packages/yaml/init.py", line 72, in load
return loader.get_single_data()
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 37, in get_single_data
return self.construct_document(node)
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 41, in construct_document
data = self.construct_object(node)
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 86, in construct_object
data = constructor(self, node)
File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 414, in construct_undefined
node.start_mark)
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!nltk.BrillTagger'
in "my_tagger.yaml", line 1, column 1

I get this from running pos tagger

Wrong Tagging for words

I have tried the program. However, when I tried a simple sentence as such

tag("Sezonunun 5. Bölümü hakkında konuşmak istiyorum.")
[('Sezonunun', 'Noun_Nom'), ('5.', 'Noun_Nom'), ('B\xc3\xb6l\xc3\xbcm\xc3\xbc', 'Noun_Nom'), ('hakk\xc4\xb1nda', 'Noun_Nom'), ('konu\xc5\x9fmak', 'Noun_Nom'), ('istiyorum.', 'Noun_Nom')]

The program recognizes the verbs as nouns. How come?

Do I have to train the program somehow? How am I supposed to do that?

Also the program only seems to work with Dockerfile. As a result I can only run via a terminal. How can I use the program by using in other python files?

Is the dataset still available?

Hello there,
Question is on the title. I tried to find the dataset you used to train your model but couldn't find it. Mind if I ask if it's still available? If it is can you give me a link to download it?

Edit: OK, my bad should have looked a bit more.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.