Giter Club home page Giter Club logo

turkish-deasciifier's Introduction

turkish-deasciifier: Turkish deasciifier

This is a deasciifier Python library and command line utility for Turkish that solves the problem of diacritics restoration (also known as diacritics reconstruction). It takes a Turkish string containing only ASCII characters (that is, without proper diacritics) and replaces the relevant characters with their corresponding Turkish letters.

The web-based, online version of this system is available at:

http://turkceyap.appspot.com/

Keep in mind that diacritics restoration (deasciification) for Turkish doesn't work 100% of the time; it is an active research topic! Still, this library is good enough for many practical purposes, and served many people and projects in the last 10 years.

This system is based on the turkish-mode for GNU Emacs by Prof. Deniz Yüret.

Table of Contents

  1. Installation
  2. Example Python Library Usage
  3. Example CLI (Command Line Interface) Usage
  4. Other Programming Languages and Systems
  5. Advanced Research

Installation

Python 3

For now, the recommended way to install is to use pip and install direcly from the project's GitHub repository:

pip install git+https://github.com/emres/turkish-deasciifier.git

Python 2

Keep in mind that switching to Python 3 is strongly recommended! If you insist on using Python 2.x, you can install using the following command:

pip install Turkish-Deasciifier

Example Python Library Usage

Python 3

from turkish.deasciifier import Deasciifier

my_ascii_turkish_txt = "Opusmegi cagristiran catirtilar."
deasciifier = Deasciifier(my_ascii_turkish_txt)
my_deasciified_turkish_txt = deasciifier.convert_to_turkish()
print(my_deasciified_turkish_txt)

Python 2

Keep in mind that switching to Python 3 is strongly recommended! If you insist on using Python 2.x, you can use the library in the following manner:

from turkish.deasciifier import Deasciifier

my_ascii_turkish_txt = "Opusmegi cagristiran catirtilar."
deasciifier = Deasciifier(my_ascii_turkish_txt.decode("utf-8"))
my_deasciified_turkish_txt = deasciifier.convert_to_turkish()
print my_deasciified_turkish_txt.encode("utf-8")

Example CLI (Command Line Interface) Usage

Python 3

Example tested in a Bash shell:

$ echo "Opusmegi cagristiran catirtilar." | turkish-deasciify
$ cat somefile.txt | turkish-deasciify

Python 2

Keep in mind that switching to Python 3 is strongly recommended!

Example tested in a Bash shell:

$ echo "Opusmegi cagristiran catirtilar." | turkish-deasciify-python2
$ cat somefile.txt | turkish-deasciify-python2

Other Programming Languages and Systems

Advanced Research

For recent advanced scientific research articles, please see the following:

turkish-deasciifier's People

Contributors

emres avatar faraday avatar roktas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

turkish-deasciifier's Issues

CSV üzerinde satırlara uygulama

Emre bey merhabalar,

Verilerim CSV formatında Google colab üzerinde şu kodları oluşturdum:

from turkish.deasciifier import Deasciifier

import csv 

duzelt = []

with open('/GDrive/My Drive/API-satir/merge1k.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        my_ascii_turkish_txt = (row)
        deasciifier = Deasciifier(my_ascii_turkish_txt)
        my_deasciified_turkish_txt = deasciifier.convert_to_turkish()
        duzelt.append(my_deasciified_turkish_txt)
        print(my_deasciified_turkish_txt) 

Ancak çalıştırdığım zaman aşağıdaki hatayı alıyorum.

def set_char_at(self, mystr, pos, c):
return mystr[0:pos] + c + mystr[pos+1:]
def convert_to_turkish(self):

TypeError: can only concatenate list (not "str") to list  

Bu sorunu nasıl aşabilirim? Yardımcı olursanız çok sevinirim.

birkaç ekleme

Merhaba,

küçük bir kaç ekleme yapmak isterim;

daha atık davranmaya
alana sigacak şekilde
perçeption (ing. kelime ama bu haliyle tuhaf göründü)

Diziyi elle güncellemek istemiyorsunuz sanırım, en azından kayıda geçsin istedim.
Kullanmak isteyenler kendi değişikliklerini yapabilir.

Başarılı bir çalışma olmuş, teşekkürler.

pip install

Merhaba.

pip install git+https://github.com/emres/turkish-deasciifier.git yaptığımda

ERROR: Complete output from command python setup.py egg_info:
    ERROR: Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\user\AppData\Local\Temp\pip-req-build-nl97thly\setup.py", line 62
        except OSError, e:
                      ^
    SyntaxError: invalid syntax


erroru alıyorum. setup dosyasında sorun var sanırım

encode-decode error

$ echo "Opusmegi cagristiran catirtilar." | turkish-deasciify
Traceback (most recent call last):
  File "/usr/bin/turkish-deasciify", line 26, in <module>
    d.deasciify()
  File "/usr/bin/turkish-deasciify", line 22, in deasciify
    sys.stdout.write(result.encode("utf-8"))
TypeError: write() argument must be str, not bytes

Deleting .decode("utf-8") and .encode("utf-8") in /usr/bin/turkish-deasciify solves the issue.

Slow performance

Hardware

MacBookPro13,3
Quad-Core Intel Core i7 - 2,7 GHz
Memory - 16 GB

Benchmark results

Word Count Character Count Result (seconds)
10000 82236 5.3s
20000 176226 23.1s
40000 376746 94.3s
80000 804532 438.6s
100000 1025479 819.4s

Summary

Converting a 1000-page book will take an average of 3 hours.
It takes weeks to translate a large old ascii website SQL database.

So a progress bar and optimization are required. fast word processing libraries can be used.

Sorunlu Kelimeler.

Sorunlu kelimelerin bazılarını derledim, turkish_pattern_table değişkeninde tanımlanırsa düzeltilebilir. olası kullanımları öğretmek gerekiyor.
Sorunlu kelimeler

  • Acar - Açar
  • Asık - Aşık
  • Oldu - Öldü
  • Sık - Sik - Şık
  • Tas - Taş
  • Su - Şu
  • Surat - Sürat
  • Koy - Köy
  • Turunçgiller

Cümle içinde kullanalım

Ascii Deasciifier hatalı çeviri
COK SIKSINIZ ÇOK SIKSINIZ
ASIK VEYSEL ASIK SURATLI MIYDI? AŞIK VEYSEL AŞIK SÜRATLİ MİYDİ?
AL KIRDIN SIKTIN BIRAKTIN! AL KIRDİN SIKTIN BIRAKTIN!
YEMEGI TASA KOY GETIR YEMEĞİ TAŞA KÖY GETİR
TURUNCGILLER TURUNÇĞİLLER
COK ACAR BIRI ÇOK AÇAR BİRİ

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.