linuxscout / pyarabic Goto Github PK

pyarabic

License: GNU General Public License v3.0

Python 97.30% Makefile 0.11% TeX 2.59%

nlp-library arabic-language text-processing

pyarabic's Introduction

PyArabic

A specific Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting Arabic letters, Arabic letters groups and characteristics, remove diacritics etc.

مكتبة برمجية للغة العربية بلغة بيثون، توفر دوالا للتحكم في الحروف والنصوص، مثلا تحديد نوع الحرف، حذف الحركات، مقارنة التشكيل.

Developpers: Taha Zerrouki: http://tahadz.com taha dot zerrouki at gmail dot com

Features	value
Authors	Taha Zerrouki: http://tahadz.com, taha dot zerrouki at gmail dot com
Release	0.6.12
License	GPLv3
Tracker	linuxscout/pyarabic/Issues
Website	https://pypi.python.org/pypi/pyarabic
Doc	package Documentaion
Source	Github
Download	pypi.python.org
Feedbacks	Comments
Accounts	@Twitter @Sourceforge

Citation

Zerrouki, T., (2023). PyArabic: A Python package for Arabic text. Journal of Open Source Software, 8(84), 4886, https://doi.org/10.21105/joss.04886

T. Zerrouki‏, Pyarabic, An Arabic language library for Python, https://pypi.python.org/pypi/pyarabic/, 2010

or in bibtex format

```bibtex
@article{Zerrouki2023,
	title        = {PyArabic: A Python package for Arabic text},
	author       = {Taha Zerrouki},
	year         = 2023,
	journal      = {Journal of Open Source Software},
	publisher    = {The Open Journal},
	volume       = 8,
	number       = 84,
	pages        = 4886,
	doi          = {10.21105/joss.04886},
	url          = {https://doi.org/10.21105/joss.04886}
}

@misc{zerrouki2012pyarabic,
  title={pyarabic, An Arabic language library for Python},
  author={Zerrouki, Taha},
  url={https://pypi.python.org/pypi/pyarabic,
  year={2010}
}

مزايا

تصنيف الحروف
تفريق النص إلى وحدات (جمل أو كلمات)
حذف الحركات:( كل الحركات، الحركات عدا الشدة، حذف الشدة، حذف التطويل، حذف الحركة الأخيرة)
فصل الحركات عن النصوص وإدماجها
اختزال التشكيل
قياس التماثل بين كلمتين ( في الحركات جزئيا وكليا، التماثل مع وزن)
تنميط الحروف ( توحيد التراكيب مثل لام الألف، والهمزات)
تحويل الأعداد إلى كلمات
استخلاص العبارات العددية من النص
تشكيل أولي للعبارات العددية
قلب النصوص العربية للأنظمة التي لا تدعم تشبيك الحروف

Features

Arabic letters classification
Text tokenization into words or sentences
Strip Harakat ( all, except Shadda, tatweel, last_haraka)
Sperate and join Letters and Harakat
Reduce tashkeel
Mesure tashkeel similarity ( Harakats, fully or partially vocalized, similarity with a template)
Letters normalization ( Ligatures and Hamza)
Numbers to words
Extract numerical phrases
Pre-vocalization of numerical phrases
Unshiping texts

Applications

Arabic text processing

Installation

pip install pyarabic

Usage

import pyarabic.araby as araby
import pyarabic.number as number

Package Documentation

https://pyarabic.readthedocs.io/

Files

file/directory category description
araby.py: arabic routins.
named.py: handle named enteties recognation.
unshape.py: unshaping arabic text

وصف

مكتبة بيثون للعربيةPyArabic مكتبة برمجية تجمع في طياتها خصائص ووظائف يحتاجها المبرمج للتعامل مع النصوص العربية، وهي مستوحاة من مكتبة بي أتش بي العربية لصديقنا خالد الشمعة، التي تستهدف توفير مصدر مفتوح لكثير من وظائف النصوص العربية لاستعمالها في مجال النشر في الإنترنت.

تعريف نص عربي

أفضل طريقة للتعامل مع النصوص العربية بلغة بيثون هو استخدام الترميز يونيكود، التي يدعمها بيثون دعما أصليا، لا حاجة فيه إلى مكتبات خارجية أو دوال خاصة، وقد يكون هذا أهمّ ما دفعني لاختيار لغة بيثون، إذ يكفي أن تسبق النص بحرف يو u لتدع بيثون يريحك من عناء التفكير وبرمجة النصوص، ويعامل معها بشفافية عالية.

تعريف نص عربي بترميز يونيكود

text = u'الإسلام ديننا'

اختيار ترميز ملف المتن.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

عرض النص العربي في المخرج

print text.encode('utf8')

اسم المكتبة pyarabic فيها العديد من الوظائف المجمعة في وحدات:

فيها العديد من الوظائف المجمعة في وحدات:

وحدة : araby.py وفيها الثوابت كالحروف وأسمائها ومجموعاتها والوظائف العامة كحذف الحركات وحذف التطويل ومقارنة التشكيل بين الكلمات، وضبط علامات الترقيم.
وحدة الأعداد number.py : وفيها وظائف تحويل الأعداد إلى كلمات والكلمات إلى أعداد، كشف ألفاظ الأعداد في النص، وتشكيلها.
وحدة المسميات : named.py وفيها وظائف لكشف الأسماء والمسميات في النص.

وحدة الوظائف العامة araby

يمكن استدعاؤها بالأمر

Import pyarabic.araby as araby

وسنستعمل الاختصار araby فيما بعد الثوابت العامة في مكتبة عربي: تضم الحروف العربية ومجموعاتها المختلفة وبعض الأنماط المستخدمة لاحقا في وظائف مختلفة 1- الحروف العربية الأساسية مع تسميات لاتينية لاستعمالها في البرمجة

The arabic chars contains all arabic letters, a sub class of unicode,

COMMA            = u'\u060C'
SEMICOLON        = u'\u061B'
QUESTION         = u'\u061F'
HAMZA            = u'\u0621'
ALEF_MADDA       = u'\u0622'
ALEF_HAMZA_ABOVE = u'\u0623'

المزيد في ملف araby.py

تضم مجموعة الحروف العربية الحروف الأساسية، والحركات والأرقام، وعلامات الترقيم، وبعض الحروف الخاصة كالألف الخنجرية والياء الصغيرة، و لامات الألف بأشكالها.

مجموعات الأحرف:

ويمكن تقسيم الحروف في مجموعات وتصنيفات نستعملها فيما بعد في الوظائف المختلفة

الاسم العربي	وصف المجموعة	عناصرها
الحروف	مجموعة الحروف العربية دون حركات	LETTERS = u'ابتةثجحخدذرزسشصضطظعغفقكلمنهويءآأؤإئ'
التشكيل	مجموعة الحركات مع الشدة مدرجة	TASHKEEL =(FATHATAN, DAMMATAN, KASRATAN, FATHA,DAMMA,KASRA, SUKUN, SHADDA)
الحركات	مجموعة الحركات دون الشدة مدرجة	HARAKAT =( FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN);
الحركات القصيرة	الحركات القصيرة دون تنوين	SHORTHARAKAT =( FATHA, DAMMA, KASRA, SUKUN);
التنوين	حركات التنوين	TANWIN =(FATHATAN, DAMMATAN, KASRATAN);
المركبات	لامات الألف في أشكالها المختلفة	LIGUATURES = (u'ﻻ', u'ﻷ', u'ﻹ', u'ﻵ')
الهمزات	الهمزة في أشكالها المختلفة	HAMZAT = (u'ء', u'ؤ', u'ئ', u'ٔ', u'ٕ', u'إ', u'أ')
الألفات	الألف في أشكالها المختلفة	ALEFAT = (u'ا', u'آ', u'أ', u'إ', u'ٱ', u'ى', u'ٰ')
حروف العلة	الياء والواو والألف	WEAK = (u'ا', u'و', u'ي', u'ى')
الياءات	ما يرسم مثل الياء، الصغيرة منها، والألف المقصورة والهمزة على النبرة	YEHLIKE = (u'ي', u'ئ', u'ى', u'ۦ')
الواوات	ما يرسم مثل الواو	WAWLIKE = (u'و', u'ؤ', u'ۥ')
التاءات	التاء المربوطة والمفتوحة	TEHLIKE = (u'ت', u'ة')
الحروف الصغيرة	الألف والياء والواو الصغار	SMALL = (u'ٰ', u'ۥ', u'ۦ')
الحروف القمرية	الحروف القمرية	MOON = (u'ء', u'آ', u'أ', u'إ', u'ا', u'ب', u'ج', u'ح', u'خ', ...
الحروف الشمسية	الحروف الشمسية	SUN = (u'ت', u'ث', u'د', u'ذ', u'ر', u'ز', u'س', u'ش', u'ص', u...
ترتيب الحروف العربية	يعطي لكل حرف عربي رقما ترتيبيا فالألف واحد والباء اثنان والهمزة 29.	AlphabeticOrder = {u'ء': 29, u'آ': 29, u'أ': 29, u'ؤ': 29, u'إ...
أسماء الحروف	يعطي كل حرف اسمه العربي	NAMES = {u'ء': u'همزة', u'آ': u'ألف ممدودة', u'أ': u'همزة على ...

الوظائف- الدوال

أهم الوظائف

وصف الدالة	الدالة
حذف الحركات كلها بما فيها الشدة	strip_tashkeel(text)
حذف الحركات كلها ماعدا الشدة	strip_harakat(text)
حذف الحركة الأخيرة	strip_lastharaka(text)
حذف التطويل	strip_tatweel(text)
تنميط أشكال الهمزة المختلفة	normalize_hamza(text)
تفريق كلمات النص	tokenize(text)
تفريق جمل النص	sentence_tokenize(text)

طالع الوظائف والأمثلة في ملف المزايا

features.md

pyarabic's People

Contributors

Stargazers

Watchers

Forkers

saksoy mohsenuss91 boussouira kursataker ihfazhillah nwohaibi munzirtaha nabildoghri ouzza guibod alaayameen assem-ch 0xjoseph greenat92 d7eame belalmohsen ahmadbass3l sauravcsvt karimamer geohadab walid0805 rmimez flimm yoosif0 mohamedabdultawab basem-ahmed mbencherif mazyod osamahali maboshokor almeta-io danyaalfageh hamdielhamdi mahmoud-abdelsattar souhaib100 elhmadany padmanabh275 ahmedelq fatima-usf wmustafaawad anas-jaf ziyadmsq sultankhaledalmutairi bhangun aliwahba 3ozir grayai0 web-programmer-web mostafa-at-github 01walid rizwandel tarek-berkane mohammad-albarham chrisw09 rawabe-aljamaan abdullahmuaad9 aqhali hudakas afnan-fn nafiealhilaly kentoseth faisalf12 engrtahar-noureddine bitsnaps abdallahaskar1 msis watheqalshowaiter hmidani-abdelilah fakhri-ahmed amtalrhmnan mariamkhaled99 mlotfic emadoz00 erkanhurnali noraddeen aliosamahassan abdelrahmanbayoumi typicasoft odaigh standardgalactic kasbr305 iqbmo04

pyarabic's Issues

Convert Arabic glyphs into standard letters

According to previous issue issue 57, we propose to add a new function to unshape this text

Salam,
I tested the given words with pyarabic word as follow,
the word contains encoded glyphs not standard letters, it must be converted to ordinary letters.

To convert glyph based word into a string of letters you can use:
NB: the second unshape function is used only to inverse the result word

 word = "ﻣﺴﺎﻣﻌﻬﻢ"
 from pyarabic.unshape import unshaping_word
unshaping_word(unshaping_word(word))
'مسامعهم'

The test used to detect the problem

``>>> import pyarabic.araby as ar

lst=["اﻟﻤﺴﺌﻮﻟﻴﺔ","ﻣﺴﺎﻣﻌﻬﻢ","ﻓﻜﻠﻨﺎ","ﻣﺒﺎدراﺗﻨﺎ","ﻓﻬﻢ","اﻟﻤﻨﻈﻮﻣﺔ"]
for i in lst:
... print(i, ar.is_arabicword(i))
...
اﻟﻤﺴﺌﻮﻟﻴﺔ False
ﻣﺴﺎﻣﻌﻬﻢ False
ﻓﻜﻠﻨﺎ False
ﻣﺒﺎدراﺗﻨﺎ False
ﻓﻬﻢ False
اﻟﻤﻨﻈﻮﻣﺔ False

for i in lst:
... print("%s"%i, ar.is_arabicword(i))
...
اﻟﻤﺴﺌﻮﻟﻴﺔ False
ﻣﺴﺎﻣﻌﻬﻢ False
ﻓﻜﻠﻨﺎ False
ﻣﺒﺎدراﺗﻨﺎ False
ﻓﻬﻢ False
اﻟﻤﻨﻈﻮﻣﺔ False
for i in lst:
... for c in i :
... print(c, ord(c), ar.name(c))
...
ا 1575 ألف
ﻟ 65247
ﻤ 65252
ﺴ 65204
ﺌ 65164
ﻮ 65262
ﻟ 65247
ﻴ 65268
ﺔ 65172
ﻣ 65251
ﺴ 65204
ﺎ 65166
ﻣ 65251
ﻌ 65228
ﻬ 65260
ﻢ 65250
ﻓ 65235
ﻜ 65244
ﻠ 65248
ﻨ 65256
ﺎ 65166
ﻣ 65251
ﺒ 65170
ﺎ 65166
د 1583 دال
ر 1585 راء
ا 1575 ألف
ﺗ 65175
ﻨ 65256
ﺎ 65166
ﻓ 65235
ﻬ 65260
ﻢ 65250
ا 1575 ألف
ﻟ 65247
ﻤ 65252
ﻨ 65256
ﻈ 65224
ﻮ 65262
ﻣ 65251
ﺔ 65172
`

Documentation

I am happy to see a documentation website for the library

I suggest using sphinx-rtd theme which supports RTL languages

https://sphinx-rtd-theme.readthedocs.io/en/stable/
https://github.com/readthedocs/sphinx_rtd_theme#contributing-or-modifying-the-theme

thank you

function araby.is_arabicword return false for some arabic word

is_arabicword is returning false when passing the following words to it
"اﻟﻤﺴﺌﻮﻟﻴﺔ","ﻣﺴﺎﻣﻌﻬﻢ","ﻓﻜﻠﻨﺎ","ﻣﺒﺎدراﺗﻨﺎ","ﻓﻬﻢ","اﻟﻤﻨﻈﻮﻣﺔ"

normalize_ligature not having the rigth format

i'm trying the exemple below but i'm getting the same result as the input text

from pyarabic.araby import normalize_ligature
text = u"لانها لالء الاسلام"
normalize_ligature(text)

i'm getting output : لانها لالء الاسلام instead of "لانها لالئ الاسلام"

And thanks for your help - very helpfull library

pip installation is missing stack.py

Hi,

when installing from pip, from pyarabic import araby result in this error:
File "", line 1, in
File "/home/naruto/Desktop/herok-app/venv/local/lib/python2.7/site-packages/pyarabic/araby.py", line 28, in
from stack import *
ImportError: No module named stack
the site-packages/pyarabic folder only contains this files:
araby.py araby.pyc init.py init.pyc

Version 0.6.8 has not been released to Pypi yet

Hello. I noticed that the version number has been updated to 0.6.8 in the code, but this version has not been released on Pypi.

https://pypi.org/project/PyArabic/

New features in Python 3 that could be useful in this codebase

I thought I would share with you a couple of new Python 3 features that be useful for this codebase. I know that this codebase currently supports Python 2, but Python 2 is now no longer supported by core developers, so this may be the time to drop support for Python 2 and start using these newer features.

This is just a friendly message, I am not expecting you to have the same opinion as me or to prioritise this. Treat this as a conversational message, not as a bug report. Feel free to close this if you intend to keep on supporting Python 2.

`\N{name}` in Python string literals:

Python 3.3 supports this new feature, see release notes:

Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

For example:

>>> "\u0649" == "\N{arabic letter alef maksura}"
True

We can tidy up the codebase to avoid using the hard to read \u0649 escape sequences.

`\N{name}` in Python regexes:

Python 3.8 supports these new features, see release notes:

Added support of \N{name} escapes in regular expressions:

>>> notice = 'Copyright © 2019'
>>> copyright_year_pattern = re.compile(r'\N{copyright sign}\s*(\d{4})')
>>> int(copyright_year_pattern.search(notice).group(1))
2019

This would allow us to improve code like this:

ALEF_MAKSURA = u"\u0649"
m = re.match(r"^(.)*[%s]" % ALEF_MAKSURA, word)

to look like this instead:

m = re.match(r"^(.)*[\N{arabic letter alef maksura}]")

This is especially useful when using raw strings literals (string literals that are prefixed with r), as it is not possible otherwise to use \N{...} in a raw string for the intended effect. Raw string literals are often used in regexes.

Correct swaping keyboard error

Example Output 1 (a):
Before - English Keyboard:

Hpf lk hgkhs hglj'vtdkK Hpf hg`dk dldg,k f;gdjil Ygn ,p]hkdm hgHl,v tb drt,k ljv]]dk fdk krdqdk>

After:

أحب من الناس المتطرفين، أحب الذين يميلون بكليتهم إلى وحدانية الأمور فلا يقفون مترددين بين نقيضين.

Example Output 2:
Before:
ِىغ هىفثممهلثىف بخخم ؤشى ةشنث فاهىلس لاهللثق ةخقث ؤخةحمثء شىي ةخقث رهخمثىفز ÷ف فشنثس ش فخعؤا خب لثىهعس شىي ش مخف خب ؤخعقشلث فخ ةخرث هى فاث خححخسهفث يهقثؤفهخىز

After:
Any intelligent fool can make things ghigger more complex and more violent. It takes a touch of genius and a lot of courage to move in the opposite direction.

Albert Einstein

References

Ar-PHP

module 'pyarabic.araby' has no attribute 'sentence_tokenize'

Hello dear

I'm trying to use pyarabic to tokenize sentences, but I get the bellow error:

AttributeError: module 'pyarabic.araby' has no attribute 'sentence_tokenize'

Normalization of Number words

most of the number words found in Arabic text is usually normalized as Arabic speakers use words like "الف" instead of "ألف" and since the modulo already includes various normalization methods it is saner to use the normalized version of the word instead of the original form

Hi

normalize_searchtext import errors + typo's

from pyarabic.normalize import normalize_searchtext
normalize_searchtext("بحث")
throws

import errors

throws typo errors

tokenize sentences

Add tokenize sentences

New features: normalizing digits

Normalizing different digit styles as
توحيد الأرقام وتنميطها إلى شكل معين

Arabic western numeral: '0123456789'
Arabic eastern digit: '٠١٢٣٤٥٦٧٨٩'
Arabic eastern digit variant: '۰۱۲۳۴۵۶۷۸۹'
الوظائف functions
Normalize digits to Arabic western تنميط الأرقام للتنويعة العربية المغربية
- Arabic eastern digit: '٠١٢٣٤٥٦٧٨٩' ==> '0123456789'
- Arabic eastern digit variant: '۰۱۲۳۴۵۶۷۸۹' ==> '0123456789'
Normalize digits to Arabic eastern تنميط الأرقام للتنويعة العربية المشرقية
- Arabic western numeral: '0123456789' ==> '٠١٢٣٤٥٦٧٨٩'
- Arabic eastern digit variant: '۰۱۲۳۴۵۶۷۸۹' ==> '٠١٢٣٤٥٦٧٨٩'
Normalize digits to Arabic eastern variant: تنميط الأرقام للتنويعة العربية المشرقية المستعملة في شبه القارة الهندية وإيران
- Arabic western numeral: '0123456789' ==> '۰۱۲۳۴۵۶۷۸۹'
- Arabic eastern digit: '٠١٢٣٤٥٦٧٨٩' ==> '۰۱۲۳۴۵۶۷۸۹'

initial and middle dotless noon is not working

As-salamo Alaikom Taha @linuxscout ,

I am testing the removal of dots above or below Arabic letters, and everything seems to be working fine, except for initial and middle noon (حرف النون).

Any idea if there is a fix for this?

Here is a screen shot.

Thanks

Adding support for stop words?

Great work Taha. Have you considered adding stop words removal support? There's a good list here that you could start with. https://github.com/mohataher/arabic-stop-words

is_arabicstring

It came with this error:
NameError: name 'is_arabicstring' is not defined
only is_arabicrange is working well

اضافة خاصية tokenize مع حفظ مواقع الكلمات

Import Error - ModuleNotFoundError: No module named 'six'

Salam Dr.Taha, 👋

There is an ImportError when importing module number.py 😕

To reproduce

>>> import pyarabic.number as number

Error Report

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "\venv\lib\site-packages\pyarabic\number.py", line 27, in <module>
    from six import text_type as unicode
ModuleNotFoundError: No module named 'six'

My System Version

Python 3.8.6

Temporary fix
Installing package six using pip install six

Suggestions
I suggest adding six "current version" into the dependencies of pyarabic within install_requires
I'm relatively new to Python "couples of months" and this suggestion is based on my humble experience with Python. 🥺
This is also my first issue submission into GitHub so please bear with me! 😅

Update the strip_tashkeel and strip_diactricts to remove the alef after tanween al fateh

The strip_tashkeel and strip_diactricts functions are very helpful when preprocessing text that will be used for searches. With these functions, one can search for a word that like رحيم without tashkeel. However, one of the challenges is this will not match a word that had tanween al fateh at the end, as the word after removing the tashkeel will still be different in structure رحيما.

I suggest adding another optional flag (to support previous versions) that will also remove the alef if it comes after tanween al fateh. See https://en.rattibha.com/thread/1266046390439903234 for details

Thank you for the amazing library!

normalize_alef converts YEHLIKE to Alef

Hi, I am an Urdu speaker. I am trying to convert arabic words to simple words . I found the stated behavior, please correct me if i am wrong , may be it is intended or correct behavior. If that's the case how can i not change last YEH (small etc) to alef.

words = ['إِلَّا','إِلَىٰ','بِى','بِٱلْهُدَىٰ','بِٱلَّذِىٓ','بِلِقَآئِ','بُغِىَ','ٱلْمَأْوَىٰ']
for w in words:
    simple=w
    print(f'word : {simple}')
    
    simple=araby.strip_diacritics(simple)
    print(f'strip_diacritics : {simple}')
    
    simple=araby.normalize_alef(simple)
    print(f'normalize_alef : {simple}')
    
    simple=araby.normalize_hamza(simple)
    print(f'normalize_hamza : {simple}')
    
    print('_'*10)

outputs:

word : إِلَّا
strip_diacritics : إلا
normalize_alef : الا
normalize_hamza : الا
__________
word : إِلَىٰ
strip_diacritics : إلى
normalize_alef : الا
normalize_hamza : الا
__________
word : بِى
strip_diacritics : بى
normalize_alef : با
normalize_hamza : با
__________
word : بِٱلْهُدَىٰ
strip_diacritics : بٱلهدى
normalize_alef : بالهدا
normalize_hamza : بالهدا
__________
word : بِٱلَّذِىٓ
strip_diacritics : بٱلذى
normalize_alef : بالذا
normalize_hamza : بالذا
__________
word : بِلِقَآئِ
strip_diacritics : بلقائ
normalize_alef : بلقائ
normalize_hamza : بلقاء
__________
word : بُغِىَ
strip_diacritics : بغى
normalize_alef : بغا
normalize_hamza : بغا
__________
word : ٱلْمَأْوَىٰ
strip_diacritics : ٱلمأوى
normalize_alef : الماوا
normalize_hamza : الماوا

Documentation site is broken

The README.md file points to this site for documentation:

https://pythonhosted.org/PyArabic/

Unfortunately, it currently is broken. It does not display properly on Chrome or Firefox. If I look in the console, I see these errors:

Load denied by X-Frame-Options: https://pythonhosted.org/PyArabic/toc.html does not permit framing.

I'm guessing pythonhosted.org recently turned on the X-Frame-Options: deny header, making it impossible to use iframes the way the current documentation uses iframes.

Sentence Tokenization

@linuxscout is there any Sentence Tokenizer function ?

Package documentation?

Hi @linuxscout

The package documentation links to https://pythonhosted.org/PyArabic/ but the link appears to be broken. Or is https://pypi.org/project/PyArabic/ the main documentation for the package?

Could you clarify this point, please?

Kamran

Does pyarabic support Python 3.x

Hi, I am wondering if the package supports python 3.x

prefix and suffix

Hi
is there any function that could provide prefix and suffix tokenization?
such as:
المدخل: ولن نبالغ إذا قلنا: إن 'هاتف' أو 'كمبيوتر المكتب' في زمننا هذا ضروري
المخرج: و+ لن نبالغ إذا قل +نا : إن ' هاتف ' أو ' كمبيوتر ال+ مكتب ' في زمن +نا هذا ضروري

Clean Arabic Text (quranic marks, esthetic symbols)

تنظيف وتنميط النص العربي بحذف العلامات المختلفة مثل :

العلامات القرآنية الموجودة في يونيكود
- مدخل: -يُنَزِّلُ ٱلْمَلَٰٓئِكَةَ بِٱلرُّوحِ مِنْ أَمْرِهِۦ عَلَىٰ مَن يَشَآءُ مِنْ عِبَادِهِۦٓ أَنْ أَنذِرُوٓاْ أَنَّهُۥ لَآ إِلَٰهَ إِلَّآ أَنَا۠ فَٱتَّقُونِ‎
- مخرج - يُنَزِّلُ الْمَلَائِكَةَ بِالرُّوحِ مِنْ أَمْرِهِ عَلَىٰ مَن يَشَاءُ مِنْ عِبَادِهِ أَنْ أَنذِرُوا أَنَّهُ لَا إِلَٰهَ إِلَّا أَنَا فَاتَّقُونِ
علامات الزخرفة
- مدخل: الہلہغہة الہعہربيہة+ال͠ل͠غ͠ة ال͠ع͠رب͠ي͠ة+الہٰلہٰغة الہٰعربٰٰيٰة+ال̲ل̲غ̲ة ال̲ع̲ر̲ب̲ي̲ة
- مخرج: اللغة العربية
علامات الإيموجي
- مدخل:
- مخرج:

مراجع:

Soundex for Arabic text

Provide options in the tokenize function

This looks promising for Arabic tokenization. Not an issue, but It'll be great to provide options in the tokenizer - ex. remove tashkeel and filter non-Arabic words in a mixed text.

Add tags to distinguish versions.

number_to_text method is missing, isn't it?

The documentation mentions that the library has a number to text (21 -> واحد وعشرون) method, but I can't find it. Would you please point to it? And give an example of how to use it?

New Feature: Arabic Text Standardize

Arabic Text Standardize:

Standardize Arabic text just like rules followed in magazines and newspapers like spaces before and after punctuations, brackets and units etc ...

Example Output:
Origenal:

هذا نص عربي ، و فيه علامات ترقيم بحاجة إلى ضبط و معايرة !و كذلك نصوص( بين أقواس )أو حتى مؤطرة"بإشارات إقتباس "أو- علامات إعتراض -الخ......
لذا ستكون هذه المكتبة أداة و وسيلة لمعالجة مثل هكذا حالات، بما فيها الواحدات 1 Kg أو مثلا MB 16 وسواها حتى النسب المؤية مثل 20% أو %50 وهكذا ...
Standard:

هذا نص عربي، وفيه علامات ترقيم بحاجة إلى ضبط ومعايرة! وكذلك نصوص (بين أقواس) أو حتى مؤطرة "بإشارات إقتباس" أو -علامات إعتراض- الخ...
لذا ستكون هذه المكتبة أداة و وسيلة لمعالجة مثل هكذا حالات، بما فيها الواحدات 1 Kg أو مثلا 16 MB وسواها حتى النسب المؤية مثل %20 أو %50 وهكذا...

Ar-php: text standardize

Python 3 support ?

Is there any plan for Python 3 support ? Any alternative ?

Add a new function to correct punctuation on joined tokens

I want to Add a function to correct punctuation for joined texts

example :

def fix_punct(text):
    fix_spaces = re.compile(r'\s*([?؟!.,،]+(?:\s+[?؟!.,،]+)*)\s*', re.UNICODE)
    text = fix_spaces.sub(lambda x: "{} ".format(x.group(1).replace(" ", "")), text)
    return text.strip()
fixed = fix_punct(u"كل فرد في الأمة مجند لمعركة المصير : الفلاح في حقله ، والعامل في مصنعه ، والطالب في معهده ، والموظف في ديوانه ")
print(fixed)
>>> 
كل فرد في الأمة مجند لمعركة المصير : الفلاح في حقله، والعامل في مصنعه، والطالب في معهده، والموظف في ديوانه...

Issue checking for a valid Arabic word

Expected Behaviour

An Arabic word doesn't contain spaces, digits and punctuation

Current Behaviour

araby.is_arabicword('؛') gives True.

Add help functions to get constants names

Add functions to list letters and their codes and names

Tokenize words

In Tokenize part, it didn't separate the character و from the word when it is not a part of the original words, like in the example:

>>> from pyarabic.araby import tokenize, is_arabicrange, strip_tashkeel
>>> text = u"ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
>>> tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)
        ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']

Function for "is_arabic"

Hi,
Is there any function that checks if a word is Arabic or not?
e.g.
is_arabic('محمد') => True
is_arabic('’Muhammed') => False

Thanks.

Which open source license is this under?

setup.py mentions that this is licensed under GPL. However, which version of the GPL? And where is the full text of the license?

I recommend including a file named LICENSE.txt with the full text of GPLv3. Here are instructions: https://choosealicense.com

Add docs for new features

spellit
encode_tashkeel
normalized_digit
fix error on project description on pypi with features.md

text2number is not converting 400 , 104 properly

أربعة مئة = 400 -> giving output as 104
104 = مائة وأربعة -> giving puput as 100 و 4

بايثون 3

السلام عليكم، شكرا لك على هذه المكتبة. كما تعلم، سيتوقف دعم بايثون 2 في العام القادم، هل هناك نسخة متوافقة مع بايثون 3؟
جزاك الله خيرا

0.6.3 breaks on Python 3

I updated to 0.6.3 and my application fails with this Traceback:

Traceback (most recent call last):
File "C:\Users\tahoar\slatedesktop\share\plugins\ar\tokenizer.py", line 23, in
from pyarabic import araby
File "C:\Program Files\Python36\lib\site-packages\pyarabic\araby.py", line 219
HARAKAT_PATTERN = re.compile(ur"[" + u"".join(HARAKAT) + u"]", re.UNICODE)
^
SyntaxError: invalid syntax

I have done considerable work on your code to update all modules for Python 2.7 and 3.x compatibility and I'd like to contribute those changed to trunk. Please review this attachment and run through your tests.

Thanks,
Tom

pyarabic.zip

Split Letter

Is there any builtin function to split letter/character?

Installing on windows using pip

I just tried installing using pip, but it gave me encoding error in the setup.py file.
To fix this i added encoding='utf-8'.

def readme():
    with open('README.md',encoding='utf-8') as f:
        return f.read()

Ordinal Number

Convert ordinal number into number
"الثالث" => 3

Are all of kinds of Arabic text normalization work in PyArabic?

From what I have read, there is only one kind of text normalization: Hamazat normalization.

What About other letters? like what is in ar-php library.

for example
Origenal Text: آسِفـــةٌ لا تَنَبُّؤْ

Normalized Text: اسفه لا تنبء

Is there anything similar to that? It would be very helpful..

thanks @linuxscout for all you work!

Strip arabic extended harakat

Add a fucntion to remove diacritics like Small alef