urduhack / urduhack Goto Github PK

An NLP library for the Urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.

Home Page: https://urduhack.readthedocs.io/en/stable/

License: MIT License

Python 100.00%

urdu urdu-language urdu-nlp machine-learning deep-learning urdu-text-processsing deeplearning python tensorflow urdu-hack

urduhack's Introduction

Urduhack: A Python NLP library for Urdu language

Urduhack is a NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.

You can reach out core contributor Mr Ikram Ali @ https://github.com/akkefa

Our Goal

Academic users Easier experimentation to prove their hypothesis without coding from scratch.
NLP beginners Learn how to build an NLP project with production level code quality.
NLP developers Build a production level application within minutes.

🔥 Features Support

🛠 Installation

Urduhack officially supports Python 3.6–3.7, and runs great on PyPy.

Installing with tensorflow cpu version.

$ pip install urduhack[tf]

Installing with tensorflow gpu version.

$ pip install urduhack[tf-gpu]

Usage

import urduhack

# Downloading models
urduhack.download()

nlp = urduhack.Pipeline()
text = ""
doc = nlp(text)

for sentence in doc.sentences:
    print(sentence.text)
    for word in sentence.words:
        print(f"{word.text}\t{word.pos}")

    for token in sentence.tokens:
        print(f"{token.text}\t{token.ner}")

🔗 Documentation

Fantastic documentation is available at https://urduhack.readthedocs.io/

Documentation
Installation	How to install Urduhack and download models
Quickstart	New to Urduhack? Here's everything you need to know!
API Reference	The detailed reference for Urduhack's API.
Contribute	How to contribute to the code base.

👍 Contributors

Special thanks to everyone who contributed to getting the Urduhack to the current state.

Backers

Thank you to all our backers! 🙏 [Become a backer]

📝 Copyright and license

Code released under the MIT License.

urduhack's People

Stargazers

Watchers

urduhack's Issues

Drop Stop words

question: how to replace multiple ۔۔۔ with one ۔

dear all
i am processing a text which contain multiple end of sentence mark like
text=" اس کو تو یاد ہی نہیں رہا۔۔۔۔۔۔۔۔"
how to get desired text using any urduhack function
desired text="اس کو تو یاد ہی نہیں رہا۔"
Thanks!
currently i am using following code

from urduhack.normalization import normalize
text="اس کو تو یاد ہی نہیں رہا۔۔۔۔۔۔۔۔" 
normalized_text = normalize(text)
print(normalized_text)

output is:
اس کو تو یاد ہی نہیں رہا۔۔۔۔۔۔۔۔
desired output:
اس کو تو یاد ہی نہیں رہا۔

Support for ShahMukhi and Sindhi scripts

Feature description

Would it be possible to extend the Standard Urdu script to add support for ShahMukhi (Punjabi) and Sindhi's additional characters as well?
That is, it would be great if the library's modules like Normalization works for all the 3 languages (Urdu, Punjabi, Sindhi).

Thanks!

Documentation: Conversion of Unicode value of initial , medial and final form of character with isolated form

Although it is doing conversion very well. the following text may be added somewhere in documentation for the sake of user clarity.

اردو یا عربی میں ایک ہی حرف کی مختلف اشکال ہوتی ہیں۔
'ﻑ', ' ﻒ ', ' ﻓ ', ' ﻔ

اکیلی شکل Isolated form
ابتدائی شکل Initial form
وسطی شکل Medial form
آخری شکل Final form
یونیکوڈ میں حرف کی اکیلی شکل ، ابتدائی شکل ، وسطی شکل اور آخری شکل کی علیحدہ علیحدہ یونیکوڈ ویلیوز ہیں۔

کی بورڈ سے لکھتے ہوئے عموما حرف کی اکیلی شکل ہی لکھی جاتی ہے اور لے آوٹ انجن فونٹ سے حرف کے کانٹیکسٹ کے مطابق مطلوبہ شکل مہیا کرتا ہے
اشکال کو تبدیل کرنے کا یہ عمل
reshaping engine
کی ذمہ داری ہوتی ہے۔

جن سسٹمز میں رائٹ ٹو لیفٹ زبان کی لے آوٹ اور ری شیپنگ کی صلاحیت نہیں ہوتی وہاں ایک جگاڑ یہ بنا لیا جاتا ہے کہ حرف کی اکیلی شکل کے بجائے
context
کے مطابق حروف کی یونیکوڈ ویلیو لے آوٹ انجن کو فراہم کی جاتی ہیں اور اس طرح درست ڈسپلے ممکن ہوتا ہے۔
لیکن سرچ کرتے ہوئے اس طرح کے متن کی تلاش مشکل ہوتی ہے۔
لہذا نارملائزیشن کے عمل میں یہ ضروری ہوتا ہے کہ حرف کی ابتدائی ، وسطی اور آخری شکل کو آسولیٹیڈ فارم سے تبدیل کر دیا جائے۔

Error in Characters Normalization code sample

In
urduhack/docs/handbook/tutorial/normalization.rst
under Characters Normalization a code sample is given in which Text and normalized_text both have same unicode values. There is no change in text and normalized_text

text:مجھ کو جو توڑا گیا تھا
unicode:
%u0645%u062C%u06BE%20%u06A9%u0648%20%u062C%u0648%20%u062A%u0648%u0691%u0627%20%u06AF%u06CC%u0627%20%u062A%u06BE%u0627

normalized_text:
مجھ کو جو توڑا گیا تھا
unicode:
%u0645%u062C%u06BE%20%u06A9%u0648%20%u062C%u0648%20%u062A%u0648%u0691%u0627%20%u06AF%u06CC%u0627%20%u062A%u06BE%u0627

for text to Unicode:
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
for diff:
https://www.diffchecker.com/diff

Normalize ﷲ to الله

Sometimes, a single unicode character ﷲ is used to denote اللہ.
Please normalize this as part of the Arabic->Urdu conversion.

Addition of 'وہاں' URDU_CONJUNCTIONS

ٹیسٹ متن:
جہاں کہیں آزادی ہوتی ہے وہاں اور روشیں جنم لے لیتی ہیں۔جہاں بلوے کی آگ ہے وہاں امریکی ایندھن موجود ہے۔جہاں زلزلہ آتا ہے وہاں خواتین کی عزت کو خطرہ ہوتا ہے۔جس بیرک میں انہیں رکھا گیا ہے وہاں کیا حالات ہیں؟

اردو ھیک سینٹینس ٹوکنائزیشن کی آوٹ پٹ:
['جہاں کہیں آزادی ہوتی ہے۔',
'وہاں اور روشیں جنم لے لیتی ہیں۔',
'جہاں بلوے کی آگ ہے۔',
'وہاں امریکی ایندھن موجود ہے۔',
'جہاں زلزلہ آتا ہے۔',
'وہاں خواتین کی عزت کو خطرہ ہوتا ہے۔',
'جس بیرک میں انہیں رکھا گیا ہے۔',
'وہاں کیا حالات ہیں؟'

Solution:
URDU_CONJUNCTIONS میں
'وہاں'
کا اضافہ کردیا جائے

support for extraction of comma separated sentences from text

مندرجہ ذیل جملے کو دیکھیے:

چند دن پہلے ایک دوست سے ملاقات ہوئی، وہ رنجیدہ دکھائی دئیے، معلوم ہوا کہ ان کے ایک محلے دار کا نوجوان لڑکا انتقال کر گیا ہے، وہ کینسر کا مریض تھا۔
source of text
http://www.inzaar.pk/hum-kahan-ja-rahy-hain-by-muhammad-aamir-khakwani/

اس میں سے مندرجہ ذیل جملے نکالے جا سکتے ہیں؟ جب کہ اردو ھیک اس کو ایک جملہ دکھا رہا ہے

چند دن پہلے ایک دوست سے ملاقات ہوئی
وہ رنجیدہ دکھائی دئیے
معلوم ہوا کہ ان کے ایک محلے دار کا نوجوان لڑکا انتقال کر گیا ہے
وہ کینسر کا مریض تھا

`from urduhack.normalization import normalize
from urduhack.preprocess import normalize_whitespace
from urduhack.tokenization import *

text='چند دن پہلے ایک دوست سے ملاقات ہوئی، وہ رنجیدہ دکھائی دئیے، معلوم ہوا کہ ان کے ایک محلے دار کا نوجوان لڑکا انتقال کر گیا ہے، وہ کینسر کا مریض تھا۔'
normalized_text = normalize_whitespace(normalize(text))
sentences=sentence_tokenizer(normalized_text)
print(sentences)
`

Generate IMDB dataset into URDU

'Charmap' codec can't decode byte 0x81 in position 21: character maps to <undefined>

Below are two images explaining the error. It is requested very humbly to help me in this regard.

IMAGE 1:

IMAGE 2:

Your Environment

Operating System: Windows 10 Education (64-bit)
Python Version Used: 3.6
Urduhack Version Used: 0.2.7
Environment Information: Jupyter IDE on HP Laptop intel(R) Core(TM) i5-3230M CPU @ 2.60GHz 2.60 GHz with 4 GB RAM.

Request: IsNormalized()

Dear Developers!
It is requested that IsNormalized(text) function should be added which takes text as input and return true if it is normalized text and false otherwise.
Thanks

word_tokenizer: Index out of bound error

I am tokenizing urdu data around 5k lines, it works well for 500-800 lines but when I run it over whole file it is generating index out of bound error and output file contains only 150 tokenized lines.
P.S: I have already tokenized this file successfully using indic and nltk tokenizer.
Error:
Traceback (most recent call last):
File "urduhack-tok.py", line 29, in
tokLine=word_tokenizer(sentence)
File "/home/buraq/.virtualenvs/urduhack/lib/python3.6/site-packages/urduhack/tokenization/tokenizer.py", line 35, in word_tokenizer
return predict(sentence, MODEL_PATH, VOCAB_PATH)
File "/home/buraq/.virtualenvs/urduhack/lib/python3.6/site-packages/urduhack/tokenization/keras_tokenizer.py", line 119, in predict
inp_, _ = preprocess_sentences(sentences, max_len, char2idx)
File "/home/buraq/.virtualenvs/urduhack/lib/python3.6/site-packages/urduhack/tokenization/keras_tokenizer.py", line 55, in preprocess_sentences
input_[i, char_index] = char2idx[letter]
IndexError: index 256 is out of bounds for axi

The command URDUHACK DOWNLOAD seems not working on windows

Dear all, I am not a developer and using the github for the first time, too. I am working on a research problem. As usual I always google things to implement ideas and solve problems. Currently I am solving a research problem in Urdu language for which I need a library like NLTK. During browsing I encountered urduhack library. However, when I tried to use/test it I faced a problem. Everything seems to be working fine except the command urduhack download. During testing an error message is thrown as shown in the error message below

The error message guides to execute the command urduhack download in the terminal. When executed the command as shown in the image below the error still exists.

From the above anaconda environment screenshot it seems like the command urduhack download does not executes at all because visually nothing happens on the screen as a result of the command e.g a success message, a failure message etc. Therefore, I think there is a problem with this command on windows.

How to reproduce the behaviour

urduhack download

Your Environment

Operating System: Windows 10 Education (64-bit)
Python Version Used: 3.6
Urduhack Version Used: 0.2.7
Environment Information: I am using Spyder IDE version 4.1.1 on HP Laptop intel(R) Core(TM) i5-3230M CPU @ 2.60GHz 2.60 GHz with 4 GB RAM.

Download IMDB dataset

Train sentimental model on available datasets.

OSError: SavedModel file does not exist at: C:\Users\user\.urduhack\models\tokenizer\word\word_tokenizer.h5\{saved_model.pbtxt|saved_model.pb}

How to reproduce the behavior

import urduhack
urduhack.download()
from urduhack.tokenization import word_tokenizer
a = "احسن فاروقی"
word_tokenizer(a)

Your Environment

Operating System: Windows 10
Python Version Used: 3.8.3
Urduhack Version Used: 1.1.1
Environment Information: Jupyter notebook
Tensorflow version: 2.5.0
Keras Version: 2.5.0

Error Generated:

OSError Traceback (most recent call last)
in
1 a = "احسن فاروقی"
----> 2 word_tokenizer(a)

~\AppData\Roaming\Python\Python38\site-packages\urduhack\tokenization\tokenizer.py in word_tokenizer(sentence, max_len)
62 if _WORD_TOKENIZER_MODEL is None:
63 _is_model_exist(WORD_TOKENIZER_MODEL_PATH, WORD_TOKENIZER_VOCAB_PATH)
---> 64 _WORD_TOKENIZER_MODEL, _CHAR2IDX, _IDX2CHAR = load_model(WORD_TOKENIZER_MODEL_PATH, WORD_TOKENIZER_VOCAB_PATH)
65
66 inp, _ = _preprocess_sentence(sentence, _CHAR2IDX, max_len=max_len)

~\AppData\Roaming\Python\Python38\site-packages\urduhack\tokenization\keras_tokenizer.py in load_model(model_path, vocab_path)
96 """
97
---> 98 model = tf.keras.models.load_model(model_path)
99
100 char2idx_, idx2char_ = _load_vocab(vocab_path)

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\saving\save.py in load_model(filepath, custom_objects, compile, options)
204 filepath = path_to_string(filepath)
205 if isinstance(filepath, str):
--> 206 return saved_model_load.load(filepath, compile, options)
207
208 raise IOError(

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\keras\saving\saved_model\load.py in load(path, compile, options)
119 # Look for metadata file or parse the SavedModel
120 metadata = saved_metadata_pb2.SavedMetadata()
--> 121 meta_graph_def = loader_impl.parse_saved_model(path).meta_graphs[0]
122 object_graph_def = meta_graph_def.object_graph_def
123 path_to_metadata_pb = os.path.join(path, constants.SAVED_METADATA_PATH)

~\AppData\Roaming\Python\Python38\site-packages\tensorflow\python\saved_model\loader_impl.py in parse_saved_model(export_dir)
111 raise IOError("Cannot parse file %s: %s." % (path_to_pbtxt, str(e)))
112 else:
--> 113 raise IOError(
114 "SavedModel file does not exist at: %s%s{%s|%s}" %
115 (export_dir, os.path.sep, constants.SAVED_MODEL_FILENAME_PBTXT,

OSError: SavedModel file does not exist at: C:\Users\user/.urduhack/models/tokenizer/word/word_tokenizer.h5{saved_model.pbtxt|saved_model.pb}

AttributeError: module 'urduhack' has no attribute 'pipeline'

I followed documentation or you say just copy paste. but the module gives an un expected error .

urduhack.pipeline()
AttributeError: module 'urduhack' has no attribute 'pipeline'
So basically it saves that urduhack does not have any attribute pipeline where as it's mention in documentation.

Data fetching for news resources

Fetch data from following news sites.

Arynews
Dawn News
Urdu point

Contains 100 MB data from at least all resource. Total dataset size will be 300 MB.

Sent file to @akkefa for verifying the results.

Normalization: Replace ِArabic ة with Urdu ۃ

It is requested that ِArabic ة should be replaced with Urdu ۃ during normalization

Pos Tag error

I run the following code to generate word tokenization for my urdu text corpus:

 
 import urduhack
 nlp = urduhack.Pipeline()
 urduhack.download()
 doc = nlp(text)
 for sentence in doc.sentences:
     for word in sentence.words:
         print(word.text)

I get the following error:

/usr/local/lib/python3.7/dist-packages/urduhack/pipeline/parsers/pos_tagger.py in parse(self, document)
20 tags = predict_tags(sentence.text)
21
---> 22 assert len(tags) == len(sentence.words), " Error in post tags"
23 for tag, word in zip(tags, sentence.words):
24 if tag[0] == word.text:

AssertionError: Error in post tags

Here's my Urdu Text corpus:

یوں توآئے دن کوئی نا کوئی واقعہ رونما ہوتا ہے ۔
کبھی لڑکیوں کو گولی مار کر زندہ درگور کیا جاتا ہے تو کسی کو جلا کر مار دیا جاتا ہے ۔
پھر قانون اور اسلام کے نام پر حاصل ہونے والے پاکستان پر ایک نہیں پانچ انگلیاں اٹھتی ہیں ۔
صفحے پر صفحے کالے کیے جاتے ہیں ۔
جس کا جتنا بس چلتا ہے وہ اپنی نفرت کا اظہار بھی اسی طرح کرتا ہے ۔
میری طرح کئی ایسے ہوں گے جو اپنی تسلی کے لیے لکھتے ہوں گے کہ شاید لکھ کر ہم نے شہیدوں میں نام لکھوا لیا ہے ۔
ابھی سیلاب کے بارے میں سوچا ہی جا رہا تھا کہ ایک ایسی وڈیو نے سوچنے پر مجبور کر دیا کہ کیا ان کو انصاف مل سکے گا ۔
یا کچھ وقت گزر جانے کے بعد یہ بھی ماضی کا حصہ بن جاہیں گے ۔
اسلام آباد میں طلبا اور سول سوسائٹی کی تنظیموں نے احتجاجی مظاہرے کئے ۔
سابق جج کاظم علی ملک پر مشتمل تحقیقاتی ٹریبونل نے چھیاسٹھ افراد کے بیانات قلمبند کئے ۔
ٹرریبونل کے سامنے سابق ڈی پی او سیالکوٹ وقار چوہان ، سابق ایس پی انوسٹی گیشن افضل ورک ، علامہ اقبال میڈیکل کالج کے آفیسرز ، ڈسٹرکٹ ایمرجنسی اور ریسکیو ڈبل ون ڈبل ٹو کے اسٹیشن انچارج اور دس اہلکاروں کے علاوہ موضع بٹر سے تعلق رکھنے والے آٹھ مردوں اور چالیس خواتین نے اپنے بیانات قلمبند کرائے ۔
دونوں بھائیوں کے قتل کے الزام میں پولیس اہلکاروں سمیت سولہ افراد کو گرفتار کرلیا گیا ہے ۔
نیشنل پریس کلب اسلام آباد کے باہر پاکستان انٹرنیشنل ہیومن رائٹس اور انجمن طلباء اسلام کے کارکنوں نے سیالکوٹ کے پرتشدد واقعے کے خلاف احتجاجی مظاہرہ کیا ۔
مظاہرین نے حکومت سے مطالبہ کیا کہ واقعہ میں ملوث ریسکیو ڈبل ون ڈبل ٹو اور موقع پر موجود پولیس اہلکاروں کے خلاف بھی قتل کا مقدمہ درج کیا جائے اور انہیں قرار واقعی سزا دی جائے ۔
موقع پر موجود ہر تماشائی کو بھی سزا ملنی چاہیے ۔
آپ کے خیال میں یہ سب کچھ ممکن ہو گا یا پھر وہی ہم اور ہمارا قانون ۔
جس سے نا پہلے کبھی انصاف کی توقع تھی اور نا ہی اب ہے ۔
ایسے میں ہم کیا کر سکتے ہیں ۔
سوچنا ہے ضرور ۔

removing punctuations makes string messy

How to reproduce the behavior

When trying to remove punctuations, it adds extra spaces to string making it messy i.e.

from urduhack.preprocess import remove_punctuation
text = "وسائل کی*!? کوئی کمی نہیں ﮨﮯ"
removed = remove_punctuation(text)
print(removed)
وسائل کی    کوئی کمی نہیں ﮨﮯ

and to remove spaces, I need to make an extra call to function normalize_whitespace making it very slow operation for NLP.

Your Environment

Operating System: macOS
Python Version Used: 3.7
Urduhack Version Used: latest
Environment Information: latest pip

this does not support romanization

Feature description

We found that this tool does not support romanization.

Some Important words in Stopwords

How to reproduce the behaviour

Just check Stopwords file

آخری بھی شامل ہے۔ یہ فقرہ ملاحظہ فرمائیں
کل میرا آخری پیپر ہے۔
اگر اس میں سے آخری نکال دیں تو سارا مفہوم تبدیل ہو گیا۔ ایسے اور بھی بہت سے الفاظ سٹاپ ورڈز میں شامل ہیں۔

Normalization and word tokenization have issues.

Tried to run the example given in the documentation for normalization and the results do not match.

normalize("پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔")
'پی ایس ایل میں 69 مقامی اور کرس گیل، ڈیرن سیمی، کیون پیٹرسن اور شین واٹسن سمیت29 غیر ملکی کھلاڑی شامل ہیں۔'

Does not normalized سمیت29.

Similarly used the word tokenizer and the results are not so good.

word_tokenizer("پی سی بی چیئرمین کے مطابق نوجوان کھلاڑیوں کو انٹرنیشنل کھلاڑیوں کے ساتھ کھیلنے سے فائدہ ہوگا۔")
['پی', 'سی', 'بی', 'چیئر', 'مین', 'کے', 'مطابق', 'نوجو', 'ان', 'کھلاڑیوں', 'کو', 'انٹرنیشنل', 'کھلاڑیوں', 'کے', 'ساتھ', 'کھیلنے', 'سے', 'فائدہ', 'ہو', 'گا۔']

چیئرمین and نوجوان are broken into multiple words.

Your Environment

Operating System: ubuntu 20
Python Version Used: 3.8
Urduhack Version Used: latest
Environment Information:

absl-py==0.12.0
astunparse==1.6.3
attrs==21.2.0
beautifulsoup4==4.9.3
bs4==0.0.1
cachetools==4.2.2
certifi==2021.5.30
charset-normalizer==2.0.4
clang==5.0
click==7.1.2
dill==0.3.4
flatbuffers==1.12
future==0.18.2
gast==0.4.0
google-auth==1.34.0
google-auth-oauthlib==0.4.5
google-pasta==0.2.0
googleapis-common-protos==1.53.0
grpcio==1.39.0
h5py==3.1.0
idna==3.2
keras==2.6.0
Keras-Preprocessing==1.1.2
Markdown==3.3.4
numpy==1.19.5
oauthlib==3.1.1
opt-einsum==3.3.0
promise==2.3
protobuf==3.17.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
regex==2021.8.3
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
six==1.15.0
soupsieve==2.2.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.6.0
tensorflow-addons==0.13.0
tensorflow-datasets==3.2.1
tensorflow-estimator==2.6.0
tensorflow-gpu==2.6.0
tensorflow-metadata==1.2.0
termcolor==1.1.0
tf2crf==0.1.32
tqdm==4.62.1
typeguard==2.12.1
typing-extensions==3.7.4.3
urduhack==1.1.1
urllib3==1.26.6
Werkzeug==2.0.1
wrapt==1.12.1

Tokenization issues.

How to reproduce the behavior

Why do we have to download tokenization model and then use it to tokenize sentences?
Punctuations are not being tokenized properly.

In NLP domain punctuations when tokenized are considered as words for example

nlp = spacy.blank("en")
sentence = nlp("This is a sentence with some ?!, punctuations." )
for word in sentence:
    print(word)

This
is
a
sentence
with
some
?
!
,
punctuations
.

import nltk
sentence = """This is a sentence with some ?!, punctuations."""
tokens = nltk.word_tokenize(sentence)
print(tokens)
['This', 'is', 'a', 'sentence', 'with', 'some', '?', '!', ',', 'punctuations', '.']

Urduhack does not handle punctuations.

from urduhack.tokenization import sentence_tokenizer, word_tokenizer
text = """عراق اور شام نے اعلان کیا ہےدونوں ممالک جلد اپنے اپنے سفیروں کو واپس 'بغداد' اور دمشق بھیج، دیں گے’۔!؟"""
words = word_tokenizer(text)
print(words)
['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک', 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھی', 'ج،', 'دیں', 'گے۔!؟']

Some words are split like بھیج to two words ج،, and بھی

گے is combined with punctuations.

Your Environment

Operating System: macOS
Python Version Used: 3.7
Urduhack Version Used: latest
Environment Information:

BERT ?!

BERT Usage

Salaam team,
Great work first of all 🥇
Can we consider adding the BERT model to the list of models ?

Tensorflow and Tensorflow-gpu are both being installed

Install requires both TensorFlow and TensorFlow-GPU?

pip install urduhack[tf-gpu]

It seems both are required to use urduhack.

Your Environment

Operating System: Ubuntu 18
Python Version Used: 3.6
Urduhack Version Used: latest
Environment Information:

absl-py==0.13.0
astunparse==1.6.3
beautifulsoup4==4.9.3
bs4==0.0.1
cachetools==4.2.2
certifi==2021.5.30
charset-normalizer==2.0.4
dataclasses==0.8
gast==0.3.3
google-auth==1.34.0
google-auth-oauthlib==0.4.5
google-pasta==0.2.0
grpcio==1.39.0
h5py==2.10.0
idna==3.2
importlib-metadata==4.6.4
Keras-Preprocessing==1.1.2
Markdown==3.3.4
numpy==1.18.5
oauthlib==3.1.1
opt-einsum==3.3.0
Pillow==8.3.1
protobuf==3.17.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
scipy==1.4.1
six==1.16.0
soupsieve==2.2.1
tensorboard==2.6.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow-estimator==2.3.0
tensorflow-gpu==2.3.0
termcolor==1.1.0
typing-extensions==3.10.0.0
urllib3==1.26.6
Werkzeug==2.0.1
wrapt==1.12.1
zipp==3.5.0

pip install broken

How to reproduce the behaviour

installing from wheel (from pypy page) is broken:
[this was tried on two different python conda environments with python 3.5.5 and 3.6 separately, with and without tensorflow installed]
$ pip install urduhack-0.3.1-py3-none-any.whl
Processing ./urduhack-0.3.1-py3-none-any.whl
ERROR: Package 'urduhack' requires a different Python: 3.5.5 not in '>= 3.6
installing from pip directly gives the same issue:
[this was tried on two different python conda environments with python 3.5.5 and 3.6 separately, with and without tensorflow installed]
$ pip install urduhack
ERROR: Could not find a version that satisfies the requirement urduhack (from versions: none)
ERROR: No matching distribution found for urduhack
git cloning this directory and then installing from inside repo also fails
[this was tried on two different python conda environments with python 3.5.5 and 3.6 separately, with and without tensorflow installed]
$ pip install -e .
Obtaining ~/urduhack
ERROR: Package 'urduhack' requires a different Python: 3.5.5 not in '>= 3.6'

Your Environment

Operating System: ubuntu 18.04
Python Version Used: tried both 3.5.5 and 3.6
Urduhack Version Used:0.3.1
Environment Information: conda

ValueError: Layer #3 (named "crf_2" in the current model) was found to correspond to layer crf in the save file. However the new layer crf_2 expects 5 weights, but the saved weights have 1 elements.

Download all datasets from s3.

Download all datasets from s3 storage.

Preprocessing module for replacing values with tag

Feature description

I've gone through the preprocessing module, and I'm confused, why there are different features of replacing values with a token *TAG*.

for example, replacing URL with *URL*.

from urduhack.preprocess import replace_urls
text = "20 www.gmail.com  فیصد"
replace_urls(text)
'20 *URL*  فیصد'

There are multiple functions with the functionality of replacing the text with *TAG*.

replace_urls
replace_emails
replace_numbers
replace_phone_numbers
replace_currency_symbols

Addition in URDU_NEWLINE_WORDS

Related file: urduhack/tokenization/eos.py/
The word کیجئے present in URDU_NEWLINE_WORDS is also written as:
کیجیے
so it is proposed that it should be added in URDU_NEWLINE_WORDS

Collect top 50 positive and negative words

AttributeError: module 'urduhack' has no attribute 'pipeline'

I followed documentation or you say just copy paste. but the module gives an un expected error .

urduhack.pipeline()
AttributeError: module 'urduhack' has no attribute 'pipeline'
So basically it saves that urduhack does not have any attribute pipeline where as it's mention in documentation.

Issue in Installation

When I try to download urduhack i received a lot of errors.
My python version is 2.8

Kaggle Task

addition in URDU_CONJUNCTIONS

ٹیسٹ متن:
بات یہاں تک پہنچنی تھی تو اتنی بڑھک نہ مارتے۔ کتنے لوگ پنجاب میں ہوں گے جو کہیں گے کہ پنجاب حکومت ٹھیک ہاتھوں میں ہے اور اِس کی کارکردگی تسلی بخش ہے؟ خاک میں کیا صورتیں ہوں گی کہ پنہاں ہو گئیں۔

اردو ھیک سینٹینس ٹوکنائزیشن کی آوٹ پٹ:
بات یہاں تک پہنچنی تھی۔
تو اتنی بڑھک نہ مارتے۔
کتنے لوگ پنجاب میں ہوں۔
گے جو کہیں گے کہ پنجاب حکومت ٹھیک ہاتھوں میں ہے اور اس کی کارکردگی تسلی بخش ہے؟
خاک میں کیا صورتیں ہوں۔
گی کہ پنہاں ہو گئیں۔

اس مسئلہ کا تعلق بھی مندرجہ ذیل فائل کی سطر 7 اور 8 سے ہے
urduhack/urduhack/tokenization/eos.py

URDU_CONJUNCTIONS = ['جنہیں','جس','جن','جو','اور', 'اگر', 'اگرچہ', 'لیکن', 'مگر', 'پر', 'یا', 'تاہم', 'کہ', 'کر']
URDU_NEWLINE_WORDS = ['کیجئے', 'گئیں', 'تھیں', 'ہوں', 'خریدا', 'گے', 'ہونگے', 'گا', 'چاہیے', 'ہوئیں', 'گی','تھا', 'تھی', 'تھے', 'ہیں', 'ہے',]
اس مسئلے کا حل:
URDU_CONJUNCTIONS میں
'تو', 'گے', 'گی'
کا اضافہ کردیا جائے

I am facing this issue of "word2idx.json" not found.

How to reproduce the problem

I ran the following code after installing urduhack as

pip install Urduhack[tf]

from here: UrduHack Docs
After that I ran the following code:

import urduhack

# Downloading models
urduhack.download()

nlp = urduhack.Pipeline()
text = """
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
"""
doc = nlp(text)

for sentence in doc.sentences:
    print(sentence.text)
    for word in sentence.words:
        print(f"{word.text}\t{word.pos}")

    for token in sentence.tokens:
        print(f"{token.text}\t{token.ner}")

This was the error that I got.

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Administrator.DESKTOP-2V0A5JF/.urduhack/models/tagger/pos/word2idx.json'

Your Environment

Operating System: Windows 10
Python Version Used: 3.8
Urduhack Version Used: 1.1.1
Environment Information: anaconda base environment

Normalization: Removal of tatweel/kashida character

Tatweel character is used to stretch some character present in the word.
It should be removed from text during normalization.

Tatweel character: ـ
unicode value: 0640

sample code:
from urduhack.normalization import normalize
text = "کــــــاتب حضــرات"
normalized_text = normalize(text)
#Normalized text
print(normalized_text)
desired output:
کاتب حضرات

Tf vs Tf-gpu installation issue.

How to reproduce the behaviour

When we install urduhack using pip. It removed GPU version TensorFlow. We need to fix it.

Your Environment

Operating System: ubuntu 18
Python Version Used: 3.6
Urduhack Version Used: 0.3.2
Environment Information:

urduhack / urduhack Goto Github PK

urduhack's Introduction

Urduhack: A Python NLP library for Urdu language

Our Goal

🔥 Features Support

🛠 Installation

Usage

🔗 Documentation

👍 Contributors

Backers

Sponsors

📝 Copyright and license

urduhack's People

Stargazers

Watchers

Forkers

urduhack's Issues

Feature description

Your Environment

How to reproduce the behaviour

Your Environment

How to reproduce the behavior

Your Environment

Error Generated:

How to reproduce the behavior

Your Environment

Feature description

How to reproduce the behaviour

Your Environment

How to reproduce the behavior

Your Environment

BERT Usage

Your Environment

How to reproduce the behaviour

Your Environment

Feature description

How to reproduce the problem

This was the error that I got.

Your Environment

How to reproduce the behaviour

Your Environment

Recommend Projects

Recommend Topics

Recommend Org