Giter Club home page Giter Club logo

urdu's Introduction

Summary Dataset

This a summary dataset. You can train abstractive summarization model using this dataset. It contains 3 files i.e. train, test and val. Data is in jsonl format.

Every line has these keys.

id
url
title
summary
text

You can easily read the data with pandas

import pandas as pd
test = pd.read_json("summary/urdu_test.jsonl", lines=True)

POS dataset

Urdu dataset for POS training. This is a small dataset and can be used for training parts of speech tagging for Urdu Language. Structure of the dataset is simple i.e.

word TAG
word TAG

The tagset used to build dataset is taken from Sajjad's Tagset

NER Datasets

Following are the datasets used for NER tasks.

UNER Dataset

Happy to announce that UNER (Urdu Named Entity Recognition) dataset is available for NLP apps. Following are NER tags which are used to build the dataset:

PERSON
LOCATION
ORGANIZATION
DATE
NUMBER
DESIGNATION
TIME

If you want to read more about the dataset check this paper Urdu NER. NER Dataset is in utf-16 format.

MK-PUCIT Dataset

Latest for Urdu NER is available. Check this paper for more information MK-PUCIT.

Entities used in the dataset are

Other
Organization
Person
Location

MK-PUCIT author also provided the Dropbox link to download the data. Dropbox

IJNLP 2008 dataset

IJNLP dataset has following NER tags.

O
LOCATION
PERSON
TIME
ORGANIZATION
NUMBER
DESIGNATION

Jahangir dataset

Jahangir dataset has following NER tags.

O
PERSON
LOCATION
ORGANIZATION
DATE
TIME

Datasets for Sentiment Analysis

IMDB Urdu Movie Review Dataset.

This dataset is taken from IMDB Urdu. It was translated using Google Translator. It has only two labels i.e.

positive
negative

Roman Dataset

This dataset can be used for sentiment analysis for Roman Urdu. It has 3 classes for classification.

Neutral
Positive
Negative

If you need more information about this dataset checkout the link Roman Urdu Dataset.

Products & Services dataset

This dataset is collected from different sources like social media and web for various products and services for sentiment analysis. It contains 3 classes.

pos
neg
neu

Daraz Products dataset

This dataset consists of reviews taken from Daraz. You can use it for sentiment analysis as well as spam or ham classification. It contains following columns.

Product_ID
Date
Rating
Spam(1) and Not Spam(0)
Reviews
Sentiment
Features

Dataset is taken from kaggle daraz

Urdu Dataset

Here is a small dataset for sentiment analysis. It has following classifying labels

P
N
O

Link to the paper Paper GitHub link to data Urdu Corpus V1

News Datasets

Urdu News Dataset 1M

This dataset(news/urdu-news-dataset-1M.tar.xz) is taken from Urdu News Dataset 1M. It has 4 classes and can be used for classification and other NLP tasks. I have removed unnecessary columns.

Business & Economics
Entertainment
Science & Technology
Sports

Real-Fake News

This dataset(news/real_fake_news.tar.gz) is used for classification of real and fake news in Fake News Dataset Dataset contains following domain news.

Technology 
Education 
Business
Sports
Politics
Entertainment

News Headlines

Headlines(news/headlines.csv.tar.gz) dataset is taken from Urd News Headlines. Original dataset is in Excel format, I've converted to csv for experiments. Can be used for clustering and classification.

RAW corpus and models

COUNTER (COrpus of Urdu News TExt Reuse) Dataset

This dataset is collected from journalism and can be used for Urdu NLP research. Here is the link to the resource for more information COUNTER.

QA datasets

I have added two qa datasets, if someone wants to use it for QA based Chatbot. QA(Ahadis): qa_ahadis.csv It contains qa pairs for Ahadis.

The dataset qa_gk.csv it contains the general knowledge QA.

Urdu model for SpaCy

Urdu model for SpaCy is available now. You can use it to build NLP apps easily. Install the package in your working environment.

pip install ur_model-0.0.0.tar.gz

You can use it with following code.

import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")

NLP Tutorials for Urdu

Checkout my articles related to Urdu NLP tasks

These articles are available on UrduNLP.

Some Helpful Tips

Download Single file from GitHub

If you want to get only raw files(text or code) then use curl command i.e.

curl -LJO https://github.com/mirfan899/Urdu/blob/master/ner/uner.txt

Concatenate files

cd data
cat */*.txt > file_name.txt

MK-PUCIT

Concatenate files of MK-PUCIT into single file using.

cat */*.txt > file_name.txt

Original dataset has a bug like Others and Other which are same entities, if you want to use the dataset from dropbox link, use following commands to clean it.

import pandas as pd
data = pd.read_csv('ner/mk-pucit.txt', sep='\t', names={"tag", "word"})
data.tag.replace({"Others":"Other"}, inplace=True)
# save according you need as csv or txt by changing the extension
data.to_csv("ner/mk-pucit.txt", index=False, header=False, sep='\t')

Now csv/txt file has format

word tag

Note

If you have a dataset(link) and want to contribute, feel free to create PR.

urdu's People

Contributors

mirfan899 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

urdu's Issues

Urdu Trained model with spacy

I have successfully downloaded Urdu trained model from here (https://github.com/mirfan899/Urdu/tree/bc114033c8a58b3fa60538c90c45844c4e73a95f/spacy ),and installed it with pip
However, when i run the following snippet of code:
import spacy
nlp = spacy.load("ur_model")
doc = nlp("میں خوش ہوں کے اردو ماڈل دستیاب ہے۔ ")

I got the following error: Please let me know how to remove the errors.
PS D:\Jamia Usmania\Project> & "C:/Users/Dr. Atif Khan/AppData/Local/Programs/Python/Python39/python.exe" "d:/Jamia Usmania/Project/test22.py"
2021-07-26 16:07:07.486218: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-07-26 16:07:07.486523: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "d:\Jamia Usmania\Project\test22.py", line 3, in
nlp = spacy.load("ur_model")
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy_init_.py", line 51, in load
return util.load_model(
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 324, in load_model
return load_model_from_package(name, **kwargs)
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 357, in load_model_from_package
return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config)
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\ur_model_init_.py", line 12, in load
return load_model_from_init_py(file, **overrides)
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 517, in load_model_from_init_py
return load_model_from_path(
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 391, in load_model_from_path
config = load_config(config_path, overrides=overrides)
File "C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\spacy\util.py", line 548, in load_config
raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
OSError: [E053] Could not read config.cfg from C:\Users\Dr. Atif Khan\AppData\Local\Programs\Python\Python39\lib\site-packages\ur_model\ur_model-0.0.0\config.cfg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.