Giter Club home page Giter Club logo

roman-urdu-dataset's Introduction

Roman Urdu Dataset

License repo-size

This is an extensive compilation of Roman Urdu Dataset (Urdu written in Latin/Roman Script), along with other helpful Roman Urdu NLP resources.

Motivation

One of the biggest hurdle when working with NLP in Urdu is the lack of comprehensive dataset in the domain unlike other languages. The above resources are therefore a culmination of that effort. The data is organized so to facilitate other researchers and hobbyist, and help promote research. Application code for the Roman Urdu NLP can be found here


Description

Dateset

The dataset consist of sentences gathered from reviews of various e-commerce website, comments on public facebook pages, and twitter accounts. Each row would ideally consist of a single sentence and have a corresponding sentiment attach to it, which would be either Negative, Positive or Neutral. There are more than 20,000 sentences and they have been manually tagged.

Dictionary

It contains English meaning of Roman Urdu words.

Conversion

It consist of words between languages in form of:

<English> : <Urdu> : <Roman-Urdu>

Negative-and-Positive-Words

It contains negative and positive words for sentiment analysis.

<English> : <Roman-Urdu> : <POS-tag>

Urdu-Names

This contains list of common Urdu (Pakistani) names.


Credits

The main dataset compilation for Roman-Urdu NLP processing was done as part of research effort by Zareen Sharf. More info here: https://archive.ics.uci.edu/ml/datasets/Roman+Urdu+Data+Set

The dataset consist of data gathered from reviews of various e-commerce website, comments of public facebook pages, and twitter accounts, Other data have been gathered from wikipedia, and various resources online.

Contributing

To facilitate more people to actively work in the domain it is requested that if you are building on from these resources, do try to contribute, extend or optimize the current datasets.

If you know of a data source that is not listed here, let me know. To do so, you can open a Pull Request with the dataset added in to the relevant file, or just open an issue and mention the dataset & link.

Usage

You are free to use this dataset for your purposes, do credit the original author, and I would appreciate if you can drop me in a message :)

Used in

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details

roman-urdu-dataset's People

Contributors

smat26 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.