BRBT-dataset-pickles

These are the pickled files ready to be fed into a Neural Network model for Sentiment Analysis. This is actually based on an ongoing thesis work done by me and a number of co-authors. For greater detail on this unpublished work please check here (you can use it to cite as well in case you used these pickle files)

###Data Statistic:

Total number of posts: 9337 (no of rows in sheet 9338)
Bangla posts: 6698 (no of rows in sheet 6699)
Romanized Bangla posts: 2639 (no of rows in sheet 2640)

####Data Sources

Data were collected from various microblog sites, such as, Facebook, Twitter, YouTube etc, and some online news portal, product review panels etc. Following is the statistic of data sources -

From Facebook: 4621
From Twitter: 2610
From YouTube: 801
From online news portals: 1255
From product review pages: 50

####Post collection processing

Removal of emoticons:- emoticon, hash-tags were removed to give annotators an unbiased-text-only content to make a decision based on three criteria - positive, negative and ambigious.
Removal of proper nouns:- Proper nouns were replaced with tags to provide ambigiuity. All text samples were collected from publicly available sources and did not reflect the opinion of the authors. (The original text samples have been preserved but are not publicly available for some technical issues. These may be obtained by emailing the authors directly and signing the required consent form.)
Manual validation (by native speakers):- Collected data samples are manually annotated into one of three categories: positive (1), negative (0) and ambiguous (A). Each text sample was independently manually annotated by two different native Bangla speaking individuals. Each annotator validated the data without knowing decisions made by the other.

###Accessing the dataset

These pickled files are made from a BRBT (Bangla and Romanized Bangla) Sentiment Analysis (SA) dataset of 9337 Bangla text samples. The dataset is unique because it also encompasses the till-now-ignored Romanized Bangla. Romanized Bangla is the Bangla written using English alphabets. Due to the ease of writing using any standard QWERTY keyboard and the simplicity of using English as base language for the posts, Romanized Bangla is gaining popularity not only in personal messages and microblogs but also in Govt. sanctioned mass messages/announcements. The dataset (xlsx file) itself is NOT available publicly at the moment, but it may be made available by personally contacting the authors via email

mridul090 / brbt-dataset-pickles Goto Github PK

brbt-dataset-pickles's Introduction

BRBT-dataset-pickles

brbt-dataset-pickles's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent