ibrahimsharaf / doc2vec Goto Github PK

:notebook: Long(er) text representation and classification using Doc2Vec embeddings

License: MIT License

Python 100.00%

nlp-machine-learning gensim scikit-learn doc2vec sentiment-analysis text-classification

doc2vec's Introduction

Doc2Vec Text Classification

Text classification model which uses gensim Doc2Vec for generating paragraph embeddings and scikit-learn Logistic Regression for classification.

Dataset

25,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary (1 for postive, 0 for negative).

This source dataset was collected in association with the following publication:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Usage

Install the required tools

pip install -r requirements.txt
Run the script

python text_classifier.py

References

Kaggle – Bag of Words Meets Bags of Popcorn (https://www.kaggle.com/c/word2vec-nlp-tutorial)
Gensim – Deep learning with paragraph2vec (https://radimrehurek.com/gensim/models/doc2vec.html)
Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents (https://arxiv.org/pdf/1405.4053v2.pdf)

doc2vec's People

Contributors

Stargazers

Watchers

doc2vec's Issues

print precision and recall

First of all I would like to thank you for sharing with us this nice project,
I am getting accuracy and F1 score when running the code,
How I can get precision an recall?
thanks

Convert script to OOP format

Convert the model code to OOP structure, for better readability and style.

Predict sentiment of input text

The script now only supports training and testing given an input dataset, we need to add a new function to support prediction given a new example.

Save the trained models (doc2vec, random forest/logisitc regression)
Load them when the user wants to predict the sentiment of a specific input text.

Depends on #14

Fixed value 183891

Hi, it is not clear for me why in the getVecs function,
if(vecs_type == 'Test'): index = index + 183891
Is there any reason to put that value?

Add CLI to the model

Add CLI support for the following commands:

Pass a dataset to the model for training
Pass a dataset to the model for testing given a trained model path
Pass a dataset to the model which splits it for training then testing
Pass a single sentence to the model for prediction given a trained model path

Depends on #14

ImportError: cannot import name ABC

[root@ip-172-31-39-31 doc2vec]# python text_classifier.py
Traceback (most recent call last):
File "text_classifier.py", line 1, in
from models.doc2vec_model import doc2VecModel
File "/home/ec2-user/doc2vec/models/doc2vec_model.py", line 1, in
from .model import Model
File "/home/ec2-user/doc2vec/models/model.py", line 1, in
from abc import ABC, abstractmethod
ImportError: cannot import name ABC

Python installed:

[root@ip-172-31-39-31 doc2vec]# python -v
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# /usr/lib64/python2.7/site.pyc matches /usr/lib64/python2.7/site.py
import site # precompiled from /usr/lib64/python2.7/site.pyc
# /usr/lib64/python2.7/os.pyc matches /usr/lib64/python2.7/os.py
import os # precompiled from /usr/lib64/python2.7/os.pyc
import errno # builtin
import posix # builtin
# /usr/lib64/python2.7/posixpath.pyc matches /usr/lib64/python2.7/posixpath.py
import posixpath # precompiled from /usr/lib64/python2.7/posixpath.pyc
# /usr/lib64/python2.7/stat.pyc matches /usr/lib64/python2.7/stat.py
import stat # precompiled from /usr/lib64/python2.7/stat.pyc
# /usr/lib64/python2.7/genericpath.pyc matches /usr/lib64/python2.7/genericpath.py
import genericpath # precompiled from /usr/lib64/python2.7/genericpath.pyc
# /usr/lib64/python2.7/warnings.pyc matches /usr/lib64/python2.7/warnings.py
import warnings # precompiled from /usr/lib64/python2.7/warnings.pyc
# /usr/lib64/python2.7/linecache.pyc matches /usr/lib64/python2.7/linecache.py
import linecache # precompiled from /usr/lib64/python2.7/linecache.pyc
# /usr/lib64/python2.7/types.pyc matches /usr/lib64/python2.7/types.py
import types # precompiled from /usr/lib64/python2.7/types.pyc
# /usr/lib64/python2.7/UserDict.pyc matches /usr/lib64/python2.7/UserDict.py
import UserDict # precompiled from /usr/lib64/python2.7/UserDict.pyc
# /usr/lib64/python2.7/_abcoll.pyc matches /usr/lib64/python2.7/_abcoll.py
import _abcoll # precompiled from /usr/lib64/python2.7/_abcoll.pyc
# /usr/lib64/python2.7/abc.pyc matches /usr/lib64/python2.7/abc.py
import abc # precompiled from /usr/lib64/python2.7/abc.pyc
# /usr/lib64/python2.7/_weakrefset.pyc matches /usr/lib64/python2.7/_weakrefset.py
import _weakrefset # precompiled from /usr/lib64/python2.7/_weakrefset.pyc
import _weakref # builtin
# /usr/lib64/python2.7/copy_reg.pyc matches /usr/lib64/python2.7/copy_reg.py
import copy_reg # precompiled from /usr/lib64/python2.7/copy_reg.pyc
# /usr/lib64/python2.7/traceback.pyc matches /usr/lib64/python2.7/traceback.py
import traceback # precompiled from /usr/lib64/python2.7/traceback.pyc
# /usr/lib64/python2.7/sysconfig.pyc matches /usr/lib64/python2.7/sysconfig.py
import sysconfig # precompiled from /usr/lib64/python2.7/sysconfig.pyc
# /usr/lib64/python2.7/re.pyc matches /usr/lib64/python2.7/re.py
import re # precompiled from /usr/lib64/python2.7/re.pyc
# /usr/lib64/python2.7/sre_compile.pyc matches /usr/lib64/python2.7/sre_compile.py
import sre_compile # precompiled from /usr/lib64/python2.7/sre_compile.pyc
import _sre # builtin
# /usr/lib64/python2.7/sre_parse.pyc matches /usr/lib64/python2.7/sre_parse.py
import sre_parse # precompiled from /usr/lib64/python2.7/sre_parse.pyc
# /usr/lib64/python2.7/sre_constants.pyc matches /usr/lib64/python2.7/sre_constants.py
import sre_constants # precompiled from /usr/lib64/python2.7/sre_constants.pyc
dlopen("/usr/lib64/python2.7/lib-dynload/_localemodule.so", 2);
import _locale # dynamically loaded from /usr/lib64/python2.7/lib-dynload/_localemodule.so
# /usr/lib64/python2.7/_sysconfigdata.pyc matches /usr/lib64/python2.7/_sysconfigdata.py
import _sysconfigdata # precompiled from /usr/lib64/python2.7/_sysconfigdata.pyc
import encodings # directory /usr/lib64/python2.7/encodings
# /usr/lib64/python2.7/encodings/__init__.pyc matches /usr/lib64/python2.7/encodings/__init__.py
import encodings # precompiled from /usr/lib64/python2.7/encodings/__init__.pyc
# /usr/lib64/python2.7/codecs.pyc matches /usr/lib64/python2.7/codecs.py
import codecs # precompiled from /usr/lib64/python2.7/codecs.pyc
import _codecs # builtin
# /usr/lib64/python2.7/encodings/aliases.pyc matches /usr/lib64/python2.7/encodings/aliases.py
import encodings.aliases # precompiled from /usr/lib64/python2.7/encodings/aliases.pyc
# /usr/lib64/python2.7/encodings/utf_8.pyc matches /usr/lib64/python2.7/encodings/utf_8.py
import encodings.utf_8 # precompiled from /usr/lib64/python2.7/encodings/utf_8.pyc
Python 2.7.16 (default, Jul 19 2019, 22:59:28)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
dlopen("/usr/lib64/python2.7/lib-dynload/readline.so", 2);
import readline # dynamically loaded from /usr/lib64/python2.7/lib-dynload/readline.so

Integrate PyUP bot

Use PyUP bot for automated updating of model dependencies.

Depends on #15

Use more datasets

Add support to train the model on other text classification datasets, which will make it suitable for more use cases.

Add Jupyter notebook

Add a Jupyter notebook with the following modifications (more ideas are welcomed!) to better understand what is going on under the hood:

Exploratory analysis of the input dataset(s).
Visualization of the trained Doc2Vec vectors (t-SNE, UMAP)

CI with Travis

Depends on #13
Trigger automated builds on Travis after every commit, to make sure that the model doesn't break.

Add Unit testing

Add unit tests to trigger the model training and inference then make sure the performance doesn't worsen below a certain threshold (80%?)

Depends on #14

Softwares needed

Hi Ibrahim,

I have been using Scala/Spark for text analysis. I am in the process of switching over to Python as many new algorithms are available in Python. I am trying your doc2vec example mentioned here and able to understand the code. But When I run "python model.py", I was keep on getting basic errors like 'Import error: module not found' etc. I understood that I am missing some of the python modules. It would be very beneficial if you can update the code with 'Usage' as in the packages needed to run this code. For example, does this code needs python3/python? I really appreciate your response. Thanks for your help and time.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

ibrahimsharaf / doc2vec Goto Github PK

doc2vec's Introduction

Doc2Vec Text Classification

Dataset

Usage

References

doc2vec's People

Contributors

Stargazers

Watchers

Forkers

doc2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org