Giter Club home page Giter Club logo

doc2vec's Introduction

Doc2Vec Text Classification Build Status

Text classification model which uses gensim Doc2Vec for generating paragraph embeddings and scikit-learn Logistic Regression for classification.

Dataset

25,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary (1 for postive, 0 for negative).

This source dataset was collected in association with the following publication:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Usage

  • Install the required tools

    pip install -r requirements.txt

  • Run the script

    python text_classifier.py

References

doc2vec's People

Contributors

avinashsai avatar ayatallah avatar banyous avatar choas avatar ibrahimsharaf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doc2vec's Issues

print precision and recall

First of all I would like to thank you for sharing with us this nice project,
I am getting accuracy and F1 score when running the code,
How I can get precision an recall?
thanks

Predict sentiment of input text

The script now only supports training and testing given an input dataset, we need to add a new function to support prediction given a new example.

  • Save the trained models (doc2vec, random forest/logisitc regression)
  • Load them when the user wants to predict the sentiment of a specific input text.

Depends on #14

Fixed value 183891

Hi, it is not clear for me why in the getVecs function,
if(vecs_type == 'Test'): index = index + 183891
Is there any reason to put that value?

Add CLI to the model

Add CLI support for the following commands:

  • Pass a dataset to the model for training
  • Pass a dataset to the model for testing given a trained model path
  • Pass a dataset to the model which splits it for training then testing
  • Pass a single sentence to the model for prediction given a trained model path

Depends on #14

ImportError: cannot import name ABC

[root@ip-172-31-39-31 doc2vec]# python text_classifier.py
Traceback (most recent call last):
File "text_classifier.py", line 1, in
from models.doc2vec_model import doc2VecModel
File "/home/ec2-user/doc2vec/models/doc2vec_model.py", line 1, in
from .model import Model
File "/home/ec2-user/doc2vec/models/model.py", line 1, in
from abc import ABC, abstractmethod
ImportError: cannot import name ABC

Python installed:

[root@ip-172-31-39-31 doc2vec]# python -v
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# /usr/lib64/python2.7/site.pyc matches /usr/lib64/python2.7/site.py
import site # precompiled from /usr/lib64/python2.7/site.pyc
# /usr/lib64/python2.7/os.pyc matches /usr/lib64/python2.7/os.py
import os # precompiled from /usr/lib64/python2.7/os.pyc
import errno # builtin
import posix # builtin
# /usr/lib64/python2.7/posixpath.pyc matches /usr/lib64/python2.7/posixpath.py
import posixpath # precompiled from /usr/lib64/python2.7/posixpath.pyc
# /usr/lib64/python2.7/stat.pyc matches /usr/lib64/python2.7/stat.py
import stat # precompiled from /usr/lib64/python2.7/stat.pyc
# /usr/lib64/python2.7/genericpath.pyc matches /usr/lib64/python2.7/genericpath.py
import genericpath # precompiled from /usr/lib64/python2.7/genericpath.pyc
# /usr/lib64/python2.7/warnings.pyc matches /usr/lib64/python2.7/warnings.py
import warnings # precompiled from /usr/lib64/python2.7/warnings.pyc
# /usr/lib64/python2.7/linecache.pyc matches /usr/lib64/python2.7/linecache.py
import linecache # precompiled from /usr/lib64/python2.7/linecache.pyc
# /usr/lib64/python2.7/types.pyc matches /usr/lib64/python2.7/types.py
import types # precompiled from /usr/lib64/python2.7/types.pyc
# /usr/lib64/python2.7/UserDict.pyc matches /usr/lib64/python2.7/UserDict.py
import UserDict # precompiled from /usr/lib64/python2.7/UserDict.pyc
# /usr/lib64/python2.7/_abcoll.pyc matches /usr/lib64/python2.7/_abcoll.py
import _abcoll # precompiled from /usr/lib64/python2.7/_abcoll.pyc
# /usr/lib64/python2.7/abc.pyc matches /usr/lib64/python2.7/abc.py
import abc # precompiled from /usr/lib64/python2.7/abc.pyc
# /usr/lib64/python2.7/_weakrefset.pyc matches /usr/lib64/python2.7/_weakrefset.py
import _weakrefset # precompiled from /usr/lib64/python2.7/_weakrefset.pyc
import _weakref # builtin
# /usr/lib64/python2.7/copy_reg.pyc matches /usr/lib64/python2.7/copy_reg.py
import copy_reg # precompiled from /usr/lib64/python2.7/copy_reg.pyc
# /usr/lib64/python2.7/traceback.pyc matches /usr/lib64/python2.7/traceback.py
import traceback # precompiled from /usr/lib64/python2.7/traceback.pyc
# /usr/lib64/python2.7/sysconfig.pyc matches /usr/lib64/python2.7/sysconfig.py
import sysconfig # precompiled from /usr/lib64/python2.7/sysconfig.pyc
# /usr/lib64/python2.7/re.pyc matches /usr/lib64/python2.7/re.py
import re # precompiled from /usr/lib64/python2.7/re.pyc
# /usr/lib64/python2.7/sre_compile.pyc matches /usr/lib64/python2.7/sre_compile.py
import sre_compile # precompiled from /usr/lib64/python2.7/sre_compile.pyc
import _sre # builtin
# /usr/lib64/python2.7/sre_parse.pyc matches /usr/lib64/python2.7/sre_parse.py
import sre_parse # precompiled from /usr/lib64/python2.7/sre_parse.pyc
# /usr/lib64/python2.7/sre_constants.pyc matches /usr/lib64/python2.7/sre_constants.py
import sre_constants # precompiled from /usr/lib64/python2.7/sre_constants.pyc
dlopen("/usr/lib64/python2.7/lib-dynload/_localemodule.so", 2);
import _locale # dynamically loaded from /usr/lib64/python2.7/lib-dynload/_localemodule.so
# /usr/lib64/python2.7/_sysconfigdata.pyc matches /usr/lib64/python2.7/_sysconfigdata.py
import _sysconfigdata # precompiled from /usr/lib64/python2.7/_sysconfigdata.pyc
import encodings # directory /usr/lib64/python2.7/encodings
# /usr/lib64/python2.7/encodings/__init__.pyc matches /usr/lib64/python2.7/encodings/__init__.py
import encodings # precompiled from /usr/lib64/python2.7/encodings/__init__.pyc
# /usr/lib64/python2.7/codecs.pyc matches /usr/lib64/python2.7/codecs.py
import codecs # precompiled from /usr/lib64/python2.7/codecs.pyc
import _codecs # builtin
# /usr/lib64/python2.7/encodings/aliases.pyc matches /usr/lib64/python2.7/encodings/aliases.py
import encodings.aliases # precompiled from /usr/lib64/python2.7/encodings/aliases.pyc
# /usr/lib64/python2.7/encodings/utf_8.pyc matches /usr/lib64/python2.7/encodings/utf_8.py
import encodings.utf_8 # precompiled from /usr/lib64/python2.7/encodings/utf_8.pyc
Python 2.7.16 (default, Jul 19 2019, 22:59:28)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
dlopen("/usr/lib64/python2.7/lib-dynload/readline.so", 2);
import readline # dynamically loaded from /usr/lib64/python2.7/lib-dynload/readline.so

Use more datasets

Add support to train the model on other text classification datasets, which will make it suitable for more use cases.

Add Jupyter notebook

Add a Jupyter notebook with the following modifications (more ideas are welcomed!) to better understand what is going on under the hood:

  • Exploratory analysis of the input dataset(s).
  • Visualization of the trained Doc2Vec vectors (t-SNE, UMAP)

CI with Travis

Depends on #13
Trigger automated builds on Travis after every commit, to make sure that the model doesn't break.

Add Unit testing

Add unit tests to trigger the model training and inference then make sure the performance doesn't worsen below a certain threshold (80%?)

Depends on #14

Softwares needed

Hi Ibrahim,

I have been using Scala/Spark for text analysis. I am in the process of switching over to Python as many new algorithms are available in Python. I am trying your doc2vec example mentioned here and able to understand the code. But When I run "python model.py", I was keep on getting basic errors like 'Import error: module not found' etc. I understood that I am missing some of the python modules. It would be very beneficial if you can update the code with 'Usage' as in the packages needed to run this code. For example, does this code needs python3/python? I really appreciate your response. Thanks for your help and time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.