Giter Club home page Giter Club logo

twitter-sentiment-cnn's Introduction

Twitter sentiment classification by Daniele Grattarola

This is a TensorFlow implementation of a convolutional neural network (CNN) to perform sentiment classification on tweets.

This code is meant to have an educational value, to train the model by yourself and play with different configurations, and was not developed to be deployed as-is (although it has been used in professional contexts). The dataset used for training is taken from here (someone reported to me that the link to the dataset appears to be dead sometimes, so dataset_downloader.py might not work. I successfully ran the script on January 20, 2018, but please report it to me if you have any problems).

NOTE: this script is for Python 2.7 only

Setup

You'll need Tensorflow >=1.1.0 and its dependecies installed in order for the script to work (see here).

Once you've installed and configured Tensorflow, download the source files and cd into the folder:

$ git clone https://gitlab.com/danielegrattarola/twitter-sentiment-cnn.git
$ cd twitter-sentiment-cnn

Before being able to use the script, some setup is needed; download the dataset from the link above by running:

$ python dataset_downloader.py

Read the dataset from the CSV into two files (.pos and .neg) with:

$ python csv_parser.py

And generate a CSV with the vocabulary (and its inverse mapping) with:

$ python vocab_builder.py

The files will be created in the twitter-sentiment-dataset/ folder. Finally, create an output/ folder that will contain all session checkpoints needed to restore the trained models:

mkdir output

Now everything is set up and you're ready to start training the model.

Usage

The simplest way to run the script is:

$ python twitter-sentiment-cnn.py

which will load the dataset in memory, create the computation graph, and quit. Try to run the script like this to see if everything is set up correctly. To run a training session on the full dataset (and save the result so that we can reuse the network later, or perform more training) run:

python twitter-sentiment-cnn.py --train --save

After training, we can test the network as follows:

$ python twitter-sentiment-cnn.py --load path/to/ckpt/folder/ --custom_input 'I love neural networks!'

which will eventually output:

...
Processing custom input: I love neural networks!
Custom input evaluation: POS
Actual output: [ 0.19249919  0.80750078]
...

By running:

$ python twitter-sentiment-cnn.py -h

the script will output a list of all customizable flags and parameters. The parameters are:

  • train: train the network;
  • save: save session checkpoints;
  • save_protobuf: save model as binary protobuf;
  • evaluate_batch: evaluate the network on a held-out batch from the dataset and print the results (for debugging/educational purposes);
  • load: restore a model from the given path;
  • custom_input: evaluate the model on the given string;
  • filter_sizes: comma-separated filter sizes for the convolutional layers (default: '3,4,5');
  • dataset_fraction: fraction of the dataset to load in memory, to reduce memory usage (default: 1.0; uses all dataset);
  • embedding_size: size of the word embeddings (default: 128);
  • num_filters: number of filters per filter size (default: 128);
  • batch_size: batch size (default: 128);
  • epochs: number of training epochs (default: 3);
  • valid_freq: how many times per epoch to perform validation testing (default: 1);
  • checkpoint_freq: how many times per epoch to save the model (default: 1);
  • test_data_ratio: fraction of the dataset to use for validation (default: 0.1);
  • device: device to use for running the model (can be either 'cpu' or 'gpu').

Pre-trained model

User @Horkyze kindly trained the model for three epochs on the full dataset and shared the summary folder for quick deploy. The folder is available on Mega, to load the model simply unpack the zip file and use the --load flag as follows:

# Current directoty: twitter-sentiment-cnn/
$ unzip path/to/run20180201-231509.zip
$ python twitter-sentiment-cnn.py --load path/to/run20180201-231509/ --custom_input "I love neural networks!"

Running this command should give you something like:

======================= START! ========================
	data_helpers: loading positive examples...
	data_helpers: [OK]
	data_helpers: loading negative examples...
	data_helpers: [OK]
	data_helpers: cleaning strings...
	data_helpers: [OK]
	data_helpers: generating labels...
	data_helpers: [OK]
	data_helpers: concatenating labels...
	data_helpers: [OK]
	data_helpers: padding strings...
	data_helpers: [OK]
	data_helpers: building vocabulary...
	data_helpers: [OK]
	data_helpers: building processed datasets...
	data_helpers: [OK]

Flags:
	batch_size = 128
	checkpoint_freq = 1
	custom_input = I love neural networks!
	dataset_fraction = 0.001
	device = cpu
	embedding_size = 128
	epochs = 3
	evaluate_batch = False
	filter_sizes = 3,4,5
	load = output/run20180201-231509/
	num_filters = 128
	save = False
	save_protobuf = False
	test_data_ratio = 0.1
	train = False
	valid_freq = 1

Dataset:
	Train set size = 1421
	Test set size = 157
	Vocabulary size = 274562
	Input layer size = 36
	Number of classes = 2

Output folder: /home/phait/dev/twitter-sentiment-cnn/output/run20180208-112402
Data processing OK, loading network...
Evaluating custom input: I love neural networks!
Custom input evaluation: POS
Actual output: [0.04109644 0.95890355]

NOTE: loading this model won't work if you change anything in the default network architecture, so don't set the --filter_sizes flag.

According to the log.log file provided by @Horkyze, the model had a final validation accuracy of 0.80976, and a validation loss of 53.3314.

I sincerely thank @Horkyze for providing the computational power and sharing the model with me.

Model description

The network implemented in this script is a single layer CNN structured as follows:

  • Embedding layer: takes as input the tweets (as strings) and maps each word to an n-dimensional space so that it is represented as a sparse vector (see word2vec).
  • Convolution layers: a set of parallel 1D convolutional layers with the given filter sizes and 128 output channels. A filter's size is the number of embedded words that the filter covers.
  • Pooling layers: a set of pooling layers associated to each of the convolutional layers.
  • Concat layer: concatenates the output of the different pooling layers into a single tensor.
  • Dropout layer: performs neuron dropout (some neurons are randomly not considered during training).
  • Output layer: fully connected layer with a softmax activation function to perform classification.

The script will automatically log the session with Tensorboard. To visualize the computation graph and training metrics run:

$ tensorboard --logdir output/path/to/summaries/

and then navigate to localhost:6006 from your browser (you'll see the computation graph in the Graph section).

twitter-sentiment-cnn's People

Contributors

danielegrattarola avatar tzolov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter-sentiment-cnn's Issues

ValueError: not enough values to unpack (expected 2, got 0)

Traceback (most recent call last):
File "twitter-sentiment-cnn.py", line 147, in
x, y, vocabulary, vocabulary_inv = load_data(FLAGS.reduced_dataset)
File "C:\Users\gnana\Desktop\twitter-sentiment-cnn\data_helpers.py", line 166, in load_data
vocabulary, vocabulary_inv = build_vocab()
File "C:\Users\gnana\Desktop\twitter-sentiment-cnn\data_helpers.py", line 119, in build_vocab
vocabulary = {x:i for x, i in voc}
File "C:\Users\gnana\Desktop\twitter-sentiment-cnn\data_helpers.py", line 119, in
vocabulary = {x:i for x, i in voc}
ValueError: not enough values to unpack (expected 2, got 0)

Can't multiply sequence by non-int of type 'float'

Running it in python3. Getting the following error message:

python3 vocab_builder.py
vocab_builder: loading...
	data_helpers: loading positive examples...
	data_helpers: [OK]
	data_helpers: loading negative examples...
	data_helpers: [OK]
Traceback (most recent call last):
  File "vocab_builder.py", line 25, in <module>
    sentences, labels = load_data_and_labels(1)  # 1 is passed so that load_data_and_labels() will parse the whole dataset
  File "/Users/subasishdas1/Desktop/RCodeData/twitter-sentiment-cnn/data_helpers.py", line 57, in load_data_and_labels
    positive_examples = sample_list(positive_examples, reduced_dataset)
  File "/Users/subasishdas1/Desktop/RCodeData/twitter-sentiment-cnn/data_helpers.py", line 39, in sample_list
    return random.sample(list, len(list)/dividend)
  File "//anaconda/lib/python3.5/random.py", line 316, in sample
    result = [None] * k
TypeError: can't multiply sequence by non-int of type 'float'

NameError: global name 'Counter' is not defined

word_counts = Counter(itertools.chain(*sentences))

MacOS Sierra, python 2.7. Performed sudo -H pip install --upgrade tensorflow and sudo -H pip install tqdm

Followed instructions and all worked until running python vocab_builder.py
Script output:

$ python vocab_builder.py 
vocab_builder: loading...
	data_helpers: loading positive examples...
	data_helpers: [OK]
	data_helpers: loading negative examples...
	data_helpers: [OK]
	data_helpers: cleaning strings...
	data_helpers: [OK]
	data_helpers: generating labels...
	data_helpers: [OK]
	data_helpers: concatenating labels...
	data_helpers: [OK]
vocab_builder: padding...
vocab_builder: building vocabularies...
Traceback (most recent call last):
  File "vocab_builder.py", line 29, in <module>
    vocabulary, vocabulary_inv = build_vocab(sentences_padded)
  File "vocab_builder.py", line 15, in build_vocab
    word_counts = Counter(itertools.chain(*sentences))
NameError: global name 'Counter' is not defined

Error while running python vocab_builder.py

vocab_builder: loading...
data_helpers: loading positive examples...
data_helpers: [OK]
data_helpers: loading negative examples...
data_helpers: [OK]
data_helpers: cleaning strings...
data_helpers: [OK]
data_helpers: generating labels...
data_helpers: [OK]
data_helpers: concatenating labels...
data_helpers: [OK]
vocab_builder: padding...
Killed

IndexError: too many indices for array

I could execute --trai --save under Win10, py3.5, tensorflow 1.9.0 but I understand it's not supported. I wonder if you can help me nevertheless, thank you

======================= START! ========================
data_helpers: loading positive examples...
data_helpers: [OK]
data_helpers: loading negative examples...
data_helpers: [OK]
data_helpers: cleaning strings...
data_helpers: [OK]
data_helpers: generating labels...
data_helpers: [OK]
data_helpers: concatenating labels...
data_helpers: [OK]
data_helpers: padding strings...
data_helpers: [OK]
data_helpers: building vocabulary...
data_helpers: [OK]
data_helpers: building processed datasets...
data_helpers: [OK]

Flags:

Dataset:
Train set size = 1420766
Test set size = 157862
Vocabulary size = 274562
Input layer size = 117
Number of classes = 2

Output folder: C:\Users\Ale\git\twitter-sentiment-cnn\output\run20180720-004328
2018-07-20 00:44:52.372521: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Data processing OK, creating network...
Traceback (most recent call last):
File "twitter-sentiment-cnn.py", line 292, in
test_batches = list(batch_iter(zip(x_test, y_test), FLAGS.batch_size, 1))
File "C:\Users\Ale\git\twitter-sentiment-cnn\data_helpers.py", line 185, in batch_iter
shuffled_data = data[shuffle_indices]
IndexError: too many indices for array

Script get killed

I run a docker tensorflow/tensorflow container on my mac.
Inside the container I followed the intructions and got to a point when I call:

python twitter-sentiment-cnn.py --train --save

However the script gets killed, with the following output:

root@19a3c9721bd0:/opt/twitter-cnn# python twitter-sentiment-cnn.py --train --save
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
======================= START! ========================
	data_helpers: loading positive examples...
	data_helpers: [OK]
	data_helpers: loading negative examples...
	data_helpers: [OK]
	data_helpers: cleaning strings...
	data_helpers: [OK]
	data_helpers: generating labels...
	data_helpers: [OK]
	data_helpers: concatenating labels...
	data_helpers: [OK]
	data_helpers: padding strings...
Killed
root@19a3c9721bd0:/opt/twitter-cnn#

Any idea what this means?

TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

First of all thank you for all the effort put into this project!
I'm trying to get it to work. Currently under a Ubuntu 18.04 machine (Python 2.7.15 & Tensorflow 1.9)
However I am still struggling to get it to work. I am getting the following error message:

Traceback (most recent call last):
File "twitter-sentiment-cnn.py", line 195, in
h_pool = tf.concat(3, pooled_outputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1110, in concat
dtype=dtypes.int32).get_shape().assert_is_compatible_with(
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1011, in convert_to_tensor
as_ref=False)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1107, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 217, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/constant_op.py", line 196, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 436, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_util.py", line 347, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

Thank you in advance!

Trained model

Hey again, I have the fully trained model at my disposal (it took whole night to train). It has 406 MB (uncompressed).
I've uploaded it to mega as zip. If you want I can create a PR to master or other branch so other people have easy access to the complete solution without the need of training the network.

Anyway here is the link https://mega.nz/#!xVg0ARYK!oVyBZatotQGOD_FFSzZl5gTS1Z49048vjFEbyzftcFY

Issue

Hi, I got the following error
Traceback (most recent call last):
File "twitter-sentiment-cnn.py", line 195, in
h_pool = tf.concat(3, pooled_outputs)
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1061, in concat
dtype=dtypes.int32).get_shape(
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 611, in convert_to_tensor
as_ref=False)
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 676, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 121, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 376, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/Users/vinhtngu/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).name))
TypeError: Expected int32, got list containing Tensors of type '_Message' instead.

Do you know how can I fix this? Thank you very much

shuffled_data = data[shuffle_indices] IndexError: too many indices for array

======================= START! ========================
data_helpers: loading positive examples...
data_helpers: [OK]
data_helpers: loading negative examples...
data_helpers: [OK]
data_helpers: cleaning strings...
data_helpers: [OK]
data_helpers: generating labels...
data_helpers: [OK]
data_helpers: concatenating labels...
data_helpers: [OK]
data_helpers: padding strings...
data_helpers: [OK]
data_helpers: building vocabulary...
data_helpers: [OK]
data_helpers: building processed datasets...
data_helpers: [OK]

Flags:
batch_size = 100
checkpoint_freq = 1
custom_input =
device = cpu
embedding_size = 128
epochs = 3
evaluate_batch = False
filter_sizes = 3,4,5
load = None
num_filters = 128
reduced_dataset = 1
save = True
save_protobuf = False
test_data_ratio = 10
train = True
valid_freq = 1

Dataset:
Train set size = 1420766
Test set size = 157862
Vocabulary size = 274562
Input layer size = 117
Number of classes = 2

Output folder: /home/bizviz/Desktop/twitter-sentiment-cnn/output/run20170626-185836
2017-06-26 20:00:41.322072: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-26 20:00:42.065691: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-26 20:00:42.559938: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-26 20:00:43.504129: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-26 20:00:43.704418: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Data processing OK, creating network...
Traceback (most recent call last):
File "twitter-sentiment-cnn.py", line 299, in
test_batches = list(batch_iter(zip(x_test, y_test), FLAGS.batch_size, 1))
File "/home/bizviz/Desktop/twitter-sentiment-cnn/data_helpers.py", line 183, in batch_iter
shuffled_data = data[shuffle_indices]
IndexError: too many indices for array

I am getting this error and not quite sure how to solve it. can you please help me with it?

can't multiply sequence by non-int of type 'float'

Below function throws as exception "can't multiply sequence by non-int of type 'float'", any idea how to fix

def sample_list(list, dividend):
"""
Returns 1/dividend-th of the given list, randomply sampled.
"""
return random.sample(list, len(list)/dividend)

fyi: i am using python 3.5

Thanks in advance

ValueError in twitter-sentiment-cnn.py

ValueError: Negative dimension size caused by subtracting 3 from 1 for 'conv-maxpool-3/conv' (op: 'Conv2D') with input shapes: [?,1,128,1], [3,128,1,128]

Accuracy.

Hey, sorry to bother again,
What is the accuracy that you are getting when running this?
i am gettin accuracy of 53% and want to increase it.
any suggestion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.