Giter Club home page Giter Club logo

sunny-side-up's Introduction

sunny-side-up

Lab41's foray into Sentiment Analysis with Deep Learning. In addition to checking out the source code, visit the Wiki for Learning Resources and possible Conferences to attend.

Try them, try them, and you may! Try them and you may, I say.

Table of Contents

Blog Overviews

Can Word Vectors Help Predict Whether Your Chinese Tweet Gets Censored? March 2016
One More Reason Not To Be Scared of Deep Learning March 2016
Some Tips for Debugging in Deep Learning January 2016
Faster On-Ramp to Deep Learning With Jupyter-driven Docker Containers November 2015
A Tour of Sentiment Analysis Techniques: Getting a Baseline for Sunny Side Up November 2015
Learning About Deep Learning! September 2015

Docker Environments

  • lab41/itorch-[cpu|cuda]: iTorch IPython kernel for Torch scientific computing GPU framework
  • lab41/keras-[cpu|cuda|cuda-jupyter]: Keras neural network library (CPU or GPU backend from command line or within Jupyter notebook)
  • lab41/neon-[cuda|cuda7.5]: neon Deep Learning framework (with CUDA backend) by Nervana
  • lab41/pylearn2: pylearn2 machine learning research library
  • lab41/sentiment-ml: build word vectors (Word2Vec from gensim; GloVe from glove-python), tokenize Chinese text (jieba and pypinyin), and tokenize Arabic text (NLTK and Stanford Parser)
  • lab41/mechanical-turk: convert CSV of Arabic tweets to individual PNG images for each Tweet (to avoid machine-translation of text) and auto-submit/score Arabic sentiment survey via AWS Mechanical Turk

Binary Classification with Word Vectors

Execution

python -m benchmarks/baseline_classifiers

Word Vector Models

model filename filesize vocabulary details
Sentiment140 sentiment140_800000.bin 153M 83,586 gensim Word2Vec(size=200, window=5, min_count=10)
Open Weiboscope openweibo_fullset_hanzi_CLEAN_vocab31357747.bin 56G 31,357,746 jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=1)
Open Weiboscope openweibo_fullset_min10_hanzi_vocab2548911.bin 4.6G 2,548,911 jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets arabic_tweets_min10vocab_vocab1520226.bin 1.2G 1,520,226 Stanford Parser-tokenized Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets arabic_tweets_NLTK_min10vocab_vocab981429.bin 759M 981,429 NLTK-tokenized Word2Vec(size=200, window=5, min_count=10)

Training and Testing Data

train/test set filename filesize details
Sentiment140 sentiment140_800000_samples_[test/train].bin 183M 80/20 split of 1.6M emoticon-labeled Tweets
Open Weiboscope openweibo_hanzi_censored_27622_samples_[test/train].bin 25M 80/20 split of 55,244 censored posts
Open Weiboscope openweibo_800000_min1vocab_samples_[test/train].bin 564M 80/20 split of 1.6M deleted posts
Arabic Twitter arabic_twitter_1067972_samples_[test/train].bin 912M 80/20 split of 2,135,944 emoticon-and-emoji labeled Tweets

Binary Classification via Deep Learning

CNN (Convolutional Neural Network)

Character-by-character processing From Zhang and LeCun's Text Understanding From Scratch:

#Set Parameters for final fully connected layers
fully_connected = [1024,1024,1]

model = Sequential()

#Input = #alphabet x 1014
model.add(Convolution2D(256,67,7,input_shape=(1,67,1014)))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 336 x 256
model.add(Convolution2D(256,1,7))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 110 x 256
model.add(Convolution2D(256,1,3))

#Input = 108 x 256
model.add(Convolution2D(256,1,3))

#Input = 106 x 256
model.add(Convolution2D(256,1,3))

#Input = 104 X 256
model.add(Convolution2D(256,1,3))
model.add(MaxPooling2D(pool_size=(1,3)))

model.add(Flatten())

#Fully Connected Layers

#Input is 8704 Output is 1024
model.add(Dense(fully_connected[0]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1024
model.add(Dense(fully_connected[1]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1
model.add(Dense(fully_connected[2]))
model.add(Activation('sigmoid'))

#Stochastic gradient parameters as set by paper
sgd = SGD(lr=0.01, decay=1e-5, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd, class_mode="binary")

LSTM (Long Short Term Memory)

# initialize the neural net and reshape the data
model = Sequential()
model.add(Embedding(max_features, embedding_size)) # embed into dense 3D float tensor (samples, maxlen, embedding_size)
model.add(Reshape(1, maxlen, embedding_size)) # reshape into 4D tensor (samples, 1, maxlen, embedding_size)

# convolution stack
model.add(Convolution2D(nb_feature_maps, nb_classes, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2 x 256/2 (32 x 50 x 128)
model.add(Dropout(0.25))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col)) # reshaped to 32 x 50 x 128
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2/2 x 256/2/2 (32 x 25 x 64)
model.add(Dropout(0.25))

# fully-connected layer
model.add(Flatten())
model.add(Dense(nb_feature_maps * (maxlen/2/2) * (embedding_size/2/2), fully_connected_size))
model.add(Activation("relu"))
model.add(Dropout(0.50))

# output classifier
model.add(Dense(fully_connected_size, 1))
model.add(Activation("sigmoid"))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")

sunny-side-up's People

Contributors

aganeshlab41 avatar bradh41 avatar kfoss avatar paulrodrigues avatar pcallier avatar shulmanbrent avatar ymt123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sunny-side-up's Issues

Making the git extension a generic Jupyter extension?

Whilst looking for a Jupyter notebook extension that would link a notebook checkbook to git commit and allow checkpoint recovery from git pushes, I came across your bespoke Commit-and-Push to GitHub from Jupyter Notebooks extension.

(In passing, I also found a GitCheckpoints package, though I haven't had a chance to try it/see if it works with Jupyter.)

I seem to recall from some of the Jupyter Google group discussions that elaborate checkpointing schemes were unlikely to become part of the core offering, so the extension way seems to be the way to go.

FWIW, SageMathCloud also has some interesting backup and history replay features, so you can replay the history of a notebook. The granularity of that seems finer grained than the git commit route is likely to offer, but it does demonstrate one way of replaying histories.

Another piece of the jigsaw when it comes to making use of commits is to have an environment for comparing diffs. With github supporting nbviewer displays of notebooks, I wonder if they are also contributing to work on an nbdiff-er. (nbdiff seems to have stalled, for example)? The csiro-scientific-computing folk who did the GitCheckpoints routine also seem to have explored diffs: NotebookDiff.

doc to vec error in Sentiment140_W2V_Pipeline.py

at line
model.train(np.random.permutation(labeled_sent))

in code
for epoch in xrange(epoch_num):
logging.info("Epoch %s..." % epoch)
# Temporarily sets logging level to show only if its at least WARNING
# This prevents model.train from overloading the log
logging.getLogger().setLevel(logging.WARN)
# Numpy random permutation method shuffles data in place
# Shuffling improves the accuracy of the model
model.train(np.random.permutation(labeled_sent))
logging.getLogger().setLevel(logging.INFO)

error massage is

File "C:\Anaconda\Lib\threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "C:\Anaconda\Lib\threading.py", line 801, in __bootstrap_inner
self.run()
File "C:\Anaconda\Lib\threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "C:\Anaconda\Lib\site-packages\gensim\models\word2vec.py", line 701, in worker_loop
if not worker_one_job(job, init):
File "C:\Anaconda\Lib\site-packages\gensim\models\word2vec.py", line 692, in worker_one_job
tally, raw_tally = self._do_train_job(items, alpha, inits)
File "C:\Anaconda\Lib\site-packages\gensim\models\doc2vec.py", line 638, in _do_train_job
indexed_doctags = self.docvecs.indexed_doctags(doc.tags)

AttributeError: 'numpy.ndarray' object has no attribute 'tags'

my check resuts
doc.tags
Traceback (most recent call last):
Debug Probe, prompt 14, line 1
AttributeError: 'numpy.ndarray' object has no attribute 'tags'

thanks
I use windows pc

by the way is word2vec or doc2vec methos??

error running tufs_cnn.py

I use windows pc and try use this file

https://github.com/Lab41/sunny-side-up/blob/master/src/examples/tufs_cnn.py

error is
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\examples\tufs_cnn_changed.py", line 105, in
h5_path='amazon_split.hd5', overwrite_previous=False,shuffle=True)
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\datasets\batch_data.py", line 403, in split_data
for new_data, new_labels in batch_iterator:

TypeError: 'NoneType' object is not iterable

pls help , since the project looks to be very interesting!!!

can not be opened zipped data from sunny-side-up-master\src\datasets\downloads\

I sue windows computer and try to run
sunny-side-up-master\src\datasets\sentiment140.py
but error hapend and I can not open this file manually as well
File "C:\Anaconda\Lib\zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"

zipfile.BadZipfile: File is not a zip file

code

backwards-compatibility

def load_data(file_path="./.downloads/sentiment140.csv", feat_extractor=None, verbose=False, return_iter=True, rng_seed=None):
loader = Sentiment140(file_path)
return loader.load_data(feat_extractor=feat_extractor, verbose=verbose, return_iter=return_iter, rng_seed=rng_seed)

from
sunny-side-up-master\src\datasets\sentiment140.py

Using 'Commit & Push' extension in JupyterHub + Dockerspawner

Hi Lab41 Team! I came across this plugin while looking out for a github sync solution for my JupyterHub (Dockerspawner) implementation. This looks like the solution I was looking for. I would call myself a beginner in Javascript/AJAX, so can you let me know the steps to include this plugin?
I also saw an existing issue raised by someone using JupyterHub,
#88.

I will work on implementing this extension in my setup, so I would be glad to help out in the referenced issue too :)

Any help to configure this extension would be appreciated.

Thanks
Mrinmoy

Running into issues w/ the git extension

Hiya.

I'll start this off by saying we're not using Docker for this project and we are using Jupyterhub so it's very possible those differences are the root of all this. Oh, and I do have gitpython installed w/ pip. That said...

So I'm trying to implement the use of git-commit-push on a jupyterhub server and running into some issues I'm hoping you can comment on.

I was running into an issue on line 24 of git-commit-push.js where, upon writing a commit message and hitting the "commit and push" button, I'm met with the following message in the dev console:
github-commit-push.js?v=20160425184637:24
Uncaught TypeError: Cannot read property '1' of null

So taking a look at line 24 of the .js file, I see:
var filepath = window.location.pathname.match(re)[1];

And the regex above that in line 23 reads:
var re = /^\/notebooks(.*?)$/;

Sure enough, running window.location.pathname.match(/^\/notebooks(.*?)$/) returns "null."

Since running window.location.pathname returns the proper path to the notebook ("/user/[username]/notebooks/[filename].ipynb"), I assumed something must be up with the regex. Again, sure enough, using
window.location.pathname.match(/\/notebooks\/(.*)$/)[1]
returns "[filename].ipynb" as intended.

Unfortunately, solving that resulted in the following error on a new attempt to commit:
github-commit-push.js?v=20160426122528:70
PUT https://[server]/git/commit 405 (Method Not Allowed)
and this is where I'm stumped. I'm not sure what/why this is trying to do with that url on my jupyter[hub] server. My guess is that this is simply relevant to your use of Docker and it's boned on my end because we're not using it, so there are differences in where AJAX calls are meant to be sent.

I know next to nothing about javascript or ajax, let alone your docker environments, so I'm hoping you can chime in with your thoughts on what could be going on.

Sorry this is so long-winded. Ha. Cheers!

error in Sentiment140_W2V_Pipeline.py label is 1 when used as str

File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 215, in
main(sys.argv[1:])
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 188, in main
model = train_d2v_model(all_data, epoch_num=10)
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 50, in train_d2v_model
ls = LabeledSentence(preprocess_tweet(sentence).split(), [label + '_%d' % neg_count])

TypeError: unsupported operand type(s) for +: 'int' and 'str'

seems to be
label is 1
when used as str
if label == 'pos':
for example
['qqq' + '_%d' % neg_count]
['qqq_0']

is ok

ImportError: No module named corpus_cython

File "/Users/i/Documents/twitter/sunny-side-up/glove/corpus.py", line 11, in
from corpus_cython import construct_cooccurrence_matrix
ImportError: No module named corpus_cython

Is there any missing file?

Thanks

update: I think the problem is caused by cython. Mac OSX 10.11 does not have the suitable gcc version to compile cython files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.