lab41 / sunny-side-up Goto Github PK

Sentiment Analysis Challenge

License: Other

Shell 0.22% Python 5.36% HTML 1.23% JavaScript 0.08% Makefile 0.02% Jupyter Notebook 93.08%

sunny-side-up's Introduction

sunny-side-up

Lab41's foray into Sentiment Analysis with Deep Learning. In addition to checking out the source code, visit the Wiki for Learning Resources and possible Conferences to attend.

Try them, try them, and you may! Try them and you may, I say.

Blog Overviews
Docker Environments
Binary Classification with Word Vectors
- Word Vector Models
- Training and Testing Data
Binary Classification via Deep Learning
- CNN
- LSTM

Blog Overviews


Can Word Vectors Help Predict Whether Your Chinese Tweet Gets Censored?	March 2016
One More Reason Not To Be Scared of Deep Learning	March 2016
Some Tips for Debugging in Deep Learning	January 2016
Faster On-Ramp to Deep Learning With Jupyter-driven Docker Containers	November 2015
A Tour of Sentiment Analysis Techniques: Getting a Baseline for Sunny Side Up	November 2015
Learning About Deep Learning!	September 2015

Docker Environments

lab41/itorch-[cpu|cuda]: iTorch IPython kernel for Torch scientific computing GPU framework
lab41/keras-[cpu|cuda|cuda-jupyter]: Keras neural network library (CPU or GPU backend from command line or within Jupyter notebook)
lab41/neon-[cuda|cuda7.5]: neon Deep Learning framework (with CUDA backend) by Nervana
lab41/pylearn2: pylearn2 machine learning research library
lab41/sentiment-ml: build word vectors (Word2Vec from gensim; GloVe from glove-python), tokenize Chinese text (jieba and pypinyin), and tokenize Arabic text (NLTK and Stanford Parser)
lab41/mechanical-turk: convert CSV of Arabic tweets to individual PNG images for each Tweet (to avoid machine-translation of text) and auto-submit/score Arabic sentiment survey via AWS Mechanical Turk

Binary Classification with Word Vectors

Execution

python -m benchmarks/baseline_classifiers

Word Vector Models

model	filename	filesize	vocabulary	details
Sentiment140	sentiment140_800000.bin	153M	83,586	gensim Word2Vec(size=200, window=5, min_count=10)
Open Weiboscope	openweibo_fullset_hanzi_CLEAN_vocab31357747.bin	56G	31,357,746	jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=1)
Open Weiboscope	openweibo_fullset_min10_hanzi_vocab2548911.bin	4.6G	2,548,911	jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets	arabic_tweets_min10vocab_vocab1520226.bin	1.2G	1,520,226	Stanford Parser-tokenized Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets	arabic_tweets_NLTK_min10vocab_vocab981429.bin	759M	981,429	NLTK-tokenized Word2Vec(size=200, window=5, min_count=10)

Training and Testing Data

train/test set	filename	filesize	details
Sentiment140	sentiment140_800000_samples_[test/train].bin	183M	80/20 split of 1.6M emoticon-labeled Tweets
Open Weiboscope	openweibo_hanzi_censored_27622_samples_[test/train].bin	25M	80/20 split of 55,244 censored posts
Open Weiboscope	openweibo_800000_min1vocab_samples_[test/train].bin	564M	80/20 split of 1.6M deleted posts
Arabic Twitter	arabic_twitter_1067972_samples_[test/train].bin	912M	80/20 split of 2,135,944 emoticon-and-emoji labeled Tweets

Binary Classification via Deep Learning

CNN (Convolutional Neural Network)

Character-by-character processing From Zhang and LeCun's Text Understanding From Scratch:

#Set Parameters for final fully connected layers
fully_connected = [1024,1024,1]

model = Sequential()

#Input = #alphabet x 1014
model.add(Convolution2D(256,67,7,input_shape=(1,67,1014)))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 336 x 256
model.add(Convolution2D(256,1,7))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 110 x 256
model.add(Convolution2D(256,1,3))

#Input = 108 x 256
model.add(Convolution2D(256,1,3))

#Input = 106 x 256
model.add(Convolution2D(256,1,3))

#Input = 104 X 256
model.add(Convolution2D(256,1,3))
model.add(MaxPooling2D(pool_size=(1,3)))

model.add(Flatten())

#Fully Connected Layers

#Input is 8704 Output is 1024
model.add(Dense(fully_connected[0]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1024
model.add(Dense(fully_connected[1]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1
model.add(Dense(fully_connected[2]))
model.add(Activation('sigmoid'))

#Stochastic gradient parameters as set by paper
sgd = SGD(lr=0.01, decay=1e-5, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd, class_mode="binary")

LSTM (Long Short Term Memory)

# initialize the neural net and reshape the data
model = Sequential()
model.add(Embedding(max_features, embedding_size)) # embed into dense 3D float tensor (samples, maxlen, embedding_size)
model.add(Reshape(1, maxlen, embedding_size)) # reshape into 4D tensor (samples, 1, maxlen, embedding_size)

# convolution stack
model.add(Convolution2D(nb_feature_maps, nb_classes, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2 x 256/2 (32 x 50 x 128)
model.add(Dropout(0.25))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col)) # reshaped to 32 x 50 x 128
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2/2 x 256/2/2 (32 x 25 x 64)
model.add(Dropout(0.25))

# fully-connected layer
model.add(Flatten())
model.add(Dense(nb_feature_maps * (maxlen/2/2) * (embedding_size/2/2), fully_connected_size))
model.add(Activation("relu"))
model.add(Dropout(0.50))

# output classifier
model.add(Dense(fully_connected_size, 1))
model.add(Activation("sigmoid"))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")

sunny-side-up's People

Contributors

Stargazers

Watchers

Forkers

kfoss shulmanbrent likaiguo aganeshlab41 aboudiop ymt123 kylemvz bradh41 pcallier gitbot41 paulrodrigues danforth36phd karllab41 anukat2015 smartinsightsfromdata cscvenkatmadurai leefionglee codeaudit fototo iwhisper kalyanp imclab ghs2015 mattilyra angelzou karimkhanp akbari59 timeleft-- xsongx wanjinchang little1tow xypan1232 halofanx lomascolo tazjel fangzheng354 radityagumay promit-dey rsarxiv skat00sh liulinhere nahidcse05 mandarup ollie314 containerz oktavianidewi solversa zhuzzjlu coderx7 andrewrothstein enozr abhishekkodi leezqcst shuvayan benn94 zhhengcs solertis laughingbuddhagames myurasov bensnw hailingc mbencherif trampolinerocket ingenieromora gwork91 hyqleonardo nishkavijay thientu skybirdhe wyx1227 ramaswamym1987 mdiby zhangyang5511 mededsc jerrygaolondon wushicanasl ahmedfadhil rahulrawat11 harshadeepg rudrabasu ngchc litaya ufukhurriyetoglu mhaoli pluketic tikyau jalbertolopez xiaoleitw ferplascencia qitong houhaichao830 nimeoz wordgod123 just4jc latuji abdo-alachari zchenack hooloong belalmohsen pratipkhandelwal

sunny-side-up's Issues

Making the git extension a generic Jupyter extension?

Whilst looking for a Jupyter notebook extension that would link a notebook checkbook to git commit and allow checkpoint recovery from git pushes, I came across your bespoke Commit-and-Push to GitHub from Jupyter Notebooks extension.

(In passing, I also found a GitCheckpoints package, though I haven't had a chance to try it/see if it works with Jupyter.)

I seem to recall from some of the Jupyter Google group discussions that elaborate checkpointing schemes were unlikely to become part of the core offering, so the extension way seems to be the way to go.

FWIW, SageMathCloud also has some interesting backup and history replay features, so you can replay the history of a notebook. The granularity of that seems finer grained than the git commit route is likely to offer, but it does demonstrate one way of replaying histories.

Another piece of the jigsaw when it comes to making use of commits is to have an environment for comparing diffs. With github supporting nbviewer displays of notebooks, I wonder if they are also contributing to work on an nbdiff-er. (nbdiff seems to have stalled, for example)? The csiro-scientific-computing folk who did the GitCheckpoints routine also seem to have explored diffs: NotebookDiff.

doc to vec error in Sentiment140_W2V_Pipeline.py

at line
model.train(np.random.permutation(labeled_sent))

in code
for epoch in xrange(epoch_num):
logging.info("Epoch %s..." % epoch)
# Temporarily sets logging level to show only if its at least WARNING
# This prevents model.train from overloading the log
logging.getLogger().setLevel(logging.WARN)
# Numpy random permutation method shuffles data in place
# Shuffling improves the accuracy of the model
model.train(np.random.permutation(labeled_sent))
logging.getLogger().setLevel(logging.INFO)

error massage is

File "C:\Anaconda\Lib\threading.py", line 774, in __bootstrap
self.__bootstrap_inner()
File "C:\Anaconda\Lib\threading.py", line 801, in __bootstrap_inner
self.run()
File "C:\Anaconda\Lib\threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "C:\Anaconda\Lib\site-packages\gensim\models\word2vec.py", line 701, in worker_loop
if not worker_one_job(job, init):
File "C:\Anaconda\Lib\site-packages\gensim\models\word2vec.py", line 692, in worker_one_job
tally, raw_tally = self._do_train_job(items, alpha, inits)
File "C:\Anaconda\Lib\site-packages\gensim\models\doc2vec.py", line 638, in _do_train_job
indexed_doctags = self.docvecs.indexed_doctags(doc.tags)

AttributeError: 'numpy.ndarray' object has no attribute 'tags'

my check resuts
doc.tags
Traceback (most recent call last):
Debug Probe, prompt 14, line 1
AttributeError: 'numpy.ndarray' object has no attribute 'tags'

thanks
I use windows pc

by the way is word2vec or doc2vec methos??

error running tufs_cnn.py

I use windows pc and try use this file

https://github.com/Lab41/sunny-side-up/blob/master/src/examples/tufs_cnn.py

error is
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\examples\tufs_cnn_changed.py", line 105, in
h5_path='amazon_split.hd5', overwrite_previous=False,shuffle=True)
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\datasets\batch_data.py", line 403, in split_data
for new_data, new_labels in batch_iterator:

TypeError: 'NoneType' object is not iterable

pls help , since the project looks to be very interesting!!!

broken link on Learning-Resources-for-NLP,-Sentiment-Analysis,-and-Deep-Learning#tutorials

https://class.coursera.org/nlp/lecture/preview
has been removed by Coursera

a good alternative
https://www.coursera.org/learn/text-mining

can not be opened zipped data from sunny-side-up-master\src\datasets\downloads\

I sue windows computer and try to run
sunny-side-up-master\src\datasets\sentiment140.py
but error hapend and I can not open this file manually as well
File "C:\Anaconda\Lib\zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"

zipfile.BadZipfile: File is not a zip file

code

backwards-compatibility

def load_data(file_path="./.downloads/sentiment140.csv", feat_extractor=None, verbose=False, return_iter=True, rng_seed=None):
loader = Sentiment140(file_path)
return loader.load_data(feat_extractor=feat_extractor, verbose=verbose, return_iter=return_iter, rng_seed=rng_seed)

from
sunny-side-up-master\src\datasets\sentiment140.py

Using 'Commit & Push' extension in JupyterHub + Dockerspawner

Hi Lab41 Team! I came across this plugin while looking out for a github sync solution for my JupyterHub (Dockerspawner) implementation. This looks like the solution I was looking for. I would call myself a beginner in Javascript/AJAX, so can you let me know the steps to include this plugin?
I also saw an existing issue raised by someone using JupyterHub,
#88.

I will work on implementing this extension in my setup, so I would be glad to help out in the referenced issue too :)

Any help to configure this extension would be appreciated.

Thanks
Mrinmoy

Running into issues w/ the git extension

Hiya.

I'll start this off by saying we're not using Docker for this project and we are using Jupyterhub so it's very possible those differences are the root of all this. Oh, and I do have gitpython installed w/ pip. That said...

So I'm trying to implement the use of git-commit-push on a jupyterhub server and running into some issues I'm hoping you can comment on.

I was running into an issue on line 24 of git-commit-push.js where, upon writing a commit message and hitting the "commit and push" button, I'm met with the following message in the dev console:
github-commit-push.js?v=20160425184637:24
Uncaught TypeError: Cannot read property '1' of null

So taking a look at line 24 of the .js file, I see:
var filepath = window.location.pathname.match(re)[1];

And the regex above that in line 23 reads:
var re = /^\/notebooks(.*?)$/;

Sure enough, running window.location.pathname.match(/^\/notebooks(.*?)$/) returns "null."

Since running window.location.pathname returns the proper path to the notebook ("/user/[username]/notebooks/[filename].ipynb"), I assumed something must be up with the regex. Again, sure enough, using
window.location.pathname.match(/\/notebooks\/(.*)$/)[1]
returns "[filename].ipynb" as intended.

Unfortunately, solving that resulted in the following error on a new attempt to commit:
github-commit-push.js?v=20160426122528:70
PUT https://[server]/git/commit 405 (Method Not Allowed)
and this is where I'm stumped. I'm not sure what/why this is trying to do with that url on my jupyter[hub] server. My guess is that this is simply relevant to your use of Docker and it's boned on my end because we're not using it, so there are differences in where AJAX calls are meant to be sent.

I know next to nothing about javascript or ajax, let alone your docker environments, so I'm hoping you can chime in with your thoughts on what could be going on.

Sorry this is so long-winded. Ha. Cheers!

You may want to check out Polyglot

http://polyglot.readthedocs.io/en/latest/

error in Sentiment140_W2V_Pipeline.py label is 1 when used as str

File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 215, in
main(sys.argv[1:])
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 188, in main
model = train_d2v_model(all_data, epoch_num=10)
File "C:\Sander\my_code\sunny-side-up-master_mar7\src\Baseline\Word2Vec\Sentiment140_W2V_Pipeline.py", line 50, in train_d2v_model
ls = LabeledSentence(preprocess_tweet(sentence).split(), [label + '_%d' % neg_count])

TypeError: unsupported operand type(s) for +: 'int' and 'str'

seems to be
label is 1
when used as str
if label == 'pos':
for example
['qqq' + '_%d' % neg_count]
['qqq_0']

is ok

ImportError: No module named corpus_cython

File "/Users/i/Documents/twitter/sunny-side-up/glove/corpus.py", line 11, in
from corpus_cython import construct_cooccurrence_matrix
ImportError: No module named corpus_cython

Is there any missing file?

Thanks

update: I think the problem is caused by cython. Mac OSX 10.11 does not have the suitable gcc version to compile cython files.

Are there any way to use the GitHub extension from JupyterNotebook to JupyterLab?

The new way to use extensions on JupyterNotebooks was improved in a new project called JupyterLab. Extensions, now, can be written in an NPM package which uses JavaScript to code.

My question is "Are there any way to use the GitHub extension from JupyterNotebook to JupyterLab?"