nateraw / lda2vec-tensorflow Goto Github PK

View Code? Open in Web Editor NEW

107.0 10.0 40.0 5.46 MB

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum

License: MIT License

Python 100.00%

lda2vec-tensorflow's Introduction

Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum

Note

This algorithm is very much so a research algorithm. It doesn't always work so well, and you have to train it for a long time. As the author noted in the paper, most of the time normal LDA will work better.

Note that you should run this algorithm for at least 100 epochs before expecting to see any results. The algorithm is meant to run for a very long time.

Usage

Installation

Warning- You may have to install dependencies manually before being able to use the package. Requirements can be found here. If you install these on a clean environment, you should be good to go. I am seeking help on this issue.

Clone the repo and run python setup.py install to install the package as is or run python setup.py develop to make your own edits.

You can also just pip install lda2vec (Last updated 3/13/19)

Pretrained Embeddings

This repo can load a wide variety of pretrained embedding files (see nlppipe.py for more info). The examples are all using GloVe embeddings. You can download them from here.

Preprocessing

The preprocessing is all done through the "nlppipe.py" file using Spacy. Feel free to use your own preprocessing, if you like.

At the most basic level, if you would like to get your data processed for lda2vec, you can do the following:

import pandas as pd
from lda2vec.nlppipe import Preprocessor

# Data directory
data_dir ="data"
# Where to save preprocessed data
clean_data_dir = "data/clean_data"
# Name of input file. Should be inside of data_dir
input_file = "20_newsgroups.txt"
# Should we load pretrained embeddings from file
load_embeds = True

# Read in data file
df = pd.read_csv(data_dir+"/"+input_file, sep="\t")

# Initialize a preprocessor
P = Preprocessor(df, "texts", max_features=30000, maxlen=10000, min_count=30)

# Run the preprocessing on your dataframe
P.preprocess()

# Load embeddings from file if we choose to do so
if load_embeds:
    # Load embedding matrix from file path - change path to where you saved them
    embedding_matrix = P.load_glove("PATH/TO/GLOVE/glove.6B.300d.txt")
else:
    embedding_matrix = None

# Save data to data_dir
P.save_data(clean_data_dir, embedding_matrix=embedding_matrix)

When you run the twenty newsgroups preprocessing example, it will create a directory tree that looks like this:

├── my_project
│   ├── data
│   │   ├── 20_newsgroups.txt
│   │   └── clean_data_dir
│   │       ├── doc_lengths.npy
│   │       ├── embedding_matrix.npy
│   │       ├── freqs.npy
│   │       ├── idx_to_word.pickle
│   │       ├── skipgrams.txt
│   │       └── word_to_idx.pickle
│   ├── load_20newsgroups.py
│   └── run_20newsgroups.py

Using the Model

To run the model, pass the same data_path to the load_preprocessed_data function and then use that data to instantiate and train the model.

from lda2vec import utils, model

# Path to preprocessed data
data_path  = "data/clean_data"
# Whether or not to load saved embeddings file
load_embeds = True

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

# Number of unique documents
num_docs = doc_ids.max() + 1
# Number of unique words in vocabulary (int)
vocab_size = len(freqs)
# Embed layer dimension size
# If not loading embeds, change 128 to whatever size you want.
embed_size = embed_matrix.shape[1] if load_embeds else 128
# Number of topics to cluster into
num_topics = 20
# Amount of iterations over entire dataset
num_epochs = 200
# Batch size - Increase/decrease depending on memory usage
batch_size = 4096
# Epoch that we want to "switch on" LDA loss
switch_loss_epoch = 0
# Pretrained embeddings value
pretrained_embeddings = embed_matrix if load_embeds else None
# If True, save logdir, otherwise don't
save_graph = True


# Initialize the model
m = model(num_docs,
          vocab_size,
          num_topics,
          embedding_size=embed_size,
          pretrained_embeddings=pretrained_embeddings,
          freqs=freqs,
          batch_size = batch_size,
          save_graph_def=save_graph)

# Train the model
m.train(pivot_ids,
        target_ids,
        doc_ids,
        len(pivot_ids),
        num_epochs,
        idx_to_word=idx_to_word,
        switch_loss_epoch=switch_loss_epoch)

Visualizing the Results

We can now visualize the results of our model using pyLDAvis:

utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)

This will launch pyLDAvis in your browser, allowing you to visualize your results like this:

lda2vec-tensorflow's People

Contributors

Stargazers

Watchers

lda2vec-tensorflow's Issues

InvalidArgumentError (see above for traceback): indices[104] = 3541 is not in [0, 3541)

I'm trying to run 'example_run.py' with Anaconda python 3.6, tensorflow 1.4.0

InvalidArgumentError (see above for traceback): indices[104] = 3541 is not in [0, 3541)
[[Node: nce_loss/negative_sampling/nce_loss/embedding_lookup_1 = Gather[Tindices=DT_INT64, Tparams=DT_FLOAT, _class=["loc:@nce_biases"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](nce_biases/read, nce_loss/negative_sampling/nce_loss/concat)]]

Unable to install LDA2Vec in Windows

I'm trying to install LDA2Vec in VirtualEnv by downloading Lda2vec-Tensorflow.zip and install it through pip install setup.py but the procedure show the following error.

Windows 10 64bit
Python 3.6.2

Clean up files that were restored from compiled pyc

Previously, the code had to be restored from compiled C, and I never reformatted it. Because of this, the format of the code got messed up. Should go back in and fix it up to make it look pretty.

Alter nlppipe.py file to allow Keras Tokenizer to handle tokenization

It seems that by using spacy, there are a lot of weird things you have to work around (mostly referring to hashes that are created). Instead of doing tokenization through spacy, perhaps we can use spacy to alter the input text in the way that we need to, writie the text back to a list of altered text, and then use Keras's Tokenizer to handle the rest of the work.

This may reduce the amount of time it takes to run the tokenizer.

InvalidArAgumentError (see above for traceback): indices[478] = 5451 is not in [0, 5451)

I'm getting this error on Epoch 1 of run_20newsgroups.py:

InvalidArgumentError (see above for traceback): indices[478] = 5451 is not in [0, 5451)
[[node word_embed_lookup (defined at /Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/Lda2vec.py:152) = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@Optimizer/train/update_word_embedding/AssignSub"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](word_embedding/read, _arg_x_pivot_idxs_0_1, word_embed_lookup/axis)]]

It seems that the 'word_embed_lookup ' tensor contains an embedding reference beyond the length of the embedding_matrix. Any ideas where this 'off by one' issue could be?

# Word embedding lookup
        word_context = tf.nn.embedding_lookup(self.w_embed.embedding, x, name='word_embed_lookup')

similar to:
#5

Update batch size after testing for memory usage

Batch size of 4000 offers huge speedups. Need to test to see if that sort of batch size is feasible to leave in the example, as most people do not have > 16GB RAM.

Set up unittest suite to avoid future broken parameter configurations

I keep updating parts of the code that lead to bugs in other parts when certain configurations are passed. We should have unittest suite that can automatically test to see if model initialization works for a variety of parameters.

Restore not working properly after updates

When we simplified the model, we deleted some of the variables in handles within the model to make the code less confusing. However, after reviewing, we can see that the word embedding variables and the doc mixture variables are not saved + reloaded properly.

The main reason for this is because they are stored within the w_embed and doc_mixture class instances, which are not savable in the config file. So, we need to extract these variables and add them to the config manually to allow for restore to work properly.

Question about get_k_closest

Hello nateraw, nice implementation of lda2vec!

I am investigating the use of lda2vec to get the most similar phrases in a text. Initially I have a dataframe with M number of rows, where M are the number of phrases.

I train lda2vec for some epochs and then I use the function get_k_closest(idx of phrase I want to check, in_type='doc',vs_type='doc', k = 2), would this be the idea?

The thing with this is that I am not sure if the indices of the docs are the same that the ones that come with my dataframe at the beginning.

Thanks!

ValueError: too many values to unpack (expected 19)

I get this error when using the larger glove file:

embedding_matrix = P.load_glove("glove.840B.300d.txt")

restore=True
pretrained_embeddings=embedding_matrix

m = model(num_docs,
vocab_size,
num_topics=num_topics,
embedding_size=embed_size,
restore=True,
logdir="/data/logdir_190318_1739_190320_0734",
pretrained_embeddings=embedding_matrix,
freqs=freqs)

ValueError Traceback (most recent call last)
in ()
25 logdir="/data/logdir_190318_1739_190320_0734",
26 pretrained_embeddings=embedding_matrix,
---> 27 freqs=freqs)
28
29 m.train(pivot_ids,target_ids,doc_ids, len(pivot_ids), num_epochs, idx_to_word=idx_to_word, switch_loss_epoch=5)

~/Lda2vec-Tensorflow/lda2vec/Lda2vec.py in init(self, num_unique_documents, vocab_size, num_topics, freqs, save_graph_def, embedding_size, num_sampled, learning_rate, lmbda, alpha, power, batch_size, logdir, restore, fixed_words, factors_in, pretrained_embeddings)
109 self.fraction, self.loss_lda, self.loss, self.loss_avgs_op,
110 self.optimizer, self.merged, embedding, nce_weights, nce_biases,
--> 111 doc_embedding, topic_embedding) = handles
112
113 self.w_embed = W.Word_Embedding(self.embedding_size, self.vocab_size, self.num_sampled,

ValueError: too many values to unpack (expected 19)

Issue installing Lda2vec

Running python setup.py build seems to work
Then ``python setup.py install``` throws the error

byte-compiling build/bdist.linux-x86_64/egg/lda2vec/Lda2vec.py to Lda2vec.pyc
  File "build/bdist.linux-x86_64/egg/lda2vec/Lda2vec.py", line 84
    self.x, self.y, self.docs, self.additional_features, self.step, self.switch_loss, self.pivot, self.doc, self.context, self.loss_word2vec, self.fraction, self.loss_lda, self.loss, self.loss_avgs_op, self.optimizer, self.doc_embedding, self.topic_embedding, self.word_embedding, self.nce_weights, self.nce_biases, self.merged, *kg = handles
                                                                                                                                                                                                                                                                                                                                         ^
SyntaxError: invalid syntax

Thanks for the help!

Link to embed downloads in readme missing

Reproducible working example in new version of Lda2Vec

I've made TONS of changes the last few weeks. This has caused things to break and has made it so my working example no longer works 😢 . So, a new reproducible example needs to be made. This is highly related to #8 , where you can see that we ended up with a working example. However, with the new changes, we should be able to remake this reliably, straight from running the run_20newsgroups.py file.

TypeError: init() got an unexpected keyword argument 'load_embeds'

Now I am trying to run the actual lda2vec process, I get the following error:

freqs=freqs) # Python list of shape (vocab_size,). Frequencies of each token, same order as embed matrix mappings.
TypeError: init() got an unexpected keyword argument 'load_embeds'

Any help is appreciated. Thanks!

about Sense2VecComponent

hi, I download the code and installed all required packages, but there is still a problem in "from sense2vec import Sense2VecComponent"---unresolved reference "Sense2VecComponent". any idea to solving it? thanks.

Add requirements.txt

Foreign words not in spacy model or GloVe

Here's an example of training my 'stories.txt' file which comprises ~5100 news stories captured from numerous news sites:

EPOCH: 15
LOSS 626.1764 w2v 6.0411406 lda 620.13525
---------Closest 10 words to given indexes----------
Topic 0 : mosharrof, mehazabien, assamese, shuvro, azerbaijani, newscasts, basque, l, mehzbin, galician
Topic 1 : newscasts, disqus, studentu, mehazabien, mosharrof, ticker, assamese, moldovan, mehzbin, azerbaijani
Topic 2 : newscasts, galician, azerbaijani, mehazabien, assamese, basque, flemish, moldovan, mosharrof, shuvro
Topic 3 : mosharrof, mehazabien, shuvro, newscasts, azerbaijani, mehzbin, assamese, l, allen, mim
Topic 4 : mosharrof, mehazabien, marathi, moldovan, assamese, galician, azerbaijani, faroese, shuvro, maltese
Topic 5 : institutionalizing, mehazabien, mosharrof, mehzbin, mim, melanie, shuvro, azerbaijani, safa, malay
Topic 6 : newscasts, mehazabien, mosharrof, shuvro, azerbaijani, assamese, studentu, sonny, basque, mehzbin
Topic 7 : mehazabien, mosharrof, marathi, shuvro, assamese, mehzbin, malay, basque, azerbaijani, oriya
Topic 8 : mehazabien, mosharrof, shuvro, newscasts, mim, mehzbin, allen, l, azerbaijani, assamese
Topic 9 : mehazabien, mosharrof, shuvro, mehzbin, l, kabir, mim, jovan, toya, allen
Topic 10 : mehazabien, mosharrof, oceania, assamese, messenger, newscasts, studentu, moldovan, shuvro, mehzbin
Topic 11 : mosharrof, mehazabien, azerbaijani, shuvro, assamese, newscasts, l, basque, safa, allen
Topic 12 : mehazabien, mosharrof, shuvro, mehzbin, assamese, newscasts, azerbaijani, l, mim, allen
Topic 13 : mehazabien, mosharrof, shuvro, mehzbin, l, azerbaijani, assamese, newscasts, mim, allen
Topic 14 : mehazabien, mosharrof, assamese, azerbaijani, shuvro, newscasts, l, basque, malay, mehzbin
Topic 15 : mehazabien, mosharrof, eish, shuvro, newscasts, azerbaijani, assamese, studentu, chery, mehzbin
Topic 16 : mehazabien, mosharrof, lithuanian, kurmanji, azerbaijani, kyrgyz, moldovan, assamese, burmese, belarusian
Topic 17 : newscasts, mehazabien, mosharrof, bookmark, allen, l, shuvro, disqus, talkup, azerbaijani
Topic 18 : mehazabien, mosharrof, newscasts, shuvro, azerbaijani, galician, assamese, bienenretter, aktuell, malay
Topic 19 : assamese, azerbaijani, mosharrof, allen, moldovan, mehzbin, shuvro, lithuanian, marathi, mehazabien

Many of these word are not in:
glove.6B.300d.txt
glove.840B.300d.txt

Or in: en_core_web_lg

Should we filter out foreign words?

OOM with GPU computing

Great package!

I tried to speed up my computation on a cloud based GPU P2 (K80 GPU) instance using an ancient Tensorflow-GPU package (1.2.1). I think it doesn't have run option setting for report_tensor_allocations_upon_oom=True, so I set my batch size to 10.

I found it still went OOM when trying to build a graph of (10967, 20). I attached the screen shot here. Maybe the batch size is not actually used in some place of the code? Just curious.

I haven't tried this on my own GPU yet, not sure the same error occurs on more recent TF package with the correct run option setting.

(Updated. I just ran the code on EC2 instance with TF-GPU 1.09, and it has no error. I guess the issue is incompatibility with old TF package. However, fixing that OOM may require some changes in the code.

Add main.py to control everything with CLI

See if fetching only the required values for that epoch step is faster

When saving the model or printing metrics, we need certain values. However, if we don't need those values on that step, we should see if it's faster to just not grab them. Also - double check that this doesn't mess with training!

see topics extracted by lda2vec

I know pyLDAvis is one way to visualize lda2vec results. Is there a way or location where all the topics (or words related to a topic along with their weights) are stored. Thanks!

doc_lengths values not purged when purging documents that are too short

doc_lengths is used when visualizing topics using ldavis. Currently, you'll get an error when trying to visualize the topics saying that its length does not equal num_docs.

This stems from the fact that we purge documents in nlppipe.py if they are too short to create skipgrams. Document lengths corresponding to purged documents are never purged, so you are left with the original length of the input texts instead of the actual number of documents we processed.

Dependency on sense2vec

I couldn't get sense2vec to build on my system. I haven't dug into the code to see what exactly it is contributing, but how important is it to the process of lda2vec? After the changes I made to the imports list, I commented out the sense2vec nlp appends, and I was able to get the code to import.

What exactly is the function of sense2vec? (I have barely started working on the lda2vec code, I'm not even close to digging into sense2vec!) Is this an important dependency? Or is it something we could get the code from something else?

Create a Google CoLab Notebook for LDA2VEC-Tensorflow

Comments?

https://medium.com/deep-learning-turkey/google-colab-free-gpu-tutorial-e113627b9f5d

ValueError: not enough values to unpack (expected 7, got 6)

In run_20newsgroups.py:

When load_embeds=False, utils.load_preprocessed_data() gets an exception unpacking the results:

load_embeds = False

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

ValueError Traceback (most recent call last)
in ()
8 # Load data from files
9 (idx_to_word, word_to_idx, freqs, pivot_ids,
---> 10 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)
11
12 # Number of unique documents

ValueError: not enough values to unpack (expected 7, got 6)

Argument save_graph_def is never used

save_graph_def parameter in the Lda2vec model init function is never used to stop tensorflow from writing a logdir.

Data pre-processing

Hi,

Does lowercasing change the modeling results? I found the data sample 20_newsgroups.txt is before pre-processing. In my case I'd like to do some pre-processing work before feeding into Lda2vec. Thank you!

predict

Hi Im trying to use the predict function in the Lda2vec.py to work, but I keep getting an error for doc_ids.

I tried fixing it this way:
def predict(self, pivot_words, doc_ids_nd_arr, temp_batch_size):
doc_ids_list = doc_ids_nd_arr.tolist()
len_doc_ids = len(doc_ids_list)
doc_ids_new_nd_arr = doc_ids_nd_arr.reshape(len_doc_ids,1)
print(doc_ids_new_nd_arr.shape)
doc_ids = tf.placeholder(tf.int32, shape=(len_doc_ids,1), name="doc_ids")

    context = self.sesh.run([self.context], feed_dict={self.x: pivot_words, doc_ids: doc_ids_new_nd_arr})

The error Im getting is:
Caused by op 'doc_ids', defined at:
File "run.py", line 20, in
freqs=freqs)
File "/root/Lda2vec-Tensorflow/tests/twenty_newsgroups/lda2vec/Lda2vec.py", line 80, in init
handles = self._build_graph()
File "/root/Lda2vec-Tensorflow/tests/twenty_newsgroups/lda2vec/Lda2vec.py", line 124, in _build_graph
docs = tf.placeholder(tf.int32, shape=[None], name='doc_ids')
File "/usr/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1746, in placeholder
return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
File "/usr/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3051, in _placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/usr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
op_def=op_def)
File "/usr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1650, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'doc_ids' with dtype int32 and shape [?]
[[Node: doc_ids = Placeholderdtype=DT_INT32, shape=[?], _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[Node: context_vector/_113 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_22_context_vector", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

But it did not work. Can you guide me on this? Thanks.

General Questions

Hi, I've been trying to reimplement lda2vec as well. I can't seem to get your repo running (some dependency problems with sense2vec), and have a couple of questions that I hope you can answer:

do you get significant speedups when using a GPU? I'm getting slowdowns: I think it's because the model is small, and transferring data between the CPU and GPU takes more time than the time savings when running computations on the GPU.
How long does it generally take for one epoch for your 20newsgroups test case? I'm getting 18m training examples (word pairs + document id), and 1 epoch takes several hours, which is pretty terrible.
Are there any questions you have about lda2vec that you think is worth discussing?

Upgrade repo to Tensorflow v1.12

Recently, I was finally able to get newer versions of Tensorflow running on my old hardware (Non-AXV compatible). So, we should make sure everything written is compatible with newer versions of Tensorflow. This will mitigate issues discussed in #27.

where should I put the pre-trained glove vector?

Thanks for sharing the code, which is very clear and extremely helpful for me!!
Maybe my question is a little dumb, where should I place the pre-trained glove vectores?

Actually I tried to install pre-trained vectors from spacy website (https://spacy.io/models/), with little modification of the code, it can run now. But many warnings pop up, saying "Warning! Document 8967 broke, likely due to spaCy merge issues.More info at their github, issues #1547 and #1474".

So I guess my modification has some problem, can you please tell me how do you handle this?

Thanks a lot!!

error generated by EmbedMixture class: Caused by op 'doc_proportions'

@nateraw Im trying to use your code. I can confirm it is more clear than all the version i tried before. But I get the following exception. I dont know if it is related to my use of a cpu instead a gpu, or it is mandatory to use a gpu.

raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[214] = 499 is not in [0, 499)
        [[Node: doc_proportions = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@Doc_Embedding"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Doc_Embedding/read, _arg_doc_ids_0_0)]]

Caused by op 'doc_proportions', defined at:
 File "/project/6008168/tamouze/Python_directory/Lda2vec-Tensorflow/example_run.py", line 33, in <module>
   restore=False)
 File "/project/6008168/tamouze/Python_directory/Lda2vec-Tensorflow/lda2vec.py", line 54, in __init__
   handles = self._build_graph()
 File "/project/6008168/tamouze/Python_directory/Lda2vec-Tensorflow/lda2vec.py", line 105, in _build_graph
   doc = self.mixture(doc_ids=docs)
 File "/project/6008168/tamouze/Python_directory/Lda2vec-Tensorflow/embedding_mixture.py", line 30, in __call__
   proportions = self.proportions(doc_ids, softmax=True)
 File "/project/6008168/tamouze/Python_directory/Lda2vec-Tensorflow/embedding_mixture.py", line 43, in proportions
   w = tf.nn.embedding_lookup(self.Doc_Embedding, doc_ids, name="doc_proportions")
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/embedding_ops.py", line 327, in embedding_lookup
   transform_fn=None)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/embedding_ops.py", line 151, in _embedding_lookup_and_transform
   result = _clip(_gather(params[0], ids, name=name), ids, max_norm)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/embedding_ops.py", line 55, in _gather
   return array_ops.gather(params, ids, name=name)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 2667, in gather
   params, indices, validate_indices=validate_indices, name=name)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1777, in gather
   validate_indices=validate_indices, name=name)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
   op_def=op_def)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
   op_def=op_def)
 File "/project/6008168/tamouze/Python_directory/tensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
   self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): indices[214] = 499 is not in [0, 499)
        [[Node: doc_proportions = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@Doc_Embedding"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Doc_Embedding/read, _arg_doc_ids_0_0)]]

`

Can you please check

run_20newsgroups.py gets OSError: File logdir_180711_0406/model.ckpt.meta does not exist

Hi I just started trying out this lib. Are the two test files load_20newsgroups.py and run_20newsgroups.py meant to run off the bat? I can get load_20newsgroups.py to run thru, but run_20newsgroups.py returns this error

OSError: File logdir_180711_0406/model.ckpt.meta does not exist.

It would be helpful if you could add a few lines, suggesting an end to end run thru. Then users can have a better direction for digging in.

Rewrite nlppipe as series of functions as opposed to class

Class seems kind of unnecessary. Would be nice if it were all functions that you can call via a single function. i.e.

import nlppipe

texts = ["some document", "another doc"..."last document"]
nlppipe.process(texts)

On top of that, it would be helpful if this could just be called directly in run_20newsgroups.py, checking if preprocessing had already been completed (to the same specifications/parameters passed to process function)

Lda2vec model should be split into two versions - Single and Multi-Context

Currently, the model handles multi-context lda2vec. That's cool, but it makes the regular version extremely hard to read/understand. Making 2 versions and/or subclassing the original model would allow for the regular version to be more understandable to new users.

spacy-related error

I am getting the following error after running the preprocessing code provided.
File "/nlppipe.py", line 41, in init
self.nlp = spacy.load(nlp, disable = ['ner', 'tagger', 'parser'])
File "/spacy/init.py", line 21, in load
return util.load_model(name, overrides)
File "*/spacy/util.py", line 119, in load_model
raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Any help would be appreciated

Input data file only 24M, but cost about 90G memory while converting to unicode in the load_20newsgroups.py step

I replaced the 20newsgroups data file to my own data file which size is about 24M, when I ran the load_20newsgroups.py, it cost about 90G memory, why this implementation costs such big memory???

ValueError: Variable topic_embedding already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

I get this error when pretrained_embeddings=None

m = model(num_docs,
          vocab_size,
          num_topics=num_topics,
          #embedding_size=embed_size,
          restore=False,
          #logdir="/data/",
          pretrained_embeddings=None,
          freqs=freqs)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-2552ac324704> in <module>()
     25           #logdir="/data/",
     26           pretrained_embeddings=None,
---> 27           freqs=freqs)
     28 
     29 m.train(pivot_ids,target_ids,doc_ids, len(pivot_ids), num_epochs, idx_to_word=idx_to_word,  switch_loss_epoch=5)

~/Lda2vec-Tensorflow/lda2vec/Lda2vec.py in __init__(self, num_unique_documents, vocab_size, num_topics, freqs, save_graph_def, embedding_size, num_sampled, learning_rate, lmbda, alpha, power, batch_size, logdir, restore, fixed_words, factors_in, pretrained_embeddings)
     76                                             power=self.power)
     77             # Initialize the Topic-Document Mixture
---> 78             self.mixture = M.EmbedMixture(self.num_unique_documents, self.num_topics, self.embedding_size)
     79 
     80 

~/Lda2vec-Tensorflow/lda2vec/embedding_mixture.py in __init__(self, n_documents, n_topics, n_dim, temperature, W_in, factors_in, name)
     27         self.topic_embedding = tf.get_variable('topic_embedding', shape=[n_topics, n_dim],
     28                                                dtype=tf.float32,
---> 29                                                initializer=tf.orthogonal_initializer(gain=scalar)) if factors_in is None else factors_in
     30 
     31 

~/anaconda/envs/ai/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in get_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
   1485       constraint=constraint,
   1486       synchronization=synchronization,
-> 1487       aggregation=aggregation)
   1488 
   1489 

~/anaconda/envs/ai/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, var_store, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
   1235           constraint=constraint,
   1236           synchronization=synchronization,
-> 1237           aggregation=aggregation)
   1238 
   1239   def _get_partitioned_variable(self,

~/anaconda/envs/ai/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in get_variable(self, name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, custom_getter, constraint, synchronization, aggregation)
    538           constraint=constraint,
    539           synchronization=synchronization,
--> 540           aggregation=aggregation)
    541 
    542   def _get_partitioned_variable(self,

~/anaconda/envs/ai/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in _true_getter(name, shape, dtype, initializer, regularizer, reuse, trainable, collections, caching_device, partitioner, validate_shape, use_resource, constraint, synchronization, aggregation)
    490           constraint=constraint,
    491           synchronization=synchronization,
--> 492           aggregation=aggregation)
    493 
    494     # Set trainable value based on synchronization value.

~/anaconda/envs/ai/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in _get_single_variable(self, name, shape, dtype, initializer, regularizer, partition_info, reuse, trainable, collections, caching_device, validate_shape, use_resource, constraint, synchronization, aggregation)
    859                          "reuse=tf.AUTO_REUSE in VarScope? "
    860                          "Originally defined at:\n\n%s" % (
--> 861                              name, "".join(traceback.format_list(tb))))
    862       found_var = self._vars[name]
    863       if not shape.is_compatible_with(found_var.get_shape()):

ValueError: Variable topic_embedding already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "/home/ubuntu/Lda2vec-Tensorflow/lda2vec/embedding_mixture.py", line 29, in __init__
    initializer=tf.orthogonal_initializer(gain=scalar)) if factors_in is None else factors_in
  File "/home/ubuntu/Lda2vec-Tensorflow/lda2vec/Lda2vec.py", line 78, in __init__
    self.mixture = M.EmbedMixture(self.num_unique_documents, self.num_topics, self.embedding_size)
  File "<ipython-input-8-6f2c3ffe8774>", line 27, in <module>
    freqs=freqs

same result for all topic generated

for my understanding, the matrix of the topics generated are in the same space of the word vector. And we use the topci maxtrix to find the most similar word by the cosine similarity. And then the word that we have fund can be a representation of the topic.

But the result i have got shows that all the topic maxtrix generated are almost same. I can not figure out why. Here are some of the result :

print(topic[4])
[-0.9622485 0.8895183 0.8651555 -0.9276399 -0.9396336 0.93779755
-0.9743131 0.94305694 -0.92948157 -1.0672562 0.946625 -0.99164987
0.8959647 0.95344895 0.9274684 -0.97949797 0.97142816 0.947076
-1.0015502 0.96531034 0.8757545 0.94082266 0.954677 -0.97633624
0.87975 0.9366757 -0.93371624 -0.85707355 0.98357856 -0.93866247
0.9577415 0.94209754 -0.97033393 -0.9504832 0.9234292 0.9165397
-0.9694142 0.91393214 0.9972066 -0.9942078 -0.9907095 -0.9176958
0.93074447 -0.8706515 -0.92425114 -1.0101646 0.95657563 -1.0012354
0.95422584 -0.764645 0.9863512 -0.99371105 0.9823682 0.64269054
-0.9487983 -0.56981754 -1.0187954 0.9872439 0.67288846 0.92767256
-0.95255184 0.7126149 -0.92712885 0.9122812 0.8112471 -1.0150576
0.80759007 0.9772657 0.974533 -0.89622474 0.96457 -0.94705147
0.9997022 -0.9722624 0.9418657 0.9430709 -0.96311724 -0.97360986
-0.8987086 0.9817178 0.8594237 -0.93254995 -0.87266266 0.98293287
-0.6322944 -0.9245911 0.95225286 -1.0082532 -0.9219543 -0.9784668
-0.9714366 -0.9701755 0.9802913 -0.94296515 -0.89987594 -0.9654876
-0.92532563 -0.9081519 0.7952786 0.9535129 ]
print(topic[5])
[-0.9545226 0.8836221 0.8790286 -0.9436467 -0.9647125 0.95075834
-0.9890084 0.9377537 -0.94952726 -1.0689101 0.980626 -0.9908181
0.8998709 0.94127303 0.9263142 -0.96562505 0.99156046 0.95024383
-1.0077744 0.99384195 0.8860567 0.92229956 0.9736233 -0.96262467
0.89396423 0.9315409 -0.9482396 -0.85639435 0.9852119 -0.9602194
0.95691586 0.94624454 -0.98274666 -0.9827932 0.9232413 0.93340456
-0.97113854 0.93778706 1.0019037 -0.9843718 -1.0034899 -0.92478126
0.95473534 -0.8701034 -0.9313964 -0.9949094 0.97523534 -1.0191345
0.9864202 -0.73943955 1.0138143 -0.9930289 0.9773597 0.6448753
-0.94340485 -0.55352324 -1.004822 0.99961305 0.68788236 0.9397265
-0.9823522 0.75456184 -0.9445327 0.9221488 0.8499458 -1.0050296
0.8211724 0.9643316 0.98302233 -0.8961856 0.9766408 -0.9336542
1.0224456 -0.982251 0.9577986 0.97083366 -0.94915915 -0.9802646
-0.9033424 0.97875696 0.8598247 -0.91498125 -0.8607036 0.98732114
-0.643369 -0.93571526 0.96445656 -1.0014955 -0.94695365 -0.9552077
-0.98248726 -0.99457294 0.9754661 -0.9417462 -0.87800306 -0.9567253
-0.94087964 -0.9052637 0.78514445 0.94565785]

Issue in pyLDAvis run

Am getting an error while running the run_20newsgroups.py file. The problem seems to be coming from the prepare function of pyLDAvis which seems to be not getting the correct input variables. Could anyone please help with this?

File "run_20newsgroups.py", line 35, in
utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)
File "/home/amit/intent/lib/python3.5/site-packages/lda2vec-0.15.0-py3.5.egg/lda2vec/utils.py", line 174, in generate_ldavis_data
File "/home/amit/intent/lib/python3.5/site-packages/pyLDAvis/_prepare.py", line 374, in prepare
_input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
File "/home/amit/intent/lib/python3.5/site-packages/pyLDAvis/_prepare.py", line 65, in _input_validate
raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
pyLDAvis._prepare.ValidationError:

Length of doc_lengths not equal to the number of rows in doc_topic_dists; both should be equal to the number of documents in the data.

factors_in parameter never used

factors_in parameter in Lda2vec model class init function is never actually passed to the init method of embedding_mixture.EmbedMixture().

generate document

Please send me your private email. I wanna send you private code about generating topics for test document.

Did ytou get the chance to implement it?

Default to save every epoch if save_every=None

Instead of saving after a certain number of steps, save every epoch by default.

Also - Perhaps we aren't interested in saving stepwise anyways...maybe change save_every to be a number of epochs. In that case - save_every default would be 1.

W_in parameter not initialized when pretrained_embeddings==False

W_in is only created when pretrained_embeddings are passed.

Thank you @dbl001 for pointing this out.

visualization of results

Hi nateraw, I find this extremely helpful! I'm looking to visualize the results...do you have any example or instructions to visualize the topics mix and to do prediction for a new document? Thanks!

Add Dockerfile

There are some strange nuances in the different versions of tensorflow. Since my hardware setup restricts me to using tensorflow 1.5, the setup for this repo is kind of strange. By adding a dockerfile, you should be able to mimic my environment exactly. Only issue is that mac is not supported by nvidia docker. Still the same headache basically, but at least the environment would be the same with the Dockerfile on Linux + Windows.

File "ops.pyx", line 111, in thinc.neural.ops.Ops.flatten IndexError: list index out of range

I'm trying to run 'run_20newsgroups.py' from the latest update. I'm getting this error:
File "ops.pyx", line 111, in thinc.neural.ops.Ops.flatten
IndexError: list index out of range

I am running Anaconda 3.6 in an Anaconda environment: 'spacy' .
I changed the call to NlpPipeline as folllows:

SP = NlpPipeline(path_to_file, 50, merge=True, num_threads=8, context=False, vectors="google_news_model")
    # SP = NlpPipeline(path_to_file, 50, merge=True, num_threads=8, context=True, usecols=["texts"], vectors="google_news_model")

(spacy) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ python run_20newsgroups.py 
/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
made texts
about to enter that pipe
Traceback (most recent call last):
  File "run_20newsgroups.py", line 18, in <module>
    SP = NlpPipeline(path_to_file, 50, merge=True, num_threads=8, context=False, vectors="google_news_model")
  File "/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/nlppipe.py", line 101, in __init__
    self.tokenize()
  File "/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/nlppipe.py", line 176, in tokenize
    for row, doc in enumerate(self.nlp.pipe(self.texts, n_threads=self.num_threads, batch_size=1000)):
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/spacy/language.py", line 554, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 369, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__next__ (cytoolz/itertoolz.c:14538)
  File "nn_parser.pyx", line 376, in pipe
  File "nn_parser.pyx", line 403, in spacy.syntax.nn_parser.Parser.parse_batch
  File "nn_parser.pyx", line 724, in spacy.syntax.nn_parser.Parser.get_batch_model
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/thinc/api.py", line 61, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/thinc/api.py", line 292, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad),
  File "ops.pyx", line 111, in thinc.neural.ops.Ops.flatten
IndexError: list index out of range
(spacy) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$

With the original code:

(spacy) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$ python run_20newsgroups.py 
/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Traceback (most recent call last):
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'texts'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_20newsgroups.py", line 19, in <module>
    SP = NlpPipeline(path_to_file, 50, merge=True, num_threads=8, context=True, usecols=["texts"], vectors="google_news_model")
  File "/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/nlppipe.py", line 101, in __init__
    self.tokenize()
  File "/Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/nlppipe.py", line 149, in tokenize
    self.texts = df[text_col_name].values.astype(str).tolist()
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/Users/davidlaxer/anaconda/envs/spacy/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'texts'
(spacy) David-Laxers-MacBook-Pro:Lda2vec-Tensorflow davidlaxer$

Loss is not decreasing

Dear. Mr. Nateraw

   I am from South Korea, graduate student, who is interested in NLP.

First of all thank you for your tensorflow LDA2vec code.

I have tested your code with suggested data, and followed your instruction of installation

data
-. glove.6B.zip
-. 20_newsgroups.txt
parameters
-. no. of topics = 20
-. no. of epochs = 200
-. batch_size = 500
-. switch_loss_epoch = 0

After that, i got the result as below.
It started loss with 3381 and finished with 3381.

Question

does your trial also have same loss?
if there are something that i had wrong, please suggest.

Thank you in advance.

PS - The code is not working with Python 3.7.1, tensorflow-gpu 1.13.1
but python 3.6.8 with tensorflow-gpu 1.12.0 works fine

Working Example

I've been working on this code base for quite a while, but I have still yet to see a working example. I've played with calculating the loss function differently, all sorts of hyperparameters, and different ways of preprocessing the data, but yet I still havent seen this code or the original author's actually work.

So, if anybody wants to contribute an example that is reproducible, please let me know! Let me know if I can help explain whats going on in any way in any of the files. Thank you.

Update 20 newsgroups to reflect original author's parameters

Changes need to be made to the preprocessing parameters as well as the model run parameters.