Giter Club home page Giter Club logo

topic-noise-models-source's People

Contributors

rchurch4 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

neilchenccc

topic-noise-models-source's Issues

CalledProcessError

Hello Team,

Thank you for this repo and the python package.

I am using the python package for topic modelling on twitter data and has my code set up based on your example on medium as follows:

from gdtm.models import TND
# Set these paths to the path where you saved the Mallet implementation of each model, plus bin/mallet
tnd_path = 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet'

# We pass in the paths to the java code along with the data set and whatever parameters we want to set
model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)

topics = model.get_topics()
noise = model.get_noise_distribution()

When I run the code, I get the traceback below:

CalledProcessError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20604/2788543094.py in
5
6 # We pass in the paths to the java code along with the data set and whatever parameters we want to set
----> 7 model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)
8
9 topics = model.get_topics()
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in init(self, dataset, k, alpha, beta0, beta1, noise_words_max, iterations, top_words, topic_word_distribution, noise_distribution, corpus, dictionary, mallet_path, random_seed, run, workers)
76 self._prepare_data()
77 if self.noise_distribution is None:
---> 78 self._compute_tnd()
79
80 def _prepare_data(self):
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in _compute_tnd(self)
96
97 """
---> 98 model = TNDMallet(self.mallet_path, self.corpus, num_topics=self.k, id2word=self.dictionary,
99 workers=self.workers,
100 alpha=self.alpha, beta=self.beta0, skew=self.beta1,
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in init(self, mallet_path, corpus, num_topics, alpha, beta, id2word, workers, prefix, optimize_interval, iterations, topic_threshold, random_seed, noise_words_max, skew, is_parent)
81 self.skew = skew
82 if corpus is not None and not is_parent:
---> 83 self.train(corpus)
84
85
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in train(self, corpus)
104
105 """
--> 106 self.convert_input(corpus, infer=False)
107 cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '
108 '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '
~.conda\envs\general\lib\site-packages\gdtm\wrappers\base_wrapper.py in convert_input(self, corpus, infer, serialize_corpus)
215 cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
216 logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 217 check_output(args=cmd, shell=True)
218
219 def getitem(self, bow, iterations=100):
~.conda\envs\general\lib\site-packages\gensim\utils.py in check_output(stdout, *popenargs, **kwargs)
1889 error = subprocess.CalledProcessError(retcode, cmd)
1890 error.output = output
-> 1891 raise error
1892 return output
1893 except KeyboardInterrupt:
CalledProcessError: Command 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\sadick\AppData\Local\Temp\750b80_corpus.txt --output C:\Users\sadick\AppData\Local\Temp\750b80_corpus.mallet' returned non-zero exit status 1.

I will be grateful if you would have a look and provide some guidance regarding this issue.

Regards
Sadick

Standalone use of Mallet TND

First of all, thank you for your Topic Noise Models!
I am currently trying to train TND in Windows (to be able to combine the noise distribution with existing topic models). However, I had difficulties with the Python package, probably due to Windows. That's why I tried to use the Java programme directly. To do this, I downloaded the mallet-tnd project from Github and imported it into my IDE. So far I can now train TND. However, I have questions about the parameters. First of all, which parameter of mallet-tnd corresponds to beta_1 (to determine whether a word is more of a topic word or a noise word)? I have seen that there is a constructor that takes the parameter "skew" which, if I understand it correctly, is related to beta_1. However, I am not sure how high I should set "skew". My data is very noisy (chats from messenger services). Furthermore, could you please explain to me how I get the noise distribution? Your implementation of "ParallelTopicModel" has the variable "noiseDistribution" (an int[] array). When you call "get_noise_distribution()" in Python, which variable exactly are you accessing? To the output of the function "printTopNoise"?

I would be very grateful for an answer! Many thanks in advance and best regards!

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Temp\\e5be94_state.mallet.gz'

Hello Team,
Thank you for this repo and the python package.
I am using the python package for topic modelling on a simple test data and has my code set up based on your example on medium as follows:
from gensim import corpora
from gdtm.wrappers import TNDMallet

Set the path to the path where you saved the Mallet implementation of the model, plus bin/mallet

tnd_path = 'D:/mallet-tnd/bin/mallet'
dataset = [['model', 'study', 'patient', 'university', 'student', 'result', 'method', 'patient'],
['statistics', 'data', 'patient', 'study', 'analysis', 'method'], ['test', 'using', 'regression',
'statistical', 'research', 'article', 'group', 'score', 'factor']]

Format the data set for consumption by the wrapper (this is done automatically in class-based models)

dictionary = corpora.Dictionary(dataset)
corpus = [dictionary.doc2bow(doc) for doc in dataset]

Pass in the path to the java code along with the data set and parameters

model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
skew=25, noise_words_max=1, iterations=1000)
topics = model.get_topics()
noise = model.load_noise_dist()
When I run the code, I get the traceback below:
Traceback (most recent call last):
File "C:/Users/雷神/Desktop/毕业论文改/主题模型/GDTM.py", line 12, in
model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 83, in init
self.train(corpus)
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 121, in train
self.word_topics = self.load_word_topics()
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\base_wrapper.py", line 271, in load_word_topics
with utils.open(self.fstate(), 'rb') as fin:
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 235, in open
binary = _open_binary_stream(uri, binary_mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 398, in _open_binary_stream
fobj = submodule.open_uri(uri, mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\local_file.py", line 34, in open_uri
fobj = io.open(parsed_uri['uri_path'], mode)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Temp\e5be94_state.mallet.gz'

Process finished with exit code 1
I will be grateful if you would have a look and provide some guidance regarding this issue.

Regards
Yuehong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.