gu-datalab / topic-noise-models-source Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 34.96 MB

Mallet implementations of Topic-Noise Models

License: Apache License 2.0

Makefile 0.12% Shell 0.32% Batchfile 0.09% HTML 0.21% Java 99.25%

topic-noise-models-source's People

Contributors

Stargazers

Watchers

Forkers

neilchenccc

topic-noise-models-source's Issues

CalledProcessError

Hello Team,

Thank you for this repo and the python package.

I am using the python package for topic modelling on twitter data and has my code set up based on your example on medium as follows:

from gdtm.models import TND
# Set these paths to the path where you saved the Mallet implementation of each model, plus bin/mallet
tnd_path = 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet'

# We pass in the paths to the java code along with the data set and whatever parameters we want to set
model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)

topics = model.get_topics()
noise = model.get_noise_distribution()

When I run the code, I get the traceback below:

CalledProcessError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20604/2788543094.py in
5
6 # We pass in the paths to the java code along with the data set and whatever parameters we want to set
----> 7 model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)
8
9 topics = model.get_topics()
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in init(self, dataset, k, alpha, beta0, beta1, noise_words_max, iterations, top_words, topic_word_distribution, noise_distribution, corpus, dictionary, mallet_path, random_seed, run, workers)
76 self._prepare_data()
77 if self.noise_distribution is None:
---> 78 self._compute_tnd()
79
80 def _prepare_data(self):
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in _compute_tnd(self)
96
97 """
---> 98 model = TNDMallet(self.mallet_path, self.corpus, num_topics=self.k, id2word=self.dictionary,
99 workers=self.workers,
100 alpha=self.alpha, beta=self.beta0, skew=self.beta1,
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in init(self, mallet_path, corpus, num_topics, alpha, beta, id2word, workers, prefix, optimize_interval, iterations, topic_threshold, random_seed, noise_words_max, skew, is_parent)
81 self.skew = skew
82 if corpus is not None and not is_parent:
---> 83 self.train(corpus)
84
85
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in train(self, corpus)
104
105 """
--> 106 self.convert_input(corpus, infer=False)
107 cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '
108 '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '
~.conda\envs\general\lib\site-packages\gdtm\wrappers\base_wrapper.py in convert_input(self, corpus, infer, serialize_corpus)
215 cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
216 logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 217 check_output(args=cmd, shell=True)
218
219 def getitem(self, bow, iterations=100):
~.conda\envs\general\lib\site-packages\gensim\utils.py in check_output(stdout, *popenargs, **kwargs)
1889 error = subprocess.CalledProcessError(retcode, cmd)
1890 error.output = output
-> 1891 raise error
1892 return output
1893 except KeyboardInterrupt:
CalledProcessError: Command 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\sadick\AppData\Local\Temp\750b80_corpus.txt --output C:\Users\sadick\AppData\Local\Temp\750b80_corpus.mallet' returned non-zero exit status 1.

I will be grateful if you would have a look and provide some guidance regarding this issue.

Regards
Sadick

Standalone use of Mallet TND

First of all, thank you for your Topic Noise Models!
I am currently trying to train TND in Windows (to be able to combine the noise distribution with existing topic models). However, I had difficulties with the Python package, probably due to Windows. That's why I tried to use the Java programme directly. To do this, I downloaded the mallet-tnd project from Github and imported it into my IDE. So far I can now train TND. However, I have questions about the parameters. First of all, which parameter of mallet-tnd corresponds to beta_1 (to determine whether a word is more of a topic word or a noise word)? I have seen that there is a constructor that takes the parameter "skew" which, if I understand it correctly, is related to beta_1. However, I am not sure how high I should set "skew". My data is very noisy (chats from messenger services). Furthermore, could you please explain to me how I get the noise distribution? Your implementation of "ParallelTopicModel" has the variable "noiseDistribution" (an int[] array). When you call "get_noise_distribution()" in Python, which variable exactly are you accessing? To the output of the function "printTopNoise"?

I would be very grateful for an answer! Many thanks in advance and best regards!

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Temp\\e5be94_state.mallet.gz'

Hello Team,
Thank you for this repo and the python package.
I am using the python package for topic modelling on a simple test data and has my code set up based on your example on medium as follows:
from gensim import corpora
from gdtm.wrappers import TNDMallet

Set the path to the path where you saved the Mallet implementation of the model, plus bin/mallet

tnd_path = 'D:/mallet-tnd/bin/mallet'
dataset = [['model', 'study', 'patient', 'university', 'student', 'result', 'method', 'patient'],
['statistics', 'data', 'patient', 'study', 'analysis', 'method'], ['test', 'using', 'regression',
'statistical', 'research', 'article', 'group', 'score', 'factor']]

Format the data set for consumption by the wrapper (this is done automatically in class-based models)

dictionary = corpora.Dictionary(dataset)
corpus = [dictionary.doc2bow(doc) for doc in dataset]

Pass in the path to the java code along with the data set and parameters

model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
skew=25, noise_words_max=1, iterations=1000)
topics = model.get_topics()
noise = model.load_noise_dist()
When I run the code, I get the traceback below:
Traceback (most recent call last):
File "C:/Users/雷神/Desktop/毕业论文改/主题模型/GDTM.py", line 12, in
model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 83, in init
self.train(corpus)
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 121, in train
self.word_topics = self.load_word_topics()
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\base_wrapper.py", line 271, in load_word_topics
with utils.open(self.fstate(), 'rb') as fin:
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 235, in open
binary = _open_binary_stream(uri, binary_mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 398, in _open_binary_stream
fobj = submodule.open_uri(uri, mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\local_file.py", line 34, in open_uri
fobj = io.open(parsed_uri['uri_path'], mode)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Temp\e5be94_state.mallet.gz'

Process finished with exit code 1
I will be grateful if you would have a look and provide some guidance regarding this issue.

Regards
Yuehong

gu-datalab / topic-noise-models-source Goto Github PK

topic-noise-models-source's People

Contributors

Stargazers

Watchers

Forkers

topic-noise-models-source's Issues

CalledProcessError

Standalone use of Mallet TND

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Temp\\e5be94_state.mallet.gz'

Set the path to the path where you saved the Mallet implementation of the model, plus bin/mallet

Format the data set for consumption by the wrapper (this is done automatically in class-based models)

Pass in the path to the java code along with the data set and parameters

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent