gu-datalab / topic-noise-models-source Goto Github PK
View Code? Open in Web Editor NEWMallet implementations of Topic-Noise Models
License: Apache License 2.0
Mallet implementations of Topic-Noise Models
License: Apache License 2.0
Hello Team,
Thank you for this repo and the python package.
I am using the python package for topic modelling on twitter data and has my code set up based on your example on medium as follows:
from gdtm.models import TND
# Set these paths to the path where you saved the Mallet implementation of each model, plus bin/mallet
tnd_path = 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet'
# We pass in the paths to the java code along with the data set and whatever parameters we want to set
model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)
topics = model.get_topics()
noise = model.get_noise_distribution()
When I run the code, I get the traceback below:
CalledProcessError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20604/2788543094.py in
5
6 # We pass in the paths to the java code along with the data set and whatever parameters we want to set
----> 7 model = TND(dataset=dataset, mallet_path=tnd_path, k=30, beta1=25, top_words=20)
8
9 topics = model.get_topics()
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in init(self, dataset, k, alpha, beta0, beta1, noise_words_max, iterations, top_words, topic_word_distribution, noise_distribution, corpus, dictionary, mallet_path, random_seed, run, workers)
76 self._prepare_data()
77 if self.noise_distribution is None:
---> 78 self._compute_tnd()
79
80 def _prepare_data(self):
~.conda\envs\general\lib\site-packages\gdtm\models\tnd.py in _compute_tnd(self)
96
97 """
---> 98 model = TNDMallet(self.mallet_path, self.corpus, num_topics=self.k, id2word=self.dictionary,
99 workers=self.workers,
100 alpha=self.alpha, beta=self.beta0, skew=self.beta1,
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in init(self, mallet_path, corpus, num_topics, alpha, beta, id2word, workers, prefix, optimize_interval, iterations, topic_threshold, random_seed, noise_words_max, skew, is_parent)
81 self.skew = skew
82 if corpus is not None and not is_parent:
---> 83 self.train(corpus)
84
85
~.conda\envs\general\lib\site-packages\gdtm\wrappers\tnd.py in train(self, corpus)
104
105 """
--> 106 self.convert_input(corpus, infer=False)
107 cmd = self.mallet_path + ' train-topics --input %s --num-topics %s --alpha %s --optimize-interval %s '
108 '--num-threads %s --output-state %s --output-doc-topics %s --output-topic-keys %s '
~.conda\envs\general\lib\site-packages\gdtm\wrappers\base_wrapper.py in convert_input(self, corpus, infer, serialize_corpus)
215 cmd = cmd % (self.fcorpustxt(), self.fcorpusmallet())
216 logger.info("converting temporary corpus to MALLET format with %s", cmd)
--> 217 check_output(args=cmd, shell=True)
218
219 def getitem(self, bow, iterations=100):
~.conda\envs\general\lib\site-packages\gensim\utils.py in check_output(stdout, *popenargs, **kwargs)
1889 error = subprocess.CalledProcessError(retcode, cmd)
1890 error.output = output
-> 1891 raise error
1892 return output
1893 except KeyboardInterrupt:
CalledProcessError: Command 'C:/Users/sadick/Downloads/topic-noise-models-source-main.zip/mallet-tnd/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\sadick\AppData\Local\Temp\750b80_corpus.txt --output C:\Users\sadick\AppData\Local\Temp\750b80_corpus.mallet' returned non-zero exit status 1.
I will be grateful if you would have a look and provide some guidance regarding this issue.
Regards
Sadick
First of all, thank you for your Topic Noise Models!
I am currently trying to train TND in Windows (to be able to combine the noise distribution with existing topic models). However, I had difficulties with the Python package, probably due to Windows. That's why I tried to use the Java programme directly. To do this, I downloaded the mallet-tnd project from Github and imported it into my IDE. So far I can now train TND. However, I have questions about the parameters. First of all, which parameter of mallet-tnd corresponds to beta_1 (to determine whether a word is more of a topic word or a noise word)? I have seen that there is a constructor that takes the parameter "skew" which, if I understand it correctly, is related to beta_1. However, I am not sure how high I should set "skew". My data is very noisy (chats from messenger services). Furthermore, could you please explain to me how I get the noise distribution? Your implementation of "ParallelTopicModel" has the variable "noiseDistribution" (an int[] array). When you call "get_noise_distribution()" in Python, which variable exactly are you accessing? To the output of the function "printTopNoise"?
I would be very grateful for an answer! Many thanks in advance and best regards!
Hello Team,
Thank you for this repo and the python package.
I am using the python package for topic modelling on a simple test data and has my code set up based on your example on medium as follows:
from gensim import corpora
from gdtm.wrappers import TNDMallet
tnd_path = 'D:/mallet-tnd/bin/mallet'
dataset = [['model', 'study', 'patient', 'university', 'student', 'result', 'method', 'patient'],
['statistics', 'data', 'patient', 'study', 'analysis', 'method'], ['test', 'using', 'regression',
'statistical', 'research', 'article', 'group', 'score', 'factor']]
dictionary = corpora.Dictionary(dataset)
corpus = [dictionary.doc2bow(doc) for doc in dataset]
model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
skew=25, noise_words_max=1, iterations=1000)
topics = model.get_topics()
noise = model.load_noise_dist()
When I run the code, I get the traceback below:
Traceback (most recent call last):
File "C:/Users/雷神/Desktop/毕业论文改/主题模型/GDTM.py", line 12, in
model = TNDMallet(tnd_path, corpus, num_topics=1, id2word=dictionary,
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 83, in init
self.train(corpus)
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\tnd.py", line 121, in train
self.word_topics = self.load_word_topics()
File "D:\Anaconda3\lib\site-packages\gdtm\wrappers\base_wrapper.py", line 271, in load_word_topics
with utils.open(self.fstate(), 'rb') as fin:
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 235, in open
binary = _open_binary_stream(uri, binary_mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\smart_open_lib.py", line 398, in _open_binary_stream
fobj = submodule.open_uri(uri, mode, transport_params)
File "D:\Anaconda3\lib\site-packages\smart_open\local_file.py", line 34, in open_uri
fobj = io.open(parsed_uri['uri_path'], mode)
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Temp\e5be94_state.mallet.gz'
Process finished with exit code 1
I will be grateful if you would have a look and provide some guidance regarding this issue.
Regards
Yuehong
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.