mind-lab / octis Goto Github PK
View Code? Open in Web Editor NEWOCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
License: MIT License
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
License: MIT License
Hi, this is more of a question than anything else. I've seen that for ETM model training, we must pass an embeddings path corresponding to a "pickled" file. However, I need to execute ETM with rather large embeddings. There's any intent on implementing a gensim.models.KeyedVectors
based (or something like that) embeddings input for this model? I've implemented something like that for an etm package of mine, but yours' has all I need to execute model optimization. Would a PR on this matter be accepted?
Anyway, cheers for the nice work, this package is really great!
Gave a look at here.
Hi,
Currently, I am comparing different topic modeling algorithms using the OCTIS package. For LSI, I noticed that the OCTIS paper states that (Hofmann, 1999) is implemented, while the Github page refers to (Landauer et al. 1998). Could you specify on which work your implementation is based?
I tried:
python octis\dashboard\server.py
and i get this error:
import octis.dashboard.experimentManager as expManager
AttributeError: module 'octis' has no attribute 'dashboard'
ETM training fails.
dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = ETM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)
The following error message was displayed.
model: ETM(
(t_drop): Dropout(p=0.5, inplace=False)
(theta_act): ReLU()
(rho): Linear(in_features=300, out_features=221413, bias=False)
(alphas): Linear(in_features=300, out_features=25, bias=False)
(q_theta): Sequential(
(0): Linear(in_features=221413, out_features=800, bias=True)
(1): ReLU()
(2): Linear(in_features=800, out_features=800, bias=True)
(3): ReLU()
)
(mu_q_theta): Linear(in_features=800, out_features=25, bias=True)
(logsigma_q_theta): Linear(in_features=800, out_features=25, bias=True)
)
Traceback (most recent call last):
File "train.py", line 62, in <module>
model_output = model.train_model(dataset)
File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 57, in train_model
continue_training = self._train_epoch(epoch)
File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 149, in _train_epoch
data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM_model/data.py", line 17, in get_batch
doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'
I read CTM uses both the preprocessed text for BOW and full text for BERT embedding. How can I create this as Dataset for the CTM model? Does saving an a OCTIS datasets automatically do this?
Many thanks
When using NeuralLDA as the model in OCTIS, it simply sets the model_type of AVITM to LDA. Does it simply do a standard LDA or something else? What is the corresponding model in the Srivastava and Sutton 2017 paper?
I am trying to load a custom dataset without splitting into train, validation, test.
I processed the content of both my files firstly with "split = False" and obtain a NoneType object.
preprocessor = Preprocessing(vocabulary=None, max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='french',
min_chars=1, min_words_docs=0,
language="french",
split = False)
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
print(type(dataset))
None
To compare, I also did it with "split = True" but got a valid Dataset object this time.
preprocessor = Preprocessing(vocabulary=None, max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='french',
min_chars=1, min_words_docs=0,
language="french",
split = False)
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
print(type(dataset))
<octis.dataset.dataset.Dataset object at 0x7f58ad457400>
I looked into the octis.preprocessing.preprocessing.py module and found out that, line 209 and 213 in the "else" statement after "if split:", there is no return statement to output the Dataset.
if self.split:
if len(final_labels) > 0:
train, test, y_train, y_test = train_test_split(
range(len(final_docs)), final_labels, test_size=0.15, random_state=1, stratify=final_labels)
train, validation = train_test_split(train, test_size=3 / 17, random_state=1, stratify=y_train)
partitioned_labels = [final_labels[doc] for doc in train + validation + test]
partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
document_indexes = [document_indexes[doc] for doc in train + validation + test]
metadata["last-training-doc"] = len(train)
metadata["last-validation-doc"] = len(validation) + len(train)
if self.save_original_indexes:
return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels,
document_indexes=document_indexes)
else:
return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels)
else:
train, test = train_test_split(range(len(final_docs)), test_size=0.15, random_state=1)
train, validation = train_test_split(train, test_size=3 / 17, random_state=1)
metadata["last-training-doc"] = len(train)
metadata["last-validation-doc"] = len(validation) + len(train)
partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
document_indexes = [document_indexes[doc] for doc in train + validation + test]
if self.save_original_indexes:
return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
document_indexes=document_indexes)
else:
return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
document_indexes=document_indexes)
else:
if self.save_original_indexes:
Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
document_indexes=document_indexes)
else:
Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels)
Adding the return statements at both lines 209 and 213 in the octis.preprocessing.preprocessing.py module resolved the problem.
Also, thank you for your incredible work, the library is really nice to use !
I get the following error when trying to do hyperparameter optimation on num_topics:
/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py in init(self, input_size, bert_input_size, inference_type, num_topics, model_type, hidden_sizes, activation, dropout, learn_priors, batch_size, lr, momentum, solver, num_epochs, num_samples, reduce_on_plateau, topic_prior_mean, topic_prior_variance, num_data_loader_workers)
45 "input_size must by type int > 0."
46 assert isinstance(num_topics, int) and input_size > 0,
---> 47 "num_topics must by type int > 0."
48 assert model_type in ['LDA', 'prodLDA'],
49 "model must be 'LDA' or 'prodLDA'."
AssertionError: num_topics must by type int > 0.
ctm_model = CTM(model_type='prodLDA', bert_model="stsb-roberta-base-v2", inference_type="combined")
search_space = {"num_topics": Integer(5, 10, 15, 20),
"num_layers": Categorical({1, 2, 3}),
"num_neurons": Categorical({100, 200})
}
optimization_runs=20
model_runs=1
optimizer=Optimizer()
optimization_result = optimizer.optimize(
ctm_model, dataset, npmi, search_space, number_of_call=optimization_runs,
model_runs=model_runs, save_models=True,
extra_metrics=None, # to keep track of other metrics
save_path='results/test_ctm/')
optimization_result.save_to_csv("results_ctm.csv")
Happy to assist with fixing this issue.
Hello!
I used the CTM model with parameter use_partitions=False
and got
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")
model = CTM(num_topics=25, num_epochs=100, inference_type='combined', use_partitions=False)
output = model.train_model(dataset)
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')
npmi_score = npmi.score(output)
@silviatti Hello again,
I have another silly question for you.
The original ETM package allows the user to use pre-trained word embeddings (the file name is 'skipgram_emb_300d.txt')
How do I tell the model to look for the pretrained embedding file instead?
Is it something like:
model = ETM(num_topics=25, embeddings_path = '...\path\to\embeddings'skipgram_emb_300d.txt'') ?
Thank you again for your assistance :)
Luke
I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')
Traceback (most recent call last):
Input In [42] in <module>
dataset.fetch_dataset('DBPedia_IT')
File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)
File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
f.write(corpus.text)
File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>
Tried to import NMF model into Colab
from octis.models.NMF import NMF
ImportError Traceback (most recent call last)
<ipython-input-17-96f0cef9c8fa> in <module>()
----> 1 from octis.models.NMF import NMF
/usr/local/lib/python3.7/dist-packages/octis/models/NMF.py in <module>()
1 from octis.models.model import AbstractModel
2 import numpy as np
----> 3 from gensim.models import nmf
4 import gensim.corpora as corpora
5 import octis.configuration.citations as citations
ImportError: cannot import name 'nmf' from 'gensim.models' (/usr/local/lib/python3.7/dist-packages/gensim/models/__init__.py)
Probably related to #20 and #35.
If your preprocessed corpus contains any single-word document, ETM training fails. This should not happen, as the Preprocessing class has 0 as the default value for parameters min_words_docs
and min_df
, which define respectivelly the minimum number of words a document must have to be keep and the minimum document-frequency for words on the corpus.
I've implemented a test case illustrating the scenario. The test fails. The test code can be found here, and the error stacktrace as shown on Github actions can be seen here.
Below, the aforementioned stacktrace (on my local machine):
Current call: 0
model: ETM(
(t_drop): Dropout(p=0.5, inplace=False)
(theta_act): ReLU()
(alphas): Linear(in_features=300, out_features=16, bias=False)
(q_theta): Sequential(
(0): Linear(in_features=872, out_features=800, bias=True)
(1): ReLU()
(2): Linear(in_features=800, out_features=800, bias=True)
(3): ReLU()
)
(mu_q_theta): Linear(in_features=800, out_features=16, bias=True)
(logsigma_q_theta): Linear(in_features=800, out_features=16, bias=True)
)
Traceback (most recent call last):
File "octis_test/unified_training.py", line 55, in <module>
model_runs=5, plot_best_seen=True) # number of runs of the topic model
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 160, in optimize
results = self._optimization_loop(opt)
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 285, in _optimization_loop
f_val = self._objective_function(next_x)
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 217, in _objective_function
self.topk)
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 60, in train_model
continue_training = self._train_epoch(epoch)
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 126, in _train_epoch
self.hyperparameters['embedding_size'], self.device)
File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM_model/data.py", line 17, in get_batch
doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'
I tried to run the ETM model through OCTIS, but got an attribute error.
I've attached my corpus (corpus.csv; for some reason git won't let my attach an actual .tsv file) and vocabulary (vocabulary.txt) for your convenience.
Here's what I did
I complied with the format of the dataset as a .tsv and vocabulary as a .txt file with one stem per row.
I was able to load the dataset with no errors.
To run the model I did the following:
from octis.models.ETM import ETM
model_etm = ETM(num_topics=40)
output_fomc = model_etm.train_model(dataset)
model: ETM(
(t_drop): Dropout(p=0.5, inplace=False)
(theta_act): ReLU()
(rho): Linear(in_features=300, out_features=9659, bias=False)
(alphas): Linear(in_features=300, out_features=40, bias=False)
(q_theta): Sequential(
(0): Linear(in_features=9659, out_features=800, bias=True)
(1): ReLU()
(2): Linear(in_features=800, out_f
eatures=800, bias=True)
(3): ReLU()
)
(mu_q_theta): Linear(in_features=800, out_features=40, bias=True)
(logsigma_q_theta): Linear(in_features=800, out_features=40, bias=True)
)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-159-a49502c35827> in <module>
----> 1 output_fomc = model_etm.train_model(dataset)
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
54
55 for epoch in range(0, self.hyperparameters['num_epochs']):
---> 56 continue_training = self._train_epoch(epoch)
57 if not continue_training:
58 break
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in _train_epoch(self, epoch)
120 self.model.zero_grad()
121 data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
--> 122 self.hyperparameters['embedding_size'], self.device)
123 sums = data_batch.sum(1).unsqueeze(1)
124 if self.hyperparameters['bow_norm']:
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM_model\data.py in get_batch(tokens, counts, ind, vocab_size, emsize, device)
15 #L = count.shape[1]
16 if len(doc) == 1:
---> 17 doc = [doc.squeeze()]
18 count = [count.squeeze()]
19 else:
AttributeError: 'list' object has no attribute 'squeeze'
``
[vocabulary.txt](https://github.com/MIND-Lab/OCTIS/files/7423616/vocabulary.txt)
[corpus.csv](https://github.com/MIND-Lab/OCTIS/files/7423618/corpus.csv)
'
Dear @silviatti ,
Related to issue #1
I would like to resubmit the feature request to integrate the DETM into your OCTIS suite. It appears the person who originally proposed issue #1 back in April 2021 has lost interest, or no longer wants to pursue this.
Is it possible if you could complete the integration?
With kindest regards
Luke
Great work!
But as for evaluating the topic models, how about adding the perplexity metric which is a common approach to evaluate the unsupervised language/topic models?
CTM training fails.
dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = CTM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)
The following error message was displayed.
Batches: 84%|████████████████████████████████████████████████████████████████████████████████████████▌ | 21790/26093 [59:43<11:47, 6.08it/s]
Traceback (most recent call last):
File "train.py", line 62, in <module>
model = ProdLDA(num_topics=TOPIC_SIZE)
File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
x_train, x_test, x_valid, input_size = self.preprocess(
File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
bert_ouput = bert_embeddings_from_list(texts, bert_model)
File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
out_features = self.forward(features)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
output_states = self.auto_model(**trans_features, return_dict=False)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
encoder_outputs = self.encoder(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
layer_outputs = layer_module(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
self_attention_outputs = self.attention(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
self_outputs = self.self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.
for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
# Lemmatize each token and convert to lower case if the token is not a pronoun
tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]
# Remove stop words and punctuation
tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
processed_documents.append(tokens)
I'm happy to contribute code to make this change.
I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")
model = CTM(num_topics=10,
num_epochs=30,
inference_type='zeroshot',
bert_model="distiluse-base-multilingual-cased")
npmi = Coherence(texts=dataset.get_corpus())
search_space = {"num_layers": Categorical({1, 2, 3}),
"num_neurons": Categorical({100, 200, 300}),
"activation": Categorical({'relu', 'softplus'}),
"dropout": Real(0.0, 0.95)
}
optimization_runs=30
model_runs=1
optimizer=Optimizer()
optimization_result = optimizer.optimize(
model, dataset, npmi, search_space, number_of_call=optimization_runs,
model_runs=model_runs, save_models=True,
extra_metrics=None, # to keep track of other metrics
plot_best_seen=True, plot_model=True, plot_name="B0_plot",
save_path='results2/test_ctm//')
I can't find where to write this variable "num_samples"
I am trying to evaluate topic model algorithms with a provided dataset, without success.
I am trying to run the following code:
from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)
acc = AccuracyScore(dataset)
output = model.train_model(dataset)
Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)
~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
164
165 if self.use_partitions:
--> 166 train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
167 else:
168 train_corpus = dataset.get_corpus()
~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
41 # Partitioned Corpus getter
42 def get_partitioned_corpus(self, use_validation=True):
---> 43 last_training_doc = self.__metadata["last-training-doc"]
44 # gestire l'eccezione se last_validation_doc non è definito, restituire
45 # il validation vuoto
TypeError: 'NoneType' object is not subscriptable
Setting max_features in Preprocessing throws the following error:
TypeError: init() got an unexpected keyword argument 'df_max_freq'
Example code
import string
from octis.preprocessing.preprocessing import Preprocessing
# Initialize preprocessing
preprocessor = Preprocessing(vocabulary=None, max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='english',
min_chars=1, min_words_docs=0, max_df=0.9, min_df=0.1) #
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=r'dataset.csv')
I'm happy to help fix the issue.
OCTIS is an excellent library - keep up the great work.
According to the readme, input in the partition column for a custom dataset should be of the type 'training', 'validation', 'test', which I can't get to yield a partition:
Make sure that the dataset is in the following format:
The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.
However, it seems the right format is 'train', 'val', 'test', which does work for me - just passing this on to make the ReadMe clearer.
def load_custom_dataset_from_folder(self, path):
"""
Loads all the dataset from a folder
Parameters
----------
path : path of the folder to read
"""
self.dataset_path = path
try:
if exists(self.dataset_path + "/metadata.json"):
self._load_metadata(self.dataset_path + "/metadata.json")
else:
self.__metadata = dict()
df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
if len(df.keys()) > 1:
df[1] = df[1].replace("train", "a_train")
df[1] = df[1].replace("val", "b_val")
df = df.sort_values(1).reset_index(drop=True)
self.__metadata['last-training-doc'] = len(df[df[1] == 'a_train'])
self.__metadata['last-validation-doc'] = len(df[df[1] == 'b_val']) + len(df[df[1] == 'a_train'])
Hi all,
thank you for this amazing library! I'll definitely consider using it for my master’s thesis. I have a question regarding coherence scores and non-English texts, and I hope it is okay to ask this here.
I saw that you're using the gensim coherence pipeline, which is based on Röder et. al. 2015. In this paper, it is not clear to me if they only used English Wikipedia or a multilingual Wikipedia as a reference corpus for calculating the coherence measures. So my question would be if the gensim coherence pipeline is suitable for the evaluation of non-English texts (e.g German) or if it would be better to use other approaches like TC-W2V with a custom corpus.
Regards
Luca
Currently, CTM returns only topic_significance_background score. Is it possible to get topic_significance_uniform score (per topic)?
Thanks,
-Atakan
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
I am trying to optimise LDA model with custom data. My evaluation metric is npmi
but I am also using topic_diversity
as extra metric during optimization.
Code:
# Create Model
model = LDA(num_topics=20, alpha=0.1)
model.partitioning(False)
# Initialize metric
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')
# Initialize metric
topic_diversity = TopicDiversity(topk=10)
optimization_runs=30 # number of optimization iterations
model_runs=5 # number of runs of the topic model
# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0),
"eta": Real(low=0.001, high=5.0),
'num_topics': Integer(low=1, high=10, prior='uniform')}
# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset,
search_space= search_space,
save_path=output_path, # path to store the results
metric= npmi,
number_of_call=optimization_runs,
model_runs=model_runs,
extra_metrics=[topic_diversity])
#save the results of th optimization in a csv file
optResult.save_to_csv(results.csv")
Error traceback :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
150 try:
--> 151 df[metric.info()["name"] + '(not optimized)'] = [np.median(
152 self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
151 df[metric.info()["name"] + '(not optimized)'] = [np.median(
--> 152 self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
153 except:
AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_6863/802420513.py in <module>
10
11 #save the results of th optimization in a csv file
---> 12 optResult.save_to_csv("results.csv")
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
152 self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
153 except:
--> 154 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
155 self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
156
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
153 except:
154 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
--> 155 self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
156
157 if not name_file.endswith(".csv"):
AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'
Code modification here
Old
try:
df[metric.info()["name"] + '(not optimized)'] = [np.median(
self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
New
try:
df[metric.info()["name"] + '(not optimized)'] = [np.median(
self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
If you are doing Bayesian optimization, it is probably a good idea to give the control of prior probability distribution to the user. Any plans to add that to your API?
Hi, thanks for sharing this amazing work!
I think the current citations for NeuralLDA and prodLDA is for the repo only.
These models are from the Autoencoding Variational Inference for Topic Models (Srivastava and Sutton 2017) paper. please consider citing the paper as well. Thanks!
I am trying to preprocess some custom corpus and When I am trying to remove stop word here is what I get.
preprocessor = Preprocessing(vocabulary=None, max_features=None,
remove_punctuation=True, punctuation=string.punctuation,
lemmatize=True, stopword_list='english',
min_chars=1, min_words_docs=0)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=corpus_path)
And I am getting the following error.
ndexes)
101 else:
102 if 'english' in stopword_list:
--> 103 with open('octis/preprocessing/stopwords/english.txt') as fr:
104 stopwords = [line.strip() for line in fr.readlines()]
105 assert stopword_list == language
FileNotFoundError: [Errno 2] No such file or directory: 'octis/preprocessing/stopwords/english.txt'
More context I am using jupyter notebook.
A possible solution.. it may be useful to use pathlib to handle those type of path.
If I fix it locally I can raise a PR soon
I'm not familiar with Github, so I may be rude.
Thank you for publishing such a great works.
It seems that the constructors of WardEmbeddingsPairwiseSimilarity
and WardEmbeddingsCentroidSimilarity
did not have self.binary
.
Therefore, I could not use any pretrained word embeddings other than the default.
In [2]: from octis.evaluation_metrics import similarity_metrics
In [3]: dummy_kv_path = "/workdir/dummy_kv.txt"
In [4]: similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-c02dac4f77ab> in <module>
----> 1 similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)
~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
71 self.wv = api.load('word2vec-google-news-300')
72 else:
---> 73 self.wv = KeyedVectors.load_word2vec_format( word2vec_path, binary=self.binary)
74
75 self.topk = topk
AttributeError: 'WordEmbeddingsPairwiseSimilarity' object has no attribute 'binary'
In [5]: similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-1f1c772b67de> in <module>
----> 1 similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)
~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
115 self.wv = api.load('word2vec-google-news-300')
116 else:
--> 117 self.wv = KeyedVectors.load_word2vec_format(word2vec_path, binary=self.binary)
118 self.topk = topk
119
AttributeError: 'WordEmbeddingsCentroidSimilarity' object has no attribute 'binary'
I am working on topic modeling for noisy short texts, trying to get topic significance scores per topic.
for t in output: #'output' is the model itself
significance_uniform_score = topic_signif_uniform.score(t)
print("Topic Significance Uniform Score: "+str(significance_uniform_score))
I get the following error message:
TypeError Traceback (most recent call last)
in
1 # Retrieve metrics score
2
----> 3 for t in output[:]:
4
5 #topic_diversity_score = topic_diversity.score(t)
TypeError: unhashable type: 'slice'
Is it possible to get topic significance score per topic?
I used OCITS metrics to evaluate my own implementation of the model. As a result, I found that WECoherencePairwise
and WECoherenceCentroid
have a negative correlation. Originally, I think these two metrics should have a positive correlation.
Each point in the figure represents an result of experiment under different conditions.
In WECoherenceCentroid
's calculation , have been done distance-1, but it would be more correct to 1-distance. (or use sklearn.metrics.pairwise.cosine_similarity
)
# https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L171
distance = spatial.distance.cosine(self._wv.__getitem__(w1), self._wv.__getitem__(w2))
topic_coherence += distance - 1
When performing hyper parameter search with the CTM-model and use_partitions=False, I get the error: AttributeError: 'CTM' object has no attribute 'vocab'.
I believe moving line 94 in CTM.py [self.vocab = dataset.get_vocabulary()] prior to the self.use_partitions if-statement in line 87 would solve the problem.
Current call: 0
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-63-7718f92a8020> in <module>
1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
3 model, dataset, npmi, search_space, number_of_call=optimization_runs,
4 model_runs=model_runs, save_models=True,
5 extra_metrics=None, # to keep track of other metrics
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
158
159 # Perform Bayesian Optimization
--> 160 results = self._optimization_loop(opt)
161
162 return results
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
283 else:
284 next_x = opt.ask()
--> 285 f_val = self._objective_function(next_x)
286
287 # Update the opt using (next_x,f_val)
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
214
215 # Prepare model
--> 216 model_output = self.model.train_model(self.dataset, params,
217 self.topk)
218 # Score of the model
~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
89 data_corpus = [' '.join(i) for i in dataset.get_corpus()]
90 self.X_train, input_size = self.preprocess(
---> 91 self.vocab, train=data_corpus, bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",
92 bert_model=self.hyperparameters["bert_model"])
93
AttributeError: 'CTM' object has no attribute 'vocab'
Despite looking at some demos, I am still not capable of making pull requests, but wanted to let you know nonetheless.
Best,
Thyge
If I add a GPU with CUDA support, will OCTIS be faster?
I am a PhD candidate and I need to evaluate the performance of three different topic model algorithm including: LDA, LSI and Bertopic. ( LDA and LSI were trained using the Gensim package)
what are the relevance metrics that I should use apart from coherence score? I would like to include in my paper a sort of table or graph that shows an evaluation in term of accuracy of the model (coherence score) and relevance of topics ( should I use the topic diversity metric ?)
Thank you
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Hi @lffloyd and @silviatti
I tried to run the ETM with pre-trained embeddings after the recent upgrade, and it returned this error.
TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'.
Please advise if I made an error on my end.
My commands and traceback are provided below.
Thank you so much!
Luke
model = ETM(num_topics=40, num_epochs=1, use_partitions=False, train_embeddings=False,
embeddings_type='word2vec', embeddings_path=r'my/path/to/embedding/skipgram_emb_300d.txt', binary_embeddings=False, headerless_embeddings=True)
output= model.train_model(dataset)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
d:\01MRes_ubuntu\OCTIS\fomcNoPartitionsPreTrained\EtmRunModelPreTrained300.py in <module>
----> 26 output_fomc_etm = model.train_model(dataset)
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
74 if hyperparameters is None:
75 hyperparameters = {}
---> 76 self.set_model(dataset, hyperparameters)
77 self.top_word = top_words
78 self.early_stopping = EarlyStopping(patience=5, verbose=True)
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in set_model(self, dataset, hyperparameters)
119
120 self.set_default_hyperparameters(hyperparameters)
--> 121 self.load_embeddings()
122 ## define model and optimizer
123 self.model = etm.ETM(num_topics=self.hyperparameters['num_topics'], vocab_size=len(self.vocab.keys()),
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in load_embeddings(self)
52 self.hyperparameters['embeddings_type'],
53 self.hyperparameters['binary_embeddings'],
---> 54 self.hyperparameters['headerless_embeddings'])
55 embeddings = np.zeros((len(self.vocab.keys()), self.hyperparameters['embedding_size']))
56 for i, word in enumerate(self.vocab.values()):
~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in _load_word_vectors(self, embeddings_path, embeddings_type, binary_embeddings, headerless_embeddings)
85 embeddings_path,
86 binary=binary_embeddings,
---> 87 no_header=headerless_embeddings)
88
89 vectors = {}
TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'
Nice work here...
I haven't yet played with the code.
I was just asking if this tool can work dynamic topic models approaches like this one .
If yes, how can we integrate it?
I've asked this at #29, but decided to open a new issue because this is a more specific scenario. So, here it is:
Hi @silviatti. So, if I understand correctly, currently there's no way to load the unprocessed corpus documents on OCTIS' CTM while using its optimizer, in a manner similar to the one done on standalone CTM's README?
Originally posted by @lffloyd in #29 (comment)
I gave a look at the docs.
Tried to run the server and create an experiment
127.0.0.1 - - [27/Apr/2021 20:06:58] "�[37mPOST /selectPath HTTP/1.1�[0m" 200 -
{'partitioning': False, 'path': 'D:/OctisResults', 'dataset': '20NewsGroup', 'model': {'name': 'LDA', 'parameters': {'alpha': 0.1, 'eta': 0.1, 'iterations': 50, 'passes': 1}}, 'optimization': {'iterations': 5, 'model_runs': 3, 'surrogate_model': 'GP', 'n_random_starts': 3, 'acquisition_function': 'LCB', 'search_spaces': {'num_topics': {'low': 2, 'high': 20}}}, 'optimize_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}], 'track_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}]}
127.0.0.1 - - [27/Apr/2021 20:07:20] "�[37mPOST /startExperiment HTTP/1.1�[0m" 200 -
starting OctExpnewsgroup
Process Process-2:1:
Traceback (most recent call last):
File "D:\octisExp\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "D:\octisExp\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "D:\octisExp\lib\site-packages\octis\dashboard\queueManager.py", line 260, in _execute_and_update
startExperiment(toRun[running[0]])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 136, in startExperiment
model_class = importModel(parameters["model"]["name"])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 64, in importModel
model = importClass(model_name, model_name, module_path)
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 46, in importClass
spec.loader.exec_module(module)
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\octisExp\lib\site-packages\octis\models\LDA.py", line 5, in
import octis.configuration.citations as citations
AttributeError: module 'octis' has no attribute 'configuration'
Hi Octis team,
When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.
I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...
One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.
def preprocess(vocab, train, bert_model, test=None, validation=None,
bert_train_path=None, bert_test_path=None, bert_val_path=None):
vocab2id = {w: i for i, w in enumerate(vocab)}
vec = CountVectorizer(
vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
entire_dataset = train.copy()
if test is not None:
entire_dataset.extend(test)
if validation is not None:
entire_dataset.extend(validation)
vec.fit(entire_dataset)
idx2token = {v: k for (k, v) in vec.vocabulary_.items()}
x_train = vec.transform(train)
b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
input_size = len(idx2token.keys())
Tutorial, that yields exception
from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence
dataset = Dataset()
dataset.fetch_dataset("M10")
model = CTM(num_topics=10, num_epochs=30, inference_type='zeroshot', bert_model="bert-base-nli-mean-tokens")
npmi = Coherence(texts=dataset.get_corpus())
search_space = {"num_layers": Categorical({1, 2, 3}),
"num_neurons": Categorical({100, 200, 300}),
"activation": Categorical({'sigmoid', 'relu', 'softplus'}),
"dropout": Real(0.0, 0.95)
}
optimization_runs=30
model_runs=1
optimizer=Optimizer()
optimization_result = optimizer.optimize(
model, dataset, npmi, search_space, number_of_call=optimization_runs,
model_runs=model_runs, save_models=True,
extra_metrics=None, # to keep track of other metrics
save_path='results/test_ctm//')
Current call: 0
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-46-7718f92a8020> in <module>
1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
3 model, dataset, npmi, search_space, number_of_call=optimization_runs,
4 model_runs=model_runs, save_models=True,
5 extra_metrics=None, # to keep track of other metrics
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
158
159 # Perform Bayesian Optimization
--> 160 results = self._optimization_loop(opt)
161
162 return results
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
283 else:
284 next_x = opt.ask()
--> 285 f_val = self._objective_function(next_x)
286
287 # Update the opt using (next_x,f_val)
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
214
215 # Prepare model
--> 216 model_output = self.model.train_model(self.dataset, params,
217 self.topk)
218 # Score of the model
~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
80 self.vocab = dataset.get_vocabulary()
81 self.X_train, self.X_test, self.X_valid, input_size = \
---> 82 self.preprocess(self.vocab, data_corpus_train, test=data_corpus_test,
83 validation=data_corpus_validation,
84 bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",
~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in preprocess(vocab, train, bert_model, test, validation, bert_train_path, bert_test_path, bert_val_path)
178 b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
179
--> 180 train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
181 input_size = len(idx2token.keys())
182
~/anaconda3/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py in __init__(self, X, X_bert, idx2token)
15 """
16 if X.shape[0] != len(X_bert):
---> 17 raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
18 "You might want to check if the BoW preparation method has removed some documents. ")
19
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.
My reproduction, that works fine:
def preprocess(vocab, train, bert_model, test=None, validation=None,
bert_train_path=None, bert_test_path=None, bert_val_path=None):
vocab2id = {w: i for i, w in enumerate(vocab)}
vec = CountVectorizer(
vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
entire_dataset = train.copy()
if test is not None:
entire_dataset.extend(test)
if validation is not None:
entire_dataset.extend(validation)
vec.fit(entire_dataset)
idx2token = {v: k for (k, v) in vec.vocabulary_.items()}
x_train = vec.transform(train)
b_train = bert_embeddings_from_list(train, bert_model)
train_data = CTMDataset(x_train.toarray(), b_train, idx2token)
input_size = len(idx2token.keys())
if test is not None and validation is not None:
x_test = vec.transform(test)
b_test = bert_embeddings_from_list(test, bert_model)
test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
x_valid = vec.transform(validation)
b_val = bert_embeddings_from_list(validation, bert_model)
valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
return train_data, test_data, valid_data, input_size
if test is None and validation is not None:
x_valid = vec.transform(validation)
b_val = bert_embeddings_from_list(validation, bert_model)
valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
return train_data, valid_data, input_size
if test is not None and validation is None:
x_test = vec.transform(test)
b_test = bert_embeddings_from_list(test, bert_model)
test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
return train_data, test_data, input_size
if test is None and validation is None:
return train_data, input_size
def bert_embeddings_from_list(texts, sbert_model_to_load="bert-base-nli-mean-tokens", batch_size=100):
"""
Creates SBERT Embeddings from a list
"""
model = SentenceTransformer(sbert_model_to_load)
return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
import torch
from torch.utils.data import Dataset
import scipy.sparse
class CTMDataset(Dataset):
"""Class to load BOW dataset."""
def __init__(self, X, X_bert, idx2token):
"""
Args
X : array-like, shape=(n_samples, n_features)
Document word matrix.
"""
if X.shape[0] != len(X_bert):
raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
"You might want to check if the BoW preparation method has removed some documents. ")
self.X = X
self.X_bert = X_bert
self.idx2token = idx2token
def __len__(self):
"""Return length of dataset."""
return self.X.shape[0]
def __getitem__(self, i):
"""Return sample from dataset at index i."""
if type(self.X[i]) == scipy.sparse.csr.csr_matrix:
X = torch.FloatTensor(self.X[i].todense())
X_bert = torch.FloatTensor(self.X_bert[i])
else:
X = torch.FloatTensor(self.X[i])
X_bert = torch.FloatTensor(self.X_bert[i])
return {'X': X, 'X_bert': X_bert}
from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence
dataset = Dataset()
dataset.fetch_dataset("M10")
train, validation, test = dataset.get_partitioned_corpus(use_validation=True)
data_corpus_train = [' '.join(i) for i in train]
data_corpus_test = [' '.join(i) for i in test]
data_corpus_validation = [' '.join(i) for i in validation]
vocab = dataset.get_vocabulary()
X_train, X_test, X_valid, input_size = \
preprocess(vocab, data_corpus_train, test=data_corpus_test,
validation=data_corpus_validation,
bert_train_path=""+"_train.pkl",
bert_test_path=""+"_test.pkl",
bert_val_path=""+"_val.pkl",
bert_model='bert-base-nli-mean-tokens')
Batches: 100%
59/59 [00:08<00:00, 7.10it/s]
Batches: 100%
13/13 [00:01<00:00, 6.62it/s]
Batches: 100%
13/13 [00:00<00:00, 28.11it/s]
Dear Silvia,
Love your work!
I have a silly question. How do you train a model using the entire corpus (ignoring partitions)?
Thank you for your help.
Luke
I tried:
from octis.models.ProdLDA import ProdLDA
and get the following error:
d:\octisexp\lib\site-packages\octis\models\pytorchavitm_init_.py in
1 """Init package"""
2
----> 3 from octis.models.pytorchavitm.avitm.avitm_model import AVITM_model
ModuleNotFoundError: No module named 'octis.models.pytorchavitm.avitm'
I have created a virtual environment and installed octis using : pip install octis
OCTIS version: 1.10.0
Python version :3.9.7
Operating System:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
In this line (https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L180
), topic[0] contains a word, so if this is a word that is not included in self._wv, it will cause an error.
Since Gensim's KeyedVectors class has a vector_size variable, I think this code should be rewritten to create a zero vector with reference to vector_size.
#t = [0] * len(self._wv.__getitem__(topic[0]))
t = np.zeros(self._wv.vector_size)
File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis-1.10.0-py3.9.egg/octis/evaluation_metrics/coherence_metrics.py", line 180, in score
t = [0] * len(self._wv.__getitem__(topic[0]))
File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 395, in __getitem__
return self.get_vector(key_or_keys)
File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 438, in get_vector
index = self.get_index(key)
File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 412, in get_index
raise KeyError(f"Key '{key}' not present")
KeyError: "Key 'elsevi' not present"
Hi Octis Team,
Thanks for making this available!
When providing a custom dataset for a LDA hyperparameter seach, I get: KeyError: 'info'
This is not the case when I run a single model (no hypersearch), nor when I fetch the M10 dataset and use this.
If I manually add an info entry with a name for the dataset to the metadata attribute of the custom dataset, the hyperparameter search works fine.
Perhaps the required metadata could be auto-filled when providing custom data?
Best,
Thyge
Code and traceback:
# Load modules
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Categorical
# Load custom dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder(str(_dp))
# Initiate model
model = LDA(alpha=0.5, eta=0.5)
# Define search space
search_space = {"num_topics": Categorical({15, 20, 25, 30})}
# Set number of runs
optimization_runs=15
model_runs=1
# Define evaluation metric
npmi = Coherence(texts=dataset.get_corpus())
# Hypersearch
optimizer=Optimizer()
optimization_result = optimizer.optimize(
model, dataset, npmi, search_space, number_of_call=optimization_runs,
model_runs=model_runs, save_models=True,
extra_metrics=None, # to keep track of other metrics
save_path=str(_models / 'Octis' / 'LDA'))
Current call: 0
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-225-087bc04aa55a> in <module>
1 # Hypersearch
2 optimizer=Optimizer()
----> 3 optimization_result = optimizer.optimize(
4 model, dataset, npmi, search_space, number_of_call=optimization_runs,
5 model_runs=model_runs, save_models=True,
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
158
159 # Perform Bayesian Optimization
--> 160 results = self._optimization_loop(opt)
161
162 return results
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
299
300 # Create an object related to the BO optimization
--> 301 results = OptimizerEvaluation(self, BO_results=res)
302
303 # Save the object
~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in __init__(self, optimizer, BO_results)
45 # Info about optimization
46 self.info = dict()
---> 47 dataset_info = optimizer.dataset.get_metadata()["info"]
48 if dataset_info is not None:
49 self.info.update({"dataset_name": dataset_info["name"]})
KeyError: 'info'
Adding this after loading custom data fixes the problem:
# Load existing metadata
meta_dict = dataset.get_metadata()
# Add name to dict
meta_dict['info'] = {'name':'dataset_name'}
# Update metadata
dataset._Dataset__metadata = meta_dict
# Verify info is updated
dataset.get_info()
Trying to evaluate a model using the WordEmbeddingsInvertedRBOCentroid() method I get an attribute error "'KeyedVectors' object has no attribute 'wv'"
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("custom_dataset")
from octis.models.LDA import LDA
model_LDA_15 = LDA(num_topics=15)
model_LDA_15_output = model_LDA_15.train_model(dataset)
from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
topic_rbo_centroid_score = rbo_centroid_metric.score(model_LDA_15_output)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-45-eb5075095fc9> in <module>
1 from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
2 rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
----> 3 topic_rbo_centroid__score = rbo_centroid_metric.score(model_LDA_15_output)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/diversity_metrics.py in score(self, model_output)
174 indexed_list1 = [word2index[word] for word in list1]
175 indexed_list2 = [word2index[word] for word in list2]
--> 176 rbo_val = weirbo_centroid(
177 indexed_list1[:self.topk], indexed_list2[:self.topk], p=self.weight, index2word=index2word,
178 word2vec=self.wv, norm=self.norm)[2]
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in word_embeddings_rbo(list1, list2, p, index2word, word2vec, norm)
145 args = (list1, list2, p, index2word, word2vec, norm)
146
--> 147 return RBO(rbo_min(*args), rbo_res(*args), rbo_ext(*args))
148
149
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in rbo_min(list1, list2, p, index2word, word2vec, norm, depth)
79 """
80 depth = min(len(list1), len(list2)) if depth is None else depth
---> 81 x_k = overlap(list1, list2, depth, index2word, word2vec, norm)
82 log_term = x_k * math.log(1 - p)
83 sum_term = sum(
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in overlap(list1, list2, depth, index2word, word2vec, norm)
59 # NOTE: comment the preceding and uncomment the following line if you want
60 # to stick to the algorithm as defined by the paper
---> 61 ov = embeddings_overlap(list1, list2, depth, index2word, word2vec, norm=norm)[0]
62 # print("overlap", ov)
63 return ov
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in embeddings_overlap(list1, list2, depth, index2word, word2vec, norm)
41 word_list2 = [index2word[index] for index in list2]
42
---> 43 centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
44 centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
45 cos_sim = 1 - distance.cosine(centroid_1, centroid_2)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in <listcomp>(.0)
41 word_list2 = [index2word[index] for index in list2]
42
---> 43 centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
44 centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
45 cos_sim = 1 - distance.cosine(centroid_1, centroid_2)
AttributeError: 'KeyedVectors' object has no attribute 'wv'
What we have in the preprocessing step is already a good starting point, but we can do better by adding an option to define someone's custom preprocessing pipeline to handle what is not yet handled in the current preprocessing.
I am trying to fetch the Italian Europarl_IT dataset to train topic models on. However, this does not work.
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('Europarl_IT')
Traceback (most recent call last):
Input In [40] in <module>
dataset.fetch_dataset('Europarl_IT')
File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)
File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:96 in download_dataset
metadata["info"]["name"] = dataset_name
KeyError: 'info'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.