Giter Club home page Giter Club logo

mind-lab / octis Goto Github PK

View Code? Open in Web Editor NEW
685.0 14.0 91.0 172.3 MB

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

License: MIT License

Python 58.39% Makefile 0.26% JavaScript 10.77% CSS 1.02% HTML 19.04% Jupyter Notebook 10.53%
topic-modeling latent-dirichlet-allocation latent-semantic-analysis evaluation-metrics natural-language-processing non-negative-matrix-factorization neural-topic-models bayesian-optimization hyperparameter-optimization hyperparameter-tuning

octis's People

Contributors

adaminsky avatar anonymous-submission000 avatar arijitgupta42 avatar brunog89 avatar cerqueiramatheus avatar davidepietrasanta avatar dependabot[bot] avatar dopc avatar espoirmur avatar gregorywu avatar lorenzofamiglini avatar pietrotrope avatar silviatti avatar stepgazaille avatar vinid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

octis's Issues

ETM - Possibility of using KeyedVectors input for pre-trained W2V embeddings

  • OCTIS version: 1.9.0
  • Python version: 3.7.6
  • Operating System: Ubuntu 20.04 LTS

Description

Hi, this is more of a question than anything else. I've seen that for ETM model training, we must pass an embeddings path corresponding to a "pickled" file. However, I need to execute ETM with rather large embeddings. There's any intent on implementing a gensim.models.KeyedVectors based (or something like that) embeddings input for this model? I've implemented something like that for an etm package of mine, but yours' has all I need to execute model optimization. Would a PR on this matter be accepted?

Anyway, cheers for the nice work, this package is really great!

What I Did

Gave a look at here.

Is LSI or LSA implemented?

Hi,

Currently, I am comparing different topic modeling algorithms using the OCTIS package. For LSI, I noticed that the OCTIS paper states that (Hofmann, 1999) is implemented, while the Github page refers to (Landauer et al. 1998). Could you specify on which work your implementation is based?

Not able to run dashboard

  • Python version: 3.6
  • Operating System: Windows

Description

I tried:
python octis\dashboard\server.py

and i get this error:

import octis.dashboard.experimentManager as expManager
AttributeError: module 'octis' has no attribute 'dashboard'

ETM training fails.

  • OCTIS version: 1.8.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04.02

Description

ETM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = ETM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (rho): Linear(in_features=300, out_features=221413, bias=False)
  (alphas): Linear(in_features=300, out_features=25, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=221413, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=25, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=25, bias=True)
)
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model_output = model.train_model(dataset)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 57, in train_model
    continue_training = self._train_epoch(epoch)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 149, in _train_epoch
    data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM_model/data.py", line 17, in get_batch
    doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'

How to preprocess text for CTM?

I read CTM uses both the preprocessed text for BOW and full text for BERT embedding. How can I create this as Dataset for the CTM model? Does saving an a OCTIS datasets automatically do this?

Many thanks

Is Neural LDA the same as LDA

When using NeuralLDA as the model in OCTIS, it simply sets the model_type of AVITM to LDA. Does it simply do a standard LDA or something else? What is the corresponding model in the Srivastava and Sutton 2017 paper?

"Select Path" button bug in "Create Experiments" tab

  • Python version: 3.9
  • Operating System: MacOS
  • Description

In the dashboard, I was trying to create an experiment. Everything was ok, but got an error when I clicked to "Select Path" to where to save the experiments.

I got the error:

RuntimeError: main thread is not in main loop

Captura de Tela 2022-02-22 às 17 04 03

Captura de Tela 2022-02-22 às 17 03 55

Preprocessing with "split = False" returns NoneType object

  • OCTIS version: 1.6.0
  • Python version: 3.9.5
  • Operating System: Ubuntu 16.04

Description

I am trying to load a custom dataset without splitting into train, validation, test.

What I Did

I processed the content of both my files firstly with "split = False" and obtain a NoneType object.

preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='french',
                             min_chars=1, min_words_docs=0,
                             language="french",
                             split = False)
    
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
    
print(type(dataset))

None

To compare, I also did it with "split = True" but got a valid Dataset object this time.

preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='french',
                             min_chars=1, min_words_docs=0,
                             language="french",
                             split = False)
    
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
    
print(type(dataset))

<octis.dataset.dataset.Dataset object at 0x7f58ad457400>

Might be source of the problem

I looked into the octis.preprocessing.preprocessing.py module and found out that, line 209 and 213 in the "else" statement after "if split:", there is no return statement to output the Dataset.

        if self.split:
            if len(final_labels) > 0:
                train, test, y_train, y_test = train_test_split(
                    range(len(final_docs)), final_labels, test_size=0.15, random_state=1, stratify=final_labels)

                train, validation = train_test_split(train, test_size=3 / 17, random_state=1, stratify=y_train)
                partitioned_labels = [final_labels[doc] for doc in train + validation + test]
                partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
                document_indexes = [document_indexes[doc] for doc in train + validation + test]
                metadata["last-training-doc"] = len(train)
                metadata["last-validation-doc"] = len(validation) + len(train)
                if self.save_original_indexes:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels,
                                   document_indexes=document_indexes)
                else:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels)
            else:
                train, test = train_test_split(range(len(final_docs)), test_size=0.15, random_state=1)
                train, validation = train_test_split(train, test_size=3 / 17, random_state=1)

                metadata["last-training-doc"] = len(train)
                metadata["last-validation-doc"] = len(validation) + len(train)
                partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
                document_indexes = [document_indexes[doc] for doc in train + validation + test]
                if self.save_original_indexes:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                                   document_indexes=document_indexes)
                else:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                                   document_indexes=document_indexes)
        else:
            if self.save_original_indexes:
                Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                        document_indexes=document_indexes)
            else:

                Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels)

Adding the return statements at both lines 209 and 213 in the octis.preprocessing.preprocessing.py module resolved the problem.

Also, thank you for your incredible work, the library is really nice to use !

Hyperparameter Tuning of num_topics Errors

I get the following error when trying to do hyperparameter optimation on num_topics:

/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py in init(self, input_size, bert_input_size, inference_type, num_topics, model_type, hidden_sizes, activation, dropout, learn_priors, batch_size, lr, momentum, solver, num_epochs, num_samples, reduce_on_plateau, topic_prior_mean, topic_prior_variance, num_data_loader_workers)
45 "input_size must by type int > 0."
46 assert isinstance(num_topics, int) and input_size > 0,
---> 47 "num_topics must by type int > 0."
48 assert model_type in ['LDA', 'prodLDA'],
49 "model must be 'LDA' or 'prodLDA'."

AssertionError: num_topics must by type int > 0.

What I Did

ctm_model = CTM(model_type='prodLDA', bert_model="stsb-roberta-base-v2", inference_type="combined")

search_space = {"num_topics": Integer(5, 10, 15, 20),
                "num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200})
}

optimization_runs=20
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    ctm_model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm/')

optimization_result.save_to_csv("results_ctm.csv")

Happy to assist with fixing this issue.

use_partitions error

  • OCTIS version: 1.10.2
  • Python version: 3.6.5
  • Operating System: Linux

Description

Hello!
I used the CTM model with parameter use_partitions=False and got
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

What I Did

dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

model = CTM(num_topics=25, num_epochs=100, inference_type='combined', use_partitions=False)
output = model.train_model(dataset)
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')
npmi_score = npmi.score(output)

ETM - How to use pretrained word embeddings

@silviatti Hello again,

I have another silly question for you.

The original ETM package allows the user to use pre-trained word embeddings (the file name is 'skipgram_emb_300d.txt')

How do I tell the model to look for the pretrained embedding file instead?

Is it something like:

model = ETM(num_topics=25, embeddings_path = '...\path\to\embeddings'skipgram_emb_300d.txt'') ?

Thank you again for your assistance :)

Luke

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset

  • OCTIS version: 1.10.3
  • Python version: 3.9
  • Operating System: Windows

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>

NMF Model Cannot Be Imported Into Colab

  • OCTIS version: 1.8.0

Description

Tried to import NMF model into Colab

What I Did

from octis.models.NMF import NMF
ImportError                               Traceback (most recent call last)
<ipython-input-17-96f0cef9c8fa> in <module>()
----> 1 from octis.models.NMF import NMF

/usr/local/lib/python3.7/dist-packages/octis/models/NMF.py in <module>()
      1 from octis.models.model import AbstractModel
      2 import numpy as np
----> 3 from gensim.models import nmf
      4 import gensim.corpora as corpora
      5 import octis.configuration.citations as citations

ImportError: cannot import name 'nmf' from 'gensim.models' (/usr/local/lib/python3.7/dist-packages/gensim/models/__init__.py)

ETM - Single-word document causes ETM training to fail

  • OCTIS version: 1.9.0
  • Python version: 3.7.6
  • Operating System: Ubuntu 20.04 LTS

Description

Probably related to #20 and #35.
If your preprocessed corpus contains any single-word document, ETM training fails. This should not happen, as the Preprocessing class has 0 as the default value for parameters min_words_docs and min_df, which define respectivelly the minimum number of words a document must have to be keep and the minimum document-frequency for words on the corpus.

What I Did

I've implemented a test case illustrating the scenario. The test fails. The test code can be found here, and the error stacktrace as shown on Github actions can be seen here.

Below, the aforementioned stacktrace (on my local machine):

Current call:  0
model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (alphas): Linear(in_features=300, out_features=16, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=872, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=16, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=16, bias=True)
)
Traceback (most recent call last):
  File "octis_test/unified_training.py", line 55, in <module>
    model_runs=5, plot_best_seen=True) # number of runs of the topic model
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 160, in optimize
    results = self._optimization_loop(opt)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 285, in _optimization_loop
    f_val = self._objective_function(next_x)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 217, in _objective_function
    self.topk)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 60, in train_model
    continue_training = self._train_epoch(epoch)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 126, in _train_epoch
    self.hyperparameters['embedding_size'], self.device)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM_model/data.py", line 17, in get_batch
    doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'

ETM model: AttributeError: 'list' object has no attribute 'squeeze'

  • OCTIS version: 1.9.0
  • Python version: 3.6.13
  • Operating System: Windows 10

Description

I tried to run the ETM model through OCTIS, but got an attribute error.
I've attached my corpus (corpus.csv; for some reason git won't let my attach an actual .tsv file) and vocabulary (vocabulary.txt) for your convenience.

What I Did

Here's what I did

I complied with the format of the dataset as a .tsv and vocabulary as a .txt file with one stem per row.

I was able to load the dataset with no errors.

To run the model I did the following:

from octis.models.ETM import ETM
model_etm = ETM(num_topics=40)
output_fomc = model_etm.train_model(dataset)

model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (rho): Linear(in_features=300, out_features=9659, bias=False)
  (alphas): Linear(in_features=300, out_features=40, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=9659, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_f
eatures=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=40, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=40, bias=True)
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-159-a49502c35827> in <module>
----> 1 output_fomc = model_etm.train_model(dataset)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
     54 
     55         for epoch in range(0, self.hyperparameters['num_epochs']):
---> 56             continue_training = self._train_epoch(epoch)
     57             if not continue_training:
     58                 break

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in _train_epoch(self, epoch)
    120             self.model.zero_grad()
    121             data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
--> 122                                         self.hyperparameters['embedding_size'], self.device)
    123             sums = data_batch.sum(1).unsqueeze(1)
    124             if self.hyperparameters['bow_norm']:

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM_model\data.py in get_batch(tokens, counts, ind, vocab_size, emsize, device)
     15         #L = count.shape[1]
     16         if len(doc) == 1:
---> 17             doc = [doc.squeeze()]
     18             count = [count.squeeze()]
     19         else:

AttributeError: 'list' object has no attribute 'squeeze'


``
[vocabulary.txt](https://github.com/MIND-Lab/OCTIS/files/7423616/vocabulary.txt)
[corpus.csv](https://github.com/MIND-Lab/OCTIS/files/7423618/corpus.csv)

'

Integrating D-ETM

Dear @silviatti ,

Related to issue #1

I would like to resubmit the feature request to integrate the DETM into your OCTIS suite. It appears the person who originally proposed issue #1 back in April 2021 has lost interest, or no longer wants to pursue this.

Is it possible if you could complete the integration?

With kindest regards
Luke

Adding the perplexity evaulation metric

Great work!

But as for evaluating the topic models, how about adding the perplexity metric which is a common approach to evaluate the unsupervised language/topic models?

CTM training fails.

  • OCTIS version: 1.8.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04.02

Description

CTM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = CTM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

Batches:  84%|████████████████████████████████████████████████████████████████████████████████████████▌                 | 21790/26093 [59:43<11:47,  6.08it/s]
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model = ProdLDA(num_topics=TOPIC_SIZE)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
    b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
    bert_ouput = bert_embeddings_from_list(texts, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
    out_features = self.forward(features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

Improve Preprocessing Speed

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

num_samples should be a positive integer value, but got num_samples=0

  • OCTIS version:
  • Python version:
  • Operating System:

Description

I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")

model = CTM(num_topics=10,
            num_epochs=30,
            inference_type='zeroshot', 
            bert_model="distiluse-base-multilingual-cased")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
                }
optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    plot_best_seen=True, plot_model=True, plot_name="B0_plot", 
    save_path='results2/test_ctm//')

I can't find where to write this variable "num_samples"

How do I load a dataset? How to do multi-label classification with OCTIS?

  • OCTIS version: any
  • Python version: any
  • Operating System: any

Description

I am trying to evaluate topic model algorithms with a provided dataset, without success.

What I Did

I am trying to run the following code:

from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA


dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)

acc = AccuracyScore(dataset)
output = model.train_model(dataset)

Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
      1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
    164 
    165         if self.use_partitions:
--> 166             train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
    167         else:
    168             train_corpus = dataset.get_corpus()

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
     41     # Partitioned Corpus getter
     42     def get_partitioned_corpus(self, use_validation=True):
---> 43         last_training_doc = self.__metadata["last-training-doc"]
     44         # gestire l'eccezione se last_validation_doc non è definito, restituire
     45         # il validation vuoto

TypeError: 'NoneType' object is not subscriptable

Unexpected keyword argument 'df_max_freq' when max_features is used in Preprocessing

Setting max_features in Preprocessing throws the following error:
TypeError: init() got an unexpected keyword argument 'df_max_freq'

Example code

import string
from octis.preprocessing.preprocessing import Preprocessing

# Initialize preprocessing
preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='english',
                             min_chars=1, min_words_docs=0, max_df=0.9, min_df=0.1) # 
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=r'dataset.csv')

I'm happy to help fix the issue.

OCTIS is an excellent library - keep up the great work.

Update format for partition input in ReadMe

  • OCTIS version: 1.2.0
  • Python version: Python 3.8.3
  • Operating System: Linux

According to the readme, input in the partition column for a custom dataset should be of the type 'training', 'validation', 'test', which I can't get to yield a partition:


Make sure that the dataset is in the following format:

  • corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
  • vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.


However, it seems the right format is 'train', 'val', 'test', which does work for me - just passing this on to make the ReadMe clearer.

    def load_custom_dataset_from_folder(self, path):
        """
        Loads all the dataset from a folder
        Parameters
        ----------
        path : path of the folder to read
        """
        self.dataset_path = path
        try:
            if exists(self.dataset_path + "/metadata.json"):
                self._load_metadata(self.dataset_path + "/metadata.json")
            else:
                self.__metadata = dict()
            df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
            if len(df.keys()) > 1:
                df[1] = df[1].replace("train", "a_train")
                df[1] = df[1].replace("val", "b_val")
                df = df.sort_values(1).reset_index(drop=True)

                self.__metadata['last-training-doc'] = len(df[df[1] == 'a_train'])
                self.__metadata['last-validation-doc'] = len(df[df[1] == 'b_val']) + len(df[df[1] == 'a_train'])

Gensim Coherence Pipeline only for English texts?

Hi all,

thank you for this amazing library! I'll definitely consider using it for my master’s thesis. I have a question regarding coherence scores and non-English texts, and I hope it is okay to ask this here.

I saw that you're using the gensim coherence pipeline, which is based on Röder et. al. 2015. In this paper, it is not clear to me if they only used English Wikipedia or a multilingual Wikipedia as a reference corpus for calculating the coherence measures. So my question would be if the gensim coherence pipeline is suitable for the evaluation of non-English texts (e.g German) or if it would be better to use other approaches like TC-W2V with a custom corpus.

Regards
Luca

Can CTM return topic_significance_uniform score?

  • OCTIS version:
  • Python version:
  • Operating System:

Description

Currently, CTM returns only topic_significance_background score. Is it possible to get topic_significance_uniform score (per topic)?

Thanks,
-Atakan

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

'OptimizerEvaluation' object has no attribute 'dict_model_runs' error while using 'extra_metrics' in optimization

  • OCTIS version: 1.10.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04.3

Description

I am trying to optimise LDA model with custom data. My evaluation metric is npmi but I am also using topic_diversity as extra metric during optimization.

What I Did

Code:

# Create Model
model = LDA(num_topics=20, alpha=0.1)
model.partitioning(False)

# Initialize metric
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')

# Initialize metric
topic_diversity = TopicDiversity(topk=10)

optimization_runs=30 # number of optimization iterations
model_runs=5 # number of runs of the topic model

# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), 
                "eta": Real(low=0.001, high=5.0), 
                'num_topics': Integer(low=1, high=10, prior='uniform')}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, 
                             search_space= search_space, 
                             save_path=output_path, # path to store the results
                             metric= npmi,
                             number_of_call=optimization_runs,
                             model_runs=model_runs, 
                             extra_metrics=[topic_diversity])

#save the results of th optimization in a csv file
optResult.save_to_csv(results.csv")

Error traceback :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
    150             try:
--> 151                 df[metric.info()["name"] + '(not optimized)'] = [np.median(
    152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
    151                 df[metric.info()["name"] + '(not optimized)'] = [np.median(
--> 152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    153             except:

AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_6863/802420513.py in <module>
     10 
     11 #save the results of th optimization in a csv file
---> 12 optResult.save_to_csv("results.csv")

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
    152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    153             except:
--> 154                 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
    155                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    156 

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
    153             except:
    154                 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
--> 155                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    156 
    157         if not name_file.endswith(".csv"):

AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'

Possible solution

Code modification here

Old

try:
    df[metric.info()["name"] + '(not optimized)'] = [np.median(
        self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
    df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
        self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

New

try:
  df[metric.info()["name"] + '(not optimized)'] = [np.median(
      self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
  df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
      self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

Control of prior distribution

If you are doing Bayesian optimization, it is probably a good idea to give the control of prior probability distribution to the user. Any plans to add that to your API?

Citation for NeuralLDA and ProdLDA

Hi, thanks for sharing this amazing work!
I think the current citations for NeuralLDA and prodLDA is for the repo only.
These models are from the Autoencoding Variational Inference for Topic Models (Srivastava and Sutton 2017) paper. please consider citing the paper as well. Thanks!

Make Stop Word list independant from the installation directory

  • OCTIS version: '1.5.0'
  • Python version: 3.7
  • Operating System: Mac

Description

I am trying to preprocess some custom corpus and When I am trying to remove stop word here is what I get.

What I Did

preprocessor = Preprocessing(vocabulary=None, max_features=None, 
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='english',
                             min_chars=1, min_words_docs=0)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=corpus_path)

And I am getting the following error.

ndexes)
    101             else:
    102                 if 'english' in stopword_list:
--> 103                     with open('octis/preprocessing/stopwords/english.txt') as fr:
    104                         stopwords = [line.strip() for line in fr.readlines()]
    105                         assert stopword_list == language

FileNotFoundError: [Errno 2] No such file or directory: 'octis/preprocessing/stopwords/english.txt'

More context I am using jupyter notebook.

A possible solution.. it may be useful to use pathlib to handle those type of path.

If I fix it locally I can raise a PR soon

AttributeError: 'WordEmbeddingsCentroidSimilarity' object has no attribute 'binary'

  • OCTIS version: 1.9.0
  • Python version :3.9.7
  • Operating System:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=20.04
    DISTRIB_CODENAME=focal
    DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

I'm not familiar with Github, so I may be rude.
Thank you for publishing such a great works.

It seems that the constructors of WardEmbeddingsPairwiseSimilarity and WardEmbeddingsCentroidSimilarity did not have self.binary.
Therefore, I could not use any pretrained word embeddings other than the default.

What I Did

In [2]: from octis.evaluation_metrics import similarity_metrics

In [3]: dummy_kv_path = "/workdir/dummy_kv.txt"

In [4]: similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c02dac4f77ab> in <module>
----> 1 similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)

~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
     71             self.wv = api.load('word2vec-google-news-300')
     72         else:
---> 73             self.wv = KeyedVectors.load_word2vec_format( word2vec_path, binary=self.binary)
     74 
     75         self.topk = topk

AttributeError: 'WordEmbeddingsPairwiseSimilarity' object has no attribute 'binary'

In [5]: similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-1f1c772b67de> in <module>
----> 1 similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)

~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
    115             self.wv = api.load('word2vec-google-news-300')
    116         else:
--> 117             self.wv = KeyedVectors.load_word2vec_format(word2vec_path, binary=self.binary)
    118         self.topk = topk
    119 

AttributeError: 'WordEmbeddingsCentroidSimilarity' object has no attribute 'binary'

Significance score per topics

  • OCTIS version:
  • Python version:
  • Operating System:

Description

I am working on topic modeling for noisy short texts, trying to get topic significance scores per topic.

What I Did

for t in output: #'output' is the model itself 
    
    significance_uniform_score = topic_signif_uniform.score(t)
    print("Topic Significance Uniform Score: "+str(significance_uniform_score))

I get the following error message:


TypeError Traceback (most recent call last)
in
1 # Retrieve metrics score
2
----> 3 for t in output[:]:
4
5 #topic_diversity_score = topic_diversity.score(t)

TypeError: unhashable type: 'slice'

Is it possible to get topic significance score per topic?

WECoherencePairwise and WECoherenceCentroid are negatively correlated.

  • OCTIS version: 1.9.0
  • Python version :3.9.7
  • Operating System:
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=20.04
    DISTRIB_CODENAME=focal
    DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

I used OCITS metrics to evaluate my own implementation of the model. As a result, I found that WECoherencePairwise and WECoherenceCentroid have a negative correlation. Originally, I think these two metrics should have a positive correlation.
Each point in the figure represents an result of experiment under different conditions.

In WECoherenceCentroid's calculation , have been done distance-1, but it would be more correct to 1-distance. (or use sklearn.metrics.pairwise.cosine_similarity)

# https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L171 
distance = spatial.distance.cosine(self._wv.__getitem__(w1), self._wv.__getitem__(w2))
topic_coherence += distance - 1

What I Did

image

No self.vocab when not using partitions for the CTM-class

  • OCTIS version: 1.2.0
  • Python version: 3.8.3
  • Operating System: Linux

Description

When performing hyper parameter search with the CTM-model and use_partitions=False, I get the error: AttributeError: 'CTM' object has no attribute 'vocab'.

I believe moving line 94 in CTM.py [self.vocab = dataset.get_vocabulary()] prior to the self.use_partitions if-statement in line 87 would solve the problem.

Current call:  0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-63-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     89             data_corpus = [' '.join(i) for i in dataset.get_corpus()]
     90             self.X_train, input_size = self.preprocess(
---> 91                 self.vocab, train=data_corpus, bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",
     92                 bert_model=self.hyperparameters["bert_model"])
     93 

AttributeError: 'CTM' object has no attribute 'vocab'

Despite looking at some demos, I am still not capable of making pull requests, but wanted to let you know nonetheless.

Best,
Thyge

Question: OCTIS supports GPU?

  • OCTIS version: 1.8.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 20.04.02

Description

If I add a GPU with CUDA support, will OCTIS be faster?

What I Did


Evaluate 3 different topic modeling algorithms

  • OCTIS version:
  • Python version:3,7
  • Operating System: linux

Description

I am a PhD candidate and I need to evaluate the performance of three different topic model algorithm including: LDA, LSI and Bertopic. ( LDA and LSI were trained using the Gensim package)
what are the relevance metrics that I should use apart from coherence score? I would like to include in my paper a sort of table or graph that shows an evaluation in term of accuracy of the model (coherence score) and relevance of topics ( should I use the topic diversity metric ?)
Thank you

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'

  • OCTIS version: 1.10.0
  • Python version: 3.6.13
  • Operating System: Windows 10

Description

Hi @lffloyd and @silviatti

I tried to run the ETM with pre-trained embeddings after the recent upgrade, and it returned this error.

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'.

Please advise if I made an error on my end.

My commands and traceback are provided below.

Thank you so much!
Luke

What I Did

model = ETM(num_topics=40, num_epochs=1, use_partitions=False, train_embeddings=False,
            embeddings_type='word2vec', embeddings_path=r'my/path/to/embedding/skipgram_emb_300d.txt', binary_embeddings=False, headerless_embeddings=True)

output= model.train_model(dataset)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
d:\01MRes_ubuntu\OCTIS\fomcNoPartitionsPreTrained\EtmRunModelPreTrained300.py in <module>
----> 26 output_fomc_etm = model.train_model(dataset)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
     74         if hyperparameters is None:
     75             hyperparameters = {}
---> 76         self.set_model(dataset, hyperparameters)
     77         self.top_word = top_words
     78         self.early_stopping = EarlyStopping(patience=5, verbose=True)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in set_model(self, dataset, hyperparameters)
    119 
    120         self.set_default_hyperparameters(hyperparameters)
--> 121         self.load_embeddings()
    122         ## define model and optimizer
    123         self.model = etm.ETM(num_topics=self.hyperparameters['num_topics'], vocab_size=len(self.vocab.keys()),

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in load_embeddings(self)
     52                                         self.hyperparameters['embeddings_type'],
     53                                         self.hyperparameters['binary_embeddings'],
---> 54                                         self.hyperparameters['headerless_embeddings'])
     55         embeddings = np.zeros((len(self.vocab.keys()), self.hyperparameters['embedding_size']))
     56         for i, word in enumerate(self.vocab.values()):

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in _load_word_vectors(self, embeddings_path, embeddings_type, binary_embeddings, headerless_embeddings)
     85                 embeddings_path,
     86                 binary=binary_embeddings,
---> 87                 no_header=headerless_embeddings)
     88 
     89         vectors = {}

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'

Loading unprocessed corpus documents with CTM and Optimizer

  • OCTIS version: 1.10.0
  • Python version: 3.7.6
  • Operating System: Ubuntu 20.04 LTS

Description

I've asked this at #29, but decided to open a new issue because this is a more specific scenario. So, here it is:

Hi @silviatti. So, if I understand correctly, currently there's no way to load the unprocessed corpus documents on OCTIS' CTM while using its optimizer, in a manner similar to the one done on standalone CTM's README?

Originally posted by @lffloyd in #29 (comment)

What I Did

I gave a look at the docs.

AttributeError: module 'octis' has no attribute 'configuration'

  • OCTIS version: 1.3.0
  • Python version: 3.6
  • Operating System: Windows

Description

Tried to run the server and create an experiment

What I Did

127.0.0.1 - - [27/Apr/2021 20:06:58] "�[37mPOST /selectPath HTTP/1.1�[0m" 200 -
{'partitioning': False, 'path': 'D:/OctisResults', 'dataset': '20NewsGroup', 'model': {'name': 'LDA', 'parameters': {'alpha': 0.1, 'eta': 0.1, 'iterations': 50, 'passes': 1}}, 'optimization': {'iterations': 5, 'model_runs': 3, 'surrogate_model': 'GP', 'n_random_starts': 3, 'acquisition_function': 'LCB', 'search_spaces': {'num_topics': {'low': 2, 'high': 20}}}, 'optimize_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}], 'track_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}]}
127.0.0.1 - - [27/Apr/2021 20:07:20] "�[37mPOST /startExperiment HTTP/1.1�[0m" 200 -
starting OctExpnewsgroup
Process Process-2:1:
Traceback (most recent call last):
File "D:\octisExp\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "D:\octisExp\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "D:\octisExp\lib\site-packages\octis\dashboard\queueManager.py", line 260, in _execute_and_update
startExperiment(toRun[running[0]])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 136, in startExperiment
model_class = importModel(parameters["model"]["name"])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 64, in importModel
model = importClass(model_name, model_name, module_path)
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 46, in importClass
spec.loader.exec_module(module)
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\octisExp\lib\site-packages\octis\models\LDA.py", line 5, in
import octis.configuration.citations as citations
AttributeError: module 'octis' has no attribute 'configuration'

Exception raised when running tutorial 'How to optimize the hyperparameters of a neural topic model (CTM on M10)'

  • OCTIS version: 1.2.0
  • Python version: 3.8.3
  • Operating System: Linux

Hi Octis team,

When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.

I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...

One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.

    def preprocess(vocab, train, bert_model, test=None, validation=None,
                   bert_train_path=None, bert_test_path=None, bert_val_path=None):
        vocab2id = {w: i for i, w in enumerate(vocab)}
        vec = CountVectorizer(
            vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
        entire_dataset = train.copy()
        if test is not None:
            entire_dataset.extend(test)
        if validation is not None:
            entire_dataset.extend(validation)

        vec.fit(entire_dataset)
        idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

        x_train = vec.transform(train)
        b_train = CTM.load_bert_data(bert_train_path, train, bert_model)

        train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
        input_size = len(idx2token.keys())

Tutorial, that yields exception

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

model = CTM(num_topics=10, num_epochs=30, inference_type='zeroshot', bert_model="bert-base-nli-mean-tokens")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'sigmoid', 'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
}

optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm//')

Current call:  0
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-46-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     80             self.vocab = dataset.get_vocabulary()
     81             self.X_train, self.X_test, self.X_valid, input_size = \
---> 82                 self.preprocess(self.vocab, data_corpus_train, test=data_corpus_test,
     83                                 validation=data_corpus_validation,
     84                                 bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in preprocess(vocab, train, bert_model, test, validation, bert_train_path, bert_test_path, bert_val_path)
    178         b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
    179 
--> 180         train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
    181         input_size = len(idx2token.keys())
    182 

~/anaconda3/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py in __init__(self, X, X_bert, idx2token)
     15         """
     16         if X.shape[0] != len(X_bert):
---> 17             raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
     18                             "You might want to check if the BoW preparation method has removed some documents. ")
     19 

Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

My reproduction, that works fine:

def preprocess(vocab, train, bert_model, test=None, validation=None,
               bert_train_path=None, bert_test_path=None, bert_val_path=None):
    vocab2id = {w: i for i, w in enumerate(vocab)}
    vec = CountVectorizer(
        vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
    entire_dataset = train.copy()
    if test is not None:
        entire_dataset.extend(test)
    if validation is not None:
        entire_dataset.extend(validation)

    vec.fit(entire_dataset)
    idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

    x_train = vec.transform(train)
    b_train = bert_embeddings_from_list(train, bert_model)

    train_data = CTMDataset(x_train.toarray(), b_train, idx2token)
    input_size = len(idx2token.keys())

    if test is not None and validation is not None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)

        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, test_data, valid_data, input_size
    if test is None and validation is not None:
        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, valid_data, input_size
    if test is not None and validation is None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
        return train_data, test_data, input_size
    if test is None and validation is None:
        return train_data, input_size

def bert_embeddings_from_list(texts, sbert_model_to_load="bert-base-nli-mean-tokens", batch_size=100):
    """
    Creates SBERT Embeddings from a list
    """
    model = SentenceTransformer(sbert_model_to_load)
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))

import torch
from torch.utils.data import Dataset
import scipy.sparse


class CTMDataset(Dataset):

    """Class to load BOW dataset."""

    def __init__(self, X, X_bert, idx2token):
        """
        Args
            X : array-like, shape=(n_samples, n_features)
                Document word matrix.
        """
        if X.shape[0] != len(X_bert):
            raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
                            "You might want to check if the BoW preparation method has removed some documents. ")

        self.X = X
        self.X_bert = X_bert
        self.idx2token = idx2token

    def __len__(self):
        """Return length of dataset."""
        return self.X.shape[0]

    def __getitem__(self, i):
        """Return sample from dataset at index i."""
        if type(self.X[i]) == scipy.sparse.csr.csr_matrix:
            X = torch.FloatTensor(self.X[i].todense())
            X_bert = torch.FloatTensor(self.X_bert[i])
        else:
            X = torch.FloatTensor(self.X[i])
            X_bert = torch.FloatTensor(self.X_bert[i])

        return {'X': X, 'X_bert': X_bert}

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

train, validation, test = dataset.get_partitioned_corpus(use_validation=True)

data_corpus_train = [' '.join(i) for i in train]
data_corpus_test = [' '.join(i) for i in test]
data_corpus_validation = [' '.join(i) for i in validation]

vocab = dataset.get_vocabulary()
X_train, X_test, X_valid, input_size = \
    preprocess(vocab, data_corpus_train, test=data_corpus_test,
                validation=data_corpus_validation,
                bert_train_path=""+"_train.pkl",
                bert_test_path=""+"_test.pkl",
                bert_val_path=""+"_val.pkl",
                bert_model='bert-base-nli-mean-tokens')

Batches: 100%
59/59 [00:08<00:00, 7.10it/s]

Batches: 100%
13/13 [00:01<00:00, 6.62it/s]

Batches: 100%
13/13 [00:00<00:00, 28.11it/s]

Not able to run ProdLDA

  • Python version: 3.6
  • Operating System: Windows

Description

I tried:
from octis.models.ProdLDA import ProdLDA

and get the following error:
d:\octisexp\lib\site-packages\octis\models\pytorchavitm_init_.py in
1 """Init package"""
2
----> 3 from octis.models.pytorchavitm.avitm.avitm_model import AVITM_model

ModuleNotFoundError: No module named 'octis.models.pytorchavitm.avitm'

I have created a virtual environment and installed octis using : pip install octis

Bug OOV words in WECoherenceCentroid

OCTIS version: 1.10.0
Python version :3.9.7
Operating System:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

In this line (https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L180
), topic[0] contains a word, so if this is a word that is not included in self._wv, it will cause an error.

Since Gensim's KeyedVectors class has a vector_size variable, I think this code should be rewritten to create a zero vector with reference to vector_size.

#t = [0] * len(self._wv.__getitem__(topic[0]))
t = np.zeros(self._wv.vector_size)

Examples of error messages

  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis-1.10.0-py3.9.egg/octis/evaluation_metrics/coherence_metrics.py", line 180, in score
    t = [0] * len(self._wv.__getitem__(topic[0]))
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 395, in __getitem__
    return self.get_vector(key_or_keys)
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 438, in get_vector
    index = self.get_index(key)
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 412, in get_index
    raise KeyError(f"Key '{key}' not present")
KeyError: "Key 'elsevi' not present"

Missing info for custom data yields key error in hypersearch

  • OCTIS version: 1.2.0
  • Python version: 3.8.3
  • Operating System: Linux

Hi Octis Team,

Thanks for making this available!

When providing a custom dataset for a LDA hyperparameter seach, I get: KeyError: 'info'

This is not the case when I run a single model (no hypersearch), nor when I fetch the M10 dataset and use this.

If I manually add an info entry with a name for the dataset to the metadata attribute of the custom dataset, the hyperparameter search works fine.

Perhaps the required metadata could be auto-filled when providing custom data?

Best,
Thyge

Code and traceback:

# Load modules
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Categorical

# Load custom dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder(str(_dp))

# Initiate model
model = LDA(alpha=0.5, eta=0.5)  

# Define search space
search_space = {"num_topics": Categorical({15, 20, 25, 30})}

# Set number of runs
optimization_runs=15
model_runs=1 

# Define evaluation metric
npmi = Coherence(texts=dataset.get_corpus())

# Hypersearch
optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path=str(_models / 'Octis' / 'LDA'))


Current call:  0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-225-087bc04aa55a> in <module>
      1 # Hypersearch
      2 optimizer=Optimizer()
----> 3 optimization_result = optimizer.optimize(
      4     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      5     model_runs=model_runs, save_models=True,

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    299 
    300             # Create an object related to the BO optimization
--> 301             results = OptimizerEvaluation(self, BO_results=res)
    302 
    303             # Save the object

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in __init__(self, optimizer, BO_results)
     45         # Info about optimization
     46         self.info = dict()
---> 47         dataset_info = optimizer.dataset.get_metadata()["info"]
     48         if dataset_info is not None:
     49             self.info.update({"dataset_name": dataset_info["name"]})

KeyError: 'info'

Adding this after loading custom data fixes the problem:

# Load existing metadata
meta_dict = dataset.get_metadata()

# Add name to dict
meta_dict['info'] = {'name':'dataset_name'}

# Update metadata
dataset._Dataset__metadata = meta_dict

# Verify info is updated
dataset.get_info()

AttributeError: 'KeyedVectors' object has no attribute 'wv'

  • OCTIS version: 1.8.3
  • Python version: 3.8.5
  • Operating System: macOS

Description

Trying to evaluate a model using the WordEmbeddingsInvertedRBOCentroid() method I get an attribute error "'KeyedVectors' object has no attribute 'wv'"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("custom_dataset")

from octis.models.LDA import LDA
model_LDA_15 = LDA(num_topics=15) 
model_LDA_15_output = model_LDA_15.train_model(dataset)

from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
topic_rbo_centroid_score = rbo_centroid_metric.score(model_LDA_15_output)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-45-eb5075095fc9> in <module>
      1 from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
      2 rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
----> 3 topic_rbo_centroid__score = rbo_centroid_metric.score(model_LDA_15_output)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/diversity_metrics.py in score(self, model_output)
    174                 indexed_list1 = [word2index[word] for word in list1]
    175                 indexed_list2 = [word2index[word] for word in list2]
--> 176                 rbo_val = weirbo_centroid(
    177                     indexed_list1[:self.topk], indexed_list2[:self.topk], p=self.weight, index2word=index2word,
    178                     word2vec=self.wv, norm=self.norm)[2]

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in word_embeddings_rbo(list1, list2, p, index2word, word2vec, norm)
    145     args = (list1, list2, p, index2word, word2vec, norm)
    146 
--> 147     return RBO(rbo_min(*args), rbo_res(*args), rbo_ext(*args))
    148 
    149 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in rbo_min(list1, list2, p, index2word, word2vec, norm, depth)
     79     """
     80     depth = min(len(list1), len(list2)) if depth is None else depth
---> 81     x_k = overlap(list1, list2, depth, index2word, word2vec, norm)
     82     log_term = x_k * math.log(1 - p)
     83     sum_term = sum(

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in overlap(list1, list2, depth, index2word, word2vec, norm)
     59     # NOTE: comment the preceding and uncomment the following line if you want
     60     # to stick to the algorithm as defined by the paper
---> 61     ov = embeddings_overlap(list1, list2, depth, index2word, word2vec, norm=norm)[0]
     62     # print("overlap", ov)
     63     return ov

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in embeddings_overlap(list1, list2, depth, index2word, word2vec, norm)
     41     word_list2 = [index2word[index] for index in list2]
     42 
---> 43     centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
     44     centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
     45     cos_sim = 1 - distance.cosine(centroid_1, centroid_2)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in <listcomp>(.0)
     41     word_list2 = [index2word[index] for index in list2]
     42 
---> 43     centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
     44     centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
     45     cos_sim = 1 - distance.cosine(centroid_1, centroid_2)

AttributeError: 'KeyedVectors' object has no attribute 'wv'

KeyError: 'info' - when fetching the Europarl_IT dataset

  • OCTIS version: 1.10.3
  • Python version: 3.9
  • Operating System: Windows

Description

I am trying to fetch the Italian Europarl_IT dataset to train topic models on. However, this does not work.

What I Did


from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('Europarl_IT')

Traceback (most recent call last):

  Input In [40] in <module>
    dataset.fetch_dataset('Europarl_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:96 in download_dataset
    metadata["info"]["name"] = dataset_name

KeyError: 'info'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.