mind-lab / octis Goto Github PK

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

License: MIT License

Python 58.39% Makefile 0.26% JavaScript 10.77% CSS 1.02% HTML 19.04% Jupyter Notebook 10.53%

topic-modeling latent-dirichlet-allocation latent-semantic-analysis evaluation-metrics natural-language-processing non-negative-matrix-factorization neural-topic-models bayesian-optimization hyperparameter-optimization hyperparameter-tuning

octis's People

Contributors

Stargazers

Watchers

Forkers

tripathiaakash ahoho espoirmur o7s8r6 leofn amirstudy antelligent-app manas-embold adbmd phillette ml-ai-nlp-ir liyvqi vinid jackyuanjie1990 chrisw09 sungchien fenerator timpowellgit eamonjarrett-mann hahaxun rsimd sandy4321 lei-liu1 aahmadai dopc abcp4 yyolanda atypon iamdank fj-morales mistel1225 jprengifo varun-ml zhibinduan gregorywu manueltonneau seiji-shimizu yw10 kinshuk-h dauntlesshunter mila-aia lekonard vinay-swamy my-nonlinear-valentine bilgehanozkan miguelfrutos khirami24 idleartist arijitgupta42 cateto sitraka17 climatenlp hyades910739 5l1v3r1 martinborcin amart85 isr-wang barlaskara adaminsky valdancs yjyoo3312 moonisali summerwind0630 eric-filson morgenmuesli sepidehhosseinian fani-lab hongzhangmu niguen chauvietnam techthiyanes xqz-u e-arian jeremy-costello cerqueiramatheus raymondzmc jiangxiaoyuwww totallyincontrol contemmcm codeassociation haihua0913 yfke danielmalmer eshtehari amazingmatthew dash-uvic dscovr defhjjfff chrdrn alex-spok 1jamesthompson1 pbrazval tlyu0419 spartanlasergun xavierspycy alexocculate

octis's Issues

Not able to run dashboard

Python version: 3.6
Operating System: Windows

Description

I tried:
python octis\dashboard\server.py

and i get this error:

import octis.dashboard.experimentManager as expManager
AttributeError: module 'octis' has no attribute 'dashboard'

Significance score per topics

OCTIS version:
Python version:
Operating System:

Description

I am working on topic modeling for noisy short texts, trying to get topic significance scores per topic.

What I Did

for t in output: #'output' is the model itself 
    
    significance_uniform_score = topic_signif_uniform.score(t)
    print("Topic Significance Uniform Score: "+str(significance_uniform_score))

I get the following error message:

TypeError Traceback (most recent call last)
in
1 # Retrieve metrics score
2
----> 3 for t in output[:]:
4
5 #topic_diversity_score = topic_diversity.score(t)

TypeError: unhashable type: 'slice'

Is it possible to get topic significance score per topic?

Currently, I am comparing different topic modeling algorithms using the OCTIS package. For LSI, I noticed that the OCTIS paper states that (Hofmann, 1999) is implemented, while the Github page refers to (Landauer et al. 1998). Could you specify on which work your implementation is based?

Not able to run ProdLDA

Python version: 3.6
Operating System: Windows

Description

I tried:
from octis.models.ProdLDA import ProdLDA

and get the following error:
d:\octisexp\lib\site-packages\octis\models\pytorchavitm_init_.py in
1 """Init package"""
2
----> 3 from octis.models.pytorchavitm.avitm.avitm_model import AVITM_model

ModuleNotFoundError: No module named 'octis.models.pytorchavitm.avitm'

I have created a virtual environment and installed octis using : pip install octis

AttributeError: 'WordEmbeddingsCentroidSimilarity' object has no attribute 'binary'

OCTIS version: 1.9.0
Python version :3.9.7
Operating System:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

I'm not familiar with Github, so I may be rude.
Thank you for publishing such a great works.

It seems that the constructors of WardEmbeddingsPairwiseSimilarity and WardEmbeddingsCentroidSimilarity did not have self.binary.
Therefore, I could not use any pretrained word embeddings other than the default.

What I Did

In [2]: from octis.evaluation_metrics import similarity_metrics

In [3]: dummy_kv_path = "/workdir/dummy_kv.txt"

In [4]: similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c02dac4f77ab> in <module>
----> 1 similarity_metrics.WordEmbeddingsPairwiseSimilarity(word2vec_path=dummy_kv_path)

~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
     71             self.wv = api.load('word2vec-google-news-300')
     72         else:
---> 73             self.wv = KeyedVectors.load_word2vec_format( word2vec_path, binary=self.binary)
     74 
     75         self.topk = topk

AttributeError: 'WordEmbeddingsPairwiseSimilarity' object has no attribute 'binary'

In [5]: similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-1f1c772b67de> in <module>
----> 1 similarity_metrics.WordEmbeddingsCentroidSimilarity(word2vec_path=dummy_kv_path)

~/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis/evaluation_metrics/similarity_metrics.py in __init__(self, word2vec_path, topk)
    115             self.wv = api.load('word2vec-google-news-300')
    116         else:
--> 117             self.wv = KeyedVectors.load_word2vec_format(word2vec_path, binary=self.binary)
    118         self.topk = topk
    119 

AttributeError: 'WordEmbeddingsCentroidSimilarity' object has no attribute 'binary'

Feature request : Add option on the preprocessing to chose a custom text preprocessor.

What we have in the preprocessing step is already a good starting point, but we can do better by adding an option to define someone's custom preprocessing pipeline to handle what is not yet handled in the current preprocessing.

Citation for NeuralLDA and ProdLDA

Hi, thanks for sharing this amazing work!
I think the current citations for NeuralLDA and prodLDA is for the repo only.
These models are from the Autoencoding Variational Inference for Topic Models (Srivastava and Sutton 2017) paper. please consider citing the paper as well. Thanks!

Update format for partition input in ReadMe

OCTIS version: 1.2.0
Python version: Python 3.8.3
Operating System: Linux

According to the readme, input in the partition column for a custom dataset should be of the type 'training', 'validation', 'test', which I can't get to yield a partition:

Make sure that the dataset is in the following format:

corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary

The partition can be "training", "test" or "validation". An example of dataset can be found here: sample_dataset_.

However, it seems the right format is 'train', 'val', 'test', which does work for me - just passing this on to make the ReadMe clearer.

    def load_custom_dataset_from_folder(self, path):
        """
        Loads all the dataset from a folder
        Parameters
        ----------
        path : path of the folder to read
        """
        self.dataset_path = path
        try:
            if exists(self.dataset_path + "/metadata.json"):
                self._load_metadata(self.dataset_path + "/metadata.json")
            else:
                self.__metadata = dict()
            df = pd.read_csv(self.dataset_path + "/corpus.tsv", sep='\t', header=None)
            if len(df.keys()) > 1:
                df[1] = df[1].replace("train", "a_train")
                df[1] = df[1].replace("val", "b_val")
                df = df.sort_values(1).reset_index(drop=True)

                self.__metadata['last-training-doc'] = len(df[df[1] == 'a_train'])
                self.__metadata['last-validation-doc'] = len(df[df[1] == 'b_val']) + len(df[df[1] == 'a_train'])

use_partitions error

OCTIS version: 1.10.2
Python version: 3.6.5
Operating System: Linux

Description

Hello!
I used the CTM model with parameter use_partitions=False and got
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

What I Did

dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

model = CTM(num_topics=25, num_epochs=100, inference_type='combined', use_partitions=False)
output = model.train_model(dataset)
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')
npmi_score = npmi.score(output)

Exception raised when running tutorial 'How to optimize the hyperparameters of a neural topic model (CTM on M10)'

OCTIS version: 1.2.0
Python version: 3.8.3
Operating System: Linux

Hi Octis team,

When I run your tutorial on my local server (jupyter notebook) I get an exception. I get the same exception when training a single model (no hypersearch) on custom data.

I have attemted to locate the problem, but when I reproduce the individual steps, it runs fine - otherwise happy to make a pull request, but not sure what is going on here...

One odd observation: while CTM.load_bert_data(bert_train_path, train, bert_model) runs prior to the CTMDataset(x_train.toarray(), b_train, idx2token) in preprocess (see below), and bert_embeddings_from_list from /models/contextualized_topic_models/utils/data_preparation.py/ defaults to 'show_progress_bar=True', the exception is thrown before any progress bar.

    def preprocess(vocab, train, bert_model, test=None, validation=None,
                   bert_train_path=None, bert_test_path=None, bert_val_path=None):
        vocab2id = {w: i for i, w in enumerate(vocab)}
        vec = CountVectorizer(
            vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
        entire_dataset = train.copy()
        if test is not None:
            entire_dataset.extend(test)
        if validation is not None:
            entire_dataset.extend(validation)

        vec.fit(entire_dataset)
        idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

        x_train = vec.transform(train)
        b_train = CTM.load_bert_data(bert_train_path, train, bert_model)

        train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
        input_size = len(idx2token.keys())

Tutorial, that yields exception

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

model = CTM(num_topics=10, num_epochs=30, inference_type='zeroshot', bert_model="bert-base-nli-mean-tokens")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'sigmoid', 'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
}

optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm//')

Current call:  0
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-46-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     80             self.vocab = dataset.get_vocabulary()
     81             self.X_train, self.X_test, self.X_valid, input_size = \
---> 82                 self.preprocess(self.vocab, data_corpus_train, test=data_corpus_test,
     83                                 validation=data_corpus_validation,
     84                                 bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in preprocess(vocab, train, bert_model, test, validation, bert_train_path, bert_test_path, bert_val_path)
    178         b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
    179 
--> 180         train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
    181         input_size = len(idx2token.keys())
    182 

~/anaconda3/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py in __init__(self, X, X_bert, idx2token)
     15         """
     16         if X.shape[0] != len(X_bert):
---> 17             raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
     18                             "You might want to check if the BoW preparation method has removed some documents. ")
     19 

Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

My reproduction, that works fine:

def preprocess(vocab, train, bert_model, test=None, validation=None,
               bert_train_path=None, bert_test_path=None, bert_val_path=None):
    vocab2id = {w: i for i, w in enumerate(vocab)}
    vec = CountVectorizer(
        vocabulary=vocab2id, token_pattern=r'(?u)\b\w+\b')
    entire_dataset = train.copy()
    if test is not None:
        entire_dataset.extend(test)
    if validation is not None:
        entire_dataset.extend(validation)

    vec.fit(entire_dataset)
    idx2token = {v: k for (k, v) in vec.vocabulary_.items()}

    x_train = vec.transform(train)
    b_train = bert_embeddings_from_list(train, bert_model)

    train_data = CTMDataset(x_train.toarray(), b_train, idx2token)
    input_size = len(idx2token.keys())

    if test is not None and validation is not None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)

        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, test_data, valid_data, input_size
    if test is None and validation is not None:
        x_valid = vec.transform(validation)
        b_val = bert_embeddings_from_list(validation, bert_model)
        valid_data = CTMDataset(x_valid.toarray(), b_val, idx2token)
        return train_data, valid_data, input_size
    if test is not None and validation is None:
        x_test = vec.transform(test)
        b_test = bert_embeddings_from_list(test, bert_model)
        test_data = CTMDataset(x_test.toarray(), b_test, idx2token)
        return train_data, test_data, input_size
    if test is None and validation is None:
        return train_data, input_size

def bert_embeddings_from_list(texts, sbert_model_to_load="bert-base-nli-mean-tokens", batch_size=100):
    """
    Creates SBERT Embeddings from a list
    """
    model = SentenceTransformer(sbert_model_to_load)
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))

import torch
from torch.utils.data import Dataset
import scipy.sparse


class CTMDataset(Dataset):

    """Class to load BOW dataset."""

    def __init__(self, X, X_bert, idx2token):
        """
        Args
            X : array-like, shape=(n_samples, n_features)
                Document word matrix.
        """
        if X.shape[0] != len(X_bert):
            raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
                            "You might want to check if the BoW preparation method has removed some documents. ")

        self.X = X
        self.X_bert = X_bert
        self.idx2token = idx2token

    def __len__(self):
        """Return length of dataset."""
        return self.X.shape[0]

    def __getitem__(self, i):
        """Return sample from dataset at index i."""
        if type(self.X[i]) == scipy.sparse.csr.csr_matrix:
            X = torch.FloatTensor(self.X[i].todense())
            X_bert = torch.FloatTensor(self.X_bert[i])
        else:
            X = torch.FloatTensor(self.X[i])
            X_bert = torch.FloatTensor(self.X_bert[i])

        return {'X': X, 'X_bert': X_bert}

from octis.models.CTM import CTM
from octis.dataset.dataset import Dataset
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence

dataset = Dataset()
dataset.fetch_dataset("M10")

train, validation, test = dataset.get_partitioned_corpus(use_validation=True)

data_corpus_train = [' '.join(i) for i in train]
data_corpus_test = [' '.join(i) for i in test]
data_corpus_validation = [' '.join(i) for i in validation]

vocab = dataset.get_vocabulary()
X_train, X_test, X_valid, input_size = \
    preprocess(vocab, data_corpus_train, test=data_corpus_test,
                validation=data_corpus_validation,
                bert_train_path=""+"_train.pkl",
                bert_test_path=""+"_test.pkl",
                bert_val_path=""+"_val.pkl",
                bert_model='bert-base-nli-mean-tokens')

Batches: 100%
59/59 [00:08<00:00, 7.10it/s]

Batches: 100%
13/13 [00:01<00:00, 6.62it/s]

Batches: 100%
13/13 [00:00<00:00, 28.11it/s]

Adding the perplexity evaulation metric

Great work!

But as for evaluating the topic models, how about adding the perplexity metric which is a common approach to evaluate the unsupervised language/topic models?

NMF Model Cannot Be Imported Into Colab

OCTIS version: 1.8.0

Description

Tried to import NMF model into Colab

What I Did

from octis.models.NMF import NMF

ImportError                               Traceback (most recent call last)
<ipython-input-17-96f0cef9c8fa> in <module>()
----> 1 from octis.models.NMF import NMF

/usr/local/lib/python3.7/dist-packages/octis/models/NMF.py in <module>()
      1 from octis.models.model import AbstractModel
      2 import numpy as np
----> 3 from gensim.models import nmf
      4 import gensim.corpora as corpora
      5 import octis.configuration.citations as citations

ImportError: cannot import name 'nmf' from 'gensim.models' (/usr/local/lib/python3.7/dist-packages/gensim/models/__init__.py)

AttributeError: module 'octis' has no attribute 'configuration'

OCTIS version: 1.3.0
Python version: 3.6
Operating System: Windows

Description

Tried to run the server and create an experiment

What I Did

127.0.0.1 - - [27/Apr/2021 20:06:58] "�[37mPOST /selectPath HTTP/1.1�[0m" 200 -
{'partitioning': False, 'path': 'D:/OctisResults', 'dataset': '20NewsGroup', 'model': {'name': 'LDA', 'parameters': {'alpha': 0.1, 'eta': 0.1, 'iterations': 50, 'passes': 1}}, 'optimization': {'iterations': 5, 'model_runs': 3, 'surrogate_model': 'GP', 'n_random_starts': 3, 'acquisition_function': 'LCB', 'search_spaces': {'num_topics': {'low': 2, 'high': 20}}}, 'optimize_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}], 'track_metrics': [{'name': 'Coherence', 'parameters': {'measure': 'c_npmi', 'texts': 'use dataset texts', 'topk': 10}}]}
127.0.0.1 - - [27/Apr/2021 20:07:20] "�[37mPOST /startExperiment HTTP/1.1�[0m" 200 -
starting OctExpnewsgroup
Process Process-2:1:
Traceback (most recent call last):
File "D:\octisExp\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "D:\octisExp\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "D:\octisExp\lib\site-packages\octis\dashboard\queueManager.py", line 260, in _execute_and_update
startExperiment(toRun[running[0]])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 136, in startExperiment
model_class = importModel(parameters["model"]["name"])
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 64, in importModel
model = importClass(model_name, model_name, module_path)
File "D:\octisExp\lib\site-packages\octis\dashboard\experimentManager.py", line 46, in importClass
spec.loader.exec_module(module)
File "", line 678, in exec_module
File "", line 219, in _call_with_frames_removed
File "D:\octisExp\lib\site-packages\octis\models\LDA.py", line 5, in
import octis.configuration.citations as citations
AttributeError: module 'octis' has no attribute 'configuration'

Gensim Coherence Pipeline only for English texts?

Hi all,

thank you for this amazing library! I'll definitely consider using it for my master’s thesis. I have a question regarding coherence scores and non-English texts, and I hope it is okay to ask this here.

I saw that you're using the gensim coherence pipeline, which is based on Röder et. al. 2015. In this paper, it is not clear to me if they only used English Wikipedia or a multilingual Wikipedia as a reference corpus for calculating the coherence measures. So my question would be if the gensim coherence pipeline is suitable for the evaluation of non-English texts (e.g German) or if it would be better to use other approaches like TC-W2V with a custom corpus.

Regards
Luca

Integrating D-ETM

Dear @silviatti ,

Related to issue #1

I would like to resubmit the feature request to integrate the DETM into your OCTIS suite. It appears the person who originally proposed issue #1 back in April 2021 has lost interest, or no longer wants to pursue this.

Is it possible if you could complete the integration?

With kindest regards
Luke

Make Stop Word list independant from the installation directory

OCTIS version: '1.5.0'
Python version: 3.7
Operating System: Mac

Description

I am trying to preprocess some custom corpus and When I am trying to remove stop word here is what I get.

What I Did

preprocessor = Preprocessing(vocabulary=None, max_features=None, 
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='english',
                             min_chars=1, min_words_docs=0)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=corpus_path)

And I am getting the following error.

ndexes)
    101             else:
    102                 if 'english' in stopword_list:
--> 103                     with open('octis/preprocessing/stopwords/english.txt') as fr:
    104                         stopwords = [line.strip() for line in fr.readlines()]
    105                         assert stopword_list == language

FileNotFoundError: [Errno 2] No such file or directory: 'octis/preprocessing/stopwords/english.txt'

More context I am using jupyter notebook.

A possible solution.. it may be useful to use pathlib to handle those type of path.

If I fix it locally I can raise a PR soon

CTM training fails.

OCTIS version: 1.8.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.02

Description

CTM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = CTM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

Batches:  84%|████████████████████████████████████████████████████████████████████████████████████████▌                 | 21790/26093 [59:43<11:47,  6.08it/s]
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model = ProdLDA(num_topics=TOPIC_SIZE)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 175, in preprocess
    b_train = CTM.load_bert_data(bert_train_path, train, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/CTM.py", line 208, in load_bert_data
    bert_ouput = bert_embeddings_from_list(texts, bert_model)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/contextualized_topic_models/utils/data_preparation.py", line 35, in bert_embeddings_from_list
    return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size))
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/SentenceTransformer.py", line 160, in encode
    out_features = self.forward(features)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sentence_transformers/models/Transformer.py", line 51, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 991, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 582, in forward
    layer_outputs = layer_module(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 470, in forward
    self_attention_outputs = self.attention(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 401, in forward
    self_outputs = self.self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py", line 305, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

KeyError: 'info' - when fetching the Europarl_IT dataset

OCTIS version: 1.10.3
Python version: 3.9
Operating System: Windows

Description

I am trying to fetch the Italian Europarl_IT dataset to train topic models on. However, this does not work.

What I Did


from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('Europarl_IT')

Traceback (most recent call last):

  Input In [40] in <module>
    dataset.fetch_dataset('Europarl_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:96 in download_dataset
    metadata["info"]["name"] = dataset_name

KeyError: 'info'

ETM - Possibility of using KeyedVectors input for pre-trained W2V embeddings

OCTIS version: 1.9.0
Python version: 3.7.6
Operating System: Ubuntu 20.04 LTS

Description

Hi, this is more of a question than anything else. I've seen that for ETM model training, we must pass an embeddings path corresponding to a "pickled" file. However, I need to execute ETM with rather large embeddings. There's any intent on implementing a gensim.models.KeyedVectors based (or something like that) embeddings input for this model? I've implemented something like that for an etm package of mine, but yours' has all I need to execute model optimization. Would a PR on this matter be accepted?

Anyway, cheers for the nice work, this package is really great!

What I Did

Gave a look at here.

The hyperparameter clip is initialized twice

OCTIS/octis/models/DETM.py

Line 39 in a51540d

self.hyperparameters['clip'] = int(clip)

In line 35, the hyperparameter clip is already initialized.

How do I load a dataset? How to do multi-label classification with OCTIS?

OCTIS version: any
Python version: any
Operating System: any

Description

I am trying to evaluate topic model algorithms with a provided dataset, without success.

What I Did

I am trying to run the following code:

from octis.evaluation_metrics.classification_metrics import AccuracyScore
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA


dataset = Dataset(corpus=X, labels=y)
model = LDA(num_topics=5, alpha=0.1)

acc = AccuracyScore(dataset)
output = model.train_model(dataset)

Where X is my text data and y is the topics (multilabel) for the given text. The last line return this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-29-99a4fc73752b> in <module>
      1 acc = AccuracyScore(dataset)
----> 2 output = model.train_model(dataset)

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/models/LDA.py in train_model(self, dataset, hyperparams, top_words)
    164 
    165         if self.use_partitions:
--> 166             train_corpus, test_corpus = dataset.get_partitioned_corpus(use_validation=False)
    167         else:
    168             train_corpus = dataset.get_corpus()

~/Projects/nlp/topic-modeling/venv/lib/python3.8/site-packages/octis/dataset/dataset.py in get_partitioned_corpus(self, use_validation)
     41     # Partitioned Corpus getter
     42     def get_partitioned_corpus(self, use_validation=True):
---> 43         last_training_doc = self.__metadata["last-training-doc"]
     44         # gestire l'eccezione se last_validation_doc non è definito, restituire
     45         # il validation vuoto

TypeError: 'NoneType' object is not subscriptable

Feature request : Adding D-ETM and other dynamic topic model approaches

Nice work here...

I haven't yet played with the code.

I was just asking if this tool can work dynamic topic models approaches like this one .

If yes, how can we integrate it?

Evaluate 3 different topic modeling algorithms

OCTIS version:
Python version:3,7
Operating System: linux

Description

I am a PhD candidate and I need to evaluate the performance of three different topic model algorithm including: LDA, LSI and Bertopic. ( LDA and LSI were trained using the Gensim package)
what are the relevance metrics that I should use apart from coherence score? I would like to include in my paper a sort of table or graph that shows an evaluation in term of accuracy of the model (coherence score) and relevance of topics ( should I use the topic diversity metric ?)
Thank you

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Unexpected keyword argument 'df_max_freq' when max_features is used in Preprocessing

Setting max_features in Preprocessing throws the following error:
TypeError: init() got an unexpected keyword argument 'df_max_freq'

Example code

import string
from octis.preprocessing.preprocessing import Preprocessing

# Initialize preprocessing
preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='english',
                             min_chars=1, min_words_docs=0, max_df=0.9, min_df=0.1) # 
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path=r'dataset.csv')

I'm happy to help fix the issue.

OCTIS is an excellent library - keep up the great work.

A question about implementation of KLDivergence

Hi and thanks for the great work!

OCTIS/octis/evaluation_metrics/diversity_metrics.py

Line 221 in 2f22350

kl_div += _LOR(beta[i], beta[j])

In this line, the implementation use the function _LOR at this line, but I think it should be the function _KL at that line.

Am I right?

how to train a model using the whole corpus (ignoring partitions)

Dear Silvia,

Love your work!

I have a silly question. How do you train a model using the entire corpus (ignoring partitions)?

Thank you for your help.

Luke

No self.vocab when not using partitions for the CTM-class

OCTIS version: 1.2.0
Python version: 3.8.3
Operating System: Linux

Description

When performing hyper parameter search with the CTM-model and use_partitions=False, I get the error: AttributeError: 'CTM' object has no attribute 'vocab'.

I believe moving line 94 in CTM.py [self.vocab = dataset.get_vocabulary()] prior to the self.use_partitions if-statement in line 87 would solve the problem.

Current call:  0
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-63-7718f92a8020> in <module>
      1 optimizer=Optimizer()
----> 2 optimization_result = optimizer.optimize(
      3     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      4     model_runs=model_runs, save_models=True,
      5     extra_metrics=None, # to keep track of other metrics

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    283             else:
    284                 next_x = opt.ask()
--> 285                 f_val = self._objective_function(next_x)
    286 
    287             # Update the opt using (next_x,f_val)

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _objective_function(self, hyperparameter_values)
    214 
    215             # Prepare model
--> 216             model_output = self.model.train_model(self.dataset, params,
    217                                                   self.topk)
    218             # Score of the model

~/anaconda3/lib/python3.8/site-packages/octis/models/CTM.py in train_model(self, dataset, hyperparameters, top_words)
     89             data_corpus = [' '.join(i) for i in dataset.get_corpus()]
     90             self.X_train, input_size = self.preprocess(
---> 91                 self.vocab, train=data_corpus, bert_train_path=self.hyperparameters['bert_path'] + "_train.pkl",
     92                 bert_model=self.hyperparameters["bert_model"])
     93 

AttributeError: 'CTM' object has no attribute 'vocab'

Despite looking at some demos, I am still not capable of making pull requests, but wanted to let you know nonetheless.

Best,
Thyge

Loading unprocessed corpus documents with CTM and Optimizer

OCTIS version: 1.10.0
Python version: 3.7.6
Operating System: Ubuntu 20.04 LTS

Description

I've asked this at #29, but decided to open a new issue because this is a more specific scenario. So, here it is:

Hi @silviatti. So, if I understand correctly, currently there's no way to load the unprocessed corpus documents on OCTIS' CTM while using its optimizer, in a manner similar to the one done on standalone CTM's README?

Originally posted by @lffloyd in #29 (comment)

What I Did

I gave a look at the docs.

Hyperparameter Tuning of num_topics Errors

I get the following error when trying to do hyperparameter optimation on num_topics:

/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py in init(self, input_size, bert_input_size, inference_type, num_topics, model_type, hidden_sizes, activation, dropout, learn_priors, batch_size, lr, momentum, solver, num_epochs, num_samples, reduce_on_plateau, topic_prior_mean, topic_prior_variance, num_data_loader_workers)
45 "input_size must by type int > 0."
46 assert isinstance(num_topics, int) and input_size > 0,
---> 47 "num_topics must by type int > 0."
48 assert model_type in ['LDA', 'prodLDA'],
49 "model must be 'LDA' or 'prodLDA'."

AssertionError: num_topics must by type int > 0.

What I Did

ctm_model = CTM(model_type='prodLDA', bert_model="stsb-roberta-base-v2", inference_type="combined")

search_space = {"num_topics": Integer(5, 10, 15, 20),
                "num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200})
}

optimization_runs=20
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    ctm_model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_ctm/')

optimization_result.save_to_csv("results_ctm.csv")

Happy to assist with fixing this issue.

Control of prior distribution

If you are doing Bayesian optimization, it is probably a good idea to give the control of prior probability distribution to the user. Any plans to add that to your API?

Missing info for custom data yields key error in hypersearch

OCTIS version: 1.2.0
Python version: 3.8.3
Operating System: Linux

Hi Octis Team,

Thanks for making this available!

When providing a custom dataset for a LDA hyperparameter seach, I get: KeyError: 'info'

This is not the case when I run a single model (no hypersearch), nor when I fetch the M10 dataset and use this.

If I manually add an info entry with a name for the dataset to the metadata attribute of the custom dataset, the hyperparameter search works fine.

Perhaps the required metadata could be auto-filled when providing custom data?

Best,
Thyge

Code and traceback:

# Load modules
from octis.dataset.dataset import Dataset
from octis.models.LDA import LDA
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Categorical

# Load custom dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder(str(_dp))

# Initiate model
model = LDA(alpha=0.5, eta=0.5)  

# Define search space
search_space = {"num_topics": Categorical({15, 20, 25, 30})}

# Set number of runs
optimization_runs=15
model_runs=1 

# Define evaluation metric
npmi = Coherence(texts=dataset.get_corpus())

# Hypersearch
optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path=str(_models / 'Octis' / 'LDA'))


Current call:  0
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-225-087bc04aa55a> in <module>
      1 # Hypersearch
      2 optimizer=Optimizer()
----> 3 optimization_result = optimizer.optimize(
      4     model, dataset, npmi, search_space, number_of_call=optimization_runs,
      5     model_runs=model_runs, save_models=True,

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in optimize(self, model, dataset, metric, search_space, extra_metrics, number_of_call, n_random_starts, initial_point_generator, optimization_type, model_runs, surrogate_model, kernel, acq_func, random_state, x0, y0, save_models, save_step, save_name, save_path, early_stop, early_step, plot_best_seen, plot_model, plot_name, log_scale_plot, topk)
    158 
    159         # Perform Bayesian Optimization
--> 160         results = self._optimization_loop(opt)
    161 
    162         return results

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer.py in _optimization_loop(self, opt)
    299 
    300             # Create an object related to the BO optimization
--> 301             results = OptimizerEvaluation(self, BO_results=res)
    302 
    303             # Save the object

~/anaconda3/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in __init__(self, optimizer, BO_results)
     45         # Info about optimization
     46         self.info = dict()
---> 47         dataset_info = optimizer.dataset.get_metadata()["info"]
     48         if dataset_info is not None:
     49             self.info.update({"dataset_name": dataset_info["name"]})

KeyError: 'info'

Adding this after loading custom data fixes the problem:

# Load existing metadata
meta_dict = dataset.get_metadata()

# Add name to dict
meta_dict['info'] = {'name':'dataset_name'}

# Update metadata
dataset._Dataset__metadata = meta_dict

# Verify info is updated
dataset.get_info()

'OptimizerEvaluation' object has no attribute 'dict_model_runs' error while using 'extra_metrics' in optimization

OCTIS version: 1.10.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.3

Description

I am trying to optimise LDA model with custom data. My evaluation metric is npmi but I am also using topic_diversity as extra metric during optimization.

What I Did

Code:

# Create Model
model = LDA(num_topics=20, alpha=0.1)
model.partitioning(False)

# Initialize metric
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')

# Initialize metric
topic_diversity = TopicDiversity(topk=10)

optimization_runs=30 # number of optimization iterations
model_runs=5 # number of runs of the topic model

# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), 
                "eta": Real(low=0.001, high=5.0), 
                'num_topics': Integer(low=1, high=10, prior='uniform')}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, 
                             search_space= search_space, 
                             save_path=output_path, # path to store the results
                             metric= npmi,
                             number_of_call=optimization_runs,
                             model_runs=model_runs, 
                             extra_metrics=[topic_diversity])

#save the results of th optimization in a csv file
optResult.save_to_csv(results.csv")

Error traceback :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
    150             try:
--> 151                 df[metric.info()["name"] + '(not optimized)'] = [np.median(
    152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
    151                 df[metric.info()["name"] + '(not optimized)'] = [np.median(
--> 152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    153             except:

AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_6863/802420513.py in <module>
     10 
     11 #save the results of th optimization in a csv file
---> 12 optResult.save_to_csv("results.csv")

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in save_to_csv(self, name_file)
    152                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    153             except:
--> 154                 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
    155                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    156 

~/envs/topic_modeling/lib/python3.8/site-packages/octis/optimization/optimizer_evaluation.py in <listcomp>(.0)
    153             except:
    154                 df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
--> 155                     self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
    156 
    157         if not name_file.endswith(".csv"):

AttributeError: 'OptimizerEvaluation' object has no attribute 'dict_model_runs'

Possible solution

Code modification here

Old

try:
    df[metric.info()["name"] + '(not optimized)'] = [np.median(
        self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
    df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
        self.dict_model_runs[metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

New

try:
  df[metric.info()["name"] + '(not optimized)'] = [np.median(
      self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]
except:
  df[metric.__class__.__name__ + '(not optimized)'] = [np.median(
      self.info['dict_model_runs'][metric.__class__.__name__]['iteration_' + str(i)]) for i in range(n_row)]

Improve Preprocessing Speed

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

How to preprocess text for CTM?

I read CTM uses both the preprocessed text for BOW and full text for BERT embedding. How can I create this as Dataset for the CTM model? Does saving an a OCTIS datasets automatically do this?

Many thanks

Preprocessing with "split = False" returns NoneType object

OCTIS version: 1.6.0
Python version: 3.9.5
Operating System: Ubuntu 16.04

Description

I am trying to load a custom dataset without splitting into train, validation, test.

What I Did

I processed the content of both my files firstly with "split = False" and obtain a NoneType object.

preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='french',
                             min_chars=1, min_words_docs=0,
                             language="french",
                             split = False)
    
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
    
print(type(dataset))

None

To compare, I also did it with "split = True" but got a valid Dataset object this time.

preprocessor = Preprocessing(vocabulary=None, max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='french',
                             min_chars=1, min_words_docs=0,
                             language="french",
                             split = False)
    
dataset = preprocessor.preprocess_dataset(documents_path = dataPath / 'fake_docs.txt', labels_path = dataPath / 'fake_metadata.txt')
    
print(type(dataset))

<octis.dataset.dataset.Dataset object at 0x7f58ad457400>

Might be source of the problem

I looked into the octis.preprocessing.preprocessing.py module and found out that, line 209 and 213 in the "else" statement after "if split:", there is no return statement to output the Dataset.

        if self.split:
            if len(final_labels) > 0:
                train, test, y_train, y_test = train_test_split(
                    range(len(final_docs)), final_labels, test_size=0.15, random_state=1, stratify=final_labels)

                train, validation = train_test_split(train, test_size=3 / 17, random_state=1, stratify=y_train)
                partitioned_labels = [final_labels[doc] for doc in train + validation + test]
                partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
                document_indexes = [document_indexes[doc] for doc in train + validation + test]
                metadata["last-training-doc"] = len(train)
                metadata["last-validation-doc"] = len(validation) + len(train)
                if self.save_original_indexes:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels,
                                   document_indexes=document_indexes)
                else:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=partitioned_labels)
            else:
                train, test = train_test_split(range(len(final_docs)), test_size=0.15, random_state=1)
                train, validation = train_test_split(train, test_size=3 / 17, random_state=1)

                metadata["last-training-doc"] = len(train)
                metadata["last-validation-doc"] = len(validation) + len(train)
                partitioned_corpus = [final_docs[doc] for doc in train + validation + test]
                document_indexes = [document_indexes[doc] for doc in train + validation + test]
                if self.save_original_indexes:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                                   document_indexes=document_indexes)
                else:
                    return Dataset(partitioned_corpus, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                                   document_indexes=document_indexes)
        else:
            if self.save_original_indexes:
                Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels,
                        document_indexes=document_indexes)
            else:

                Dataset(final_docs, vocabulary=vocabulary, metadata=metadata, labels=final_labels)

Adding the return statements at both lines 209 and 213 in the octis.preprocessing.preprocessing.py module resolved the problem.

Also, thank you for your incredible work, the library is really nice to use !

num_samples should be a positive integer value, but got num_samples=0

OCTIS version:
Python version:
Operating System:

Description

I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")

model = CTM(num_topics=10,
            num_epochs=30,
            inference_type='zeroshot', 
            bert_model="distiluse-base-multilingual-cased")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
                }
optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    plot_best_seen=True, plot_model=True, plot_name="B0_plot", 
    save_path='results2/test_ctm//')

I can't find where to write this variable "num_samples"

Question: OCTIS supports GPU?

OCTIS version: 1.8.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.02

Description

If I add a GPU with CUDA support, will OCTIS be faster?

What I Did

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'

OCTIS version: 1.10.0
Python version: 3.6.13
Operating System: Windows 10

Description

Hi @lffloyd and @silviatti

I tried to run the ETM with pre-trained embeddings after the recent upgrade, and it returned this error.

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'.

Please advise if I made an error on my end.

My commands and traceback are provided below.

Thank you so much!
Luke

What I Did

model = ETM(num_topics=40, num_epochs=1, use_partitions=False, train_embeddings=False,
            embeddings_type='word2vec', embeddings_path=r'my/path/to/embedding/skipgram_emb_300d.txt', binary_embeddings=False, headerless_embeddings=True)

output= model.train_model(dataset)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
d:\01MRes_ubuntu\OCTIS\fomcNoPartitionsPreTrained\EtmRunModelPreTrained300.py in <module>
----> 26 output_fomc_etm = model.train_model(dataset)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
     74         if hyperparameters is None:
     75             hyperparameters = {}
---> 76         self.set_model(dataset, hyperparameters)
     77         self.top_word = top_words
     78         self.early_stopping = EarlyStopping(patience=5, verbose=True)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in set_model(self, dataset, hyperparameters)
    119 
    120         self.set_default_hyperparameters(hyperparameters)
--> 121         self.load_embeddings()
    122         ## define model and optimizer
    123         self.model = etm.ETM(num_topics=self.hyperparameters['num_topics'], vocab_size=len(self.vocab.keys()),

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in load_embeddings(self)
     52                                         self.hyperparameters['embeddings_type'],
     53                                         self.hyperparameters['binary_embeddings'],
---> 54                                         self.hyperparameters['headerless_embeddings'])
     55         embeddings = np.zeros((len(self.vocab.keys()), self.hyperparameters['embedding_size']))
     56         for i, word in enumerate(self.vocab.values()):

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\base_etm.py in _load_word_vectors(self, embeddings_path, embeddings_type, binary_embeddings, headerless_embeddings)
     85                 embeddings_path,
     86                 binary=binary_embeddings,
---> 87                 no_header=headerless_embeddings)
     88 
     89         vectors = {}

TypeError: load_word2vec_format() got an unexpected keyword argument 'no_header'

AttributeError: 'KeyedVectors' object has no attribute 'wv'

OCTIS version: 1.8.3
Python version: 3.8.5
Operating System: macOS

Description

Trying to evaluate a model using the WordEmbeddingsInvertedRBOCentroid() method I get an attribute error "'KeyedVectors' object has no attribute 'wv'"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("custom_dataset")

from octis.models.LDA import LDA
model_LDA_15 = LDA(num_topics=15) 
model_LDA_15_output = model_LDA_15.train_model(dataset)

from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
topic_rbo_centroid_score = rbo_centroid_metric.score(model_LDA_15_output)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-45-eb5075095fc9> in <module>
      1 from octis.evaluation_metrics.diversity_metrics import WordEmbeddingsInvertedRBOCentroid
      2 rbo_centroid_metric = WordEmbeddingsInvertedRBOCentroid()
----> 3 topic_rbo_centroid__score = rbo_centroid_metric.score(model_LDA_15_output)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/diversity_metrics.py in score(self, model_output)
    174                 indexed_list1 = [word2index[word] for word in list1]
    175                 indexed_list2 = [word2index[word] for word in list2]
--> 176                 rbo_val = weirbo_centroid(
    177                     indexed_list1[:self.topk], indexed_list2[:self.topk], p=self.weight, index2word=index2word,
    178                     word2vec=self.wv, norm=self.norm)[2]

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in word_embeddings_rbo(list1, list2, p, index2word, word2vec, norm)
    145     args = (list1, list2, p, index2word, word2vec, norm)
    146 
--> 147     return RBO(rbo_min(*args), rbo_res(*args), rbo_ext(*args))
    148 
    149 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in rbo_min(list1, list2, p, index2word, word2vec, norm, depth)
     79     """
     80     depth = min(len(list1), len(list2)) if depth is None else depth
---> 81     x_k = overlap(list1, list2, depth, index2word, word2vec, norm)
     82     log_term = x_k * math.log(1 - p)
     83     sum_term = sum(

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in overlap(list1, list2, depth, index2word, word2vec, norm)
     59     # NOTE: comment the preceding and uncomment the following line if you want
     60     # to stick to the algorithm as defined by the paper
---> 61     ov = embeddings_overlap(list1, list2, depth, index2word, word2vec, norm=norm)[0]
     62     # print("overlap", ov)
     63     return ov

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in embeddings_overlap(list1, list2, depth, index2word, word2vec, norm)
     41     word_list2 = [index2word[index] for index in list2]
     42 
---> 43     centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
     44     centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
     45     cos_sim = 1 - distance.cosine(centroid_1, centroid_2)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/octis/evaluation_metrics/word_embeddings_rbo_centroid.py in <listcomp>(.0)
     41     word_list2 = [index2word[index] for index in list2]
     42 
---> 43     centroid_1 = np.mean([word2vec.wv[w] for w in word_list1[:depth]], axis=0)
     44     centroid_2 = np.mean([word2vec.wv[w] for w in word_list2[:depth]], axis=0)
     45     cos_sim = 1 - distance.cosine(centroid_1, centroid_2)

AttributeError: 'KeyedVectors' object has no attribute 'wv'

Bug OOV words in WECoherenceCentroid

OCTIS version: 1.10.0
Python version :3.9.7
Operating System:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

In this line (https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L180
), topic[0] contains a word, so if this is a word that is not included in self._wv, it will cause an error.

Since Gensim's KeyedVectors class has a vector_size variable, I think this code should be rewritten to create a zero vector with reference to vector_size.

#t = [0] * len(self._wv.__getitem__(topic[0]))
t = np.zeros(self._wv.vector_size)

Examples of error messages

  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/octis-1.10.0-py3.9.egg/octis/evaluation_metrics/coherence_metrics.py", line 180, in score
    t = [0] * len(self._wv.__getitem__(topic[0]))
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 395, in __getitem__
    return self.get_vector(key_or_keys)
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 438, in get_vector
    index = self.get_index(key)
  File "/root/.cache/pypoetry/virtualenvs/sktopic-L2WRRFYm-py3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 412, in get_index
    raise KeyError(f"Key '{key}' not present")
KeyError: "Key 'elsevi' not present"

ETM model: AttributeError: 'list' object has no attribute 'squeeze'

OCTIS version: 1.9.0
Python version: 3.6.13
Operating System: Windows 10

Description

I tried to run the ETM model through OCTIS, but got an attribute error.
I've attached my corpus (corpus.csv; for some reason git won't let my attach an actual .tsv file) and vocabulary (vocabulary.txt) for your convenience.

What I Did

Here's what I did

I complied with the format of the dataset as a .tsv and vocabulary as a .txt file with one stem per row.

I was able to load the dataset with no errors.

To run the model I did the following:

from octis.models.ETM import ETM
model_etm = ETM(num_topics=40)
output_fomc = model_etm.train_model(dataset)

model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (rho): Linear(in_features=300, out_features=9659, bias=False)
  (alphas): Linear(in_features=300, out_features=40, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=9659, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_f
eatures=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=40, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=40, bias=True)
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-159-a49502c35827> in <module>
----> 1 output_fomc = model_etm.train_model(dataset)

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in train_model(self, dataset, hyperparameters, top_words)
     54 
     55         for epoch in range(0, self.hyperparameters['num_epochs']):
---> 56             continue_training = self._train_epoch(epoch)
     57             if not continue_training:
     58                 break

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM.py in _train_epoch(self, epoch)
    120             self.model.zero_grad()
    121             data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
--> 122                                         self.hyperparameters['embedding_size'], self.device)
    123             sums = data_batch.sum(1).unsqueeze(1)
    124             if self.hyperparameters['bow_norm']:

~\anaconda3\envs\lda_env36\lib\site-packages\octis\models\ETM_model\data.py in get_batch(tokens, counts, ind, vocab_size, emsize, device)
     15         #L = count.shape[1]
     16         if len(doc) == 1:
---> 17             doc = [doc.squeeze()]
     18             count = [count.squeeze()]
     19         else:

AttributeError: 'list' object has no attribute 'squeeze'


``
[vocabulary.txt](https://github.com/MIND-Lab/OCTIS/files/7423616/vocabulary.txt)
[corpus.csv](https://github.com/MIND-Lab/OCTIS/files/7423618/corpus.csv)

'

Addition of DTM and DETM

Hello,

Is it possible to add the following models to OCTIS?

Dynamic Topic Model git here
DETM git here

Thank you
Luke

ETM - How to use pretrained word embeddings

@silviatti Hello again,

I have another silly question for you.

The original ETM package allows the user to use pre-trained word embeddings (the file name is 'skipgram_emb_300d.txt')

How do I tell the model to look for the pretrained embedding file instead?

Is it something like:

model = ETM(num_topics=25, embeddings_path = '...\path\to\embeddings'skipgram_emb_300d.txt'') ?

Thank you again for your assistance :)

Luke

Can CTM return topic_significance_uniform score?

OCTIS version:
Python version:
Operating System:

Description

Currently, CTM returns only topic_significance_background score. Is it possible to get topic_significance_uniform score (per topic)?

Thanks,
-Atakan

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Is Neural LDA the same as LDA

When using NeuralLDA as the model in OCTIS, it simply sets the model_type of AVITM to LDA. Does it simply do a standard LDA or something else? What is the corresponding model in the Srivastava and Sutton 2017 paper?

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset

OCTIS version: 1.10.3
Python version: 3.9
Operating System: Windows

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>

ETM training fails.

OCTIS version: 1.8.0
Python version: 3.8.10
Operating System: Ubuntu 20.04.02

Description

ETM training fails.

What I Did

dataset = Dataset()
dataset.load_custom_dataset_from_folder(DATASET_PATH)
model = ETM(num_topics=TOPIC_SIZE)
model_output = model.train_model(dataset)
save_model_output(model_output, MODEL_OUTPUT_PATH)
save_model_output(model, MODEL_PATH)

The following error message was displayed.

model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (rho): Linear(in_features=300, out_features=221413, bias=False)
  (alphas): Linear(in_features=300, out_features=25, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=221413, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=25, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=25, bias=True)
)
Traceback (most recent call last):
  File "train.py", line 62, in <module>
    model_output = model.train_model(dataset)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 57, in train_model
    continue_training = self._train_epoch(epoch)
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM.py", line 149, in _train_epoch
    data_batch = data.get_batch(self.train_tokens, self.train_counts, ind, len(self.vocab.keys()),
  File "/usr/local/lib/python3.8/dist-packages/octis/models/ETM_model/data.py", line 17, in get_batch
    doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'

ETM - Single-word document causes ETM training to fail

OCTIS version: 1.9.0
Python version: 3.7.6
Operating System: Ubuntu 20.04 LTS

Description

Probably related to #20 and #35.
If your preprocessed corpus contains any single-word document, ETM training fails. This should not happen, as the Preprocessing class has 0 as the default value for parameters min_words_docs and min_df, which define respectivelly the minimum number of words a document must have to be keep and the minimum document-frequency for words on the corpus.

What I Did

I've implemented a test case illustrating the scenario. The test fails. The test code can be found here, and the error stacktrace as shown on Github actions can be seen here.

Below, the aforementioned stacktrace (on my local machine):

Current call:  0
model: ETM(
  (t_drop): Dropout(p=0.5, inplace=False)
  (theta_act): ReLU()
  (alphas): Linear(in_features=300, out_features=16, bias=False)
  (q_theta): Sequential(
    (0): Linear(in_features=872, out_features=800, bias=True)
    (1): ReLU()
    (2): Linear(in_features=800, out_features=800, bias=True)
    (3): ReLU()
  )
  (mu_q_theta): Linear(in_features=800, out_features=16, bias=True)
  (logsigma_q_theta): Linear(in_features=800, out_features=16, bias=True)
)
Traceback (most recent call last):
  File "octis_test/unified_training.py", line 55, in <module>
    model_runs=5, plot_best_seen=True) # number of runs of the topic model
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 160, in optimize
    results = self._optimization_loop(opt)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 285, in _optimization_loop
    f_val = self._objective_function(next_x)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/optimization/optimizer.py", line 217, in _objective_function
    self.topk)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 60, in train_model
    continue_training = self._train_epoch(epoch)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM.py", line 126, in _train_epoch
    self.hyperparameters['embedding_size'], self.device)
  File "/home/luizmatos/anaconda3/lib/python3.7/site-packages/octis/models/ETM_model/data.py", line 17, in get_batch
    doc = [doc.squeeze()]
AttributeError: 'list' object has no attribute 'squeeze'

WECoherencePairwise and WECoherenceCentroid are negatively correlated.

OCTIS version: 1.9.0
Python version :3.9.7
Operating System:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"

Description

I used OCITS metrics to evaluate my own implementation of the model. As a result, I found that WECoherencePairwise and WECoherenceCentroid have a negative correlation. Originally, I think these two metrics should have a positive correlation.
Each point in the figure represents an result of experiment under different conditions.

In WECoherenceCentroid's calculation , have been done distance-1, but it would be more correct to 1-distance. (or use sklearn.metrics.pairwise.cosine_similarity)

# https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/coherence_metrics.py#L171 
distance = spatial.distance.cosine(self._wv.__getitem__(w1), self._wv.__getitem__(w2))
topic_coherence += distance - 1

What I Did

"Select Path" button bug in "Create Experiments" tab

Python version: 3.9
Operating System: MacOS

Description

In the dashboard, I was trying to create an experiment. Everything was ok, but got an error when I clicked to "Select Path" to where to save the experiments.

I got the error:

RuntimeError: main thread is not in main loop

mind-lab / octis Goto Github PK

octis's People

Contributors

Stargazers

Watchers

Forkers

octis's Issues

Description

Description

What I Did

Description

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Description

What I Did

What I Did

Description

What I Did

Possible solution

Description

What I Did

Might be source of the problem

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

Examples of error messages

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org