sebischair / lbl2vec Goto Github PK

Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Home Page: https://wwwmatthes.in.tum.de/pages/naimi84squl1/Lbl2Vec-An-Embedding-based-Approach-for-Unsupervised-Document-Retrieval-on-Predefined-Topics

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

natural-language-processing word-embeddings document-embeddings label-embeddings unsupervised-classification nlp machine-learning text-classification python unsupervised-document-retrieval

lbl2vec's People

Contributors

Stargazers

Watchers

lbl2vec's Issues

Localization possible ?

Does Lbl2Vec work with other languages than English, as in does it create the doc2vec correctly when using it on other languages ?

Saved the trained model using pickle (and also using the lbl2vec save feature) and when I tried to load in a custom module, it gives the following error "No module named lbl2vec.sav"

It is situated in a path where there are other models and they are loaded the same way successfully. Does anyone have any idea why this is happening?

Is paragraph classification possible?

Hello and thanks for sharing this. A question: can Lbl2Vec perform well when the "documents" are paragraph-sized? For example 3-5 sentences? Would we need to change Doc2Vec that Lbl2Vec currently uses into Sent2Vec or some other equivalent? Your thoughts?

Thanks!

Lbl2TransformerVec - predict_model_docs() when clean_outliers=True creates Dimension out of range

When calling model.predict_model_docs() with the clean_outliers=True , model.predict_model_docs() produces an "IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)" error. If clean_outliers=False, there is no issues with predict_model_docs(). It appears that clean_outliers() is creating a dimension mismatch?

model = lbl2vec.Lbl2TransformerVec(transformer_model=transformer_model_loop, label_names=labels, keywords_list=keys,
                               documents=df['name'].apply(str.lower), device=torch.device('cuda'), similarity_threshold=.5, clean_outliers=True)
model.fit()
torch.set_default_tensor_type('torch.cuda.FloatTensor')

## Produces issues with clean_outliers=True
model_out = model_loop.predict_model_docs()

Error: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Lbl2TransformerVec - INFO - Calculate document<->label similarities

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-c5d23d521a48> in <module>

---> 15     model_out = model.predict_model_docs()


~/.local/lib/python3.8/site-packages/lbl2vec/lbl2transformervec.py in predict_model_docs(self, doc_idxs)
    333         self.logger.info('Calculate document<->label similarities')
    334         # calculate document vector <-> label vector similarities
--> 335         labeled_docs = self._get_document_label_similarities(labeled_docs=labeled_docs, doc_key_column=doc_key_column,
    336                                                              most_similar_label_column=most_similar_label_column,
    337                                                              highest_similarity_score_column=highest_similarity_score_column)

~/.local/lib/python3.8/site-packages/lbl2vec/lbl2transformervec.py in _get_document_label_similarities(self, labeled_docs, doc_key_column, most_similar_label_column, highest_similarity_score_column)
    532         label_similarities = []
    533         for label_vector in list(self.labels['label_vector_from_docs']):
--> 534             similarities = top_similar_vectors(key_vector=label_vector, candidate_vectors=list(labeled_docs['doc_vec']))
    535             similarities.sort(key=lambda x: x[1])
    536             similarities = [elem[0] for elem in similarities]

~/.local/lib/python3.8/site-packages/lbl2vec/utils.py in top_similar_vectors(key_vector, candidate_vectors)
    178           A descending sorted of tuples of (cos_similarity, list_idx) by cosine similarities for each candidate vector in the list
    179      '''
--> 180     cos_scores = util.cos_sim(key_vector, np.asarray(candidate_vectors))[0]
    181     top_results = torch.topk(cos_scores, k=len(candidate_vectors))
    182     top_cos_scores = top_results[0].detach().cpu().numpy()

~/.local/lib/python3.8/site-packages/sentence_transformers/util.py in cos_sim(a, b)
     45         b = b.unsqueeze(0)
     46 
---> 47     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     48     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
     49     return torch.mm(a_norm, b_norm.transpose(0, 1))

~/.local/lib/python3.8/site-packages/torch/nn/functional.py in normalize(input, p, dim, eps, out)
   4630         return handle_torch_function(normalize, (input, out), input, p=p, dim=dim, eps=eps, out=out)
   4631     if out is None:
-> 4632         denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input)
   4633         return input / denom
   4634     else:

~/.local/lib/python3.8/site-packages/torch/_tensor.py in norm(self, p, dim, keepdim, dtype)
    636                 Tensor.norm, (self,), self, p=p, dim=dim, keepdim=keepdim, dtype=dtype
    637             )
--> 638         return torch.norm(self, p, dim, keepdim, dtype=dtype)
    639 
    640     def solve(self, other):

~/.local/lib/python3.8/site-packages/torch/functional.py in norm(input, p, dim, keepdim, out, dtype)
   1527         if out is None:
   1528             if dtype is None:
-> 1529                 return _VF.norm(input, p, _dim, keepdim=keepdim)  # type: ignore[attr-defined]
   1530             else:
   1531                 return _VF.norm(input, p, _dim, keepdim=keepdim, dtype=dtype)  # type: ignore[attr-defined]

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Multilingual?

Does this model work for languages other than English?
If yes, could you please specify which ones?

Lbl2TransformerVec(Lbl2Vec).predict_model_docs() stalls / lack of GPU utilization

It appears that on larger label datasets (>1000 labels), Lbl2TransformerVec(Lbl2Vec).predict_model_docs() will stall at the "calculate document vector <-> label vector similarities" step, perhaps due to a memory issue. Tracing the issue, it may be due to the below "utils.top_similar_vectors" function which converts the Torch tensors to numpy, which is called on in an apply function with predict_model_docs(). Would there be a way to refactor the below to perhaps leave the torch tensors in GPU and then convert to numpy outside of this function to improve performance?

The issue only seems to appear with label counts >1000.

utils.py

def top_similar_vectors(key_vector: np.array, candidate_vectors: List[np.array]) -> List[tuple]:
'''
 Calculates the cosines similarities of a given key vector to a list of candidate vectors.
 Parameters
 ----------
 key_vector : `np.array`_
         The key embedding vector

 candidate_vectors : List[`np.array`_]
         A list of candidate embedding vectors
 Returns
 -------
 top_results : List[tuples]
      A descending sorted of tuples of (cos_similarity, list_idx) by cosine similarities for each candidate vector in the list
 '''

cos_scores = util.cos_sim(key_vector, np.asarray(candidate_vectors))[0]
top_results = torch.topk(cos_scores, k=len(candidate_vectors))
## Return the tensors then convert to numpy

## Consider refactoring implementation to leave tensors in GPU instead of move to CPU at this point
top_cos_scores = top_results[0].detach().cpu().numpy()
top_indices = top_results[1].detach().cpu().numpy()

return list(zip(top_cos_scores, top_indices))

ValueError: cannot compute mean with no input

Does this model support german keywords? There is an issue when trying to fit the model with german keywords. Can you please suggest ?

multiclass multilabel classification

Hi team,

I have a couple of questions about multiclass multilabel classification.

do I need to create keywords list for each class?
by setting up threshold, does that mean it can use for multilabel classification? i.e. any class above the threshold is a match, so, one data can have multiple label?

Thanks,
Ling

Is it possible to use 2 words as keywords

Is it possible to use keywords that are composed of 2 words each? For example 'movie theater' would be a useful keyword if I wanted to find documents about movie theaters, but the individual words movie and theater would identify a different subset of documents than what I'm really after

pip install doesnt work

Hello
I'm trying to install the package but I get an error.

pip install lbl2vec

Collecting lbl2vec
ERROR: Could not find a version that satisfies the requirement lbl2vec (from versions: none)
ERROR: No matching distribution found for lbl2vec

I searched a bit on google and couldn't find a solution.

Python 3.7.4
pip 19.2.3

ValueError: cannot compute similarity with no input

Hi Team,

I am getting following error while running model fit:

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings
2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

ValueError Traceback (most recent call last)
in

~/SageMaker/lbl2vec/lbl2vec.py in fit(self)
248 # get doc keys and similarity scores of documents that are similar to
249 # the description keywords
--> 250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4211 else:
4212 values = self.astype(object)._values
-> 4213 mapped = lib.map_infer(values, f, convert=convert_dtype)
4214
4215 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

~/SageMaker/lbl2vec/lbl2vec.py in (row)
249 # the description keywords
250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
--> 251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252
253 # validate that documents to calculate label embeddings from are found

~/SageMaker/lbl2vec/lbl2vec.py in _get_similar_documents(self, doc2vec_model, keywords, num_docs, similarity_threshold, min_num_docs)
625 for word in cleaned_keywords_list]
626 similar_docs = doc2vec_model.dv.most_similar(
--> 627 positive=keywordword_vectors, topn=num_docs)
628 except KeyError as error:
629 error.args = (

~/anaconda3/envs/python3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer)
775 all_keys.add(self.get_index(key))
776 if not mean:
--> 777 raise ValueError("cannot compute similarity with no input")
778 mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
779

ValueError: cannot compute similarity with no input

sebischair / lbl2vec Goto Github PK

lbl2vec's People

Contributors

Stargazers

Watchers

Forkers

lbl2vec's Issues

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings 2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

Recommend Projects

Recommend Topics

Recommend Org

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings
2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings