Giter Club home page Giter Club logo

sebischair / lbl2vec Goto Github PK

View Code? Open in Web Editor NEW
170.0 6.0 28.0 14.04 MB

Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Home Page: https://wwwmatthes.in.tum.de/pages/naimi84squl1/Lbl2Vec-An-Embedding-based-Approach-for-Unsupervised-Document-Retrieval-on-Predefined-Topics

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
natural-language-processing word-embeddings document-embeddings label-embeddings unsupervised-classification nlp machine-learning text-classification python unsupervised-document-retrieval

lbl2vec's People

Contributors

timschopf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lbl2vec's Issues

Localization possible ?

Does Lbl2Vec work with other languages than English, as in does it create the doc2vec correctly when using it on other languages ?

Is paragraph classification possible?

Hello and thanks for sharing this. A question: can Lbl2Vec perform well when the "documents" are paragraph-sized? For example 3-5 sentences? Would we need to change Doc2Vec that Lbl2Vec currently uses into Sent2Vec or some other equivalent? Your thoughts?

Thanks!

Lbl2TransformerVec - predict_model_docs() when clean_outliers=True creates Dimension out of range

When calling model.predict_model_docs() with the clean_outliers=True , model.predict_model_docs() produces an "IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)" error. If clean_outliers=False, there is no issues with predict_model_docs(). It appears that clean_outliers() is creating a dimension mismatch?

model = lbl2vec.Lbl2TransformerVec(transformer_model=transformer_model_loop, label_names=labels, keywords_list=keys,
                               documents=df['name'].apply(str.lower), device=torch.device('cuda'), similarity_threshold=.5, clean_outliers=True)
model.fit()
torch.set_default_tensor_type('torch.cuda.FloatTensor')

## Produces issues with clean_outliers=True
model_out = model_loop.predict_model_docs()

Error: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Lbl2TransformerVec - INFO - Calculate document<->label similarities

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-c5d23d521a48> in <module>

---> 15     model_out = model.predict_model_docs()


~/.local/lib/python3.8/site-packages/lbl2vec/lbl2transformervec.py in predict_model_docs(self, doc_idxs)
    333         self.logger.info('Calculate document<->label similarities')
    334         # calculate document vector <-> label vector similarities
--> 335         labeled_docs = self._get_document_label_similarities(labeled_docs=labeled_docs, doc_key_column=doc_key_column,
    336                                                              most_similar_label_column=most_similar_label_column,
    337                                                              highest_similarity_score_column=highest_similarity_score_column)

~/.local/lib/python3.8/site-packages/lbl2vec/lbl2transformervec.py in _get_document_label_similarities(self, labeled_docs, doc_key_column, most_similar_label_column, highest_similarity_score_column)
    532         label_similarities = []
    533         for label_vector in list(self.labels['label_vector_from_docs']):
--> 534             similarities = top_similar_vectors(key_vector=label_vector, candidate_vectors=list(labeled_docs['doc_vec']))
    535             similarities.sort(key=lambda x: x[1])
    536             similarities = [elem[0] for elem in similarities]

~/.local/lib/python3.8/site-packages/lbl2vec/utils.py in top_similar_vectors(key_vector, candidate_vectors)
    178           A descending sorted of tuples of (cos_similarity, list_idx) by cosine similarities for each candidate vector in the list
    179      '''
--> 180     cos_scores = util.cos_sim(key_vector, np.asarray(candidate_vectors))[0]
    181     top_results = torch.topk(cos_scores, k=len(candidate_vectors))
    182     top_cos_scores = top_results[0].detach().cpu().numpy()

~/.local/lib/python3.8/site-packages/sentence_transformers/util.py in cos_sim(a, b)
     45         b = b.unsqueeze(0)
     46 
---> 47     a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
     48     b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
     49     return torch.mm(a_norm, b_norm.transpose(0, 1))

~/.local/lib/python3.8/site-packages/torch/nn/functional.py in normalize(input, p, dim, eps, out)
   4630         return handle_torch_function(normalize, (input, out), input, p=p, dim=dim, eps=eps, out=out)
   4631     if out is None:
-> 4632         denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input)
   4633         return input / denom
   4634     else:

~/.local/lib/python3.8/site-packages/torch/_tensor.py in norm(self, p, dim, keepdim, dtype)
    636                 Tensor.norm, (self,), self, p=p, dim=dim, keepdim=keepdim, dtype=dtype
    637             )
--> 638         return torch.norm(self, p, dim, keepdim, dtype=dtype)
    639 
    640     def solve(self, other):

~/.local/lib/python3.8/site-packages/torch/functional.py in norm(input, p, dim, keepdim, out, dtype)
   1527         if out is None:
   1528             if dtype is None:
-> 1529                 return _VF.norm(input, p, _dim, keepdim=keepdim)  # type: ignore[attr-defined]
   1530             else:
   1531                 return _VF.norm(input, p, _dim, keepdim=keepdim, dtype=dtype)  # type: ignore[attr-defined]

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Multilingual?

Does this model work for languages other than English?
If yes, could you please specify which ones?

Lbl2TransformerVec(Lbl2Vec).predict_model_docs() stalls / lack of GPU utilization

It appears that on larger label datasets (>1000 labels), Lbl2TransformerVec(Lbl2Vec).predict_model_docs() will stall at the "calculate document vector <-> label vector similarities" step, perhaps due to a memory issue. Tracing the issue, it may be due to the below "utils.top_similar_vectors" function which converts the Torch tensors to numpy, which is called on in an apply function with predict_model_docs(). Would there be a way to refactor the below to perhaps leave the torch tensors in GPU and then convert to numpy outside of this function to improve performance?

The issue only seems to appear with label counts >1000.

utils.py

def top_similar_vectors(key_vector: np.array, candidate_vectors: List[np.array]) -> List[tuple]:
'''
 Calculates the cosines similarities of a given key vector to a list of candidate vectors.
 Parameters
 ----------
 key_vector : `np.array`_
         The key embedding vector

 candidate_vectors : List[`np.array`_]
         A list of candidate embedding vectors
 Returns
 -------
 top_results : List[tuples]
      A descending sorted of tuples of (cos_similarity, list_idx) by cosine similarities for each candidate vector in the list
 '''

cos_scores = util.cos_sim(key_vector, np.asarray(candidate_vectors))[0]
top_results = torch.topk(cos_scores, k=len(candidate_vectors))
## Return the tensors then convert to numpy

## Consider refactoring implementation to leave tensors in GPU instead of move to CPU at this point
top_cos_scores = top_results[0].detach().cpu().numpy()
top_indices = top_results[1].detach().cpu().numpy()

return list(zip(top_cos_scores, top_indices))

multiclass multilabel classification

Hi team,

I have a couple of questions about multiclass multilabel classification.

  1. do I need to create keywords list for each class?
  2. by setting up threshold, does that mean it can use for multilabel classification? i.e. any class above the threshold is a match, so, one data can have multiple label?

Thanks,
Ling

Is it possible to use 2 words as keywords

Is it possible to use keywords that are composed of 2 words each? For example 'movie theater' would be a useful keyword if I wanted to find documents about movie theaters, but the individual words movie and theater would identify a different subset of documents than what I'm really after

pip install doesnt work

Hello
I'm trying to install the package but I get an error.

pip install lbl2vec

Collecting lbl2vec
ERROR: Could not find a version that satisfies the requirement lbl2vec (from versions: none)
ERROR: No matching distribution found for lbl2vec

I searched a bit on google and couldn't find a solution.

Python 3.7.4
pip 19.2.3

ValueError: cannot compute similarity with no input

Hi Team,

I am getting following error while running model fit:

2022-04-08 14:19:04,344 - Lbl2Vec - INFO - Train document and word embeddings
2022-04-08 14:19:09,992 - Lbl2Vec - INFO - Train label embeddings

ValueError Traceback (most recent call last)
in

~/SageMaker/lbl2vec/lbl2vec.py in fit(self)
248 # get doc keys and similarity scores of documents that are similar to
249 # the description keywords
--> 250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
4211 else:
4212 values = self.astype(object)._values
-> 4213 mapped = lib.map_infer(values, f, convert=convert_dtype)
4214
4215 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

~/SageMaker/lbl2vec/lbl2vec.py in (row)
249 # the description keywords
250 self.labels[['doc_keys', 'doc_similarity_scores']] = self.labels['description_keywords'].apply(lambda row: self._get_similar_documents(
--> 251 self.doc2vec_model, row, num_docs=self.num_docs, similarity_threshold=self.similarity_threshold, min_num_docs=self.min_num_docs))
252
253 # validate that documents to calculate label embeddings from are found

~/SageMaker/lbl2vec/lbl2vec.py in _get_similar_documents(self, doc2vec_model, keywords, num_docs, similarity_threshold, min_num_docs)
625 for word in cleaned_keywords_list]
626 similar_docs = doc2vec_model.dv.most_similar(
--> 627 positive=keywordword_vectors, topn=num_docs)
628 except KeyError as error:
629 error.args = (

~/anaconda3/envs/python3/lib/python3.6/site-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, clip_start, clip_end, restrict_vocab, indexer)
775 all_keys.add(self.get_index(key))
776 if not mean:
--> 777 raise ValueError("cannot compute similarity with no input")
778 mean = matutils.unitvec(array(mean).mean(axis=0)).astype(REAL)
779

ValueError: cannot compute similarity with no input

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.