4oh4 / doc-similarity Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 24.0 48 KB

Ranking documents using semantic similarity in Python

License: MIT License

Python 35.44% Jupyter Notebook 64.56%

doc-similarity's People

Contributors

Stargazers

Watchers

doc-similarity's Issues

GLOVE embedings

Hi,

As you mentioned in the post "By default, a GloVe word embedding model is loaded (glove-wiki-gigaword-50), although a custom model can also be used."

Will the following command fine tune GLOVE embedding on my own dataset or what is the way to fine-tune embeddings on my own dataset?

from docsim import DocSim

docsim = DocSim(verbose=True)

similarities = docsim.similarity_query(query_string, documents)

Finally, I have around 500 queries, for which I would like to retrieve similar documents from the corpus containing 15K documents. For this purpose, I am using following command:

similarities = docsim.similarity_query(query_string, documents)

But, it is extremely slow. Is there any possibility to speedup?

cannot import name 'WordEmbeddingSimilarityIndex

I am not able to import cannot import name 'WordEmbeddingSimilarityIndex in CoLab notebook. Which version of gensim are you using?

Soft cosine similarity 1 between query and a document

I am calculating the similarity between a query: query2 = 'Audit and control, Board structure, Remuneration, Shareholder rights, Transparency and Performance' and a document(in my case it is a company's annual report).

I am using glove vectors and calculating the soft cosine between vectors, however somehow I get the similarity score of 1 with two documents. How is that possible? For sure I know that the document does not contain only these query words. The document is a .txt file with cleaned text. And if the document matches exactly these words, then similarity can be 1 but I know it does not match exactly these words.

25 1.000 2019_q4_en_eur_con_00.txt
14 1.000 2017_q3_en_eur_con_00.txt
16 0.994 2018_ar_en_eur_con_00.txt
21 0.989 2019_ar_en_eur_con_00.txt
28 0.986 2020_q2_en_eur_con_00.txt
1 0.963 2014_ar_en_eur_con_00.txt

Any fast way to retrieve similar documents through GLOVE

Hi,

I have 100 queries and I have to extract similar set of 10 documents from a corpus of 15K documents by using GLOVE. So, when I use the following code snippet it will take lot of time to achieve results for each query.

docsim_obj.similarity_query(query, corpus)

In this experiment, corpus is the same for all the queries. So, can it be possible to achieve the results in a fast way? If yes, what type of changes do I require in the code?

About Pre-trained Wikipedia embeddings

Hi,

Can you please confirm the specification of by-default pre-trained Wikipedia embeddings used in GLOVE?
Is it like Wikipedia (2014, Gigaword 5, vector dimension 50)?

Glove method - lemmatization

Hi!
Questions:

Why didn't you use lemmatization when processing your document's? Is there a reason behind that?
Why did you use this glove pre-trained model(dimensions)?
Can you validate the results somehow?

About fair comparison

Hi,

In the doc-similarity library, there are two implementations TF-IDF and GLOVE. However,regular cosine similarity is calculated for TF-IDF, where as in GLOVE soft-cosine similarity is measured.

Due to the difference in similarity metric, can it be possible to achieve fair comparison in results between TF-IDF and GLOVE?

DocSim

Hello,

Inside your method 2b, can we use BERT model for finding the similarity?

4oh4 / doc-similarity Goto Github PK

doc-similarity's People

Contributors

Stargazers

Watchers

Forkers

doc-similarity's Issues

GLOVE embedings

cannot import name 'WordEmbeddingSimilarityIndex

Soft cosine similarity 1 between query and a document

Any fast way to retrieve similar documents through GLOVE

About Pre-trained Wikipedia embeddings

Glove method - lemmatization

About fair comparison

DocSim

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent