4oh4 / doc-similarity Goto Github PK
View Code? Open in Web Editor NEWRanking documents using semantic similarity in Python
License: MIT License
Ranking documents using semantic similarity in Python
License: MIT License
Hi,
As you mentioned in the post "By default, a GloVe word embedding model is loaded (glove-wiki-gigaword-50), although a custom model can also be used."
Will the following command fine tune GLOVE embedding on my own dataset or what is the way to fine-tune embeddings on my own dataset?
from docsim import DocSim
docsim = DocSim(verbose=True)
similarities = docsim.similarity_query(query_string, documents)
Finally, I have around 500 queries, for which I would like to retrieve similar documents from the corpus containing 15K documents. For this purpose, I am using following command:
similarities = docsim.similarity_query(query_string, documents)
But, it is extremely slow. Is there any possibility to speedup?
I am not able to import cannot import name 'WordEmbeddingSimilarityIndex in CoLab notebook. Which version of gensim are you using?
I am calculating the similarity between a query: query2 = 'Audit and control, Board structure, Remuneration, Shareholder rights, Transparency and Performance'
and a document(in my case it is a company's annual report).
I am using glove vectors and calculating the soft cosine between vectors, however somehow I get the similarity score of 1 with two documents. How is that possible? For sure I know that the document does not contain only these query words. The document is a .txt file with cleaned text. And if the document matches exactly these words, then similarity can be 1 but I know it does not match exactly these words.
25 1.000 2019_q4_en_eur_con_00.txt
14 1.000 2017_q3_en_eur_con_00.txt
16 0.994 2018_ar_en_eur_con_00.txt
21 0.989 2019_ar_en_eur_con_00.txt
28 0.986 2020_q2_en_eur_con_00.txt
1 0.963 2014_ar_en_eur_con_00.txt
Hi,
I have 100 queries and I have to extract similar set of 10 documents from a corpus of 15K documents by using GLOVE. So, when I use the following code snippet it will take lot of time to achieve results for each query.
docsim_obj.similarity_query(query, corpus)
In this experiment, corpus is the same for all the queries. So, can it be possible to achieve the results in a fast way? If yes, what type of changes do I require in the code?
Hi,
Can you please confirm the specification of by-default pre-trained Wikipedia embeddings used in GLOVE?
Is it like Wikipedia (2014, Gigaword 5, vector dimension 50)?
Hi!
Questions:
Hi,
In the doc-similarity library, there are two implementations TF-IDF and GLOVE. However,regular cosine similarity is calculated for TF-IDF, where as in GLOVE soft-cosine similarity is measured.
Due to the difference in similarity metric, can it be possible to achieve fair comparison in results between TF-IDF and GLOVE?
Hello,
Inside your method 2b, can we use BERT model for finding the similarity?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.