Giter Club home page Giter Club logo

doc-similarity's People

Contributors

4oh4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

doc-similarity's Issues

GLOVE embedings

Hi,

As you mentioned in the post "By default, a GloVe word embedding model is loaded (glove-wiki-gigaword-50), although a custom model can also be used."

Will the following command fine tune GLOVE embedding on my own dataset or what is the way to fine-tune embeddings on my own dataset?

from docsim import DocSim

docsim = DocSim(verbose=True)

similarities = docsim.similarity_query(query_string, documents)

Finally, I have around 500 queries, for which I would like to retrieve similar documents from the corpus containing 15K documents. For this purpose, I am using following command:

similarities = docsim.similarity_query(query_string, documents)

But, it is extremely slow. Is there any possibility to speedup?

Soft cosine similarity 1 between query and a document

I am calculating the similarity between a query: query2 = 'Audit and control, Board structure, Remuneration, Shareholder rights, Transparency and Performance' and a document(in my case it is a company's annual report).

I am using glove vectors and calculating the soft cosine between vectors, however somehow I get the similarity score of 1 with two documents. How is that possible? For sure I know that the document does not contain only these query words. The document is a .txt file with cleaned text. And if the document matches exactly these words, then similarity can be 1 but I know it does not match exactly these words.

25 1.000 2019_q4_en_eur_con_00.txt
14 1.000 2017_q3_en_eur_con_00.txt
16 0.994 2018_ar_en_eur_con_00.txt
21 0.989 2019_ar_en_eur_con_00.txt
28 0.986 2020_q2_en_eur_con_00.txt
1 0.963 2014_ar_en_eur_con_00.txt

Any fast way to retrieve similar documents through GLOVE

Hi,

I have 100 queries and I have to extract similar set of 10 documents from a corpus of 15K documents by using GLOVE. So, when I use the following code snippet it will take lot of time to achieve results for each query.

docsim_obj.similarity_query(query, corpus)

In this experiment, corpus is the same for all the queries. So, can it be possible to achieve the results in a fast way? If yes, what type of changes do I require in the code?

About Pre-trained Wikipedia embeddings

Hi,

Can you please confirm the specification of by-default pre-trained Wikipedia embeddings used in GLOVE?
Is it like Wikipedia (2014, Gigaword 5, vector dimension 50)?

Glove method - lemmatization

Hi!
Questions:

  1. Why didn't you use lemmatization when processing your document's? Is there a reason behind that?
  2. Why did you use this glove pre-trained model(dimensions)?
  3. Can you validate the results somehow?

About fair comparison

Hi,

In the doc-similarity library, there are two implementations TF-IDF and GLOVE. However,regular cosine similarity is calculated for TF-IDF, where as in GLOVE soft-cosine similarity is measured.

Due to the difference in similarity metric, can it be possible to achieve fair comparison in results between TF-IDF and GLOVE?

DocSim

Hello,

Inside your method 2b, can we use BERT model for finding the similarity?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.