Giter Club home page Giter Club logo

shash's Introduction

Detection of near-duplicate documents

A sample implementation of Charikar's hash for identification of similar documents.

https://github.com/vilda/shash

How to

The calculation of the hash is performend in the following way:

  • Document is splitted into tokens (words for example) or super-tokens (word tuples)
  • Each token is represented by its hash value; a traditonal hash function is used
  • Weights are associated with tokens
  • A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits
  • In a cycle for all token's hash values (h), vector V is updated:
    • ith element is decreased by token's weight if the ith bit of the hash h is 0, otherwise
    • ith element is increased by token's weight if the ith bit of the hash h is 1
  • Finally, signs of elements of V corresponds to the bits of the final fingerprint

References

  • Charikar: Similarity Estimation Techniques from Rounding Algorithms, in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM Press, 2002
  • Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling. in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007
  • Henzinger: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2006

Viliam Holub [email protected]

shash's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.