Giter Club home page Giter Club logo

Comments (6)

karussell avatar karussell commented on August 19, 2024

Do you have an example? Do you mean exact duplicates or near duplicates?

I thought that I have reduced the amount of duplicates with the spam filter ...

for reference only: duplicate filter could be done at query time (results grouping) or index time (like our spam filter works)

from jetwick.

karussell avatar karussell commented on August 19, 2024

we need two new fields: duplicate_hash_s and duplicates_i

we check the duplicate_hash_s before indexing and set the duplicates_i accordingly. when querying we can descrease the duplicate count filter for aggressive duplicate removal.

the hash can be calculated as stated in the TermCreateCommand:

using a technic from TextProfileSignature:

  1. create a list of tokens and their frequency, separated by spaces, in the order of decreasing frequency (+ sorted because freq is one nearly for all tokens I think).
  2. This list is then submitted to an MD5 hash calculation.

from jetwick.

pannous avatar pannous commented on August 19, 2024

great idea.
when viewing a specific users tweets, we could disable the duplicate filter.
(otherwise we could just mark them grey / collapse them or hide them)

from jetwick.

karussell avatar karussell commented on August 19, 2024

http://wiki.apache.org/solr/Deduplication

from jetwick.

karussell avatar karussell commented on August 19, 2024

Just implemented and deployed this idea. Please test the "Duplicates without" link.

PS: tweets are not reindexed so it will take a week until all tweets have this possibilty

from jetwick.

karussell avatar karussell commented on August 19, 2024

ok, works reasonable after adjusting the jaccard index. please file a new issue for bugs

from jetwick.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.