Giter Club home page Giter Club logo

nlpdedup's Introduction

NLPDedup

Remove duplicates and near-duplicates from text corpora, no matter the scale.


Documentation License LastCommit Code Coverage

Developers:

Installation

The package is available on PyPI, so you can install the package using your favourite package manager. For instance, pip install nlp_dedup or poetry add nlp_dedup.

Quick Start

If the corpus is stored as corpus.txt (both txt and jsonl files are supported), the following deduplicates the corpus and stores the deduplicates corpus into the folder deduplicated:

$ dedup corpus.txt deduplicated

This defaults to deduplicating based on blocks of 13 consecutive words, where two documents are considered near-duplicate if they have more than 80% of these blocks in common. This can all be changed to your specific needs, however. See $ dedup --help for more information on all the settings.

Deduplication can also be done directly from Python:

>>> from nlp_dedup import Deduper
>>> deduper = Deduper()
>>> corpus = ["Test", "Another test", "Test"]
>>> deduper.deduplicate(corpus=corpus)

Here corpus does not have to be a list, but can also be an iterable or generator of strings, if the corpus is too big to be stored in memory. Dictionaries are also supported instead of strings, in which case the text entry in the dictionaries will be used (change this with the text_column argument when calling deduplicate).

See more in the documentation.

nlpdedup's People

Contributors

saattrupdan avatar

Stargazers

Kasper Junge avatar dinhanhx avatar Boa avatar Nikolaus Schlemm avatar  avatar Kosti avatar Christopher Schröder avatar Lasse Hansen avatar James Brown avatar fullstack avatar Kenneth Enevoldsen avatar

Watchers

Kenneth Enevoldsen avatar  avatar

Forkers

peterbjorgensen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.