A collection of data deduplication scripts.
Ready to use or modify scripts for each deduplication method:
- MinHash + MinHashLSH, including a spark implementation suitable for large scale (>100M) datasets
- SimHash (64, 128)
- SuffixArray Substring
- Bloom Filter
- Exact Hash
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- BigScience (Apache 2.0)
- BigCode (Apache 2.0)
- Gaoya (MIT)
MODIFY spark.py
FOR YOUR OWN PROJECT AND DATASET FIRST!
export CLUSTER_NAME=chenghao-temp
export PROJECT_ID=xx
gcloud dataproc clusters create $CLUSTER_NAME \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-a \
--master-machine-type c2d-standard-16 \
--master-boot-disk-size 500 \
--num-workers 10 \
--worker-machine-type c2d-standard-16 \
--worker-boot-disk-size 500 \
--image-version 2.0-debian10 \
--project $PROJECT_ID
gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} \
--region us-central1 \
--jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
--driver-log-levels root=WARN \
--properties="spark.executor.memory"="50g","spark.driver.memory"="8g","spark.executor.cores"="14" \
spark.py
For reference, the script finished deduplicating 42 million rows in less than 40 minutes with above settings (160 cores, 640GB memory in total), while the python version would take around 10 hours with a 80-core machine with 1.8TB memory.
In the following part, we are going to deduplicate one dataset: gl
subset of oscar-corpus/OSCAR-2201
.
# input
python -m text_dedup.suffix_array \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/suffix_array/oscar_gl_dedup" \
--column "text" \
--google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"
# output
INFO Loading : 2.75 seconds
INFO Preprocessing : 4.78 seconds
INFO SuffixArray : 98.29 seconds
INFO SelfSimilar : 4.24 seconds
INFO Restore : 0.25 seconds
INFO Deduplicate : 6.23 seconds
INFO Saving : 8.91 seconds
INFO Total : 125.45 seconds
INFO Before : 180332342 bytes (88803)
INFO After : 97646271 bytes (40404)
# input
python -m text_dedup.minhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/minhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000
# output
INFO Loading : 2.62 seconds
INFO MinHashing : 0.08 seconds
INFO Clustering : 2.20 seconds
INFO Filtering : 0.53 seconds
INFO Saving : 9.86 seconds
INFO Total : 15.29 seconds
INFO Data Number (before) : 88803
INFO Data Number (after) : 44124 (49.69%)
INFO Duplicate Number : 44679 (50.31%)
INFO ๐ค Happy Deduplicating ๐ค
# input
python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/simhash/oscar_gl_dedup" \
--column "text" \
--batch_size 10000
# output
INFO Loading : 2.60 seconds
INFO SimHashing : 0.04 seconds
INFO Indexing : 28.88 seconds
INFO Filtering : 0.88 seconds
INFO Saving : 10.41 seconds
INFO Total : 42.80 seconds
INFO Data Number (before) : 88803
INFO Data Number (after) : 46163 (51.98%)
INFO Duplicate Number : 42640 (48.02%)
INFO ๐ค Happy Deduplicating ๐ค
# input
python -m text_dedup.exact_hash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/exact_hash/oscar_gl_dedup" \
--column "text" \
--batch_size 1000
# output
INFO Loading : 2.95s
INFO Processing : 3.79s
INFO Filtering : 0.10s
INFO Saving : 2.89s
INFO Total : 9.72s
INFO Before : 88803
INFO After : 47049
# input
python -m text_dedup.bloom_filter \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output "output/bloom_filter/oscar_gl_dedup" \
--error_rate 1e-5 \
--column "text" \
--batch_size 1000
# output
INFO Loading : 2.72s
INFO Processing : 4.84s
INFO Filtering : 0.10s
INFO Saving : 2.88s
INFO Total : 10.54s
INFO Before : 88803
INFO After : 47045
A benchmark of different methods here can be found in benchmarks/wiki40.ipynb
. A notebook in evaluating MinHash on pinecone/core-2020-05-10-deduplication
can be found in benchmarks/pinecone.ipynb
.
For quick reference, here are the results:
Method | Precision | Recall | F1 | Time |
---|---|---|---|---|
MinHash | 0.9464 | 0.9446 | 0.9455 | 24s |
SimHash* | 0.9011 | 0.6959 | 0.7853 | 210s |
SimHash (Gyawali et al., LREC 2020) | 0.697 | 0.247 | 0.3647 | - |
Exact Title (my implementation) | 0.8302 | 0.5521 | 0.6632 | - |
Exact Title (Gyawali et al., LREC 2020) | 0.830 | 0.50 | 0.624 | - |
*Best SimHash result from benchmarks/hyperparameter.ipynb
.
- TODO
- Memory benchmark for streaming processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
- A collections of deduplication methods used in papers/datasets/projects: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
Early versions of the code uses object-oriented design for hashing and indexing, which was very difficult because different methods share little to no abstraction. In order to complie something that is useful, a lot of the wrapper code was used, and that actually increased the overhead of using this library. Additionally, deduplicating is often a one-time thing in data preprocessing pipeline, there isn't really a need for inline access.
Because the google repo is licensed under Apache 2.0, I have to update from MIT. Util that part of code is completely re-implemented, Apache 2.0. will be the license I use.