Giter Club home page Giter Club logo

naf-dpm's Introduction

NAF-DPM: Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

NAF-DPM: Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement
Giordano Cicchetti, Danilo Comminiello

This is the official repository for the paper NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement. NAF-DPM is a novel generative framework based on DPMs that solves document enhancement tasks by treating them as a conditional image-to-image translation problem. It can be used for tasks such as document deblurring, denoising, binarization, etc. Actually paper under review at IEEE Transactions on Pattern Analysis and Machine Intelligence.

arXiv Visitors

PWC PWC


Highlights

Abstract: Real-world documents often suffer from various forms of degradation, which can lead to lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of the documents. In this paper, we propose a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem, we address both the architecture design and sampling strategy. To this end, we provide the DPM with an efficient nonlinear activation-free (NAF) network, which also proves to be very effective in image restoration tasks. For the sampling strategy, we employ a fast solver of ordinary differential equations, which is able to converge in 20 iterations at most. The combination of a small number of parameters (only 9.4M), an efficient network and a fast sampling strategy allow us to achieve competitive inference time with respect to existing methods. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks (CRNN) simulating the behavior of an OCR system during the training phase. This module is used to compute an additional loss function that better guides the diffusion model to restore the original quality of the text in document images. Experiments conducted on diverse datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of both pixel-level metrics %(PSNR and SSIM) and perceptual similarity metrics% (LPIPS and DISTS) . Furthermore, the results demonstrate a notable error reduction by OCR systems when transcribing real-world document images enhanced by our framework.


Results

NAF-DPM in comparison with existing methods

Results reported below show performance of different methods on the Blurry Document Images OCR Text Dataset (https://www.fit.vutbr.cz/~ihradis/CNN-Deblur/) for PSNR, SSIM, LPIPS, DISTS and CER metrics .

Name PSNR SSIM LPIPS DISTS CER
Hradis 30.629 0.987 0.0135 0.0258 5.44
DE-GAN 28.803 0.985 0.0144 0.0237 6.87
DocDiff 29.787 0.989 0.0094 0.0339 2.78
NAF-DPM (ours) 34.377 0.994 0.0046 0.0228 1.55

Results reported below show performance of different methods on DIBCO2019 for PSNR, F-Measure and Pf-measure metrics.

Name PSNR F-Measure Pf-Measure
[Otsu] 9.08 47.83 45.59
[Sauola] 13.72 51.73 55.15
[Competition_Top] 14.48 72.88 72.15
[DE-GAN] 12.29 55.98 53.44
[D^2BFormer] 15.05 67.63 66.69
[DocDiff] 15.14 73.38 75.12
NAF-DPM (ours) 15.39 74.61 76.25

Installation

This codebase is tested on Ubuntu 22.04 LTS with python 3.12.1 Follow the below steps to create environment and install dependencies

  • Setup conda environment (recommended).
# Create a conda environment
conda create -y -n NAFDPM python=3.12.1

# Activate the environment
conda activate NAFDPM

# Install torch (requires version >= 1.8.1) and torchvision
# Please refer to https://pytorch.org/ if you need a different cuda version
pip3 install torch torchvision torchaudio
  • Clone NAFDPM code repository and install requirements
# Clone NAFDPM code base
git clone https://github.com/Giordano-Cicchetti/Diffusion-Document-Enhancement.git

cd Diffusion-Document-Enhancement/
# Install requirements

pip install -r requirements.txt

Data Preparation

BMVC Blurry Document Images Text Dataset

  • Create folder dataset_deblurring
  • Download the training data from the official website and extract the data in the created folder.
  • Use the script in the folder utils/prepare_deblurring_dataset.py to divide data into training and validation. Please if necessary change the variables referring to paths into utils/prepare_deblurring_dataset.py

The directory structure should look like

dataset_deblurring/
   |–– train_origin/ #Contains 30000 origin images
   |–– train_blur/ #Contains 30000 blurry images
   |–– test_origin/ #Contains 10000 origin images
   |–– test_blur/ #Contains 10000 blurry images

DIBCO

  • We gathered the DIBCO, H-DIBCO, Bickley Diary dataset, Persian Heritage Image Binarization Dataset (PHIDB), the Synchromedia Multispectral dataset (S-MS) and Palm Leaf dataset (PALM) and organized them in one folder. You can download it from this link. After downloading, extract the folder named DIBCOSETS and place it in your desired data path. Means: /YOUR_DATA_PATH/DIBCOSETS/
  • Specify the data path, split size, validation and testing sets to prepare your data. In this example, we set the split size as (256 X 256), the validation set as null and the testing as 2018 while running the utils/process_dibco.py file.
python utils/process_dibco.py --data_path /YOUR_DATA_PATH/ --split_size 256 --testing_dataset 2018 --validation_dataset 0

credits for utils/process_dibco.py file: https://github.com/dali92002/DocEnTR/tree/main

Model Zoo

All models are available and can be downloaded through this link

Training and Evaluation

In this codebase there are two different submodules, Deblurring and Binarization. To each submodule is dedicated a folder. In each folder there is a configuration file (i.e. Binarization/conf.yml and Deblurring/conf.yml).

Whether it's for training or inference, you just need to modify the configuration parameters in the correspondent conf.yml and run:

**BINARIZATION

python main.py --config Binarization/conf.yml

**DEBLURRING

python main.py --config Deblurring/conf.yml

MODE=1 is for training, MODE=0 is for inference, MODE=2 is for finetuning (only for deblurring). The parameters in conf.yml have detailed annotations, so you can modify them as needed. Please change and properly set path to test/train dataset, log folders and pretraining models (if needed).

FINETUNING

  • First use a commercial OCR system to extract text and bounding boxes from BMVC Dataset images. You can use scripts contained in utils/extractOCR.py. Change path variables inside this script.
  • Pretrain CRNN Module for 20 epochs using Deblurring/CRNN/trainCRNN.py script.
  • Set MODE=2 and properly change path to CRNN pretrained model and new dataset in Deblurring/conf.yml
  • Finetune NAFDPM for a sufficient number of iteration (>100k) using python main.py --config Deblurring/conf.yml

Acknowledgement

if you find our work useful and if you want to use NAF-DPM as the baseline for your project, please give us a star and cite our paper. Thank you! 🤞😘

@misc{cicchetti2024nafdpm,
      title={NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement}, 
      author={Giordano Cicchetti and Danilo Comminiello},
      year={2024},
      eprint={2404.05669},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For any doubt and question, please send an email to [email protected] with the subject "NAFDPM" or eventually you can open an issue. We will reply as soon as possible.

naf-dpm's People

Contributors

giordano-cicchetti avatar dependabot[bot] avatar

Stargazers

Wanglong Lu avatar  avatar zhfly021 avatar sunzheng avatar  avatar HengLI avatar  avatar Mohammed Saleh avatar  avatar Balthazar Neveu avatar Chaodong Zhang avatar  avatar  avatar  avatar Eleonora Lopez avatar Huy Q Can avatar  avatar Zongyuan Yang avatar Fu avatar Jhin avatar  avatar

Watchers

Kostas Georgiou avatar

naf-dpm's Issues

Question about Differentiable OCR-Guided Finetuning in the paper

This paper is really excellent! But I have a small question
It is the first time I see someone think of CTC loss to make the model recover more accurate characters. However, in this paper, only Figure 6 and Table 4 have a slight reference to this. Is there only CER metrics in Table 4 because this loss may affect the results of PSNR of the model?
Because the input of the model is 128*128 resolution patch instead of the whole image, I think this may also lead to some words or letters being cut off, which may lead to false recognition and affect the training.

when residual is too little

Hello, the work of residuals is really interesting, but I found that when the rough generation effect in the first stage is good, it will lead to very small residuals. Then, during the training process, after about 1000 iterations, the model predicts to be all 0. How to solve this problem? Can we consider reducing the timestep a bit?

Question about production

Thank you for your work! I have some questions to ask.

  1. Does this model support the "shadow removing" task?

  2. For "Binarization" task, I find that the both DPM and VIT model (like DocEnTr) takes a long time to do the prediction. (more than 20s under T4 GPU for a image's size 2000 * 2000 ) . For the ”Binarization“ task, what kind of model do you think is more suitable for use in a production environment?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.