NAF-DPM: Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

NAF-DPM: Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement
Giordano Cicchetti, Danilo Comminiello

This is the official repository for the paper NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement. NAF-DPM is a novel generative framework based on DPMs that solves document enhancement tasks by treating them as a conditional image-to-image translation problem. It can be used for tasks such as document deblurring, denoising, binarization, etc. Actually paper under review at IEEE Transactions on Pattern Analysis and Machine Intelligence.

Highlights

Abstract: Real-world documents often suffer from various forms of degradation, which can lead to lower accuracy in optical character recognition (OCR) systems. Therefore, a crucial preprocessing step is essential to eliminate noise while preserving text and key features of the documents. In this paper, we propose a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents. While DPMs are recognized for their high-quality generated images, they are also known for their large inference time. To mitigate this problem, we address both the architecture design and sampling strategy. To this end, we provide the DPM with an efficient nonlinear activation-free (NAF) network, which also proves to be very effective in image restoration tasks. For the sampling strategy, we employ a fast solver of ordinary differential equations, which is able to converge in 20 iterations at most. The combination of a small number of parameters (only 9.4M), an efficient network and a fast sampling strategy allow us to achieve competitive inference time with respect to existing methods. To better preserve text characters, we introduce an additional differentiable module based on convolutional recurrent neural networks (CRNN) simulating the behavior of an OCR system during the training phase. This module is used to compute an additional loss function that better guides the diffusion model to restore the original quality of the text in document images. Experiments conducted on diverse datasets showcase the superiority of our approach, achieving state-of-the-art performance in terms of both pixel-level metrics %(PSNR and SSIM) and perceptual similarity metrics% (LPIPS and DISTS) . Furthermore, the results demonstrate a notable error reduction by OCR systems when transcribing real-world document images enhanced by our framework.

Results

NAF-DPM in comparison with existing methods

Results reported below show performance of different methods on the Blurry Document Images OCR Text Dataset (https://www.fit.vutbr.cz/~ihradis/CNN-Deblur/) for PSNR, SSIM, LPIPS, DISTS and CER metrics .

Name	PSNR	SSIM	LPIPS	DISTS	CER
Hradis	30.629	0.987	0.0135	0.0258	5.44
DE-GAN	28.803	0.985	0.0144	0.0237	6.87
DocDiff	29.787	0.989	0.0094	0.0339	2.78
NAF-DPM (ours)	34.377	0.994	0.0046	0.0228	1.55

Results reported below show performance of different methods on DIBCO2019 for PSNR, F-Measure and Pf-measure metrics.

Name	PSNR	F-Measure	Pf-Measure
[Otsu]	9.08	47.83	45.59
[Sauola]	13.72	51.73	55.15
[Competition_Top]	14.48	72.88	72.15
[DE-GAN]	12.29	55.98	53.44
[D^2BFormer]	15.05	67.63	66.69
[DocDiff]	15.14	73.38	75.12
NAF-DPM (ours)	15.39	74.61	76.25

Installation

This codebase is tested on Ubuntu 22.04 LTS with python 3.12.1 Follow the below steps to create environment and install dependencies

Setup conda environment (recommended).

# Create a conda environment
conda create -y -n NAFDPM python=3.12.1

# Activate the environment
conda activate NAFDPM

# Install torch (requires version >= 1.8.1) and torchvision
# Please refer to https://pytorch.org/ if you need a different cuda version
pip3 install torch torchvision torchaudio

Clone NAFDPM code repository and install requirements

# Clone NAFDPM code base
git clone https://github.com/Giordano-Cicchetti/Diffusion-Document-Enhancement.git

cd Diffusion-Document-Enhancement/
# Install requirements

pip install -r requirements.txt

Data Preparation

BMVC Blurry Document Images Text Dataset

Create folder dataset_deblurring
Download the training data from the official website and extract the data in the created folder.
Use the script in the folder utils/prepare_deblurring_dataset.py to divide data into training and validation. Please if necessary change the variables referring to paths into utils/prepare_deblurring_dataset.py

The directory structure should look like

dataset_deblurring/
   |–– train_origin/ #Contains 30000 origin images
   |–– train_blur/ #Contains 30000 blurry images
   |–– test_origin/ #Contains 10000 origin images
   |–– test_blur/ #Contains 10000 blurry images

DIBCO

We gathered the DIBCO, H-DIBCO, Bickley Diary dataset, Persian Heritage Image Binarization Dataset (PHIDB), the Synchromedia Multispectral dataset (S-MS) and Palm Leaf dataset (PALM) and organized them in one folder. You can download it from this link. After downloading, extract the folder named DIBCOSETS and place it in your desired data path. Means: /YOUR_DATA_PATH/DIBCOSETS/
Specify the data path, split size, validation and testing sets to prepare your data. In this example, we set the split size as (256 X 256), the validation set as null and the testing as 2018 while running the utils/process_dibco.py file.

python utils/process_dibco.py --data_path /YOUR_DATA_PATH/ --split_size 256 --testing_dataset 2018 --validation_dataset 0

credits for utils/process_dibco.py file: https://github.com/dali92002/DocEnTR/tree/main

Model Zoo

All models are available and can be downloaded through this link

Training and Evaluation

In this codebase there are two different submodules, Deblurring and Binarization. To each submodule is dedicated a folder. In each folder there is a configuration file (i.e. Binarization/conf.yml and Deblurring/conf.yml).

Whether it's for training or inference, you just need to modify the configuration parameters in the correspondent conf.yml and run:

**BINARIZATION

python main.py --config Binarization/conf.yml

**DEBLURRING

python main.py --config Deblurring/conf.yml

MODE=1 is for training, MODE=0 is for inference, MODE=2 is for finetuning (only for deblurring). The parameters in conf.yml have detailed annotations, so you can modify them as needed. Please change and properly set path to test/train dataset, log folders and pretraining models (if needed).

FINETUNING

First use a commercial OCR system to extract text and bounding boxes from BMVC Dataset images. You can use scripts contained in utils/extractOCR.py. Change path variables inside this script.
Pretrain CRNN Module for 20 epochs using Deblurring/CRNN/trainCRNN.py script.
Set MODE=2 and properly change path to CRNN pretrained model and new dataset in Deblurring/conf.yml
Finetune NAFDPM for a sufficient number of iteration (>100k) using python main.py --config Deblurring/conf.yml

Acknowledgement

if you find our work useful and if you want to use NAF-DPM as the baseline for your project, please give us a star and cite our paper. Thank you! 🤞😘

@misc{cicchetti2024nafdpm,
      title={NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement}, 
      author={Giordano Cicchetti and Danilo Comminiello},
      year={2024},
      eprint={2404.05669},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

For any doubt and question, please send an email to [email protected] with the subject "NAFDPM" or eventually you can open an issue. We will reply as soon as possible.

ispamm / naf-dpm Goto Github PK

naf-dpm's Introduction

NAF-DPM: Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement

Highlights

Results

NAF-DPM in comparison with existing methods

Installation

Data Preparation

BMVC Blurry Document Images Text Dataset

DIBCO

Model Zoo

Training and Evaluation

FINETUNING

Acknowledgement

naf-dpm's People

Contributors

Stargazers

Watchers

Forkers

naf-dpm's Issues

Recommend Projects

Recommend Topics

Recommend Org