Giter Club home page Giter Club logo

seq2seq-ocr's Introduction

Seq2Seq Model to Correct OCR Errors

Abstract (pasted from writeup):

Documents transcribed with OCR software often have errors that may make the text unsuitable for analysis. If they occur infrequently enough, it is easy to correct these errors using a combination of spell-checking tools and contextual natural language processing (NLP) models (e.g., Google BERT). However, when it comes to extremely noisy text, these strategies are often less successful.

The goal of this project was to develop a deep learning model that corrects common OCR errors. The model itself has a LSTM seq2seq architecture, commonly used in machine translation tasks. It is fed a noisy string of characters and outputs a predicted word. Out of several attempts, we found that the most effective strategy was to train the model using a historical English corpus with forced errors. These errors consist of common letter substitutions observed in OCR'ed text that are forced into the training data at a frequency proportional to their observed occurrence. Applying the model to London Times articles from 1820-1939, we were able to increase the proportion of recognizable English words by an average of 5-10%. These results illustrate how NLP models can be trained to correct specific errors in noisy data, even when the noise hinders context-dependent tools.

Instructions for Use:

I would suggest looking at the example-notebooks directory for several examples of how to use the seq2seq model. The Basic Usage provides a broad overview of correcting text with the model and is probably the best notebook to start off with.

File Overview:

The model is stored in the s2s directory and can be accessed via the Seq2SeqOCR Class defined in seq2seqocr.py.

The training data is stored in training-sets. The training script is train_model.py. Other files (source data, lexicons, error probabilities) are in source-data.

The process_letter_sub.py file is for preparing OCR error probabilities from Ted Underwood's data. Those error probabilities are used in prepare_training.py to force errors into our training data. The process_lexicons.py file saves several hashsets to disk that are used in preprocessing. Finally, general_util.py provides some general string functions shared between files.

Citations:

seq2seq model inspired from Keras's sample program.
Noise function inspired by Spell Checker.
OCR Error data from Ted Underwood's OCR rulesets.
Training data of historical English text taken from the following sources: COHA, Hansard, Google's 10000 frequent English words.

Note: I manually processed Google's document to remove modern words such as "honda" and "programmer", short word-abbreviations such as "uk" and "ca", and inappropriate words.

seq2seq-ocr's People

Contributors

mattyding avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.