Giter Club home page Giter Club logo

Scribe OCR

Scribe OCR is a free and open-source web application for recognizing text, proofreading OCR data, and creating fully-digitized documents. Live site at scribeocr.com.

Scribe OCR includes the Tesseract OCR engine for recognizing text. It can also be used for proofreading existing OCR data from Tesseract or Abbyy.

Running

ScribeOCR can be run by using the public site at scribeocr.com. The entire program runs in your browser--no data is sent to a remote server.

There is currently no standalone desktop application, so running locally requires serving the files over a local HTTP server. To run a local copy, run the following commands (requires npm):

git clone --recursive https://github.com/scribeocr/scribeocr.git
cd scribeocr
npm i
npx http-server

The npx http-server command will print the address on your local network that ScribeOCR is running on. You can use the site by visiting that address.

Please "thumbs up" this Git Issue if you would prefer a desktop application, and we can consider adding one.

Documentation

Documentation for users is available at docs.scribeocr.com, and is managed in this repo. If you review the documentation and think something important is unclear or missing, feel free to open a Git Issue in that repo.

Proofreading Overview

Efficient proofreading is a major focus of Scribe OCR. Using the proofreading interface, users can easily spot and correct errors, bringing their OCR data from 98% accuracy to 100% accuracy.

To allow for efficient proofreading, Scribe OCR precisely prints editable OCR text over source images. To replicate the document as closely as possible, Scribe OCR generates a custom font for each document, optimized using the provided OCR data. This improves the alignment between the original scan and overlay text, and by making errors more obvious, can significantly decrease the time spent proofreading. For example, the images below show the same text, with and without Font Optimization enabled.

To show how Scribe OCR can be used to digitize documents, three versions of a scanned book page found at Archive.org are shown below. The first panel shows the original image. The second shows Scribe OCR’s Proofreading Mode, which precisely layers colored OCR text over the source image. In addition to overlapping poorly with the underlying image, most errors are also colored red, which indicates the OCR engine flagged them as low-confidence. The third panel shows Ebook Mode, which only contains the (now corrected) text layer.

Display Mode Comparison

Most OCR output formats either compromise on faithfully representing the original document (e.g. text or markdown that omits formatting) or produce enormous files by printing invisible text over the original scanned images. In contrast, the third panel above (Ebook Mode) faithfully represents the original scan while maintaining a small file size. (Exporting .pdfs with the traditional invisible text-over-image approach is also supported for users only interested in proofreading.)

scribeocr's Projects

bootstrap icon bootstrap

The most popular HTML, CSS, and JavaScript framework for developing responsive, mobile first projects on the web.

scribeocr icon scribeocr

Web interface for recognizing text, proofreading OCR, and creating fully-digitized documents.

tesseract icon tesseract

Tesseract Open Source OCR Engine (main repository)

tesseract.js icon tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.