Giter Club home page Giter Club logo

pdfocr's Introduction

pdfOcr

Use Tesseract to make a PDF searchable.

Installation

Install Tesseract v3.0.5. This library shells out to it.

Then install this package. Maven-style:

<dependency>
  <groupId>org.overviewproject</groupId>
  <artifactId>pdfocr_2.12</artifactId>
  <version>0.0.10</version>
</dependency>

Sbt-style:

dependencies += "org.overviewproject" %% "pdfocr" % "0.0.10"

Usage

You've got to use Scala. Code something like this:

import java.nio.file.Path
import java.util.Locale
import org.overviewproject.pdfocr.{PdfOcr,PdfOcrProgress,PdfOcrResult}
import org.overviewproject.pdfocr.exceptions._
import scala.concurrent.Future

val pdfOcr = new PdfOcr()                          // default settings: finds tesseract in your $PATH
val inPdf = new Path("/path/to/needs-ocr.pdf")     // exists
val outPdf = new Path("/path/to/ocr-finished.pdf") // doesn't exist; will be deleted if it does
val process = PdfOcr.makePdfSearchable(inPdf, outPdf, Seq(Locale("en")))

process.progress // Future[PdfOcrProgress]
  .map { progress =>
    // It's a Future because we don't know how many pages there are until
    // we begin parsing the PDF, which takes time.

    progress.value       // 0.0 ... 1.0
    progress.currentPage // 1 .. nPages
    progress.nPages      // n
  }

process.result // Future[PdfTextResult]
  .map { result =>
    // do something with outPdf now...

    // Also, since the data is handy and would otherwise take a long time
    // to compute, PdfOcr returns the text, in pages.
    val text = result.pages.map(_.text).mkString("\n")
  }
  .recover {
    // outPdf is guaranteed not to exist

    case TesseractMissingException => throw
    case TesseractLanguageMissingException => throw
    case EncryptedPdfException => throw
    case InvalidPdfException => throw
    // Other errors may happen -- PDFBox bugs, Tesseract bugs,
    // out-of-memory.... You shouldn't catch those.
  }

// Or if you got impatient, you could:
process.cancel // Future[Unit]

How PdfOcr behaves

  • PdfOcr processes one page at a time.
  • PdfOcr sends Tesseract any page that's missing fonts or missing 100 characters of text.
  • PdfOcr's progress reports are page-by-page. If one page needs OCR and nine don't, the progress report will be unintuitive.
  • PdfOcr communicates with Tesseract via stdin and stdout.
  • For any method that will block on I/O, PdfOcr returns a Future. In other words: blocking methods are asynchronous.
  • PdfOcr does heavy computations (especially in PdfPage) which are slow. These are non-blocking and synchronous.

Developing

First, Install sbt.

After that,

  1. Run sbt ~test to run unit tests in the background.
  2. Edit files in src/test until a test fails.
  3. Edit files in src/main until the test passes.
  4. Return to step 2.
  5. Commit to a git branch, push it to GitHub, and submit a pull request.

Publishing

We use [sbt-sonatype](https://github.com/xerial/sbt-sonatype for more details) for all this.

Setup: using the sbt-sonatype instructions, ensure you've done these things:

  • Created an account at https://oss.sonatype.org and get access to this project.
  • Created ~/.sbt/1.0/sonatype.sbt with your credentials.

Then, every new version:

  1. sbt publishSigned to deploy to staging
  2. sbt sonatypeRelease to close and promote it

If the version ends in -SNAPSHOT, you won't be able to release it.

License

This software is Copyright 2011-2018 The Associated Press, and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.

pdfocr's People

Contributors

zanio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.