Giter Club home page Giter Club logo

pdfocr's Introduction

pdfocr

pdfocr adds an OCR text layer to scanned PDF files, allowing them to be searched. It currently depends on Ruby 1.8.7 or above, and uses ocropus, cuneiform, or tesseract for performing OCR.

Using

To use, run:

pdfocr -i input.pdf -o output.pdf

For more details, see the manpage.

Dependencies

pdfocr requires tesseract and hocr2pdf. These can be provided by installing the packages tesseract-ocr, tesseract-ocr-eng (or other languages you need), and exactimage from your distribution.

Credits

pdfocr was written by Geza Kovacs

pdfocr is hosted at http://github.com/gkovacs/pdfocr

Christian Pietsch added tesseract support.

pdfocr's People

Contributors

0xace avatar cristiklein avatar duesenklipper avatar fritz-hh avatar gkovacs avatar imanzuk avatar jtrees avatar llfourn avatar oliviercailloux avatar orymate avatar snowboard975 avatar thomasx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfocr's Issues

elsif block in line 390 is never executed

Here is the code:

if usetesseract
        puts "renaming merged-new.pdf to merged.pdf"
        sh "mv #{tmp+'/0000000000000-merged-new.pdf'} #{tmp+'/merged.pdf'}"
elsif
        puts "Merging together PDF files"
        sh "pdftk #{tmp+'/'+'*-new.pdf'} cat output #{tmp+'/merged.pdf'}"
end

Is there a condition missing here or should the elsif be an else?

In any case the condition being evaluated is actually the return value of puts which is always nil so the block will never be executed.

compress pdf file

Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you.
pic

Resulting PDFs are not searchable in OS X Preview.app

I'm wondering if anyone has tested the resulting PDFs under OS X?
I'm using the PPA version of pdfocr on Ubuntu 14.04 (version reported is 0.1.4)

The behaviour is strange - selection shows that there's definitely some text layer, and it's properly aligned with the image, but whatever I highlight, the selection comes back as a series of spaces. Searching doesn't find anything.

The text is definitely correct in the pdf (judging by the output of pdftotext), and Preview.app definitely supports this feature - here's a PDF where it works properly:
S&W_7-25.pdf

BTW, is it normal to see size increases from 167k -> 430k (single page)? Seems a bit excessive for a mere text layer, but what do I know...

requirements

it would be nice to have list of requirements. to install all dependency at once.

Installation on amazon centos

We can install it on ubuntu by the following command:
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

But I am not able to install it on amazon centos, Please let me know how to install it on amazon centos.

This package does not work for me...

Somehow, it does not seem to work with my directory layout (Ubuntu 10.04).
This seems to be a tesseract-related issue (Cuneiform seems to work)...

pdfocr -i beleg0059.pdf -o b59.pdf
Input file is /home/samba-shares/family/scans/beleg0059.pdf
Output file is /home/samba-shares/family/scans/b59.pdf
Using working dir /tmp/d20131230-26500-1fddng
Getting info from PDF file

Warning: no info dictionary found
NumberOfPages: 4

Converting 4 pages

Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/1.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `1.hocr.html': No such file or directory

Error while running OCR on page 1

Extracting page 2
Converting page 2 to ppm
Running OCR on page 2
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/2.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `2.hocr.html': No such file or directory

Error while running OCR on page 2

Extracting page 3
Converting page 3 to ppm
Running OCR on page 3
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/3.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `3.hocr.html': No such file or directory

Error while running OCR on page 3

Extracting page 4
Converting page 4 to ppm
Running OCR on page 4
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/4.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `4.hocr.html': No such file or directory
Error while running OCR on page 4
Merging together PDF files
/tmp/d20131230-26500-1fddng/-new.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/
-new.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Updating PDF info for /home/samba-shares/family/scans/b59.pdf
/tmp/d20131230-26500-1fddng/merged.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/merged.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Cleaning up temporary files

Linux Compatibility

I am using Fedora and it was 30 min swimming in libraries to try to run pdftk and of course due limitation of my time i gave up and in my search i found pdftk gem for ruby what about using it instead of binary file of pdftk it will be good for compatibility for other distribution 'cus ruby has a good compatibility on most of distribution

Thank you

Remove temp files during executing

Thanks for the nice program.
On a large PDF the collected ppm files can become huge.
When not using the -k option, I think i.ppm can be removed once i-new.pdf is created.

Parallel execution?

This seems like a relatively easy thing to parallelize, as currently it only works in serial.

I am envisioning (with a queue):

  1. Parallel PDF => image extraction
  2. Parallel OCR per image
  3. Parallel (merge-sort like) merging of PDF

It shouldn't take too much work from what I see in code, but could be nice. I can give this a try if I have time.

Need to decrease pdf size

when i try to convert a pdf file of size 500kb, i got the output pdf file of size larger than 1MB. it takes more time.

Support for black and white, and grayscale pdf files

Currently, pdfocr converts b/w and grayscale pdf to ppm format in color and runs tesseracts on them. Therefore the output file size of pdfocr is about 10 to 100 times bigger than the the input file in case of b/w pdf files.
But there is a method to reduce the file size.

Line 331 on the newest version of pdfocr says,
sh "pdftoppm -r 300 #{shell_escape(basefn)}.pdf >#{shell_escape(basefn)}.ppm"

If this line is replaced with below for b/w format,
sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pbmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

or if the line is replaced with below for grayscale format,
sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

Then, the ppm file is in b/w or grayscale and therefore the output file of pdfocr is much smaller than the current one.
But the problem of this is that it always converts the ppm file as b/w or grayscale.
So it would be nice if you implement an additional option such as -gray or -mono in pdfocr to separate commands for ppm conversion according to the options.

*ps: pdftoppm also supports -mono and -gray option, but -mono option of pdftoppm reduces the image quality for some reason. So I avoided using -mono option on pdftoppm command. I used gs command instead to avoid the problem.

Tesseract 3.02 not return list of Language and pdfocr.rb stop execution with error.

--list-langs is not more a parameter of tesseract and return the usage message.
To execute ocrpdf.rb I had to comment out the following lines.

From line 253 to 276:

if checklang
  langlist = []
  if usecuneiform
    begin
      langlist = `cuneiform -l`.split("\n")[-1].split(":")[-1].delete(".").split(" ")
    rescue
      puts "Unable to list supported languages from cuneiform"
    end
  end
  if usetesseract
    begin
      langlist = `tesseract --list-langs 2>&1`.split("\n")[1..-1]
    rescue
      puts "Unable to list supported languages from tesseract"
    end
  end
  if langlist and not langlist.empty?()
    if not langlist.include?(language)
      puts "Language #{language} is not supported or not installed. Please choose from"
      puts langlist.join(' ')
      exit
    end
  end
end

--keep -k not doing anything

Hi,

can't verify that -k does anything.

Thanks for the help.

(BTW: everything else works pretty good, thanks!)

Add support for files with ".PDF", not just ".pdf" extension

.PDF and .pdf are both valid extensions for PDF's, and the difference of extension does not drastically affect the file's content.

Many systems / programs will create / export PDF's with a ".PDF" extension rather than ".pdf", but pdfocr fails for PDF's with ".PDF" extension. Both formats should be supported.

In the source, line 167 reads
if outfile[-3..-1] != "pdf"
then it goes onto print an error message. This line should be changed to also check for "PDF", or compare the strings in a case-insensitive way.

line 85 typo?

Trying to run the script results in a syntax error right out of the box. Changing the ~ on line 85 to a - fixed it.

pdftk error: Unexpected Exception in open_reader()

For some PDF files, pdftk throws this error:

Error: Unexpected Exception in open_reader()
Unhandled Java Exception:

This bug has been reported on pdftk launchpad: https://bugs.launchpad.net/ubuntu/+source/pdftk/+bug/774052

It seems like the bug hasn't been fixed. Due to this bug, pdfocr.rb also fails on many occasions. However, there is a temporary solution that I have. The solution is something like this:

Sometimes, pdftk completely fails to read certain types of PDFs. However, if we read those PDFs using some other tool and then recreate them, then pdftk will read the newly created PDF just fine. E.g. we can use ghostscript to recreate pdf like this:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=newfile.pdf myfile.pdf

Now pdftk will read the newly created PDF file just fine.

If someone is willing to apply this solution, then it'd be really good. Otherwise I will make the changes myself and send a pull request.

PS:
A sample file which fails to be read is given here: https://www.jstage.jst.go.jp/article/jsmec/45/3/45_3_730/_pdf

Preserve image data (/filesize) from original PDF

Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.

Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.

I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).

Cheers,
Frank

Cropped pages after run of pdfocr

Hello

There seems to be a problem with the final step in the pdfocr script. Running pdfocr produces a heavily cropped pdf file. Most of each page is missing.

Actual Result:
Cropped pdf file

Expected Result:
Pdf file in original dimensions

Description:
I'm running the command in a script like so:
pdfocr -i $FILENAME.tmp.pdf -l deu -w . -k -o $FILENAME.pdf

Turning the -k option on shows me the "merged.pdf" file in the working directory ("pdfocr") which is still perfectly fine, size, OCRed text, and all. But the final pdf is heavily cropped.

Comparing the pdf metadata of the final file and "merged.pdf" with "pdftk merged.pdf dump_data" shows the differences in dimensions.

Commenting out line 374 in "pdfocr.rb" prevents the final file from being created and the metadata from being updated, so up to this point everything seems to work properly. The line is:

sh "pdftk", tmp+'/merged.pdf', "update_info", tmp+'/pdfinfo.txt', "output", outfile

Unfortunately, I don't 'speak' Ruby, so I don't know what I'd be doing if I were to edit the pdfocr script. I'm using a workaround now by simply deleting the final file and moving "merged.pdf".

My System:
Ubuntu 20.10, pdfocr 0.1.4, ruby 2.7.1p83, pdftk 3.1.1

If there's any further information I can provide, please let me know.

Do not depend on pdftk

pdftk has been for some time out of distros such as Fedora because of a licensing concern in iText, which is a dependency (through GCJ, which is also absent on Fedora 21+, so it is quite difficult to compile here).

In short, pdftk may be linked to a non-free version of iText and it cannot be included as Free Software.

Can its use be replaced by another similar tool?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.