gkovacs / pdfocr Goto Github PK

Adds text to PDF files using the cuneiform OCR software

License: MIT License

Ruby 88.71% Roff 11.29%

pdfocr's Introduction

pdfocr

pdfocr adds an OCR text layer to scanned PDF files, allowing them to be searched. It currently depends on Ruby 1.8.7 or above, and uses ocropus, cuneiform, or tesseract for performing OCR.

Using

To use, run:

pdfocr -i input.pdf -o output.pdf

For more details, see the manpage.

Dependencies

pdfocr requires tesseract and hocr2pdf. These can be provided by installing the packages tesseract-ocr, tesseract-ocr-eng (or other languages you need), and exactimage from your distribution.

Credits

pdfocr was written by Geza Kovacs

pdfocr is hosted at http://github.com/gkovacs/pdfocr

Christian Pietsch added tesseract support.

pdfocr's People

Contributors

Stargazers

Watchers

pdfocr's Issues

elsif block in line 390 is never executed

Here is the code:

if usetesseract
        puts "renaming merged-new.pdf to merged.pdf"
        sh "mv #{tmp+'/0000000000000-merged-new.pdf'} #{tmp+'/merged.pdf'}"
elsif
        puts "Merging together PDF files"
        sh "pdftk #{tmp+'/'+'*-new.pdf'} cat output #{tmp+'/merged.pdf'}"
end

Is there a condition missing here or should the elsif be an else?

In any case the condition being evaluated is actually the return value of puts which is always nil so the block will never be executed.

compress pdf file

Hi, my input file of pdfocr is ~9M and my output file is about 390M, and I try to used pdftk to compress but the compress rate is less than 0.1%. So I wonder it is possible to compress the pdf file. Thank you.

Resulting PDFs are not searchable in OS X Preview.app

I'm wondering if anyone has tested the resulting PDFs under OS X?
I'm using the PPA version of pdfocr on Ubuntu 14.04 (version reported is 0.1.4)

The behaviour is strange - selection shows that there's definitely some text layer, and it's properly aligned with the image, but whatever I highlight, the selection comes back as a series of spaces. Searching doesn't find anything.

The text is definitely correct in the pdf (judging by the output of pdftotext), and Preview.app definitely supports this feature - here's a PDF where it works properly:
S&W_7-25.pdf

BTW, is it normal to see size increases from 167k -> 430k (single page)? Seems a bit excessive for a mere text layer, but what do I know...

[bug] [regression] latest merge breaks cuneiform, ocroscript and old tesseract versions

4d274c9, merged in b53060b, breaks pdfocr for all OCR engines except for very new tesseract versions.

The support for cuneiform+ocroscript is broken, because they generate .hocr files, which won't be merged to PDF anymore.
tesseract 3.02.01-6 as in debian stable can't yet create pdf directly.

So IMHO this patch needs a thorough rework with more cases: if tesseract && recent_version: new code; else old code

requirements

it would be nice to have list of requirements. to install all dependency at once.

Installation on amazon centos

We can install it on ubuntu by the following command:
sudo add-apt-repository ppa:gezakovacs/pdfocr
sudo apt-get update
sudo apt-get install pdfocr

But I am not able to install it on amazon centos, Please let me know how to install it on amazon centos.

This package does not work for me...

Somehow, it does not seem to work with my directory layout (Ubuntu 10.04).
This seems to be a tesseract-related issue (Cuneiform seems to work)...

pdfocr -i beleg0059.pdf -o b59.pdf
Input file is /home/samba-shares/family/scans/beleg0059.pdf
Output file is /home/samba-shares/family/scans/b59.pdf
Using working dir /tmp/d20131230-26500-1fddng
Getting info from PDF file

Warning: no info dictionary found
NumberOfPages: 4

Converting 4 pages

Extracting page 1
Converting page 1 to ppm
Running OCR on page 1
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/1.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `1.hocr.html': No such file or directory

Error while running OCR on page 1

Extracting page 2
Converting page 2 to ppm
Running OCR on page 2
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/2.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `2.hocr.html': No such file or directory

Error while running OCR on page 2

Extracting page 3
Converting page 3 to ppm
Running OCR on page 3
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/3.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `3.hocr.html': No such file or directory

Error while running OCR on page 3

Extracting page 4
Converting page 4 to ppm
Running OCR on page 4
read_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/4.hocrread_variables_file:Can't open /usr/share/tesseract-ocr/tessdata/configs/hocrerror: Could not find variable 'P6'
mv: cannot stat `4.hocr.html': No such file or directory
Error while running OCR on page 4
Merging together PDF files
/tmp/d20131230-26500-1fddng/-new.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/-new.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Updating PDF info for /home/samba-shares/family/scans/b59.pdf
/tmp/d20131230-26500-1fddng/merged.pdf not found as file or resource.
Error: Failed to open PDF file:
/tmp/d20131230-26500-1fddng/merged.pdf
Errors encountered. No output created.
Done. Input errors, so no output created.
Cleaning up temporary files

Linux Compatibility

I am using Fedora and it was 30 min swimming in libraries to try to run pdftk and of course due limitation of my time i gave up and in my search i found pdftk gem for ruby what about using it instead of binary file of pdftk it will be good for compatibility for other distribution 'cus ruby has a good compatibility on most of distribution

Thank you

Remove temp files during executing

Thanks for the nice program.
On a large PDF the collected ppm files can become huge.
When not using the -k option, I think i.ppm can be removed once i-new.pdf is created.

Add a comparison with https://github.com/jbarlow83/OCRmyPDF to documentation

It would be good to know the advantages that pdfocr has over OCRmyPDF

Parallel execution?

This seems like a relatively easy thing to parallelize, as currently it only works in serial.

I am envisioning (with a queue):

Parallel PDF => image extraction
Parallel OCR per image
Parallel (merge-sort like) merging of PDF

It shouldn't take too much work from what I see in code, but could be nice. I can give this a try if I have time.

Need to decrease pdf size

when i try to convert a pdf file of size 500kb, i got the output pdf file of size larger than 1MB. it takes more time.

Support for black and white, and grayscale pdf files

Currently, pdfocr converts b/w and grayscale pdf to ppm format in color and runs tesseracts on them. Therefore the output file size of pdfocr is about 10 to 100 times bigger than the the input file in case of b/w pdf files.
But there is a method to reduce the file size.

Line 331 on the newest version of pdfocr says,
sh "pdftoppm -r 300 #{shell_escape(basefn)}.pdf >#{shell_escape(basefn)}.ppm"

If this line is replaced with below for b/w format,
sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pbmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

or if the line is replaced with below for grayscale format,
sh "gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r300 -sOutputFile=#{shell_escape(basefn)}.ppm #{shell_escape(basefn)}.pdf"

Then, the ppm file is in b/w or grayscale and therefore the output file of pdfocr is much smaller than the current one.
But the problem of this is that it always converts the ppm file as b/w or grayscale.
So it would be nice if you implement an additional option such as -gray or -mono in pdfocr to separate commands for ppm conversion according to the options.

*ps: pdftoppm also supports -mono and -gray option, but -mono option of pdftoppm reduces the image quality for some reason. So I avoided using -mono option on pdftoppm command. I used gs command instead to avoid the problem.

Please add a PPA for ubuntu 14.04

When I run
sudo apt-get update

I get below error.

W: http://ppa.launchpad.net/gezakovacs/pdfocr/ubuntu/dists/trusty/main/binary-i386/Packages Failed to download the file. 404 Not Found

Please add a suitable PPA for ubuntu 14.04.
Thank you!

Tesseract 3.02 not return list of Language and pdfocr.rb stop execution with error.

--list-langs is not more a parameter of tesseract and return the usage message.
To execute ocrpdf.rb I had to comment out the following lines.

From line 253 to 276:

if checklang
  langlist = []
  if usecuneiform
    begin
      langlist = `cuneiform -l`.split("\n")[-1].split(":")[-1].delete(".").split(" ")
    rescue
      puts "Unable to list supported languages from cuneiform"
    end
  end
  if usetesseract
    begin
      langlist = `tesseract --list-langs 2>&1`.split("\n")[1..-1]
    rescue
      puts "Unable to list supported languages from tesseract"
    end
  end
  if langlist and not langlist.empty?()
    if not langlist.include?(language)
      puts "Language #{language} is not supported or not installed. Please choose from"
      puts langlist.join(' ')
      exit
    end
  end
end

--keep -k not doing anything

Hi,

can't verify that -k does anything.

Thanks for the help.

(BTW: everything else works pretty good, thanks!)

Add support for files with ".PDF", not just ".pdf" extension

.PDF and .pdf are both valid extensions for PDF's, and the difference of extension does not drastically affect the file's content.

Many systems / programs will create / export PDF's with a ".PDF" extension rather than ".pdf", but pdfocr fails for PDF's with ".PDF" extension. Both formats should be supported.

In the source, line 167 reads
if outfile[-3..-1] != "pdf"
then it goes onto print an error message. This line should be changed to also check for "PDF", or compare the strings in a case-insensitive way.

Add an option to clean up the pages

Tools like
http://scantailor.org/
or
http://code.google.com/p/ocrfeeder/
can clean up separate pages. Maybe this can be integrated in the process?

line 85 typo?

Trying to run the script results in a syntax error right out of the box. Changing the ~ on line 85 to a - fixed it.

pdftk error: Unexpected Exception in open_reader()

For some PDF files, pdftk throws this error:

Error: Unexpected Exception in open_reader()
Unhandled Java Exception:

This bug has been reported on pdftk launchpad: https://bugs.launchpad.net/ubuntu/+source/pdftk/+bug/774052

It seems like the bug hasn't been fixed. Due to this bug, pdfocr.rb also fails on many occasions. However, there is a temporary solution that I have. The solution is something like this:

Sometimes, pdftk completely fails to read certain types of PDFs. However, if we read those PDFs using some other tool and then recreate them, then pdftk will read the newly created PDF just fine. E.g. we can use ghostscript to recreate pdf like this:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=newfile.pdf myfile.pdf

Now pdftk will read the newly created PDF file just fine.

If someone is willing to apply this solution, then it'd be really good. Otherwise I will make the changes myself and send a pull request.

PS:
A sample file which fails to be read is given here: https://www.jstage.jst.go.jp/article/jsmec/45/3/45_3_730/_pdf

Returning the exit status would be great

It would be great if the script would return an exit status != 0 when it stops because of an error.

Preserve image data (/filesize) from original PDF

Thank you for this nice tool! I think it has one issue which might make it not ideal for an important use-case: PDFs from document scanners usually consist of exactly one image per page. It would be great if pdfocr could cater for those by preserving the original image data and simply adding the OCR text layer.

Currently, pdfocr converts the original pages to images using pdftoppm, thus creating very large image files and still gradually worsening the quality of the output pdf. For the use-case described above it would be nicer to use "pdfimages -all" to extract the original page image data and send that through tesseract (more or less) directly.

I have implemented a prototype of this as a bash script here: http://cern.ch/fsiegert/tmp/pdfocr.sh
It's definitely not complete and probably doesn't handle all types of documents that can come from different scanners yet (I have only tested it using a document from one scanner I had available). But I thought I'd contact you and ask whether you could imagine adding something similar as an option to pdfocr.rb (I'm not fluent in Ruby, but I could try to provide a patch/pull request if there is interest).

Cheers,
Frank

Deleted thread

I deleted this issue because I opened a new one

Cropped pages after run of pdfocr

Hello

There seems to be a problem with the final step in the pdfocr script. Running pdfocr produces a heavily cropped pdf file. Most of each page is missing.

Actual Result:
Cropped pdf file

Expected Result:
Pdf file in original dimensions

Description:
I'm running the command in a script like so:
pdfocr -i $FILENAME.tmp.pdf -l deu -w . -k -o $FILENAME.pdf

Turning the -k option on shows me the "merged.pdf" file in the working directory ("pdfocr") which is still perfectly fine, size, OCRed text, and all. But the final pdf is heavily cropped.

Comparing the pdf metadata of the final file and "merged.pdf" with "pdftk merged.pdf dump_data" shows the differences in dimensions.

Commenting out line 374 in "pdfocr.rb" prevents the final file from being created and the metadata from being updated, so up to this point everything seems to work properly. The line is:

sh "pdftk", tmp+'/merged.pdf', "update_info", tmp+'/pdfinfo.txt', "output", outfile

Unfortunately, I don't 'speak' Ruby, so I don't know what I'd be doing if I were to edit the pdfocr script. I'm using a workaround now by simply deleting the final file and moving "merged.pdf".

My System:
Ubuntu 20.10, pdfocr 0.1.4, ruby 2.7.1p83, pdftk 3.1.1

If there's any further information I can provide, please let me know.

Add support for files with ".PDF"

Do not depend on pdftk

pdftk has been for some time out of distros such as Fedora because of a licensing concern in iText, which is a dependency (through GCJ, which is also absent on Fedora 21+, so it is quite difficult to compile here).

In short, pdftk may be linked to a non-free version of iText and it cannot be included as Free Software.

Can its use be replaced by another similar tool?