yob / pdf-reader Goto Github PK
View Code? Open in Web Editor NEWThe PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
License: MIT License
The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
License: MIT License
Hi! Been very happy with pdf-reader.
Noticed something odd: when calling .text on each page in the following pdf: http://www.tokyo-wako.com/site/menus/arcadia_dining.pdf I get the following error:
can't convert Symbol into Integer
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:205:in `[]'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:205:in `invoke_xobject'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:62:in `block in callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:61:in `each'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:61:in `callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:73:in `content_stream'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:42:in `walk'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:212:in `invoke_xobject'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:144:in `block in callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:143:in `each'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:143:in `callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:130:in `content_stream'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:95:in `walk'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:65:in `text'
PDF seems to work fine with regular pdf readers. Personally I don't need it enough to free up time for digging in the pdf-reader source but maybe it's of interest.
I am generating this PDF via PrinceXML(http://princexml.com): http://dl.dropbox.com/u/599002/DocRaptor/test.pdf
It has no user or owner password set. It is encrypted, however. Based on a brief readthrough of the PDF spec, the trailer seemed correct. It opens without password prompting in Preview and Skim.
The exception trace is:
PDF::Reader::EncryptedPDFError: Invalid password ()
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/standard_security_handler.rb:182:in `build_standard_key'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/standard_security_handler.rb:66:in `initialize'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:275:in `new'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:275:in `build_security_handler'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:48:in `initialize'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:116:in `new'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:116:in `initialize'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:159:in `new'
vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:159:in `open'
Thoughts?
The specs run fine with ruby 1.8.7, but with jruby they seem to hang:
(in /var/tmp/portage/dev-ruby/pdf-reader-0.8.6/work/jruby/yob-pdf-reader-a05658c)
/usr/share/jruby/lib/ruby/1.8/pathname.rb:263 warning: `*' interpreted as argument prefix
...................................................
Process is quite busy so I would guess a loop is encountered somewhere?
Hi!
I noted the minor following issue in the 1.0.0.beta1 version: the script examples/extract_fonts.rb is missing a shebang on the first line.
Best regards,
Cédric
I have a shallow understanding of pdf internals, but I think that the bottom of a page is Y=0, am I right? In this case, a move_to_start_of_next_line should substract, not add to the current position.
# ./lib/pdf/reader/page_text_receiver.rb
def move_to_start_of_next_line # T*
move_text_position(0, -state[:text_leading])
end
I was having some problems with a PDF and after this change everything worked as expected.
I'm using TextReceiver to parse a document where all of the characters are rotated 90 degrees.
There were two big problems:
1) page_state.font_size always returned zero
2) character displacement was calculated incorrectly
The fix to the first problem was this:
PDF::Reader::PageState.class_eval do
def font_size
return state[:text_font_size]
end
end
The old function multiplied it by transformation_matrix.a (which equals zero in a 90 degree rotation matrix). Although it now returns the proper font size, I'm pretty sure it breaks PageLayout. (I believe that this might also explain line 330 of page_state.rb, but my pdf file has ctm.a == 1 so I'm not sure)
The fix to the character displacement problem was to flip the direction of the matrix multiplication. Based on page 252 of the pdf spec, i think it goes (translation matrix) x T_m
PDF::Reader::TransformationMatrix.class_eval do
def horizontal_displacement_multiply!(e2)
newe = (e2 * @A) + @e
newf = (e2 * @b) + @f
@e, @f = newe, newf
end
end
I might take a closer look this weekend.
AcroForm, Annots, etc
Using the following code:
receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.string(pdf_text, receiver)
With pdf_text using: Hello World.pdf
Causes: can't convert Fixnum into String (TypeError)
See full trace.
Quick Solution (in text_receiver.rb:189
):
def show_text_with_positioning (params)
prev_adjustment = @state.last[:tj_adjustment]
params.each do |p|
case p
when Float, Fixnum # Added Fixnum
@state.last[:tj_adjustment] = p
else
show_text(p)
end
end
@state.last[:tj_adjustment] = prev_adjustment
end
Hi
I keep getting this error for this line:
reader = PDF::Reader.new(@pdf_name)
Doesn't make any sense since that's what's in the examples.
Is this Rails 3 compatible?
Hi?
Would you mind include the test suite spec/ in the gem/tarball you distribute? This would allow GNU/Linux distribution like Debian to run the tests during package building to check if everything is fine with ruby versions/ other libraries shipped with it.
Thank you for your understanding.
Best wishes,
Cédric
I got this error when I tried to read this pdf file.
irb(main):001:0> require 'pdf-reader'
=> true
irb(main):002:0> p = PDF::Reader.new '/Users/andrew/Downloads/project_euler-poster.pdf'
=> #<PDF::Reader:0x007fde2b056ae8 @cache=<PDF::Reader::ObjectCache size: 0>, @objects=<PDF::Reader::ObjectHash size: 159>>
irb(main):003:0> p.pages.map(&:text)
NoMethodError: undefined method `unpack' for -314:Fixnum
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/font.rb:78:in `unpack'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:95:in `internal_show_text'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:62:in `block in show_text_with_positioning'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each_slice'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `show_text_with_positioning'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:150:in `block in callback'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:149:in `each'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:149:in `callback'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:136:in `content_stream'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:101:in `walk'
from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:71:in `text'
from (irb):3:in `map'
from (irb):3
from /Users/andrew/.rbenv/versions/1.9.3-p0/bin/irb:12:in `<main>'
Hi!
I saw you fixed the issue #24 regarding the license problematic file header in the 1.0.0.beta.
The new version requires ruby-rc4, which is missing a license indicating conditions for distribution. So it is not fit to enter a Linux distribution. Would it be possible to release a 0.10.1 with the fix of #24, to have a version without license issues to work with prawn 1.0?
Best regards
gems/pdf-reader-0.8.6/lib/pdf/reader/xref.rb:66:in load': XRef streams are not supported in PDF::Reader yet (PDF::Reader::UnsupportedFeatureError) from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:131:in
parse'
from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:76:in block in file' from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:75:in
open'
from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:75:in file' from extract_text.rb:36:in
This error is raised when I try to run the PDF at http://sigmm.utdallas.edu/archive/MM/mm96.pdf through the text extraction example at https://github.com/yob/pdf-reader/blob/master/examples/text.rb. I plan to look at the text extraction code myself and see if it's an easy fix/workaround; if so will add a gist.
Hi!
The test suite is run during the build of pdf-reader Debian package, by requiring all *_spec files. When doing this, I get the following error:
Failures:
1) Spec suite PDFs should be intact
Failure/Error: Unable to find matching line from backtrace
NameError:
uninitialized constant YAML
# ./spec/integrity_spec.rb:18
# debian/ruby-tests.rb:2
Finished in 61.99 seconds
1084 examples, 1 failure, 4 pending
It seems that there is a require 'yaml'
statement missing in spec/integrity_spec.rb
(or spec_helper.rb
).
Best wishes,
Cédric
I tried extracting out PNG images in PDF files that we create here using the example code extract_images.rb but it failed as it didn't know how to handle FlateDecode streams. I just added a new Png handler that is exactly the same as the JPEG but the image files that are created cannot be opened. Is there something else that I need to do ? I also tried Inflating (decompressing) the FlateDecode stream data then writing it out to files but this did not work either though my understanding of PDF file format is not great.
Any help appreciated,
Cheers.
P.S: Tried posting this to the mailing list but Google Groups is giving me an error for the past two days now.
One of the great things about pdf reader is that extracting the pdf text from an existing string (eg web response body) is ultra simple:
result = StringIO.new
PDF::Reader.string(response.body, PDF::Reader::TextReceiver.new(result))
result.string
I've seen several cases (eg here http://agilesoftwaretesting.com/?p=166 and here http://blog.liangzan.net/index.php/2009/12/11/testing-pdfs-with-celerity-culerity-and-cucumber) where people have copied the text example implementation of PageTextReceiver from examples/text.rb, and are saving temporary files to disk because they haven't checked out the rest of the API.
I was thinking examples/text.rb could be simplified to show how to use PDF::Reader.string() and PDF::Reader::TextReceiver.
Hi!
The license of this file seems to be in conflict with the freedom of modifying the source (since if I want to keep a reference from where this document comes from, I cannot modify it by the first paragraph of that file).
The easiest away to circumvent this for people distributing your software would be that you just remove this header, and make it a derivative (with exactly the same content). This is allowed according to the second paragraph of the license. This new file would be then distributed under the same license as the rest of the files.
Or you could use this file:
http://sourceforge.net/projects/aglfn.adobe/files/glyphlist.txt/download
which has exactly the same content, but another (less restrictive) license.
There is another change in 1.3.0 that affected our test suite.
It looks like that even the strings that were created intentionally with double(or more) whitespaces between a word, when calling Page#text it returns a single whitespace between the words.
For example, some date strings have double whitespaces due to the format mask (%l - Hour of the day, 12-hour clock, blank-padded ( 1..12)). But, since the Page#text does not return more than a single whitespace between words, the test is breaking.
Is it a desired behavior, limit the Page#text return to a single whitespace between words, even though the original string (and the rendered one) have more than a single whitespace between words?
Hi there
Love your library, thanks.
The examples in your Readme.rdoc don't work, so I had to use the examples in /examples/ to get things running. Thought it might be worth mentioning that the .rdoc needs an update.
Matt
Ok, I have a PDF file from the Gent Work Group test test, and when I parse it with PDF-Reader it shows the Unknown Character in the text. All the character are there in the PDF file when i open it in a viewer.
Here is the PDF file: http://www.davidakachaos.nl/pdf_files/GWG_Testfile_018a_v4_2008_04_01_NewspaperAds.pdf
I'm trying to debug what's going wrong, but I'm going in circles! Can you please help me? Or nudge me in the right direction?
This is how I saw the wrong chars in PDF Reader
require 'pdf-reader'
filename = 'GWG_Testfile_018a_v4_2008_04_01_NewspaperAds.pdf'
reader = PDF::Reader.new(filename)
reader.pages.first.text
The above code shows text from the PDF with unknown characters. When opening the PDF file in a viewer, everything seems fine...
FYI, A 14 page pdf took 27 seconds to extract the text.
[12] pry(main)> Benchmark.realtime do
[12] pry(main)* pdf_file.pages.map{ |p| p.text }
[12] pry(main)* end
=> 27.4952540397644
Intel 2.2 GHz i7, 8GB RAM
First of all, thanks for this library! I am see the following problem with a PDF that I am trying to open:
gems/pdf-reader-0.10.0/lib/pdf/reader/pages_strategy.rb:354:in `content_stream': Unknown font F1 (PDF::Reader::MalformedPDFError)
I would submit the PDF so you can test it, however, it contains alot of my personal information (financial data, SSN, etc).
Let me know what other information I can give you. The file is an ADP generated paystub (iPay)
Thanks,
Benny
I'm getting this error when trying to read a PDF after processing it with ghostscript.
Here's the original: http://dl.dropbox.com/u/16582920/20110315_stellenanzeige_praktikant.pdf
And the ghostscript version: http://dl.dropbox.com/u/16582920/out.pdf
I'm running the following command:
gs -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel="1.7" -sOutputFile="out.pdf" -f "in.pdf"
And then here's the (snippet) of output from pdf reader:
bundle exec pdf_callbacks out.pdf
....
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1074.6, 0, 0, 32.4, 4163.4, 4262]
begin_inline_image => []
begin_inline_image_data => [:CS, :RGB, :W, 199, :H, 6, :BPC, 8, :F, :Fl, :DP, {:Predictor=>15, :Columns=>199, :Colors=>3}]
end_inline_image => [0]
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1069.2, 0, 0, 37.8, 4168.8, 4224.2]
invoke_xobject => [:R21]
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1063.8, 0, 0, 32.4, 4174.2, 4191.8]
begin_inline_image => []
begin_inline_image_data => [:CS, :RGB, :W, 197, :H, 6, :BPC, 8, :F, :Fl, :DP, {:Predictor=>15, :Columns=>197, :Colors=>3}]
~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:99:in `block in dictionary': Dictionary key (0) is not a name (PDF::Reader::MalformedPDFError)
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:96:in `loop'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:96:in `dictionary'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:51:in `parse_token'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:124:in `block in array'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `loop'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `array'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:52:in `parse_token'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:124:in `block in array'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `loop'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `array'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:52:in `parse_token'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/page.rb:176:in `content_stream'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/page.rb:150:in `walk'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:22:in `block in <top (required)>'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:16:in `each'
from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:16:in `<top (required)>'
from ~/.rvm/gems/ruby-1.9.2-p290/bin/pdf_callbacks:19:in `load'
from ~/.rvm/gems/ruby-1.9.2-p290/bin/pdf_callbacks:19:in `<main>'
I'd like to know whether the behavior change for Page#text was intentional or not?
It looks like the latest version, when calling Page#text, includes a lot more whitespaces (looks like those inserted for formatting).
I've created two gists showing the different outputs from 1.2.0 to 1.3.0
https://gist.github.com/5ea2e953a890db919136 - 1.3.0
https://gist.github.com/1b83e585469c04f1d0ac - 1.2.0
both refer to the same page calling #text, even though some data is different.
Also, I could see that the current spec tests a basic pdf for #text. Maybe this was the reason for we haven't caught this change earlier.
Can you confirm whether this change was intentional or not?
Hi,
I'm encountering a 'stack level too deep' error while trying to work with this PDF file: http://dl.dropbox.com/u/6646130/stack-level-too-deep.pdf
I'm using:
reader = PDF::Reader.new("stack-level-too-deep.pdf")
puts reader.pdf_version
puts reader.info
puts reader.page_count
puts reader.metadata
text = reader.pages.map{ |page| page.text.strip }.join(' ')
The PDF is:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 2560
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 709
virtual memory (kbytes, -v) unlimited
I doubled the stack size hoping that might help but it didn't:
$ ulimit -s 16384
Any thoughts on how to resolve this?
Thanks,
Matt
When evaluating multiple columns on a page the TextReceiver will read the entire horizontal line as one instead of reading the entire column then reading the next column.
Hi
I use pdf-reader to extract the text from pdf-documents, and it works great, except for one problem. In one of the text files there is one line of text that gets messed up. A line is split in two right in the middle of a word, and then the parts are displayed on separate lines in reverse order.
Example of what I mean:
Text in PDF
This is the first line This is the second line This is the third line
What page.text yields
This is the first line econd line This is the s This is the third line
I have tested 4 PDF so far, and it only happens with one of them. I tried an online converter and it did not occur there. If you wanna look into it I could send you the PDF where it happens.
This error appear:
C:/Documents and Settings/Wxyz/My documents/NetBeansProjects/Books/lib/main.rb:47: uninitialized constant PDF (NameError)
in the line:
receiver = PageTextReceiver.new
==> pdf = PDF::Reader.file("isla.pdf", receiver)
puts receiver.content.inspect
And I dont know why, please help me
I'm using netbeans 6.8 and ruby 1.8.7 and pdf-reader 0.8.3
GemFile
gem 'pdf-reader', :git => "git://github.com/yob/pdf-reader.git"
I make a simple test in my app :
describe PDF do
describe PDF::Reader do
it "should parse target pdf" do
reader = ::PDF::Reader.new("./spec/resources/target.pdf")
pages = reader.pages
puts "pages : #{pages.count}"
pages.each do |page|
t = page.text
end
end
end
end
I get the following error :
Seams there is something broken
Am trying to extract the images from a pdf, am referring the Extract Image Module from the examples section.
The PDF am trying to parse has image of filter type FlateDecode, as per the code in that module its fall into else condition as the filter doesnt match the explicit filters in if condition,
Now it gives me "unsupport color depth", when i inspected hash of the image am trying to parse i get this,
{:Length=>24407, :Type=>:XObject, :Subtype=>:Image, :Width=>4810, :Height=>1302, :ImageMask=>true, :BitsPerComponent=>1, :Filter=>:FlateDecode}
As seen, there is no ColorSpace key in this hash, how about saving this kind of image, is there a way doing it?
Hoping a quick and favourable reply.
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(FILE)) + "/var/www/scaled_scores/output.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This code is not working in irb, I have also installed pdf-reader. The error i am getting is below
ArgumentError: input must be an IO-like object or a filename
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader/object_hash.rb:337:in extract_io_from' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader/object_hash.rb:43:in
initialize'
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:115:in new' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:115:in
initialize'
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:158:in new' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:158:in
open'
from (irb):16
from /home/eossysd7/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `
I am looking forward for assistance
Regards
Nishanth
Hello
I find myself needing to parse a rather big pdf file, and I stumbled upon your library while searching for a way to solve this task in ruby.
However, Im getting a NoMethodError, when iterating the pages in the pdf (happens around the 12th page).
Im using the 1.0.0.beta1 version, running on Ruby 1.9.2-p180 on Mac OSX 10.6.
PDF file:
ftp://medical.nema.org/medical/dicom/2011/11_06pu.pdf
require 'pdf/reader'
f = File.open("11_06pu.pdf")
reader = PDF::Reader.new(f)
reader.pages.each do |page|
puts page.text.length
end
/Users/chris/.rvm/gems/ruby-1.9.2-p180/gems/pdf-reader-1.0.0.beta1/lib/pdf/reader/page_text_receiver.rb:261:in `clone_state':
undefined method `dump' for Psych:Module (NoMethodError)
Hello,
When using a PDF file of size 0 Bytes, we got the following error:
undefined method `match' for nil:NilClass
The error is raise from object_hash.rb @ line 326.
Find the stack trace:
undefined method `match' for nil:NilClass
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader/object_hash.rb:326:in `read_version'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader/object_hash.rb:44:in `initialize'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:116:in `new'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:116:in `initialize'
....
Hi,
the tag for release 0.8.3 is missing (and in Gentoo I'm relying on git tagged downloads otherwise I'm unable to run tests).
Thanks,
Diego
Using the code detailed in:
http://github.com/yob/pdf-reader/blob/master/examples/text.rb
This file:
http://dl.dropbox.com/u/175905/test1.pdf
Will hang when calling PDF::Reader.file. File in unencrypted, without password and is PDF v1.6. Stack track when aborting the hang with CTRL+C:
^C/usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:in
match': Interrupt
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:in from_buffer' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/parser.rb:46:in
parse_token'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:357:in content_stream' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:314:in
walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:in each' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:in
walk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in walk_pages' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in
each'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in walk_pages' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:284:in
document'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:136:in parse' from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:76:in
file'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:in open' from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:in
file'`
Use .gitattributes to solve the issue of specs failing on windows because of CRLF conversion.
I received the following error while reading a pdf using the built-in TextReceiver. Manually responding to callbacks in a similar fashion to the built-in, such as @instance_variable << string do not produce the same error, yet the error comes up on that line.
pdf-reader-0.8.5/lib/pdf/reader/text_receiver.rb:148:in `show_text': 4294965045 out of char range (RangeError)
If you generate a very simple pdf from MS Office2010 (PDF Version 1.5) you will be able to parse it.
But all the lines extracted with something like
will have an extra space at the end of each line, even on the last one.
It seems a white-space character (I have done a unit test for that).
It is not a big issue, but I think it is a sign of a more subtle problem.
When parsing a pdf file, this correct content,
1.乳突(mastoid process)是位在下列那一骨頭上?
A.蝶骨(sphenoid bone)
B.枕骨(occipital bone)
C.顳骨(temporal bone)
D.顴骨(zygomatic bone)
but got these:
\n \nmastoid process \n乳突()是位在下列那一骨頭上?", "sphenoid bone\n蝶骨()\nA.\noccipital bone\n枕骨()\nB.\n \n \ntemporal bone\n顳骨()\nC.\nzygomatic bone\n顴骨()\nD.\n \n
How to resolve this problem?
thanks.
It does not read Encrypted PDF files.
Trying to parse a PDF this morning, I ran into a space encoding issue described in the google group.
I don't know anything about PDF internals but took a stab at trying to fix it. It ended up working great for the particular PDF I needed to parse. I also added in a delimiter b/c I needed to delimit tabular data, but please ignore that part since it's separate from space encoding.
My changes: https://github.com/huned/pdf-reader/commit/7d63e68721828e25bf00442331b276a704eedb86. I'm not sending a pull request because I know this isn't the correct fix. 14 specs fail, and I'm hardcoding stuff.
So I'm wondering: is it worth trying to graduate this fix to a real solution? The three things I need to do are:
But I don't really know anything about PDFs. I'm happy to hack on it some more with some guidance if this seems like a viable path. Thoughts?
Hi and first of all thanks for this awesome gem!
I have a PDF file taken from http://www.irs.gov/pub/irs-pdf/fw4.pdf and it contains fields to be filled. I fill some fields with, for example, Blah-blah
and save a copy of PDF file. However, when I try to read its pages text, filled text is not shown.
reader = PDF::Reader.new('copy.pdf')
reader.page(1).text.include?('Blah-blah')
#=> false
I have a few .pdf files that cause pdf-reader to get stuck in a buffering loop. In one case, it appeared that the parser encountered a double nil, and I was able to get around it by adding "@io.pos=@pos+1" in buffer.rb's prepare_literal_token. The parser then makes it about twice as far into the file before hanging again (100% CPU), but I can't find the cause this time. I would be happy to send you the .pdf file and welcome any pointers. Thanks!
The source of the issue we saw that was classified as: #17 (comment) has an embedded image that does not have any space between the end of the image data and the EI token. I pulled out the section of the PDF that was causing the failure and added the spec here: rstawarz@9601d33
Reading the PDF spec it seems the embedded image definition follows that of the stream objects (section 7.3.8) which is beautifully written as "There should be an end-of-line marker after the data and before endstream;"... the operative word being 'should be'.
I was going to change the buffer parser directly but you have specs in there that specify that an 'EI' should be allowed inside the image stream. Without implementing some sort of look ahead, it seems the two are mutually exclusive. Any thoughts?
I used the sample text.rb program, fed it a small PDF (RubyMine_ReferenceCard.pdf) and got mostly good text output. However, there were many places in the output where an extract character (the number '2') was inserted. For example:
Alt + Shift + N Navigate to Rails 2model/view/controlle2r etc.Ctrl + F FindCtrl + Space Basic code completion2 (the name of any cl2ass, method
Alt + F2 Preview Rails View2 in browserF3 Find next or variable)
The only valid '2' is the one in "Alt + F2". The missing carriage returns ("etc.Ctrl", "FindCtrl", "browserF3") are not an issue; this is a three-column document.
Is this just the way of PDFs or is there some problem here?
Thanks for a great gem!
We're trying out pdf-reader for an internal tool, but there are some issues with the way vertical spaces are handled. We have a PDF that renders like this:
But is parsed without any whitespace by pdf-reader:
Billing PeriodDaysReading-Reading=DifferencexBilling Factor=Total Therms
We can't post the raw PDF file for legal reasons, but here is a snippet of the raw content that the gem generates around the particular issue in question:
BT 24.00 527.04 Td /F0201 7.00 Tf [(Billing )-12(Period)] TJ ET
BT 109.92 527.04 Td /F0201 7.00 Tf [(Days)] TJ ET
BT 136.56 534.96 Td /F0201 7.00 Tf [(Current)] TJ ET
BT 136.56 527.04 Td /F0201 7.00 Tf [(Reading)] TJ ET
BT 166.08 527.04 Td /F0701 7.00 Tf (-) Tj ET
BT 175.92 534.96 Td /F0201 7.00 Tf [(Previous)] TJ ET
BT 175.92 527.04 Td /F0201 7.00 Tf [(Reading)] TJ ET
BT 208.80 527.04 Td /F0701 7.00 Tf (=) Tj ET
BT 219.84 527.04 Td /F0201 7.00 Tf [(Difference)] TJ ET
BT 257.52 527.04 Td /F0701 7.00 Tf (x) Tj ET
BT 267.36 527.04 Td /F0201 7.00 Tf [(Billing )-12(Factor)] TJ ET
BT 321.60 527.04 Td /F0701 7.00 Tf (=) Tj ET
BT 348.00 527.04 Td /F0201 7.00 Tf [(Total )-12(Therms)] TJ ET
I get this error on some pages when I try to access the text of a page:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:194:in `[]': can't convert Symbol into Integer (TypeError)
The rest of the context:
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:194:in `invoke_xobject'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:61:in `block in callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:60:in `each'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:60:in `callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:72:in `content_stream'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:47:in `walk'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:200:in `invoke_xobject'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:192:in `block in callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:191:in `each'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:191:in `callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:178:in `content_stream'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:150:in `walk'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:120:in `text'
As you can see this is from the pdf-reader-0.12.0.alpha.
Thanks.
Morten
Hi, Yob, I have strange issue. When I using rails evironment - all is ok.
But if I try to use your gem in simple ruby or in irb - I take the same.
ruby-1.9.2-p290 :006 > a = File.absolute_path('./somefile.pdf')
=> "/Users/username/projects/present_work/pdfparser/somefile.pdf"
ruby-1.9.2-p290 :007 > File.exist? a
=> true
ruby-1.9.2-p290 :008 > reader = PDF::Reader.new(a)
=> #PDF::Reader:0x00000100919960
ruby-1.9.2-p290 :009 > reader.pages
NoMethodError: undefined method pages' for #<PDF::Reader:0x00000100919960> from (irb):9 from /Users/username/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in
Answer, please, I really need it =(
Hi,
I encountered an issue parsing PDF files, and found a work-around that others might find useful.
The PDF I am parsing is basically tabular data (converted from a spreadsheet). There are multiple cells and columns. The pdf-reader should be returning the text on a line-by-line basis. What I found was that on occasion, some cells in my 'table' were being read out of order. For example:
-------------------------------------------- aaaaa | bbbbbb | ccccc | dddddd eeeee | ffffff | ggggg | hhhhhh -------------------------------------------- xxxxxx | yyyyyy | zzzzz | zzzzzzz
Hopefully you get the idea. What should be returned is a text stream like:
aaaaa bbbbb ccccc ddddd eeeee fffffffff ggggg ....
and so on
What I was getting (occasionally) was:
aaaaa bbbbb ddddd ccccc eeeee fffffff ggggg
So you can see the data was returned out-of-order for part of the table. I tracked this down to the following location:
pdf/reader/page_text_receiver.rb (function show_text)
def show_text(string) # Tj raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil? newx, newy = @state.trm_transform(0,0) @content[newy] ||= "" @content[newy] << @state.current_font.to_utf8(string) end
The problem is the way the newy variable is calculated. What pdf-reader appears to do, is to store all text it parses into a Hash, keyed by the y-coordinate of where the text occurred. In my situation, the y-coordinate for each block of text appeared to have some amount of tiny variation - enough to result in
YCOORD | TEXT --------------------------- 303.91 aaaaaa 303.91 bbbbbb 303.92 cccccccc 303.91 ddddddd 350.001 eeeeee 350.001 fffffffff 350.001 ggggg
Look at the y-coord for 'ccccccc'. Even though the different is tiny (.92 vs .91), it is enough for the 'cccccc' text to be inserted into it's own 'row' in the @content Hash/array, and subsequently be returned out-of-order when we read the text with pdf.page(x).text. I don't know why this 'error' in the y-coord is there, but it occurs in the PDFs I am parsing.
The solution was round the y-coordinate to a whole number before inserting it into the @content hash. Just copy+paste the code below into your ruby program (no need to patch the original code)
module PDF class Reader class PageTextReceiver def show_text(string) # Tj raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil? newx, newy = @state.trm_transform(0,0) newy = newy.round(0) @content[newy] ||= "" @content[newy] << @state.current_font.to_utf8(string) end end end end
Hopefully this is useful for somebody else! Rounding to the nearest whole number worked for me. I don't know enough about the pdf coordinate system to know if that is a generic solution though...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.