yob / pdf-reader Goto Github PK

View Code? Open in Web Editor NEW

1.8K 50.0 267.0 26.97 MB

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

License: MIT License

Ruby 99.59% Shell 0.24% HTML 0.17%

pdf-reader's People

Stargazers

Watchers

Forkers

alex-knaub crishoj jeg2 sigmike bradediger ljubomirm matin davidlowry ikataitsev manuelmorales jyn sss epictetus posgen calebhaye 3jb pelgrim bernerdschaefer brentsnook bueller mattvv honwlee martianinteractive 2potatocakes tardate jirapong nickhammond ilhom ricardolchfilho rstawarz nifarius andrewajo strobejb graphicly seeingidog jaygen nmadura cactis jcurve ponlork chuythong grassss rubemz babymastodon mikz chip egadstar pat scraping-xx ashbt mullican fengxueysf josal openflex xiaofuzi corecode adelevie lizconlan spletta packetmonkey g8d3 wconrad gunman808 aamir-pk railsler oblivionwielder aarti tomtaylor jo9dgr8 taruni1 bf4 willmendesneto tdenovan adamjonas ljinke khalidelsayed java66liu won21kr mbautin icleversoft bigboss21x dheeraj510 modu vtrkanna cforee shura71 revskill mrniket bronzle som-poddar edwardt rhio-ryu enumera seako klintzz michael-gabenna boardiq eitoball kareemgrant modulexcite

pdf-reader's Issues

can't convert Symbol into Integer

Hi! Been very happy with pdf-reader.

Noticed something odd: when calling .text on each page in the following pdf: http://www.tokyo-wako.com/site/menus/arcadia_dining.pdf I get the following error:

can't convert Symbol into Integer
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:205:in `[]'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:205:in `invoke_xobject'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:62:in `block in callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:61:in `each'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:61:in `callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:73:in `content_stream'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/form_xobject.rb:42:in `walk'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page_text_receiver.rb:212:in `invoke_xobject'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:144:in `block in callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:143:in `each'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:143:in `callback'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:130:in `content_stream'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:95:in `walk'
/Users/lucas/.rvm/gems/ruby-1.9.2-p180/bundler/gems/pdf-reader-8a3dfe7a1f1d/lib/pdf/reader/page.rb:65:in `text'

PDF seems to work fine with regular pdf readers. Personally I don't need it enough to free up time for digging in the pdf-reader source but maybe it's of interest.

Some encrypted files don't open

I am generating this PDF via PrinceXML(http://princexml.com): http://dl.dropbox.com/u/599002/DocRaptor/test.pdf

It has no user or owner password set. It is encrypted, however. Based on a brief readthrough of the PDF spec, the trailer seemed correct. It opens without password prompting in Preview and Skim.

The exception trace is:

PDF::Reader::EncryptedPDFError: Invalid password ()
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/standard_security_handler.rb:182:in `build_standard_key'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/standard_security_handler.rb:66:in `initialize'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:275:in `new'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:275:in `build_security_handler'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader/object_hash.rb:48:in `initialize'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:116:in `new'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:116:in `initialize'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:159:in `new'
    vendor/bundle/ruby/1.8/gems/pdf-reader-1.0.0.rc1/lib/pdf/reader.rb:159:in `open'

Thoughts?

pdf-reader 0.8.6 specs hang with jruby

The specs run fine with ruby 1.8.7, but with jruby they seem to hang:

(in /var/tmp/portage/dev-ruby/pdf-reader-0.8.6/work/jruby/yob-pdf-reader-a05658c)
/usr/share/jruby/lib/ruby/1.8/pathname.rb:263 warning: `*' interpreted as argument prefix
...................................................

Process is quite busy so I would guess a loop is encountered somewhere?

missing shebang in examples/extract_fonts.rb

Hi!
I noted the minor following issue in the 1.0.0.beta1 version: the script examples/extract_fonts.rb is missing a shebang on the first line.
Best regards,

Cédric

Problem with move_to_start_of_next_line

I have a shallow understanding of pdf internals, but I think that the bottom of a page is Y=0, am I right? In this case, a move_to_start_of_next_line should substract, not add to the current position.

# ./lib/pdf/reader/page_text_receiver.rb
def move_to_start_of_next_line # T*
  move_text_position(0, -state[:text_leading])
end

I was having some problems with a PDF and after this change everything worked as expected.

Sideways characters (rotated 90 degrees)

I'm using TextReceiver to parse a document where all of the characters are rotated 90 degrees.

There were two big problems:
1) page_state.font_size always returned zero
2) character displacement was calculated incorrectly

The fix to the first problem was this:

PDF::Reader::PageState.class_eval do
def font_size
return state[:text_font_size]
end
end

The old function multiplied it by transformation_matrix.a (which equals zero in a 90 degree rotation matrix). Although it now returns the proper font size, I'm pretty sure it breaks PageLayout. (I believe that this might also explain line 330 of page_state.rb, but my pdf file has ctm.a == 1 so I'm not sure)

The fix to the character displacement problem was to flip the direction of the matrix multiplication. Based on page 252 of the pdf spec, i think it goes (translation matrix) x T_m

PDF::Reader::TransformationMatrix.class_eval do
def horizontal_displacement_multiply!(e2)
newe = (e2 * @A) + @e
newf = (e2 * @b) + @f
@e, @f = newe, newf
end
end

I might take a closer look this weekend.

PDF::Reader class should provide access to document level attributes

AcroForm, Annots, etc

TextReceiver can't convert Fixnum into String (TypeError) in show_text_with_positioning

Using the following code:

receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.string(pdf_text, receiver)

With pdf_text using: Hello World.pdf

Causes: can't convert Fixnum into String (TypeError)
See full trace.

Quick Solution (in text_receiver.rb:189):

def show_text_with_positioning (params)
  prev_adjustment = @state.last[:tj_adjustment]

  params.each do |p|
    case p
    when Float, Fixnum # Added Fixnum
      @state.last[:tj_adjustment] = p
    else
      show_text(p)
    end
  end

  @state.last[:tj_adjustment]  = prev_adjustment
end

wrong number of arguments (1 for 0)

I keep getting this error for this line:

reader = PDF::Reader.new(@pdf_name)

Doesn't make any sense since that's what's in the examples.
Is this Rails 3 compatible?

spec/ not shipped in gem?

Hi?

Would you mind include the test suite spec/ in the gem/tarball you distribute? This would allow GNU/Linux distribution like Debian to run the tests during package building to check if everything is fine with ruby versions/ other libraries shipped with it.

Thank you for your understanding.

Best wishes,

Cédric

NoMethodError: undefined method `unpack' for -314:Fixnum

I got this error when I tried to read this pdf file.

irb(main):001:0> require 'pdf-reader'
=> true
irb(main):002:0> p = PDF::Reader.new '/Users/andrew/Downloads/project_euler-poster.pdf'
=> #<PDF::Reader:0x007fde2b056ae8 @cache=<PDF::Reader::ObjectCache size: 0>, @objects=<PDF::Reader::ObjectHash size: 159>>
irb(main):003:0> p.pages.map(&:text)
NoMethodError: undefined method `unpack' for -314:Fixnum
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/font.rb:78:in `unpack'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:95:in `internal_show_text'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:62:in `block in show_text_with_positioning'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each_slice'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `each'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page_text_receiver.rb:61:in `show_text_with_positioning'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:150:in `block in callback'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:149:in `each'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:149:in `callback'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:136:in `content_stream'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:101:in `walk'
    from /Users/andrew/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/page.rb:71:in `text'
    from (irb):3:in `map'
    from (irb):3
    from /Users/andrew/.rbenv/versions/1.9.3-p0/bin/irb:12:in `<main>'

possible release of 0.10.1 with fix of #24?

Hi!
I saw you fixed the issue #24 regarding the license problematic file header in the 1.0.0.beta.
The new version requires ruby-rc4, which is missing a license indicating conditions for distribution. So it is not fit to enter a Linux distribution. Would it be possible to release a 0.10.1 with the fix of #24, to have a version without license issues to work with prawn 1.0?

Best regards

XRef streams are not supported in PDF::Reader yet

gems/pdf-reader-0.8.6/lib/pdf/reader/xref.rb:66:in load': XRef streams are not supported in PDF::Reader yet (PDF::Reader::UnsupportedFeatureError) from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:131:inparse'
from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:76:in block in file' from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:75:inopen'
from /Users/alterscape/.rvm/gems/ruby-1.9.2-p0@acm_cloud/gems/pdf-reader-0.8.6/lib/pdf/reader.rb:75:in file' from extract_text.rb:36:in

This error is raised when I try to run the PDF at http://sigmm.utdallas.edu/archive/MM/mm96.pdf through the text extraction example at https://github.com/yob/pdf-reader/blob/master/examples/text.rb. I plan to look at the text extraction code myself and see if it's an easy fix/workaround; if so will add a gist.

missing requirement on yaml for integrity_spec.rb

Hi!

The test suite is run during the build of pdf-reader Debian package, by requiring all *_spec files. When doing this, I get the following error:

Failures:

  1) Spec suite PDFs should be intact
     Failure/Error: Unable to find matching line from backtrace
     NameError:
       uninitialized constant YAML
     # ./spec/integrity_spec.rb:18
     # debian/ruby-tests.rb:2

Finished in 61.99 seconds
1084 examples, 1 failure, 4 pending

It seems that there is a require 'yaml' statement missing in spec/integrity_spec.rb (or spec_helper.rb).

Best wishes,

Cédric

Extracting PNG image from PDF file

I tried extracting out PNG images in PDF files that we create here using the example code extract_images.rb but it failed as it didn't know how to handle FlateDecode streams. I just added a new Png handler that is exactly the same as the JPEG but the image files that are created cannot be opened. Is there something else that I need to do ? I also tried Inflating (decompressing) the FlateDecode stream data then writing it out to files but this did not work either though my understanding of PDF file format is not great.

Any help appreciated,

Cheers.

P.S: Tried posting this to the mailing list but Google Groups is giving me an error for the past two days now.

Add simple text example using PDF::Reader.string() and PDF::Reader::TextReceiver

One of the great things about pdf reader is that extracting the pdf text from an existing string (eg web response body) is ultra simple:

result = StringIO.new
PDF::Reader.string(response.body, PDF::Reader::TextReceiver.new(result))
result.string

I've seen several cases (eg here http://agilesoftwaretesting.com/?p=166 and here http://blog.liangzan.net/index.php/2009/12/11/testing-pdfs-with-celerity-culerity-and-cucumber) where people have copied the text example implementation of PageTextReceiver from examples/text.rb, and are saving temporary files to disk because they haven't checked out the rest of the API.

I was thinking examples/text.rb could be simplified to show how to use PDF::Reader.string() and PDF::Reader::TextReceiver.

please consider removing the header of /lib/pdf/reader/glyphlist.txt

Hi!
The license of this file seems to be in conflict with the freedom of modifying the source (since if I want to keep a reference from where this document comes from, I cannot modify it by the first paragraph of that file).

The easiest away to circumvent this for people distributing your software would be that you just remove this header, and make it a derivative (with exactly the same content). This is allowed according to the second paragraph of the license. This new file would be then distributed under the same license as the rest of the files.

Or you could use this file:
http://sourceforge.net/projects/aglfn.adobe/files/glyphlist.txt/download
which has exactly the same content, but another (less restrictive) license.

Page#text does not return extra whitespaces between words

There is another change in 1.3.0 that affected our test suite.

It looks like that even the strings that were created intentionally with double(or more) whitespaces between a word, when calling Page#text it returns a single whitespace between the words.

For example, some date strings have double whitespaces due to the format mask (%l - Hour of the day, 12-hour clock, blank-padded ( 1..12)). But, since the Page#text does not return more than a single whitespace between words, the test is breaking.

Is it a desired behavior, limit the Page#text return to a single whitespace between words, even though the original string (and the rendered one) have more than a single whitespace between words?

Readme.rdoc examples don't work

Hi there

Love your library, thanks.

The examples in your Readme.rdoc don't work, so I had to use the examples in /examples/ to get things running. Thought it might be worth mentioning that the .rdoc needs an update.

Matt

Invalid Unknown characters in PDF?

Ok, I have a PDF file from the Gent Work Group test test, and when I parse it with PDF-Reader it shows the Unknown Character in the text. All the character are there in the PDF file when i open it in a viewer.

Here is the PDF file: http://www.davidakachaos.nl/pdf_files/GWG_Testfile_018a_v4_2008_04_01_NewspaperAds.pdf

I'm trying to debug what's going wrong, but I'm going in circles! Can you please help me? Or nudge me in the right direction?

This is how I saw the wrong chars in PDF Reader

require 'pdf-reader'
filename = 'GWG_Testfile_018a_v4_2008_04_01_NewspaperAds.pdf'
reader = PDF::Reader.new(filename)
reader.pages.first.text

The above code shows text from the PDF with unknown characters. When opening the PDF file in a viewer, everything seems fine...

slow parsing

FYI, A 14 page pdf took 27 seconds to extract the text.

[12] pry(main)> Benchmark.realtime do
[12] pry(main)* pdf_file.pages.map{ |p| p.text }
[12] pry(main)* end
=> 27.4952540397644

Intel 2.2 GHz i7, 8GB RAM

Unable to parse PDF

First of all, thanks for this library! I am see the following problem with a PDF that I am trying to open:

gems/pdf-reader-0.10.0/lib/pdf/reader/pages_strategy.rb:354:in `content_stream': Unknown font F1 (PDF::Reader::MalformedPDFError)

I would submit the PDF so you can test it, however, it contains alot of my personal information (financial data, SSN, etc).

Let me know what other information I can give you. The file is an ADP generated paystub (iPay)

Thanks,
Benny

PDF::Reader::MalformedPDFError: Dictionary key (0) is not a name

I'm getting this error when trying to read a PDF after processing it with ghostscript.

Here's the original: http://dl.dropbox.com/u/16582920/20110315_stellenanzeige_praktikant.pdf

And the ghostscript version: http://dl.dropbox.com/u/16582920/out.pdf

I'm running the following command:

gs -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel="1.7" -sOutputFile="out.pdf" -f "in.pdf"

And then here's the (snippet) of output from pdf reader:

bundle exec pdf_callbacks out.pdf
....
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1074.6, 0, 0, 32.4, 4163.4, 4262]
begin_inline_image => []
begin_inline_image_data => [:CS, :RGB, :W, 199, :H, 6, :BPC, 8, :F, :Fl, :DP, {:Predictor=>15, :Columns=>199, :Colors=>3}]
end_inline_image => [0]
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1069.2, 0, 0, 37.8, 4168.8, 4224.2]
invoke_xobject => [:R21]
restore_graphics_state => []
save_graphics_state => []
concatenate_matrix => [1063.8, 0, 0, 32.4, 4174.2, 4191.8]
begin_inline_image => []
begin_inline_image_data => [:CS, :RGB, :W, 197, :H, 6, :BPC, 8, :F, :Fl, :DP, {:Predictor=>15, :Columns=>197, :Colors=>3}]
~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:99:in `block in dictionary': Dictionary key (0) is not a name (PDF::Reader::MalformedPDFError)
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:96:in `loop'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:96:in `dictionary'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:51:in `parse_token'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:124:in `block in array'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `loop'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `array'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:52:in `parse_token'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:124:in `block in array'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `loop'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:123:in `array'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/parser.rb:52:in `parse_token'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/page.rb:176:in `content_stream'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/lib/pdf/reader/page.rb:150:in `walk'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:22:in `block in <top (required)>'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:16:in `each'
    from ~/.rvm/gems/ruby-1.9.2-p290/bundler/gems/pdf-reader-758fe91140c0/bin/pdf_callbacks:16:in `<top (required)>'
    from ~/.rvm/gems/ruby-1.9.2-p290/bin/pdf_callbacks:19:in `load'
    from ~/.rvm/gems/ruby-1.9.2-p290/bin/pdf_callbacks:19:in `<main>'

The Page#text behavior changed

I'd like to know whether the behavior change for Page#text was intentional or not?

It looks like the latest version, when calling Page#text, includes a lot more whitespaces (looks like those inserted for formatting).

I've created two gists showing the different outputs from 1.2.0 to 1.3.0

https://gist.github.com/5ea2e953a890db919136 - 1.3.0
https://gist.github.com/1b83e585469c04f1d0ac - 1.2.0

both refer to the same page calling #text, even though some data is different.

Also, I could see that the current spec tests a basic pdf for #text. Maybe this was the reason for we haven't caught this change earlier.

Can you confirm whether this change was intentional or not?

stack level too deep Ruby 1.9.3-p125

Hi,

I'm encountering a 'stack level too deep' error while trying to work with this PDF file: http://dl.dropbox.com/u/6646130/stack-level-too-deep.pdf

I'm using:

OS X Lion 10.7.3
ruby 1.9.3-p125
gem 'pdf-reader', '1.1.0'

reader = PDF::Reader.new("stack-level-too-deep.pdf")
puts reader.pdf_version
puts reader.info
puts reader.page_count
puts reader.metadata
text   = reader.pages.map{ |page| page.text.strip }.join(' ')

The PDF is:

reader.pdf_version: 1.7
reader.page_count: 64

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2560
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited

I doubled the stack size hoping that might help but it didn't:

$ ulimit -s 16384

Any thoughts on how to resolve this?

Thanks,
Matt

Multiple columns on a page

When evaluating multiple columns on a page the TextReceiver will read the entire horizontal line as one instead of reading the entire column then reading the next column.

Line of text gets split and reversed

I use pdf-reader to extract the text from pdf-documents, and it works great, except for one problem. In one of the text files there is one line of text that gets messed up. A line is split in two right in the middle of a word, and then the parts are displayed on separate lines in reverse order.
Example of what I mean:

Text in PDF

This is the first line
This is the second line
This is the third line

What page.text yields

This is the first line
econd line
This is the s
This is the third line

I have tested 4 PDF so far, and it only happens with one of them. I tried an online converter and it did not occur there. If you wanna look into it I could send you the PDF where it happens.

Uninitialized constant pdf

This error appear:
C:/Documents and Settings/Wxyz/My documents/NetBeansProjects/Books/lib/main.rb:47: uninitialized constant PDF (NameError)
in the line:
receiver = PageTextReceiver.new
==> pdf = PDF::Reader.file("isla.pdf", receiver)
puts receiver.content.inspect
And I dont know why, please help me
I'm using netbeans 6.8 and ruby 1.8.7 and pdf-reader 0.8.3

page.text : wrong number of arguments (1 for 6)

GemFile

gem 'pdf-reader', :git => "git://github.com/yob/pdf-reader.git"

I make a simple test in my app :

describe PDF do
  describe PDF::Reader do
    it "should parse target pdf" do
      reader = ::PDF::Reader.new("./spec/resources/target.pdf")
      pages = reader.pages
      puts "pages : #{pages.count}"
      pages.each do |page|
        t = page.text
      end
    end
  end
end

I get the following error :

PDF PDF::Reader should parse target pdf
Failure/Error: t = page.text
ArgumentError:
wrong number of arguments (1 for 6)
./spec/models/pdf_spec.rb:10

./spec/models/pdf_spec.rb:9:in `each'

./spec/models/pdf_spec.rb:9

Seams there is something broken

Extract Images from PDF

Am trying to extract the images from a pdf, am referring the Extract Image Module from the examples section.

The PDF am trying to parse has image of filter type FlateDecode, as per the code in that module its fall into else condition as the filter doesnt match the explicit filters in if condition,

Now it gives me "unsupport color depth", when i inspected hash of the image am trying to parse i get this,

{:Length=>24407, :Type=>:XObject, :Subtype=>:Image, :Width=>4810, :Height=>1302, :ImageMask=>true, :BitsPerComponent=>1, :Filter=>:FlateDecode}

As seen, there is no ColorSpace key in this hash, how about saving this kind of image, is there a way doing it?

Hoping a quick and favourable reply.

Extract all text from a single PDF

require 'rubygems'
require 'pdf/reader'

filename = File.expand_path(File.dirname(FILE)) + "/var/www/scaled_scores/output.pdf"

PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end

This code is not working in irb, I have also installed pdf-reader. The error i am getting is below

ArgumentError: input must be an IO-like object or a filename
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader/object_hash.rb:337:in extract_io_from' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader/object_hash.rb:43:ininitialize'
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:115:in new' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:115:ininitialize'
from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:158:in new' from /home/eossysd7/.rvm/gems/ruby-1.9.3-p194@veritas/gems/pdf-reader-1.2.0/lib/pdf/reader.rb:158:inopen'
from (irb):16
from /home/eossysd7/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `

I am looking forward for assistance

Regards
Nishanth

Undefined method 'dump' for Psych:Module

Hello
I find myself needing to parse a rather big pdf file, and I stumbled upon your library while searching for a way to solve this task in ruby.

However, Im getting a NoMethodError, when iterating the pages in the pdf (happens around the 12th page).
Im using the 1.0.0.beta1 version, running on Ruby 1.9.2-p180 on Mac OSX 10.6.

PDF file:
ftp://medical.nema.org/medical/dicom/2011/11_06pu.pdf

require 'pdf/reader'
f = File.open("11_06pu.pdf")
reader = PDF::Reader.new(f)
reader.pages.each do |page|
  puts page.text.length
end

/Users/chris/.rvm/gems/ruby-1.9.2-p180/gems/pdf-reader-1.0.0.beta1/lib/pdf/reader/page_text_receiver.rb:261:in `clone_state':
undefined method `dump' for Psych:Module (NoMethodError)

undefined method `match' for nil:NilClass when passing file of 0 Bytes

Hello,

When using a PDF file of size 0 Bytes, we got the following error:

undefined method `match' for nil:NilClass

The error is raise from object_hash.rb @ line 326.

Find the stack trace:

undefined method `match' for nil:NilClass
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader/object_hash.rb:326:in `read_version'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader/object_hash.rb:44:in `initialize'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:116:in `new'
/srv/fundlook_testing/shared/bundle/ruby/1.8/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:116:in `initialize'
....

Missing tag for 0.8.3

Hi,

the tag for release 0.8.3 is missing (and in Gentoo I'm relying on git tagged downloads otherwise I'm unable to run tests).

Thanks,
Diego

Supplied PDF Causes PDF::Reader.file to Freeze

Using the code detailed in:

http://github.com/yob/pdf-reader/blob/master/examples/text.rb

This file:

http://dl.dropbox.com/u/175905/test1.pdf

Will hang when calling PDF::Reader.file. File in unencrypted, without password and is PDF v1.6. Stack track when aborting the hang with CTRL+C:

^C/usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:inmatch': Interrupt
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/reference.rb:35:in from_buffer' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/parser.rb:46:inparse_token'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:357:in content_stream' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:314:inwalk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:in each' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:312:inwalk_pages'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in walk_pages' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:ineach'
from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:297:in walk_pages' from /usr/local/lib/ruby/gems/1.8/gems/pdf-reader-0.8.1/lib/pdf/reader/content.rb:284:indocument'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:136:in parse' from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:76:infile'
from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:in open' from /Users/colin/Projects/third-party/pdf-reader/lib/pdf/reader.rb:75:infile'`

Use .gitattributes

Use .gitattributes to solve the issue of specs failing on windows because of CRLF conversion.

"Out of range" character error

I received the following error while reading a pdf using the built-in TextReceiver. Manually responding to callbacks in a similar fashion to the built-in, such as @instance_variable << string do not produce the same error, yet the error comes up on that line.

pdf-reader-0.8.5/lib/pdf/reader/text_receiver.rb:148:in `show_text': 4294965045 out of char range (RangeError)

Trailing white space parsing an Office 2010 pdf

If you generate a very simple pdf from MS Office2010 (PDF Version 1.5) you will be able to parse it.
But all the lines extracted with something like

will have an extra space at the end of each line, even on the last one.
It seems a white-space character (I have done a unit test for that).

It is not a big issue, but I think it is a sign of a more subtle problem.

Problem when Chinese mixing with English

When parsing a pdf file, this correct content,

1.乳突(mastoid process)是位在下列那一骨頭上?
A.蝶骨(sphenoid bone)
B.枕骨(occipital bone)
C.顳骨(temporal bone)
D.顴骨(zygomatic bone)

but got these:

\n \nmastoid process \n乳突（）是位在下列那一骨頭上？", "sphenoid bone\n蝶骨（）\nA.\noccipital bone\n枕骨（）\nB.\n \n \ntemporal bone\n顳骨（）\nC.\nzygomatic bone\n顴骨（）\nD.\n \n

How to resolve this problem?

thanks.

Encrypted PDFs

It does not read Encrypted PDF files.

interpret coordinate-encoded cursor movement as space characters

Trying to parse a PDF this morning, I ran into a space encoding issue described in the google group.

I don't know anything about PDF internals but took a stab at trying to fix it. It ended up working great for the particular PDF I needed to parse. I also added in a delimiter b/c I needed to delimit tabular data, but please ignore that part since it's separate from space encoding.

My changes: https://github.com/huned/pdf-reader/commit/7d63e68721828e25bf00442331b276a704eedb86. I'm not sending a pull request because I know this isn't the correct fix. 14 specs fail, and I'm hardcoding stuff.

So I'm wondering: is it worth trying to graduate this fix to a real solution? The three things I need to do are:

understand the correct way to detect whether or not I should convert coordinates to spaces
robustly interpret cursor distances as one or more spaces
make sure it doesn't break any specs

But I don't really know anything about PDFs. I'm happy to hack on it some more with some guidance if this seems like a viable path. Thoughts?

Reading text from PDF with fields

Hi and first of all thanks for this awesome gem!

I have a PDF file taken from http://www.irs.gov/pub/irs-pdf/fw4.pdf and it contains fields to be filled. I fill some fields with, for example, Blah-blah and save a copy of PDF file. However, when I try to read its pages text, filled text is not shown.

reader = PDF::Reader.new('copy.pdf')
reader.page(1).text.include?('Blah-blah')
#=> false

hangs on moderately complex .pdf

I have a few .pdf files that cause pdf-reader to get stuck in a buffering loop. In one case, it appeared that the parser encountered a double nil, and I was able to get around it by adding "@io.pos=@pos+1" in buffer.rb's prepare_literal_token. The parser then makes it about twice as far into the file before hanging again (100% CPU), but I can't find the cause this time. I would be happy to send you the .pdf file and welcome any pointers. Thanks!

Embeded Images do not require space before EI token

The source of the issue we saw that was classified as: #17 (comment) has an embedded image that does not have any space between the end of the image data and the EI token. I pulled out the section of the PDF that was causing the failure and added the spec here: rstawarz@9601d33

Reading the PDF spec it seems the embedded image definition follows that of the stream objects (section 7.3.8) which is beautifully written as "There should be an end-of-line marker after the data and before endstream;"... the operative word being 'should be'.

I was going to change the buffer parser directly but you have specs in there that specify that an 'EI' should be allowed inside the image stream. Without implementing some sort of look ahead, it seems the two are mutually exclusive. Any thoughts?

add specs with a sample PDF that has encrypted streams but plain text metadata

Many extra '2' characters in text output

I used the sample text.rb program, fed it a small PDF (RubyMine_ReferenceCard.pdf) and got mostly good text output. However, there were many places in the output where an extract character (the number '2') was inserted. For example:

Alt + Shift + N Navigate to Rails 2model/view/controlle2r etc.Ctrl + F FindCtrl + Space Basic code completion2 (the name of any cl2ass, method  
Alt + F2 Preview Rails View2 in browserF3 Find next or variable)

The only valid '2' is the one in "Alt + F2". The missing carriage returns ("etc.Ctrl", "FindCtrl", "browserF3") are not an issue; this is a three-column document.

Is this just the way of PDFs or is there some problem here?

Thanks for a great gem!

Handling of vertical spaces

We're trying out pdf-reader for an internal tool, but there are some issues with the way vertical spaces are handled. We have a PDF that renders like this:

But is parsed without any whitespace by pdf-reader:

Billing PeriodDaysReading-Reading=DifferencexBilling Factor=Total Therms

We can't post the raw PDF file for legal reasons, but here is a snippet of the raw content that the gem generates around the particular issue in question:

BT 24.00 527.04 Td /F0201 7.00 Tf [(Billing )-12(Period)] TJ ET
BT 109.92 527.04 Td /F0201 7.00 Tf [(Days)] TJ ET
BT 136.56 534.96 Td /F0201 7.00 Tf [(Current)] TJ ET
BT 136.56 527.04 Td /F0201 7.00 Tf [(Reading)] TJ ET
BT 166.08 527.04 Td /F0701 7.00 Tf (-) Tj ET
BT 175.92 534.96 Td /F0201 7.00 Tf [(Previous)] TJ ET
BT 175.92 527.04 Td /F0201 7.00 Tf [(Reading)] TJ ET
BT 208.80 527.04 Td /F0701 7.00 Tf (=) Tj ET
BT 219.84 527.04 Td /F0201 7.00 Tf [(Difference)] TJ ET
BT 257.52 527.04 Td /F0701 7.00 Tf (x) Tj ET
BT 267.36 527.04 Td /F0201 7.00 Tf [(Billing )-12(Factor)] TJ ET
BT 321.60 527.04 Td /F0701 7.00 Tf (=) Tj ET
BT 348.00 527.04 Td /F0201 7.00 Tf [(Total )-12(Therms)] TJ ET

page_text_receiver.rb - can't convert Symbol into Integer (Type error)

I get this error on some pages when I try to access the text of a page:

C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:194:in `[]': can't convert Symbol into Integer (TypeError)

The rest of the context:

from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:194:in `invoke_xobject'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:61:in `block in callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:60:in `each'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:60:in `callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:72:in `content_stream'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/form_xobject.rb:47:in `walk'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page_text_receiver.rb:200:in `invoke_xobject'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:192:in `block in callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:191:in `each'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:191:in `callback'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:178:in `content_stream'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:150:in `walk'
from C:/Ruby192/lib/ruby/gems/1.9.1/gems/pdf-reader-0.12.0.alpha/lib/pdf/reader/page.rb:120:in `text'

As you can see this is from the pdf-reader-0.12.0.alpha.

Thanks.

Morten

strange issue only in non-rails env

Hi, Yob, I have strange issue. When I using rails evironment - all is ok.
But if I try to use your gem in simple ruby or in irb - I take the same.

ruby-1.9.2-p290 :006 > a = File.absolute_path('./somefile.pdf')
=> "/Users/username/projects/present_work/pdfparser/somefile.pdf"
ruby-1.9.2-p290 :007 > File.exist? a
=> true
ruby-1.9.2-p290 :008 > reader = PDF::Reader.new(a)
=> #PDF::Reader:0x00000100919960
ruby-1.9.2-p290 :009 > reader.pages
NoMethodError: undefined method pages' for #<PDF::Reader:0x00000100919960> from (irb):9 from /Users/username/.rvm/rubies/ruby-1.9.2-p290/bin/irb:16:in

Answer, please, I really need it =(

lines sometimes read out of order

Hi,

I encountered an issue parsing PDF files, and found a work-around that others might find useful.

The PDF I am parsing is basically tabular data (converted from a spreadsheet). There are multiple cells and columns. The pdf-reader should be returning the text on a line-by-line basis. What I found was that on occasion, some cells in my 'table' were being read out of order. For example:

--------------------------------------------
 aaaaa   |  bbbbbb  |  ccccc |  dddddd
 eeeee   |  ffffff  |  ggggg |  hhhhhh
--------------------------------------------
 xxxxxx  |  yyyyyy  |  zzzzz |  zzzzzzz

Hopefully you get the idea. What should be returned is a text stream like:

aaaaa  bbbbb ccccc ddddd
eeeee  fffffffff  ggggg ....

and so on

What I was getting (occasionally) was:

aaaaa  bbbbb
ddddd
ccccc
eeeee  fffffff ggggg

So you can see the data was returned out-of-order for part of the table. I tracked this down to the following location:

pdf/reader/page_text_receiver.rb (function show_text)

    def show_text(string) # Tj
        
        raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil?
        newx, newy = @state.trm_transform(0,0)
        @content[newy] ||= ""
        @content[newy]  << @state.current_font.to_utf8(string)
   end

The problem is the way the newy variable is calculated. What pdf-reader appears to do, is to store all text it parses into a Hash, keyed by the y-coordinate of where the text occurred. In my situation, the y-coordinate for each block of text appeared to have some amount of tiny variation - enough to result in

YCOORD  |     TEXT
---------------------------
303.91       aaaaaa
303.91       bbbbbb
303.92       cccccccc
303.91       ddddddd
350.001      eeeeee
350.001      fffffffff
350.001      ggggg

Look at the y-coord for 'ccccccc'. Even though the different is tiny (.92 vs .91), it is enough for the 'cccccc' text to be inserted into it's own 'row' in the @content Hash/array, and subsequently be returned out-of-order when we read the text with pdf.page(x).text. I don't know why this 'error' in the y-coord is there, but it occurs in the PDFs I am parsing.

The solution was round the y-coordinate to a whole number before inserting it into the @content hash. Just copy+paste the code below into your ruby program (no need to patch the original code)

module PDF
class Reader
class PageTextReceiver

   def show_text(string) # Tj
        
        raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil?
        newx, newy = @state.trm_transform(0,0)

        newy = newy.round(0)
        @content[newy] ||= ""
        @content[newy] << @state.current_font.to_utf8(string)
   end

end
end
end

Hopefully this is useful for somebody else! Rounding to the nearest whole number worked for me. I don't know enough about the pdf coordinate system to know if that is a generic solution though...

yob / pdf-reader Goto Github PK

pdf-reader's People

Stargazers

Watchers

Forkers

pdf-reader's Issues

./spec/models/pdf_spec.rb:10

./spec/models/pdf_spec.rb:9:in `each'

./spec/models/pdf_spec.rb:9

Recommend Projects

Recommend Topics

Recommend Org