pdf-rs / pdf Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 114.0 3.41 MB

Rust library to read, manipulate and write PDF files.

License: MIT License

Rust 100.00%

pdf pdf-files rust

pdf's People

Stargazers

Watchers

Forkers

doppioandante teovoinea matiu2 basicinside linecode divergentdave eijebong brmmm3 peterjumper geoffreyy songlinshu ahmedkorim lumpchen sysidos drahnr hoangpq dylanmc tako8ki isgasho ploppz patocl oldwarthog changfeng1992 barneydmedia sirbanana markuspettersson98 bellwether-softworks alanhilal broke heroickatora vbkaisetsu markcatley jbaker5l placrosse jetasap frankfralick howarto 5225225 niedzwiedzw weichweich donjayamanne zhangwuqiao cosmichorrordev nicholasbishop mike-kfed 70u9h13 rojer-98 yossan blackholefox itsbalamurali kuromike0629 cl4sm renatojcalpalhao evanrichter oyelowo jumpdiffusion codefather-labs mcgluszak-macroindustries tonalidadehidrica sven-hm kdr-aus emarsden sed4906 billy-sheppard danielcontro rinconjc monkslc uncomfyhalomacro huazhang yaminiu pombredanne mwanner jakeoshannessy antonwnk bombless spmadden eventualbuddha omkar-mohanty tokatoka iq-scm kuopingl greyarch tjweir dumpinground git2u nickeysoft1 harana-oss robyoung master-hax ksyoon0321 dongdonggreendot 2ndderivative v4nderstruck grepwise nilay27 shantanugoel extrawurst gdennie tenzap danruto

pdf's Issues

Standard fonts

Apparently you have copied files that are proprietary of Adobe, including the aptly named AdobePiStd.otf. However, I suspect that the entire font directory was taken from one of our installations.

Please remove these ASAP!

UnexpectedPrimitive { expected: "Reference", found: "Dictionary" }

When I tried to open the following PDF, I got the following error:
https://arxiv.org/pdf/1801.09321.pdf

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Try { file: "/Users/k-akabe/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.7.2/src/file.rs", line: 277, column: 23, source: FromPrimitive { typ: "RcRef < Catalog >", field: "root", source: Try { file: "/Users/k-akabe/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.7.2/src/file.rs", line: 94, column: 19, source: FromPrimitive { typ: "Option < RcRef < NameDictionary > >", field: "names", source: UnexpectedPrimitive { expected: "Reference", found: "Dictionary" } } } } }', src/main.rs:4:62

// main.rs
use pdf::file::File as PdfFile;

fn main() {
    PdfFile::open("/Users/k-akabe/Downloads/1801.09321.pdf").unwrap();
}

# Cargo.toml
[dependencies]
pdf = "0.7.2"

This PDF can be opened without any problem in common viewers such as Firefox.

"UnexpectedPrimitive { expected: "Stream", found: "Array" }" error on git master

I am trying to open a PDF and am getting the following error:

Error: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/file.rs", line: 94, column: 19, source: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/object/types.rs", line: 22, column: 42, source: FromPrimitive { typ: "Option < Content >", field: "contents", source: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/content.rs", line: 469, column: 28, source: UnexpectedPrimitive { expected: "Stream", found: "Array" } } } } }

I am using git master (390fec1).
Any suggestions? It's a public PDF that I've downloaded off the internet, however, the fact that I'm looking at it is private. I can provide the PDF or a link privately if that would help.

EDITED: Sorry, I made an error when posting the original issue. I was reading and writing to the same file by mistake. I've updated this post with the correct error, the original error is below if anyone is interested.

Error: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/file.rs", line: 194, column: 24, source: Other { msg: "file header is missing" } }

Redundant cloning due to Primitive::as_*

This line in parser/parse_xref.rs

let trailer = parse_with_lexer(lexer)?.as_dictionary(NO_RESOLVE)?.clone();

seems like a redundant clone, as parse_with_lexer returns a Result<Primitive> and we could just move it to the trailer variable. as_dictionary returns a reference because the resolve function requires it, but we don't use resolution (dereferencing) here.

Test lexer / remove unused functionality

Generally need tests for the lexer module.

next_word in parser/lexer/mod.rs:

/// Used by next, peek and back - returns substring and new position
/// If forward, places pointer at the next non-whitespace character.
/// If backward, places pointer at the start of the current word.

Needs a test to validate this.

~~And I don't think that the "previous lexeme"/back functionality is used anyway.~~ Anyway - where back() is used, we can probably use peek() instead.

Can't recognize type

pdf 646415a
rustc 1.17.0-nightly (b1e31766d 2017-03-03)

self.root: Dictionary({"Pages": Reference(ObjectId { obj_nr: 1, gen_nr: 0 }), "Type": Name("Catalog"), "PageMode": Name("UseOutlines"), "Outlines": Reference(ObjectId { obj_nr: 31, gen_nr: 0 })})
error: Document: error getting object from Reader.
caused by: Can't recognize type. Pos: 1262
	First lexeme: true
	Rest:
/CS /DeviceRGB
   >>
   /Resources 2 0 R
>>
endobj

	End rest

backtrace: stack backtrace:
   0:     0x5566f2fd9181 - backtrace::backtrace::libunwind::trace
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace<closure>
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/mod.rs:42
   1:     0x5566f2fda113 - backtrace::capture::{{impl}}::new
                        at /home/bbigras/dev/rust/rs-imprime/target/debug/build/backtrace-17fa5380aeaf9437/out/capture.rs:79
   2:     0x5566f2fd0c2c - error_chain::make_backtrace
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:413
   3:     0x5566f2fd0ce7 - error_chain::{{impl}}::default
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:494
   4:     0x5566f2f65c10 - pdf::err::{{impl}}::from_kind
                        at /home/bbigras/dev/rust/rs-imprime/<error_chain_processed macros>:49
   5:     0x5566f2f660ee - pdf::err::{{impl}}::from
                        at /home/bbigras/pdf/src/err.rs:7
   6:     0x5566f2f30709 - core::convert::{{impl}}::into<collections::string::String,pdf::err::Error>
                        at /checkout/src/libcore/convert.rs:279
   7:     0x5566f2f56685 - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:166
   8:     0x5566f2f53383 - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:55
   9:     0x5566f2f53383 - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:55
  10:     0x5566f2f52c33 - pdf::file::parse_object::{{impl}}::parse_object
                        at /home/bbigras/pdf/src/file/parse_object.rs:33
  11:     0x5566f2f57deb - pdf::file::parse_object::{{impl}}::parse_indirect_object
                        at /home/bbigras/pdf/src/file/parse_object.rs:186
  12:     0x5566f2f60bd6 - pdf::file::{{impl}}::read_indirect_object
                        at /home/bbigras/pdf/src/file/mod.rs:92
  13:     0x5566f2f642d8 - pdf::file::{{impl}}::next
                        at /home/bbigras/pdf/src/file/mod.rs:269
  14:     0x5566f2f4a820 - pdf::doc::{{impl}}::from_path
                        at /home/bbigras/pdf/src/doc/mod.rs:25
  15:     0x5566f2f14222 - rs_imprime::get_nb_pages
                        at /home/bbigras/dev/rust/rs-imprime/src/main.rs:146
  16:     0x5566f2f14b14 - rs_imprime::main
                        at /home/bbigras/dev/rust/rs-imprime/src/main.rs:257
  17:     0x5566f2ff5025 - std::panicking::try::do_call<fn(),()>
                        at /checkout/src/libstd/panicking.rs:454
  18:     0x5566f2ffc27a - panic_unwind::__rust_maybe_catch_panic
                        at /checkout/src/libpanic_unwind/lib.rs:98
  19:     0x5566f2ff5ad6 - std::panicking::try<(),fn()>
                        at /checkout/src/libstd/panicking.rs:433
                         - std::panic::catch_unwind<fn(),()>
                        at /checkout/src/libstd/panic.rs:361
                         - std::rt::lang_start
                        at /checkout/src/libstd/rt.rs:57
  20:     0x5566f2f14c32 - main
  21:     0x7fd4f48ad510 - __libc_start_main
  22:     0x5566f2f0c719 - _start
  23:                0x0 - <unknown>

Release 0.7

We need to release a new version to crates.io with all the improvements that have been implemented since.

Any blocking issues?

release tuple crate
release encoding crate

Cannot convert from primitive to type Catalog

This PDF causes the following error when trying to load the pdf via:
let _file = pdf::file::File::<Vec<u8>>::open("test.pdf")?;

Error: Error(Msg("Key Root: cannot convert from primitive to type Catalog"), State { next_error: Some(Error(Msg("Key Pages: cannot convert from primitive to type PageTree"), State { next_error: Some(Error(Msg("Key Kids: cannot convert from primitive to type Vec < PagesNode >"), State { next_error: Some(Error(Msg("Key Resources: cannot convert from primitive to type Option < Resources >"), State { next_error: Some(Error(Msg("Key XObject: cannot convert from primitive to type Option < BTreeMap < String , XObject > >"), State { next_error: Some(Error(EntryNotFound { key: "Type" }, State { next_error: None, backtrace: Some(stack backtrace:
   0: backtrace::backtrace::trace_unsynchronized::hee1893eb5da7e7a9 (0x10e8a070c)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/backtrace/mod.rs:66
   1: backtrace::backtrace::trace::hca6b1efc3d61f6e3 (0x10e8a0692)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/backtrace/mod.rs:53
   2: backtrace::capture::Backtrace::create::hd47ce747c1184538 (0x10e898896)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/capture.rs:163
   3: backtrace::capture::Backtrace::new::h2c9cb6b3d23b2b00 (0x10e8987d4)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/capture.rs:126
   4: error_chain::make_backtrace::h2587ae8cf04d98e7 (0x10e74a3f1)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.11.0/src/lib.rs:616
   5: <error_chain::State as core::default::Default>::default::h6aba5dc07456c848 (0x10e74a45f)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.11.0/src/lib.rs:710
   6: pdf::err::Error::from_kind::hcf862d3fafe4c047 (0x10e6d1bb0)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/<::error_chain::error_chain::impl_error_chain_processed macros>:53
   7: <pdf::err::Error as core::convert::From<pdf::err::ErrorKind>>::from::hb3577d866bd2cdae (0x10e6d4977)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/<::error_chain::error_chain::impl_error_chain_processed macros>:98
   8: <pdf::object::types::XObject as pdf::object::Object>::from_primitive::h7b8c25308f2940a5 (0x10e62f63e)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:210
   9: <alloc::collections::btree::map::BTreeMap<alloc::string::String, V> as pdf::object::Object>::from_primitive::h7ecb996880b2bfbf (0x10e66abea)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:236
  10: <alloc::collections::btree::map::BTreeMap<alloc::string::String, V> as pdf::object::Object>::from_primitive::h7ecb996880b2bfbf (0x10e66af4a)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:240
  11: pdf::primitive::<impl pdf::object::Object for core::option::Option<T>>::from_primitive::h732f52252c607dd5 (0x10e6911e6)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/primitive.rs:326
  12: <pdf::object::types::Resources as pdf::object::Object>::from_primitive::hd45eab241d4dec61 (0x10e63b059)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:135
  13: pdf::primitive::<impl pdf::object::Object for core::option::Option<T>>::from_primitive::hd977042959c5f6cf (0x10e692483)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/primitive.rs:326
  14: <pdf::object::types::Page as pdf::object::Object>::from_primitive::h335ba1f03f02c092 (0x10e638093)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:89
  15: <pdf::object::types::PagesNode as pdf::object::Object>::from_primitive::h24811c69acca4443 (0x10e62e1f1)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:25
  16: <alloc::vec::Vec<T> as pdf::object::Object>::from_primitive::{{closure}}::hedb9a02a32628cda (0x10e6c860c)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:195
  17: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::h149d4c1aea1a0351 (0x10e6bebb0)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/ops/function.rs:279
  18: <core::option::Option<T>>::map::h4a7723266d769797 (0x10e689b9d)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/option.rs:414
  19: <core::iter::Map<I, F> as core::iter::iterator::Iterator>::next::h51bfd912e8d5e7ef (0x10e6dc36e)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/mod.rs:1428
  20: <<core::result::Result<V, E> as core::iter::traits::FromIterator<core::result::Result<A, E>>>::from_iter::Adapter<Iter, E> as core::iter::iterator::Iterator>::next::h5b9a342e2ed9f8a2 (0x10e6e2e58)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/result.rs:1220
  21: <&mut I as core::iter::iterator::Iterator>::next::h58363be6b4480122 (0x10e6eb68e)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/iterator.rs:2644
  22: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter::hf2babf0635f53d93 (0x10e6b89bb)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/liballoc/vec.rs:1813
  23: <alloc::vec::Vec<T> as core::iter::traits::FromIterator<T>>::from_iter::hf1c3f6cdad026dbf (0x10e6b8f23)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/liballoc/vec.rs:1725
  24: <core::result::Result<V, E> as core::iter::traits::FromIterator<core::result::Result<A, E>>>::from_iter::hf5fae6ddabd204a1 (0x10e6e25d0)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/result.rs:1237
  25: core::iter::iterator::Iterator::collect::h3e0eb7698cf25103 (0x10e6db147)
             at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/iterator.rs:1468
  26: <alloc::vec::Vec<T> as pdf::object::Object>::from_primitive::hd6bcee64ade5112c (0x10e6bbdec)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:193
  27: <pdf::object::types::PageTree as pdf::object::Object>::from_primitive::h962ee7a4c4cc6534 (0x10e635e23)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:69
  28: <pdf::object::types::Catalog as pdf::object::Object>::from_primitive::h76a04b5e422d45fc (0x10e633969)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:34
  29: <pdf::file::Trailer as pdf::object::Object>::from_primitive::h1df45626e88e653a (0x10e6cd14b)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/file.rs:252
  30: <pdf::file::File<B>>::open::h40b1658bb78a8ab7 (0x10e6211da)
             at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/file.rs:142

Decoding Issues Flate

Hi there,
i noticed some issues while decoding BW images created with gscan2pdf.
The image is flate encoded. My guess is that unfilter with predictors isn't working properly.
Example PDF and rendered page is attached.
bw.pdf

PageTree.count = 0 when it should be 1

@compenguy has the following problem

https://github.com/compenguy/pdftext (run cargo test): document has 1 page, but the pdf lib says it has 0, because the PageTree has count = 0.

inspect-prim shows that the /Count value of the page tree is in fact 1. (just run cargo run)

I will look take a look again later.

Errors in running the example

While compiling the given example I got the below errors and warnings:

   Compiling rust v0.1.0 (/Users/hasan/PycharmProjects/rust)
error[E0432]: unresolved import `pdf::error`
 --> src/main.rs:9:10
  |
9 | use pdf::error::PdfError;
  |          ^^^^^ could not find `error` in `pdf`

error[E0599]: no method named `as_ref` found for type `&pdf::object::types::Page` in the current scope
  --> src/main.rs:21:30
   |
21 |         let resources = page.as_ref().unwrap().resources(&file).unwrap();
   |                              ^^^^^^
   |
   = note: the method `as_ref` exists but the following trait bounds were not satisfied:
           `&pdf::object::types::Page : std::convert::AsRef<_>`

error[E0599]: no method named `data` found for type `&std::rc::Rc<_>` in the current scope
  --> src/main.rs:31:22
   |
31 |         f.write(&img.data().unwrap()).unwrap();
   |                      ^^^^
   |
   = note: img is a function, perhaps you wish to call it

error: aborting due to 3 previous errors

Some errors have detailed explanations: E0432, E0599.
For more information about an error, try `rustc --explain E0432`.
error: Could not compile `rust`.

Use enum variants for errors

In the code, we often describe errors with &'static str, e.g. bail!("Read past boundary of given contents.");. I find this convenient when the code is unstable or under rewrite, as we have yet to decide on the enum variants of ErrorKind in err.rs.

At some point, we should go through the code and see what enum variants we need to have in ErrorKind, and use those instead of str, so as to accommodate for applications to do pattern matching on errors.

Improve Metadata example

TODO

print strings properly (by calling .as_str() on them)
inspect XML metadata (see Metadata Streams in the PDF specification)

Add an example that extracts all texts from a PDF

I am looking for an alternative to pdftotext, such an example would be tremendously useful also as a quick start.

Can't recognize type #2

pdf 646415a
rustc 1.17.0-nightly (b1e31766d 2017-03-03)

error: Error reading root.
caused by: Can't recognize type. Pos: 7242
	First lexeme: null
	Rest:
null 0]
/Lang(fr-CA)
>>
endobj

13 0 obj
<</Creato

	End rest

backtrace: stack backtrace:
   0:     0x55ff93495181 - backtrace::backtrace::libunwind::trace
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/libunwind.rs:53
                         - backtrace::backtrace::trace<closure>
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/mod.rs:42
   1:     0x55ff93496113 - backtrace::capture::{{impl}}::new
                        at /home/bbigras/dev/rust/rs-imprime/target/debug/build/backtrace-17fa5380aeaf9437/out/capture.rs:79
   2:     0x55ff9348cc2c - error_chain::make_backtrace
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:413
   3:     0x55ff9348cce7 - error_chain::{{impl}}::default
                        at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:494
   4:     0x55ff93421c10 - pdf::err::{{impl}}::from_kind
                        at /home/bbigras/dev/rust/rs-imprime/<error_chain_processed macros>:49
   5:     0x55ff934220ee - pdf::err::{{impl}}::from
                        at /home/bbigras/pdf/src/err.rs:7
   6:     0x55ff933ec709 - core::convert::{{impl}}::into<collections::string::String,pdf::err::Error>
                        at /checkout/src/libcore/convert.rs:279
   7:     0x55ff93412685 - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:166
   8:     0x55ff934116ee - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:134
   9:     0x55ff9340f383 - pdf::file::parse_object::{{impl}}::parse_object_internal
                        at /home/bbigras/pdf/src/file/parse_object.rs:55
  10:     0x55ff9340ec33 - pdf::file::parse_object::{{impl}}::parse_object
                        at /home/bbigras/pdf/src/file/parse_object.rs:33
  11:     0x55ff93413deb - pdf::file::parse_object::{{impl}}::parse_indirect_object
                        at /home/bbigras/pdf/src/file/parse_object.rs:186
  12:     0x55ff9341cbd6 - pdf::file::{{impl}}::read_indirect_object
                        at /home/bbigras/pdf/src/file/mod.rs:92
  13:     0x55ff9341c7f0 - pdf::file::{{impl}}::dereference
                        at /home/bbigras/pdf/src/file/mod.rs:77
  14:     0x55ff9341f728 - pdf::file::{{impl}}::read_root
                        at /home/bbigras/pdf/src/file/mod.rs:247
  15:     0x55ff9341baae - pdf::file::{{impl}}::new
                        at /home/bbigras/pdf/src/file/mod.rs:55
  16:     0x55ff9341b19b - pdf::file::{{impl}}::from_path
                        at /home/bbigras/pdf/src/file/mod.rs:36
  17:     0x55ff934065a2 - pdf::doc::{{impl}}::from_path
                        at /home/bbigras/pdf/src/doc/mod.rs:24
  18:     0x55ff933d0222 - rs_imprime::get_nb_pages
                        at /home/bbigras/dev/rust/rs-imprime/src/main.rs:146
  19:     0x55ff933d0b14 - rs_imprime::main
                        at /home/bbigras/dev/rust/rs-imprime/src/main.rs:257
  20:     0x55ff934b1025 - std::panicking::try::do_call<fn(),()>
                        at /checkout/src/libstd/panicking.rs:454
  21:     0x55ff934b827a - panic_unwind::__rust_maybe_catch_panic
                        at /checkout/src/libpanic_unwind/lib.rs:98
  22:     0x55ff934b1ad6 - std::panicking::try<(),fn()>
                        at /checkout/src/libstd/panicking.rs:433
                         - std::panic::catch_unwind<fn(),()>
                        at /checkout/src/libstd/panic.rs:361
                         - std::rt::lang_start
                        at /checkout/src/libstd/rt.rs:57
  23:     0x55ff933d0c32 - main
  24:     0x7f306b1ab510 - __libc_start_main
  25:     0x55ff933c8719 - _start
  26:                0x0 - <unknown>

How to edit metadata ?

Hi,

I am trying to open an existing PDF. Insert some metadata and then save it.
But file.trailer.info_dict values doesn't seem to be used when I use the .save_to method
Is it something feasible? Or planned?

found a function stream with type Some(Integer(0)) error on git master

I am trying to open a PDF and am getting the following error:

Error: Try {
  file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/file.rs",
  line: 94,
  column: 19,
  source: Try {
    file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/object/types.rs",
    line: 22,
    column: 42,
    source: FromPrimitive {
      typ: "Option < MaybeRef < Resources > >",
      field: "resources",
      source: Try {
        file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/file.rs",
        line: 94,
        column: 19,
        source: FromPrimitive { 
          typ: "HashMap < String, ColorSpace >",
          field: "color_spaces",
          source: Try {
            file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/object/color.rs",
            line: 73,
            column: 28,
            source: Other {
              msg: "found a function stream with type Some(Integer(0))"
            }
          }
        }
      }
    }
  }
}

I am using git master (0dcefd4).
Any suggestions? It's a public PDF that I've downloaded off the internet, however, the fact that I'm looking at it is private. I can provide the PDF or a link privately if that would help.
My code currently doesn't read the colours at all, it's just extracting text.

is there any way to save page as a image ?

is there any way to save page as a image or save to buffer and load image using https://github.com/image-rs/image ?

Parse PDFs with broken xref section

Hi,

I got a panic when loading some PDFs. They are displayed fine in the viewer. I have attached an example of a PDF that can be reproduced. Is there anything I can do about this kind of PDF?

thanks.

invalid.pdf

Loading PDF gets stuck at `READ XREF AND TABLE`

When loading this PDF via this code, the program gets stuck:
let _file = pdf::file::File::<Vec<u8>>::open("test1.pdf")?;

download_fonts.sh doesn't actually populate all the fonts

I struggled to find how to compile the read example while enabling standard-fonts.
But while compiling the example directly it could not find the following 3 fonts:

Arial-BoldMT.otf
Arial-ItalicMT.otf
Arial-MT.ttf
Contents of fonts after executing download_fonts.sh:

fonts/:
AdobePiStd.otf              CourierStd-Oblique.otf  MinionPro-Bold.otf     MyriadPro-BoldIt.otf  MyriadPro-Regular.otf  ZX______.PFB
CourierStd-BoldOblique.otf  CourierStd.otf          MinionPro-It.otf       MyriadPro-Bold.otf    PFM                    ZY______.PFB
CourierStd-Bold.otf         MinionPro-BoldIt.otf    MinionPro-Regular.otf  MyriadPro-It.otf      SY______.PFB

possible solutions

document the limitation of download_fonts.sh
patch the font downloader to get those ones too

I would prefer documenting since the project targets advanced users.

Demo website is down

Document how to run examples

$ cargo run -p read

Open a password protected PDF

How could I open a password protected PDF.
Is there some way to pass it?

 === 
Error: Reading Xref stream
  caused by: Error parsing from string - word: >
 === 

 === 
Exiting...

Error opening these PDFs

pdf error.zip

Switch from error-chain to failure

Flamegraph suggests error-chain is a bottleneck. Maybe we rely too much on propagating errors though.

trying to extract text, not all strings are present - looks like those with non-latin characters are gone

I'm unable to provide an example pdf cause it contains sensitive data though :(

Build failure on nightly 5a2465e2b: ‘Derive error’

On rustc 1.24.0-nightly (5a2465e2b 2017-12-06).

error: proc-macro derive panicked
  --> /home/sanmai/.cargo/bin/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.2/src/types.rs:51:10
   |
51 | #[derive(Object, FromDict, Debug)]
   |          ^^^^^^
   |
   = help: message: Derive error - Supported derive attributes: `key="Key"`, `default="some code"`.

Horribly slow

Iterating over all named destination lasts for 23000 items 245s!
Doing the same with pypdf2 and using pypy lasts 8s.
For a Rust library I would expect this to be done <1s.

Text extraction

Current status: works sometimes.

TODO:

needs to work reliable
attempt to extract blocks of text

Tolerate erroneous xref stream

I opened a PDF file created with xelatex, did some annotation in Drawboard PDF and saved it (xelatex_drawboard.pdf). A test crashes with this new file because of this peculiar cross-reference stream:
/W = [1, 2, 0]
Data = [2, 1, 183, 248, 2, 0, 1, 88, 2, 0, 3, 245, 2, 0, 0, 21, 2, 0, 2, 7]

From the reference: "A value of zero for an element in the W array indicates that the corresponding field is not present in the stream, and the default value is used, if there is one."

Each entry in the data should have 1 + 2 + 0 = 3 bytes but clearly this is not true - there are 5 entries with 4 bytes. Any PDF reader I use has no problem viewing it, so I think this library should try to tolerate it as well - but how should it deal with this? (given that it's not just my understanding which is wrong)

Unable to update https://gitlab.com/sebk/tuple/

why?
Unable to update https://gitlab.com/sebk/tuple/

impl Backend for Mmap is unsound

You can't just call Mmap methods behind unsafe code and call it a day. If someone modifies the file, even from another program, things will go very badly. One should be able to read a PDF document from an Mmap value only from unsafe code.

I would suggest killing Backend::open and let people initialise their MMap or Vec<u8> inputs themselves.

Typed operations

I am writing a tool that extracts data out of some domain-specific pdfs and am finding it quite cumbersome to deal with the untyped operations.

I am writing an enum version of pdf::content::Operation. Is that something you'd be interested in having in-tree or should I create a separate repo and crate for it?

Use of unstable features

Is it on the roadmap to make pdf compatible with stable again? FYI, it looks like #![feature(custom_attribute)] is unused. Here's how the other unstable features are currently being used.

termination_trait_lib: This is used in PdfError, to print error information while the process status code for the error is being fetched. This could be replaced with some boilerplate at each relevant main() function, to print out the same error information if an error bubbles all the way up.
try_trait: This is used to implement a conversion from NoneError to PdfError. This conversion is only used once, in font.rs, and it could be replaced with .ok_or(PdfError::Other { msg: ... }.
core_intrinsics: This is used to implement AnyObject and for error handling. I'm not familiar with this, but it's possible std::any::Any or the typename crates could fill in the gap here.

Unable to update https://gitlab.com/sebk/tuple/

when "cargo run -p view"
Unable to update https://gitlab.com/sebk/tuple/

Executing text example failed

While executing the text example:

use std::env::args;

use pdf::file::File;
use pdf::content::*;
use pdf::primitive::Primitive;

fn add_primitive(p: &Primitive, out: &mut String) {
    // println!("p: {:?}", p);
    match p {
        &Primitive::String(ref s) => if let Ok(text) = s.as_str() {
            out.push_str(text);
        }
        &Primitive::Array(ref a) => for p in a.iter() {
            add_primitive(p, out);
        }
        _ => ()
    }
}

fn main() {
    let path = args().nth(1).expect("no file given");
    println!("read: {}", path);
    let file = File::<Vec<u8>>::open(&path).unwrap();
    
    let mut out = String::new();
    for page in file.pages() {
        for Operation { ref operator, ref operands } in &page.unwrap().contents.as_ref().unwrap().operations {
            // println!("{} {:?}", operator, operands);
            match operator.as_str() {
                "Tj" | "TJ" | "BT" => operands.iter().for_each(|p| add_primitive(p, &mut out)),
                _ => {}
            }
        }
    }
    println!("{}", out);
}

With Cargo.toml:

[dependencies]
pdf = { git="https://github.com/pdf-rs/pdf" }

I got the below error:

PS C:\Users\hasan.DESKTOP-HU2FQ29\PycharmProjects\rust> cargo run HasanResume.pdf
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target\debug\rust.exe HasanResume.pdf`
read: HasanResume.pdf
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FromPrimitive { typ: "Option < Rc < Resources > >", field: "resources", source: FromPrimitive { typ: "BTreeMap < String, Rc < Font > >", field: "fonts", source: FromPrimitive { typ: "Vec < Rc < Font > >", field: "descendant_fonts", source: FromPrimitive { typ: "FontDescriptor", field: "font_descriptor", source: FromPrimitive { typ: "Rect", field: "font_bbox", source: UnexpectedPrimitive { expected: "Array", found: "Reference" } 
} } } } }', src\libcore\result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
error: process didn't exit successfully: `target\debug\rust.exe HasanResume.pdf` (exit code: 101)

How can I get a page and save to another file?

Like this.

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_input = PdfFileReader(open("test.pdf", 'rb'))

pdf_output = PdfFileWriter()
page = pdf_input.getPage(2)
pdf_output.addPage(page)

pdf_output.write(open("./splitted.pdf", 'wb'))

Use err::Result for Object::serialize?

Reason: we panic and unwrap in serialize functions because we can't use ? operator.

[Question] Possibility of sanitizing PDFs

Hi, is it possible to strip possibly malicious elements (such as JavaScript or OpenAction tags) off a PDF document using this tool?

Stream encoding - when to decode

In most cases, a stream ought to be decoded in order to be used by the PDF library (or an a (for example an xref stream). As it is right now, this is done during parsing (parser/mod.rs). In other cases, the encoded data doesn't need to be encoded before it is consumed by the application. Example: DCT filter for JPEG stream.
Need a policy for when to keep it as it is, and when to decode.
Simplest idea, and what I will do now:

Don't decode in parser/mod.rs.
The derivation will assume that the info field has a filter field ... all streams have this anyway.
In the derivation of stream-based Objects, add option whether or not to decode the stream in from_primitive().
Hence, it's up to each Object, whether or not they need the encoded or decoded data. I'm not sure whether this is the best way yet. ((Another idea is to just decode in parser/mod.rs depending on what filters are on it (e.g. not in the case of DCT filter). It would still require that data about which filters are still encoded on the data.))

Get the PDF writer working again.

Needed for a sane implementation of grafeia/grafeia#2

Read pdf failed

I read a pdf by this code got an error.

#[cfg(test)]
mod pdf_test {
    use glob::glob;
    use pdf::file::File;

    macro_rules! file_path {
        ( $sub_dir:expr ) => { concat!("./src/test/common/", $sub_dir) }
    }

    macro_rules! run {
        ($e:expr) => (
            match $e {
                Ok(v) => v,
                Err(e) => {
                    e.trace();
                    panic!("{}", e);
                }
            }
        )
    }

    #[test]
    pub fn read_pages() {
        for entry in glob(file_path!("original.pdf")).expect("Failed to read glob pattern") {
            match entry {
                Ok(path) => {
                    println!("\n == Now testing `{}` ==", path.to_str().unwrap());
    
                    let path = path.to_str().unwrap();
                    let file = run!(File::<Vec<u8>>::open(path));
                    for i in 0 .. file.num_pages() {
                        println!("Read page {}", i);
                        let _ = file.get_page(i);
                    }
                }
                Err(e) => println!("{:?}", e)
            }
        }
    }
}

 == Now testing `src\test\common\original.pdf` ==
0: Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:277:23
1: Can't parse field root of struct RcRef < Catalog >.
2: Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:94:19
3: Can't parse field names of struct Option < RcRef < NameDictionary > >.
4: Expected primitive Reference, found primive Dictionary instead.
thread 'test::pdf_test::pdf_test::read_pages' panicked at 'Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:277:23', src\test\pdf_test.rs:33:32
stack backtrace:
   0: std::panicking::begin_panic_handler
             at /rustc/657bc01888e6297257655585f9c475a0801db6d2\/library\std\src\panicking.rs:515
   1: std::panicking::begin_panic_fmt
             at /rustc/657bc01888e6297257655585f9c475a0801db6d2\/library\std\src\panicking.rs:457
   2: document_manager::test::pdf_test::pdf_test::read_pages
             at .\src\test\pdf_test.rs:33
   3: document_manager::test::pdf_test::pdf_test::read_pages::{{closure}}
             at .\src\test\pdf_test.rs:23
   4: core::ops::function::FnOnce::call_once<closure-0,tuple<>>
             at /rustc/657bc01888e6297257655585f9c475a0801db6d2\library\core\src\ops\function.rs:227
   5: core::ops::function::FnOnce::call_once
             at /rustc/657bc01888e6297257655585f9c475a0801db6d2\library\core\src\ops\function.rs:227
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test test::pdf_test::pdf_test::read_pages ... FAILED

failures:

failures:
    test::pdf_test::pdf_test::read_pages

I rewrite this file by Python PyPDF2 then can read it successfully. But the rewritten file becomes smaller.

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_input = PdfFileReader(open("talkiin.pdf", 'rb'))
page_count = pdf_input.getNumPages()

pdf_output = PdfFileWriter()
for i in range(page_count):
    page = pdf_input.getPage(i)
    pdf_output.addPage(page)

pdf_output.write(open("./splitted.pdf".format(i), 'wb'))

I'm sorry I can't provide this document.

dump_data() issues

I tried parsing a corrupt PDF file, and the program panicked with the following error. The unwrap() was in dump_data().

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: NotFound, error: PathError { path: "/tmp/pdf/oTRQva", err: Os { code: 2, kind: NotFound, message: "No such file or directory" } } }', src/libcore/result.rs:999:5

After I created a /tmp/pdf directory, the panic went away. However, since I didn't have logging on, it created a file in /tmp/pdf without notifying me of its presence. I'd recommend two changes to this function:

Instead of specifying a subdirectory, use the OS's default directory, and provide a file name prefix like pdf-
Only write files out when an environment variable is set, and log a message with instructions on setting the environment variable to dump the buffer if the environment variable is not set

Error on LineCap Integer

I've tried to use the "text" example with a PDF I have, but I got the following error while accessing the first page:

Error: Try { file: "pdf\\src\\file.rs", line: 96, column: 19, source: Try { file: "pdf\\src\\object\\types.rs", line: 25, column: 36, source: FromPrimitive { typ: "Option < MaybeRef < Resources > >", field: "resources", source: FromPrimitive { typ: "HashMap < String, GraphicsStateParameters >", field: "graphics_states", source: FromPrimitive { typ: "Option < LineCap >", field: "line_cap", source: UnexpectedPrimitive { expected: "Name", found: "Integer" } } } } } }

The "inspect_prim" tool is working fine with it, so I've used some debug logging to find the resource causing the issue, and got the following data:

I'm a newbie regarding the PDF standard, so I can't really tell if this is a bug in the library or an issue with my PDF file. I've checked the PDF with an online validator tool though, seems like it's compilant with the PDF 1.4 standard.
Could you help me pinpoint the issue here? Why is the decoder expecting a Name? LineCap is an enum, so the Integer should work fine, right?

Object trait: add functionality like from_dict?

Since I removed from_dict and left only from_primitive, I should consider adding provided functionality in the Object trait that uses from_primitive to implement from_dict and from_stream. This would have to fail (return Err) for any primitve not dict or stream though. Maybe rather consider adding FromDict: Object that only provides from_dict().

Error when using crate because of error in dependency

Hi, I'm getting an error when trying to use this crate.

❯ cargo run
   Compiling pdf_derive v0.1.20
error[E0433]: failed to resolve: could not find `export` in `syn`
  --> /Users/x/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf_derive-0.1.20/src/lib.rs:99:23
   |
99 | type SynStream = syn::export::TokenStream2;
   |                       ^^^^^^ could not find `export` in `syn`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0433`.
error: could not compile `pdf_derive`

This seems to be a suggested fix: frondeus/test-case#60 (comment)