pdf-rs / pdf Goto Github PK
View Code? Open in Web Editor NEWRust library to read, manipulate and write PDF files.
License: MIT License
Rust library to read, manipulate and write PDF files.
License: MIT License
Apparently you have copied files that are proprietary of Adobe, including the aptly named AdobePiStd.otf. However, I suspect that the entire font directory was taken from one of our installations.
Please remove these ASAP!
When I tried to open the following PDF, I got the following error:
https://arxiv.org/pdf/1801.09321.pdf
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Try { file: "/Users/k-akabe/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.7.2/src/file.rs", line: 277, column: 23, source: FromPrimitive { typ: "RcRef < Catalog >", field: "root", source: Try { file: "/Users/k-akabe/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.7.2/src/file.rs", line: 94, column: 19, source: FromPrimitive { typ: "Option < RcRef < NameDictionary > >", field: "names", source: UnexpectedPrimitive { expected: "Reference", found: "Dictionary" } } } } }', src/main.rs:4:62
// main.rs
use pdf::file::File as PdfFile;
fn main() {
PdfFile::open("/Users/k-akabe/Downloads/1801.09321.pdf").unwrap();
}
# Cargo.toml
[dependencies]
pdf = "0.7.2"
This PDF can be opened without any problem in common viewers such as Firefox.
I am trying to open a PDF and am getting the following error:
Error: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/file.rs", line: 94, column: 19, source: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/object/types.rs", line: 22, column: 42, source: FromPrimitive { typ: "Option < Content >", field: "contents", source: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/content.rs", line: 469, column: 28, source: UnexpectedPrimitive { expected: "Stream", found: "Array" } } } } }
I am using git master (390fec1).
Any suggestions? It's a public PDF that I've downloaded off the internet, however, the fact that I'm looking at it is private. I can provide the PDF or a link privately if that would help.
EDITED: Sorry, I made an error when posting the original issue. I was reading and writing to the same file by mistake. I've updated this post with the correct error, the original error is below if anyone is interested.
Error: Try { file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/390fec1/pdf/src/file.rs", line: 194, column: 24, source: Other { msg: "file header is missing" } }
This line in parser/parse_xref.rs
let trailer = parse_with_lexer(lexer)?.as_dictionary(NO_RESOLVE)?.clone();
seems like a redundant clone, as parse_with_lexer
returns a Result<Primitive>
and we could just move it to the trailer
variable. as_dictionary
returns a reference because the resolve function requires it, but we don't use resolution (dereferencing) here.
Generally need tests for the lexer
module.
next_word
in parser/lexer/mod.rs
:
/// Used by next, peek and back - returns substring and new position
/// If forward, places pointer at the next non-whitespace character.
/// If backward, places pointer at the start of the current word.
Needs a test to validate this.
And I don't think that the "previous lexeme"/back functionality is used anyway. Anyway - where back() is used, we can probably use peek() instead.
pdf 646415a
rustc 1.17.0-nightly (b1e31766d 2017-03-03)
self.root: Dictionary({"Pages": Reference(ObjectId { obj_nr: 1, gen_nr: 0 }), "Type": Name("Catalog"), "PageMode": Name("UseOutlines"), "Outlines": Reference(ObjectId { obj_nr: 31, gen_nr: 0 })})
error: Document: error getting object from Reader.
caused by: Can't recognize type. Pos: 1262
First lexeme: true
Rest:
/CS /DeviceRGB
>>
/Resources 2 0 R
>>
endobj
End rest
backtrace: stack backtrace:
0: 0x5566f2fd9181 - backtrace::backtrace::libunwind::trace
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/libunwind.rs:53
- backtrace::backtrace::trace<closure>
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/mod.rs:42
1: 0x5566f2fda113 - backtrace::capture::{{impl}}::new
at /home/bbigras/dev/rust/rs-imprime/target/debug/build/backtrace-17fa5380aeaf9437/out/capture.rs:79
2: 0x5566f2fd0c2c - error_chain::make_backtrace
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:413
3: 0x5566f2fd0ce7 - error_chain::{{impl}}::default
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:494
4: 0x5566f2f65c10 - pdf::err::{{impl}}::from_kind
at /home/bbigras/dev/rust/rs-imprime/<error_chain_processed macros>:49
5: 0x5566f2f660ee - pdf::err::{{impl}}::from
at /home/bbigras/pdf/src/err.rs:7
6: 0x5566f2f30709 - core::convert::{{impl}}::into<collections::string::String,pdf::err::Error>
at /checkout/src/libcore/convert.rs:279
7: 0x5566f2f56685 - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:166
8: 0x5566f2f53383 - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:55
9: 0x5566f2f53383 - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:55
10: 0x5566f2f52c33 - pdf::file::parse_object::{{impl}}::parse_object
at /home/bbigras/pdf/src/file/parse_object.rs:33
11: 0x5566f2f57deb - pdf::file::parse_object::{{impl}}::parse_indirect_object
at /home/bbigras/pdf/src/file/parse_object.rs:186
12: 0x5566f2f60bd6 - pdf::file::{{impl}}::read_indirect_object
at /home/bbigras/pdf/src/file/mod.rs:92
13: 0x5566f2f642d8 - pdf::file::{{impl}}::next
at /home/bbigras/pdf/src/file/mod.rs:269
14: 0x5566f2f4a820 - pdf::doc::{{impl}}::from_path
at /home/bbigras/pdf/src/doc/mod.rs:25
15: 0x5566f2f14222 - rs_imprime::get_nb_pages
at /home/bbigras/dev/rust/rs-imprime/src/main.rs:146
16: 0x5566f2f14b14 - rs_imprime::main
at /home/bbigras/dev/rust/rs-imprime/src/main.rs:257
17: 0x5566f2ff5025 - std::panicking::try::do_call<fn(),()>
at /checkout/src/libstd/panicking.rs:454
18: 0x5566f2ffc27a - panic_unwind::__rust_maybe_catch_panic
at /checkout/src/libpanic_unwind/lib.rs:98
19: 0x5566f2ff5ad6 - std::panicking::try<(),fn()>
at /checkout/src/libstd/panicking.rs:433
- std::panic::catch_unwind<fn(),()>
at /checkout/src/libstd/panic.rs:361
- std::rt::lang_start
at /checkout/src/libstd/rt.rs:57
20: 0x5566f2f14c32 - main
21: 0x7fd4f48ad510 - __libc_start_main
22: 0x5566f2f0c719 - _start
23: 0x0 - <unknown>
We need to release a new version to crates.io with all the improvements that have been implemented since.
Any blocking issues?
This PDF causes the following error when trying to load the pdf via:
let _file = pdf::file::File::<Vec<u8>>::open("test.pdf")?;
Error: Error(Msg("Key Root: cannot convert from primitive to type Catalog"), State { next_error: Some(Error(Msg("Key Pages: cannot convert from primitive to type PageTree"), State { next_error: Some(Error(Msg("Key Kids: cannot convert from primitive to type Vec < PagesNode >"), State { next_error: Some(Error(Msg("Key Resources: cannot convert from primitive to type Option < Resources >"), State { next_error: Some(Error(Msg("Key XObject: cannot convert from primitive to type Option < BTreeMap < String , XObject > >"), State { next_error: Some(Error(EntryNotFound { key: "Type" }, State { next_error: None, backtrace: Some(stack backtrace:
0: backtrace::backtrace::trace_unsynchronized::hee1893eb5da7e7a9 (0x10e8a070c)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/backtrace/mod.rs:66
1: backtrace::backtrace::trace::hca6b1efc3d61f6e3 (0x10e8a0692)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/backtrace/mod.rs:53
2: backtrace::capture::Backtrace::create::hd47ce747c1184538 (0x10e898896)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/capture.rs:163
3: backtrace::capture::Backtrace::new::h2c9cb6b3d23b2b00 (0x10e8987d4)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.26/src/capture.rs:126
4: error_chain::make_backtrace::h2587ae8cf04d98e7 (0x10e74a3f1)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.11.0/src/lib.rs:616
5: <error_chain::State as core::default::Default>::default::h6aba5dc07456c848 (0x10e74a45f)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.11.0/src/lib.rs:710
6: pdf::err::Error::from_kind::hcf862d3fafe4c047 (0x10e6d1bb0)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/<::error_chain::error_chain::impl_error_chain_processed macros>:53
7: <pdf::err::Error as core::convert::From<pdf::err::ErrorKind>>::from::hb3577d866bd2cdae (0x10e6d4977)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/<::error_chain::error_chain::impl_error_chain_processed macros>:98
8: <pdf::object::types::XObject as pdf::object::Object>::from_primitive::h7b8c25308f2940a5 (0x10e62f63e)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:210
9: <alloc::collections::btree::map::BTreeMap<alloc::string::String, V> as pdf::object::Object>::from_primitive::h7ecb996880b2bfbf (0x10e66abea)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:236
10: <alloc::collections::btree::map::BTreeMap<alloc::string::String, V> as pdf::object::Object>::from_primitive::h7ecb996880b2bfbf (0x10e66af4a)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:240
11: pdf::primitive::<impl pdf::object::Object for core::option::Option<T>>::from_primitive::h732f52252c607dd5 (0x10e6911e6)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/primitive.rs:326
12: <pdf::object::types::Resources as pdf::object::Object>::from_primitive::hd45eab241d4dec61 (0x10e63b059)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:135
13: pdf::primitive::<impl pdf::object::Object for core::option::Option<T>>::from_primitive::hd977042959c5f6cf (0x10e692483)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/primitive.rs:326
14: <pdf::object::types::Page as pdf::object::Object>::from_primitive::h335ba1f03f02c092 (0x10e638093)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:89
15: <pdf::object::types::PagesNode as pdf::object::Object>::from_primitive::h24811c69acca4443 (0x10e62e1f1)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:25
16: <alloc::vec::Vec<T> as pdf::object::Object>::from_primitive::{{closure}}::hedb9a02a32628cda (0x10e6c860c)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:195
17: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::h149d4c1aea1a0351 (0x10e6bebb0)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/ops/function.rs:279
18: <core::option::Option<T>>::map::h4a7723266d769797 (0x10e689b9d)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/option.rs:414
19: <core::iter::Map<I, F> as core::iter::iterator::Iterator>::next::h51bfd912e8d5e7ef (0x10e6dc36e)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/mod.rs:1428
20: <<core::result::Result<V, E> as core::iter::traits::FromIterator<core::result::Result<A, E>>>::from_iter::Adapter<Iter, E> as core::iter::iterator::Iterator>::next::h5b9a342e2ed9f8a2 (0x10e6e2e58)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/result.rs:1220
21: <&mut I as core::iter::iterator::Iterator>::next::h58363be6b4480122 (0x10e6eb68e)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/iterator.rs:2644
22: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter::hf2babf0635f53d93 (0x10e6b89bb)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/liballoc/vec.rs:1813
23: <alloc::vec::Vec<T> as core::iter::traits::FromIterator<T>>::from_iter::hf1c3f6cdad026dbf (0x10e6b8f23)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/liballoc/vec.rs:1725
24: <core::result::Result<V, E> as core::iter::traits::FromIterator<core::result::Result<A, E>>>::from_iter::hf5fae6ddabd204a1 (0x10e6e25d0)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/result.rs:1237
25: core::iter::iterator::Iterator::collect::h3e0eb7698cf25103 (0x10e6db147)
at /rustc/2aa4c46cfdd726e97360c2734835aa3515e8c858/src/libcore/iter/iterator.rs:1468
26: <alloc::vec::Vec<T> as pdf::object::Object>::from_primitive::hd6bcee64ade5112c (0x10e6bbdec)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/mod.rs:193
27: <pdf::object::types::PageTree as pdf::object::Object>::from_primitive::h962ee7a4c4cc6534 (0x10e635e23)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:69
28: <pdf::object::types::Catalog as pdf::object::Object>::from_primitive::h76a04b5e422d45fc (0x10e633969)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/object/types.rs:34
29: <pdf::file::Trailer as pdf::object::Object>::from_primitive::h1df45626e88e653a (0x10e6cd14b)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/file.rs:252
30: <pdf::file::File<B>>::open::h40b1658bb78a8ab7 (0x10e6211da)
at /Users/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.3/src/file.rs:142
Hi there,
i noticed some issues while decoding BW images created with gscan2pdf.
The image is flate encoded. My guess is that unfilter with predictors isn't working properly.
Example PDF and rendered page is attached.
bw.pdf
@compenguy has the following problem
https://github.com/compenguy/pdftext (run cargo test
): document has 1 page, but the pdf
lib says it has 0, because the PageTree
has count = 0.
inspect-prim shows that the /Count
value of the page tree is in fact 1
. (just run cargo run
)
I will look take a look again later.
While compiling the given example I got the below errors and warnings:
Compiling rust v0.1.0 (/Users/hasan/PycharmProjects/rust)
error[E0432]: unresolved import `pdf::error`
--> src/main.rs:9:10
|
9 | use pdf::error::PdfError;
| ^^^^^ could not find `error` in `pdf`
error[E0599]: no method named `as_ref` found for type `&pdf::object::types::Page` in the current scope
--> src/main.rs:21:30
|
21 | let resources = page.as_ref().unwrap().resources(&file).unwrap();
| ^^^^^^
|
= note: the method `as_ref` exists but the following trait bounds were not satisfied:
`&pdf::object::types::Page : std::convert::AsRef<_>`
error[E0599]: no method named `data` found for type `&std::rc::Rc<_>` in the current scope
--> src/main.rs:31:22
|
31 | f.write(&img.data().unwrap()).unwrap();
| ^^^^
|
= note: img is a function, perhaps you wish to call it
error: aborting due to 3 previous errors
Some errors have detailed explanations: E0432, E0599.
For more information about an error, try `rustc --explain E0432`.
error: Could not compile `rust`.
In the code, we often describe errors with &'static str
, e.g. bail!("Read past boundary of given contents.");
. I find this convenient when the code is unstable or under rewrite, as we have yet to decide on the enum variants of ErrorKind
in err.rs
.
At some point, we should go through the code and see what enum variants we need to have in ErrorKind
, and use those instead of str
, so as to accommodate for applications to do pattern matching on errors.
TODO
.as_str()
on them)I am looking for an alternative to pdftotext
, such an example would be tremendously useful also as a quick start.
pdf 646415a
rustc 1.17.0-nightly (b1e31766d 2017-03-03)
error: Error reading root.
caused by: Can't recognize type. Pos: 7242
First lexeme: null
Rest:
null 0]
/Lang(fr-CA)
>>
endobj
13 0 obj
<</Creato
End rest
backtrace: stack backtrace:
0: 0x55ff93495181 - backtrace::backtrace::libunwind::trace
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/libunwind.rs:53
- backtrace::backtrace::trace<closure>
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.0/src/backtrace/mod.rs:42
1: 0x55ff93496113 - backtrace::capture::{{impl}}::new
at /home/bbigras/dev/rust/rs-imprime/target/debug/build/backtrace-17fa5380aeaf9437/out/capture.rs:79
2: 0x55ff9348cc2c - error_chain::make_backtrace
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:413
3: 0x55ff9348cce7 - error_chain::{{impl}}::default
at /home/bbigras/.cargo/registry/src/github.com-1ecc6299db9ec823/error-chain-0.9.0/src/lib.rs:494
4: 0x55ff93421c10 - pdf::err::{{impl}}::from_kind
at /home/bbigras/dev/rust/rs-imprime/<error_chain_processed macros>:49
5: 0x55ff934220ee - pdf::err::{{impl}}::from
at /home/bbigras/pdf/src/err.rs:7
6: 0x55ff933ec709 - core::convert::{{impl}}::into<collections::string::String,pdf::err::Error>
at /checkout/src/libcore/convert.rs:279
7: 0x55ff93412685 - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:166
8: 0x55ff934116ee - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:134
9: 0x55ff9340f383 - pdf::file::parse_object::{{impl}}::parse_object_internal
at /home/bbigras/pdf/src/file/parse_object.rs:55
10: 0x55ff9340ec33 - pdf::file::parse_object::{{impl}}::parse_object
at /home/bbigras/pdf/src/file/parse_object.rs:33
11: 0x55ff93413deb - pdf::file::parse_object::{{impl}}::parse_indirect_object
at /home/bbigras/pdf/src/file/parse_object.rs:186
12: 0x55ff9341cbd6 - pdf::file::{{impl}}::read_indirect_object
at /home/bbigras/pdf/src/file/mod.rs:92
13: 0x55ff9341c7f0 - pdf::file::{{impl}}::dereference
at /home/bbigras/pdf/src/file/mod.rs:77
14: 0x55ff9341f728 - pdf::file::{{impl}}::read_root
at /home/bbigras/pdf/src/file/mod.rs:247
15: 0x55ff9341baae - pdf::file::{{impl}}::new
at /home/bbigras/pdf/src/file/mod.rs:55
16: 0x55ff9341b19b - pdf::file::{{impl}}::from_path
at /home/bbigras/pdf/src/file/mod.rs:36
17: 0x55ff934065a2 - pdf::doc::{{impl}}::from_path
at /home/bbigras/pdf/src/doc/mod.rs:24
18: 0x55ff933d0222 - rs_imprime::get_nb_pages
at /home/bbigras/dev/rust/rs-imprime/src/main.rs:146
19: 0x55ff933d0b14 - rs_imprime::main
at /home/bbigras/dev/rust/rs-imprime/src/main.rs:257
20: 0x55ff934b1025 - std::panicking::try::do_call<fn(),()>
at /checkout/src/libstd/panicking.rs:454
21: 0x55ff934b827a - panic_unwind::__rust_maybe_catch_panic
at /checkout/src/libpanic_unwind/lib.rs:98
22: 0x55ff934b1ad6 - std::panicking::try<(),fn()>
at /checkout/src/libstd/panicking.rs:433
- std::panic::catch_unwind<fn(),()>
at /checkout/src/libstd/panic.rs:361
- std::rt::lang_start
at /checkout/src/libstd/rt.rs:57
23: 0x55ff933d0c32 - main
24: 0x7f306b1ab510 - __libc_start_main
25: 0x55ff933c8719 - _start
26: 0x0 - <unknown>
Hi,
I am trying to open an existing PDF. Insert some metadata and then save it.
But file.trailer.info_dict values doesn't seem to be used when I use the .save_to method
Is it something feasible? Or planned?
I am trying to open a PDF and am getting the following error:
Error: Try {
file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/file.rs",
line: 94,
column: 19,
source: Try {
file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/object/types.rs",
line: 22,
column: 42,
source: FromPrimitive {
typ: "Option < MaybeRef < Resources > >",
field: "resources",
source: Try {
file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/file.rs",
line: 94,
column: 19,
source: FromPrimitive {
typ: "HashMap < String, ColorSpace >",
field: "color_spaces",
source: Try {
file: "/Users/markcatley/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/0dcefd4/pdf/src/object/color.rs",
line: 73,
column: 28,
source: Other {
msg: "found a function stream with type Some(Integer(0))"
}
}
}
}
}
}
}
I am using git master (0dcefd4).
Any suggestions? It's a public PDF that I've downloaded off the internet, however, the fact that I'm looking at it is private. I can provide the PDF or a link privately if that would help.
My code currently doesn't read the colours at all, it's just extracting text.
is there any way to save page as a image or save to buffer and load image using https://github.com/image-rs/image ?
Hi,
I got a panic when loading some PDFs. They are displayed fine in the viewer. I have attached an example of a PDF that can be reproduced. Is there anything I can do about this kind of PDF?
thanks.
When loading this PDF via this code, the program gets stuck:
let _file = pdf::file::File::<Vec<u8>>::open("test1.pdf")?;
I struggled to find how to compile the read
example while enabling standard-fonts
.
But while compiling the example directly it could not find the following 3 fonts:
fonts
after executing download_fonts.sh
:fonts/:
AdobePiStd.otf CourierStd-Oblique.otf MinionPro-Bold.otf MyriadPro-BoldIt.otf MyriadPro-Regular.otf ZX______.PFB
CourierStd-BoldOblique.otf CourierStd.otf MinionPro-It.otf MyriadPro-Bold.otf PFM ZY______.PFB
CourierStd-Bold.otf MinionPro-BoldIt.otf MinionPro-Regular.otf MyriadPro-It.otf SY______.PFB
download_fonts.sh
I would prefer documenting since the project targets advanced users.
$ cargo run -p read
How could I open a password protected PDF.
Is there some way to pass it?
===
Error: Reading Xref stream
caused by: Error parsing from string - word: >
===
===
Exiting...
Flamegraph suggests error-chain is a bottleneck. Maybe we rely too much on propagating errors though.
I'm unable to provide an example pdf cause it contains sensitive data though :(
On rustc 1.24.0-nightly (5a2465e2b 2017-12-06).
error: proc-macro derive panicked
--> /home/sanmai/.cargo/bin/registry/src/github.com-1ecc6299db9ec823/pdf-0.6.2/src/types.rs:51:10
|
51 | #[derive(Object, FromDict, Debug)]
| ^^^^^^
|
= help: message: Derive error - Supported derive attributes: `key="Key"`, `default="some code"`.
Iterating over all named destination lasts for 23000 items 245s!
Doing the same with pypdf2 and using pypy lasts 8s.
For a Rust library I would expect this to be done <1s.
Current status: works sometimes.
TODO:
I opened a PDF file created with xelatex
, did some annotation in Drawboard PDF and saved it (xelatex_drawboard.pdf). A test crashes with this new file because of this peculiar cross-reference stream:
/W = [1, 2, 0]
Data = [2, 1, 183, 248, 2, 0, 1, 88, 2, 0, 3, 245, 2, 0, 0, 21, 2, 0, 2, 7]
From the reference: "A value of zero for an element in the W array indicates that the corresponding field is not present in the stream, and the default value is used, if there is one."
Each entry in the data should have 1 + 2 + 0 = 3
bytes but clearly this is not true - there are 5 entries with 4 bytes. Any PDF reader I use has no problem viewing it, so I think this library should try to tolerate it as well - but how should it deal with this? (given that it's not just my understanding which is wrong)
why?
Unable to update https://gitlab.com/sebk/tuple/
You can't just call Mmap
methods behind unsafe
code and call it a day. If someone modifies the file, even from another program, things will go very badly. One should be able to read a PDF document from an Mmap
value only from unsafe code.
I would suggest killing Backend::open
and let people initialise their MMap
or Vec<u8>
inputs themselves.
I am writing a tool that extracts data out of some domain-specific pdfs and am finding it quite cumbersome to deal with the untyped operations.
I am writing an enum version of pdf::content::Operation
. Is that something you'd be interested in having in-tree or should I create a separate repo and crate for it?
Is it on the roadmap to make pdf
compatible with stable again? FYI, it looks like #![feature(custom_attribute)]
is unused. Here's how the other unstable features are currently being used.
termination_trait_lib
: This is used in PdfError, to print error information while the process status code for the error is being fetched. This could be replaced with some boilerplate at each relevant main()
function, to print out the same error information if an error bubbles all the way up.try_trait
: This is used to implement a conversion from NoneError to PdfError. This conversion is only used once, in font.rs, and it could be replaced with .ok_or(PdfError::Other { msg: ... }
.core_intrinsics
: This is used to implement AnyObject and for error handling. I'm not familiar with this, but it's possible std::any::Any
or the typename
crates could fill in the gap here.when "cargo run -p view"
Unable to update https://gitlab.com/sebk/tuple/
While executing the text example:
use std::env::args;
use pdf::file::File;
use pdf::content::*;
use pdf::primitive::Primitive;
fn add_primitive(p: &Primitive, out: &mut String) {
// println!("p: {:?}", p);
match p {
&Primitive::String(ref s) => if let Ok(text) = s.as_str() {
out.push_str(text);
}
&Primitive::Array(ref a) => for p in a.iter() {
add_primitive(p, out);
}
_ => ()
}
}
fn main() {
let path = args().nth(1).expect("no file given");
println!("read: {}", path);
let file = File::<Vec<u8>>::open(&path).unwrap();
let mut out = String::new();
for page in file.pages() {
for Operation { ref operator, ref operands } in &page.unwrap().contents.as_ref().unwrap().operations {
// println!("{} {:?}", operator, operands);
match operator.as_str() {
"Tj" | "TJ" | "BT" => operands.iter().for_each(|p| add_primitive(p, &mut out)),
_ => {}
}
}
}
println!("{}", out);
}
With Cargo.toml:
[dependencies]
pdf = { git="https://github.com/pdf-rs/pdf" }
I got the below error:
PS C:\Users\hasan.DESKTOP-HU2FQ29\PycharmProjects\rust> cargo run HasanResume.pdf
Finished dev [unoptimized + debuginfo] target(s) in 0.14s
Running `target\debug\rust.exe HasanResume.pdf`
read: HasanResume.pdf
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FromPrimitive { typ: "Option < Rc < Resources > >", field: "resources", source: FromPrimitive { typ: "BTreeMap < String, Rc < Font > >", field: "fonts", source: FromPrimitive { typ: "Vec < Rc < Font > >", field: "descendant_fonts", source: FromPrimitive { typ: "FontDescriptor", field: "font_descriptor", source: FromPrimitive { typ: "Rect", field: "font_bbox", source: UnexpectedPrimitive { expected: "Array", found: "Reference" }
} } } } }', src\libcore\result.rs:1165:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
error: process didn't exit successfully: `target\debug\rust.exe HasanResume.pdf` (exit code: 101)
Like this.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_input = PdfFileReader(open("test.pdf", 'rb'))
pdf_output = PdfFileWriter()
page = pdf_input.getPage(2)
pdf_output.addPage(page)
pdf_output.write(open("./splitted.pdf", 'wb'))
Reason: we panic
and unwrap
in serialize functions because we can't use ?
operator.
Hi, is it possible to strip possibly malicious elements (such as JavaScript or OpenAction tags) off a PDF document using this tool?
In most cases, a stream ought to be decoded in order to be used by the PDF library (or an a (for example an xref stream). As it is right now, this is done during parsing (parser/mod.rs
). In other cases, the encoded data doesn't need to be encoded before it is consumed by the application. Example: DCT filter for JPEG stream.
Need a policy for when to keep it as it is, and when to decode.
Simplest idea, and what I will do now:
parser/mod.rs
.info
field has a filter
field ... all streams have this anyway.from_primitive()
.parser/mod.rs
depending on what filters are on it (e.g. not in the case of DCT filter). It would still require that data about which filters are still encoded on the data.))Needed for a sane implementation of grafeia/grafeia#2
I read a pdf by this code got an error.
#[cfg(test)]
mod pdf_test {
use glob::glob;
use pdf::file::File;
macro_rules! file_path {
( $sub_dir:expr ) => { concat!("./src/test/common/", $sub_dir) }
}
macro_rules! run {
($e:expr) => (
match $e {
Ok(v) => v,
Err(e) => {
e.trace();
panic!("{}", e);
}
}
)
}
#[test]
pub fn read_pages() {
for entry in glob(file_path!("original.pdf")).expect("Failed to read glob pattern") {
match entry {
Ok(path) => {
println!("\n == Now testing `{}` ==", path.to_str().unwrap());
let path = path.to_str().unwrap();
let file = run!(File::<Vec<u8>>::open(path));
for i in 0 .. file.num_pages() {
println!("Read page {}", i);
let _ = file.get_page(i);
}
}
Err(e) => println!("{:?}", e)
}
}
}
}
== Now testing `src\test\common\original.pdf` ==
0: Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:277:23
1: Can't parse field root of struct RcRef < Catalog >.
2: Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:94:19
3: Can't parse field names of struct Option < RcRef < NameDictionary > >.
4: Expected primitive Reference, found primive Dictionary instead.
thread 'test::pdf_test::pdf_test::read_pages' panicked at 'Try at C:\Users\11989\.cargo\registry\src\github.com-1ecc6299db9ec823\pdf-0.7.2\src\file.rs:277:23', src\test\pdf_test.rs:33:32
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/657bc01888e6297257655585f9c475a0801db6d2\/library\std\src\panicking.rs:515
1: std::panicking::begin_panic_fmt
at /rustc/657bc01888e6297257655585f9c475a0801db6d2\/library\std\src\panicking.rs:457
2: document_manager::test::pdf_test::pdf_test::read_pages
at .\src\test\pdf_test.rs:33
3: document_manager::test::pdf_test::pdf_test::read_pages::{{closure}}
at .\src\test\pdf_test.rs:23
4: core::ops::function::FnOnce::call_once<closure-0,tuple<>>
at /rustc/657bc01888e6297257655585f9c475a0801db6d2\library\core\src\ops\function.rs:227
5: core::ops::function::FnOnce::call_once
at /rustc/657bc01888e6297257655585f9c475a0801db6d2\library\core\src\ops\function.rs:227
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test test::pdf_test::pdf_test::read_pages ... FAILED
failures:
failures:
test::pdf_test::pdf_test::read_pages
I rewrite this file by Python PyPDF2 then can read it successfully. But the rewritten file becomes smaller.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_input = PdfFileReader(open("talkiin.pdf", 'rb'))
page_count = pdf_input.getNumPages()
pdf_output = PdfFileWriter()
for i in range(page_count):
page = pdf_input.getPage(i)
pdf_output.addPage(page)
pdf_output.write(open("./splitted.pdf".format(i), 'wb'))
I'm sorry I can't provide this document.
I tried parsing a corrupt PDF file, and the program panicked with the following error. The unwrap() was in dump_data().
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: NotFound, error: PathError { path: "/tmp/pdf/oTRQva", err: Os { code: 2, kind: NotFound, message: "No such file or directory" } } }', src/libcore/result.rs:999:5
After I created a /tmp/pdf
directory, the panic went away. However, since I didn't have logging on, it created a file in /tmp/pdf
without notifying me of its presence. I'd recommend two changes to this function:
pdf-
I've tried to use the "text" example with a PDF I have, but I got the following error while accessing the first page:
Error: Try { file: "pdf\\src\\file.rs", line: 96, column: 19, source: Try { file: "pdf\\src\\object\\types.rs", line: 25, column: 36, source: FromPrimitive { typ: "Option < MaybeRef < Resources > >", field: "resources", source: FromPrimitive { typ: "HashMap < String, GraphicsStateParameters >", field: "graphics_states", source: FromPrimitive { typ: "Option < LineCap >", field: "line_cap", source: UnexpectedPrimitive { expected: "Name", found: "Integer" } } } } } }
The "inspect_prim" tool is working fine with it, so I've used some debug logging to find the resource causing the issue, and got the following data:
I'm a newbie regarding the PDF standard, so I can't really tell if this is a bug in the library or an issue with my PDF file. I've checked the PDF with an online validator tool though, seems like it's compilant with the PDF 1.4 standard.
Could you help me pinpoint the issue here? Why is the decoder expecting a Name? LineCap is an enum, so the Integer should work fine, right?
Since I removed from_dict
and left only from_primitive
, I should consider adding provided functionality in the Object
trait that uses from_primitive
to implement from_dict
and from_stream
. This would have to fail (return Err) for any primitve not dict or stream though. Maybe rather consider adding FromDict: Object
that only provides from_dict()
.
Hi, I'm getting an error when trying to use this crate.
❯ cargo run
Compiling pdf_derive v0.1.20
error[E0433]: failed to resolve: could not find `export` in `syn`
--> /Users/x/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf_derive-0.1.20/src/lib.rs:99:23
|
99 | type SynStream = syn::export::TokenStream2;
| ^^^^^^ could not find `export` in `syn`
error: aborting due to previous error
For more information about this error, try `rustc --explain E0433`.
error: could not compile `pdf_derive`
This seems to be a suggested fix: frondeus/test-case#60 (comment)
PDFs have embedded streams. These can be images or fonts. It would be great if there was an iterator over the embedded files, so that images can be extracted from the PDF.
Via experimentation, I learned that some text that wasn't rendering in the correct location was off by a factor equal to its font size. By adjusting this line to omit the font_size
multiplier, I am able to get the text to render as expected.
Due to the proprietary nature of the file I'm parsing, I'm unable to supply an example of this in action, and I'm unclear how to synthesize a file that demonstrates this behavior.
We only have two very simple examples. But the library can do much more now!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.