Comments (4)
I've been trying some extra stuff, and I've got some code which can extract the images properly and with a slight twist they are wrong, maybe it can help you.
One obvious problem with this is that it bypasses the extension capabilities of PDFWriter, so while it fixes my problem, it kind of breaks library behaviour so it's might not be an acceptable solution to others.
void exportStream(std::string fname, PDFParser *parser, PDFStreamInput *str)
{
std::ofstream outfile(fname.c_str(), std::ofstream::binary);
PDFDictionary *dict = str->QueryStreamDictionary();
PDFObject *length_obj = dict->QueryDirectObject("Length");
long long length = static_cast<PDFInteger *>(length_obj)->GetValue();
IByteReaderWithPosition *pstream = parser->GetParserStream();
LongFilePositionType old_pos = pstream->GetCurrentPosition();
//With the following 2 lines, JPG extraction works.
IByteReader *stream = pstream;
pstream->SetPosition(str->GetStreamContentStart());
//With the next one, only some bytes are extracted
//IByteReader *stream = parser->StartInputStreamReader(str);
while(stream->NotEnded() && length > 0)
{
IOBasicTypes::Byte buf[4096];
IOBasicTypes::LongBufferSizeType count = stream->Read(buf, length > sizeof(buf) ? sizeof(buf) : length);
length -= count;
outfile.write((const char *)buf, count);
}
//Choose one depending one path chosen above
pstream->SetPosition(old_pos);
//delete stream;
}
from pdf-writer.
I just tested my code with a pdf consisting of a page of a small 2x2 png. Using the StartInputStreamReader() method I got back a (( file, which is actually the png file unfiltered by FlateDecode. I got exactly those "((" in an uncompressed pdf when I ran
qpdf --stream-data=uncompress testpng.pdf testpng_uncompressed.pdf
so it seems my problem is that in order to use StartInputStreamReader() I would need to write it with the same filter.
OTOH, if the stream is an embedded image why is it ran through filters anyway, it should stay untouched, no? Why is it useful for the library to read "uncompressed" jpeg/png image?
Btw, thanks for the great work :)
PS: I now see that only jpg streams give proper images back, and that there is a conversion TIFF2PDF with no way to do the opposite. Is it eaasy to reverse the process, and if so can you do it? Thanks!
from pdf-writer.
Hi @raimundomartins,
sounds like you had quite a bit of fun. image extraction, something that i never done before, might be a bit involved. PDF Files do not include "PNG"s or "JPG"s or such. they include image data in the formats that PDF knows. as it happens JPG is one of them (called "DCT" there)...but that's just a coincidence.
What you did is quite good in getting the encoded bytes. In case of JPG images it will give you the images as is, as they are normally introduced into the PDF as is. You can detect this by looking at the image decoder array. if it only has 'DCTDecode', then this will be a JPG image that you can take as is.
As you figured out yourself, using StartReadingFromStream
decodes the streams per their decoder dictionary. It is very helpful if you wish to read PDF pages content streams, which are normally decoded in flate. you can also use it to decompress jpgs, as it can decode DCT (the jpg filtering), and also can decode Ascii85. you can also extend it with new decoders by adding an extender to the parser (see it used here where the parser determines which decoder to use). This is very useful to you if you want the raw image data, to decode it later in any format that you want.
If you are looking to build a generic image extractor this is probably the way to go - get the decoded bytes and encode them back to whatever output image format that your extractor is wanting to emit.
You can have possible shorthands for DCT images which you can extract directly to JPGs without decoding them. there might be other shortcuts. This might take a bit of a research here.
but in general sounds to me like you should DEFO decode and then encode back to the target image format (You can use libjpg or libpng for this).
Or, what might be better, is to go find an existing PDF image extraction solution, if you don't care to build one yourself.
from pdf-writer.
No, I have image extraction software (they're "common" in unix world), but I need other information about the pdf to process it, so I might as well do it in my application directly (since it was readily available)
Anyways, this issue was solved long ago (kind of a non-issue even).
Thanks!
from pdf-writer.
Related Issues (20)
- Can not modify a document by creating a new form XObject and using it in one of the pages HOT 3
- [Question] - pdf to image HOT 1
- Question about attachments HOT 2
- some example projects in wiki are missing HOT 2
- Streams objects writing problem HOT 2
- Add watermark to PDF HOT 7
- Missing lib.obj file HOT 3
- Android Build Workflow HOT 3
- CIDSet encoding does not conform with ISO 19005-2:2011, ISO 19005-3:2012 (PDF/A-2b or PDF/A-3b) HOT 21
- annotations are lost with PDFDocumentCopyingContext::AppendPDFPageFromPDF HOT 3
- How to draw Bezier curves using PDF-Witer library? HOT 2
- Parse a screenplay into scene objects? HOT 2
- color emojis HOT 16
- Links are removed when documents are merged HOT 8
- Color inversion problem occurs when exporting images HOT 1
- infinite loop HOT 2
- Crash when WriteUsedFontsDefinitions HOT 17
- Publish to github releases without PDFWriterTesting HOT 4
- U3D support, 10 years later HOT 10
- `Segmentation fault (core dumped)` just for adding `PDFWriter pdfWriter` in the `h` file HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf-writer.