Hi, I tried to use the following code to extract all the streams in the pdf file. Some

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Image Extraction giving invalid files about pdf-writer HOT 4 CLOSED

galkahana commented on July 20, 2024

Image Extraction giving invalid files

from pdf-writer.

Comments (4)

raimundomartins commented on July 20, 2024

I've been trying some extra stuff, and I've got some code which can extract the images properly and with a slight twist they are wrong, maybe it can help you.
One obvious problem with this is that it bypasses the extension capabilities of PDFWriter, so while it fixes my problem, it kind of breaks library behaviour so it's might not be an acceptable solution to others.

void exportStream(std::string fname, PDFParser *parser, PDFStreamInput *str)
{
    std::ofstream outfile(fname.c_str(), std::ofstream::binary);

    PDFDictionary *dict = str->QueryStreamDictionary();
    PDFObject *length_obj = dict->QueryDirectObject("Length");
    long long length = static_cast<PDFInteger *>(length_obj)->GetValue();

    IByteReaderWithPosition *pstream = parser->GetParserStream();
    LongFilePositionType old_pos = pstream->GetCurrentPosition();

    //With the following 2 lines, JPG extraction works.
    IByteReader *stream = pstream;
    pstream->SetPosition(str->GetStreamContentStart());
    //With the next one, only some bytes are extracted
    //IByteReader *stream = parser->StartInputStreamReader(str);
    while(stream->NotEnded() && length > 0)
    {
        IOBasicTypes::Byte buf[4096];
        IOBasicTypes::LongBufferSizeType count = stream->Read(buf, length > sizeof(buf) ? sizeof(buf) : length);
        length -= count;
        outfile.write((const char *)buf, count);
    }

    //Choose one depending one path chosen above
    pstream->SetPosition(old_pos);
    //delete stream;
}

from pdf-writer.

raimundomartins commented on July 20, 2024

I just tested my code with a pdf consisting of a page of a small 2x2 png. Using the StartInputStreamReader() method I got back a (( file, which is actually the png file unfiltered by FlateDecode. I got exactly those "((" in an uncompressed pdf when I ran
qpdf --stream-data=uncompress testpng.pdf testpng_uncompressed.pdf
so it seems my problem is that in order to use StartInputStreamReader() I would need to write it with the same filter.
OTOH, if the stream is an embedded image why is it ran through filters anyway, it should stay untouched, no? Why is it useful for the library to read "uncompressed" jpeg/png image?

Btw, thanks for the great work :)

PS: I now see that only jpg streams give proper images back, and that there is a conversion TIFF2PDF with no way to do the opposite. Is it eaasy to reverse the process, and if so can you do it? Thanks!

from pdf-writer.

galkahana commented on July 20, 2024

Hi @raimundomartins,
sounds like you had quite a bit of fun. image extraction, something that i never done before, might be a bit involved. PDF Files do not include "PNG"s or "JPG"s or such. they include image data in the formats that PDF knows. as it happens JPG is one of them (called "DCT" there)...but that's just a coincidence.

What you did is quite good in getting the encoded bytes. In case of JPG images it will give you the images as is, as they are normally introduced into the PDF as is. You can detect this by looking at the image decoder array. if it only has 'DCTDecode', then this will be a JPG image that you can take as is.

As you figured out yourself, using StartReadingFromStream decodes the streams per their decoder dictionary. It is very helpful if you wish to read PDF pages content streams, which are normally decoded in flate. you can also use it to decompress jpgs, as it can decode DCT (the jpg filtering), and also can decode Ascii85. you can also extend it with new decoders by adding an extender to the parser (see it used here where the parser determines which decoder to use). This is very useful to you if you want the raw image data, to decode it later in any format that you want.

If you are looking to build a generic image extractor this is probably the way to go - get the decoded bytes and encode them back to whatever output image format that your extractor is wanting to emit.
You can have possible shorthands for DCT images which you can extract directly to JPGs without decoding them. there might be other shortcuts. This might take a bit of a research here.

but in general sounds to me like you should DEFO decode and then encode back to the target image format (You can use libjpg or libpng for this).

Or, what might be better, is to go find an existing PDF image extraction solution, if you don't care to build one yourself.

from pdf-writer.

raimundomartins commented on July 20, 2024

No, I have image extraction software (they're "common" in unix world), but I need other information about the pdf to process it, so I might as well do it in my application directly (since it was readily available)

Anyways, this issue was solved long ago (kind of a non-issue even).
Thanks!

from pdf-writer.

Image Extraction giving invalid files about pdf-writer HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent