Giter Club home page Giter Club logo

Comments (4)

raimundomartins avatar raimundomartins commented on July 20, 2024

I've been trying some extra stuff, and I've got some code which can extract the images properly and with a slight twist they are wrong, maybe it can help you.
One obvious problem with this is that it bypasses the extension capabilities of PDFWriter, so while it fixes my problem, it kind of breaks library behaviour so it's might not be an acceptable solution to others.

void exportStream(std::string fname, PDFParser *parser, PDFStreamInput *str)
{
    std::ofstream outfile(fname.c_str(), std::ofstream::binary);

    PDFDictionary *dict = str->QueryStreamDictionary();
    PDFObject *length_obj = dict->QueryDirectObject("Length");
    long long length = static_cast<PDFInteger *>(length_obj)->GetValue();

    IByteReaderWithPosition *pstream = parser->GetParserStream();
    LongFilePositionType old_pos = pstream->GetCurrentPosition();

    //With the following 2 lines, JPG extraction works.
    IByteReader *stream = pstream;
    pstream->SetPosition(str->GetStreamContentStart());
    //With the next one, only some bytes are extracted
    //IByteReader *stream = parser->StartInputStreamReader(str);
    while(stream->NotEnded() && length > 0)
    {
        IOBasicTypes::Byte buf[4096];
        IOBasicTypes::LongBufferSizeType count = stream->Read(buf, length > sizeof(buf) ? sizeof(buf) : length);
        length -= count;
        outfile.write((const char *)buf, count);
    }

    //Choose one depending one path chosen above
    pstream->SetPosition(old_pos);
    //delete stream;
}

from pdf-writer.

raimundomartins avatar raimundomartins commented on July 20, 2024

I just tested my code with a pdf consisting of a page of a small 2x2 png. Using the StartInputStreamReader() method I got back a (( file, which is actually the png file unfiltered by FlateDecode. I got exactly those "((" in an uncompressed pdf when I ran
qpdf --stream-data=uncompress testpng.pdf testpng_uncompressed.pdf
so it seems my problem is that in order to use StartInputStreamReader() I would need to write it with the same filter.
OTOH, if the stream is an embedded image why is it ran through filters anyway, it should stay untouched, no? Why is it useful for the library to read "uncompressed" jpeg/png image?

Btw, thanks for the great work :)

PS: I now see that only jpg streams give proper images back, and that there is a conversion TIFF2PDF with no way to do the opposite. Is it eaasy to reverse the process, and if so can you do it? Thanks!

from pdf-writer.

galkahana avatar galkahana commented on July 20, 2024

Hi @raimundomartins,
sounds like you had quite a bit of fun. image extraction, something that i never done before, might be a bit involved. PDF Files do not include "PNG"s or "JPG"s or such. they include image data in the formats that PDF knows. as it happens JPG is one of them (called "DCT" there)...but that's just a coincidence.

What you did is quite good in getting the encoded bytes. In case of JPG images it will give you the images as is, as they are normally introduced into the PDF as is. You can detect this by looking at the image decoder array. if it only has 'DCTDecode', then this will be a JPG image that you can take as is.

As you figured out yourself, using StartReadingFromStream decodes the streams per their decoder dictionary. It is very helpful if you wish to read PDF pages content streams, which are normally decoded in flate. you can also use it to decompress jpgs, as it can decode DCT (the jpg filtering), and also can decode Ascii85. you can also extend it with new decoders by adding an extender to the parser (see it used here where the parser determines which decoder to use). This is very useful to you if you want the raw image data, to decode it later in any format that you want.

If you are looking to build a generic image extractor this is probably the way to go - get the decoded bytes and encode them back to whatever output image format that your extractor is wanting to emit.
You can have possible shorthands for DCT images which you can extract directly to JPGs without decoding them. there might be other shortcuts. This might take a bit of a research here.

but in general sounds to me like you should DEFO decode and then encode back to the target image format (You can use libjpg or libpng for this).

Or, what might be better, is to go find an existing PDF image extraction solution, if you don't care to build one yourself.

from pdf-writer.

raimundomartins avatar raimundomartins commented on July 20, 2024

No, I have image extraction software (they're "common" in unix world), but I need other information about the pdf to process it, so I might as well do it in my application directly (since it was readily available)

Anyways, this issue was solved long ago (kind of a non-issue even).
Thanks!

from pdf-writer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.