Giter Club home page Giter Club logo

Comments (22)

piksel avatar piksel commented on July 18, 2024 2

It seems like the only thing we can do is to add a way to ignore the CRC (in the library, that is). It should be a useful option to have in any case...

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

A zip file is not simply a deflate-compressed file, but an archiving format with file tables etc.
The individual files inside the archive may be deflated, but you need to read the file meta data to find out. I think what you are looking for is ZipInputStream instead of InflaterInputStream.

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

the code should not interpretate as "Load a zip archive". The example is simplified. The data4.bin is extracted and include with deflate comprimised data. The rest (archive and so on) works but not for the data in the file

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

Okay, I see. How was the file created? If there are no headers or meta data about the deflate stream, it can be hard to debug why the file cannot be read, and it might be related to some unsupported feature in our deflate implementation.

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

I tried reading from your data file, a single byte per read, and it seems like the deflate stream just ends after reading 208671 byte(s):

❯ dotnet run
Read 208671 byte(s) before exception: ICSharpCode.SharpZipLib.SharpZipBaseException: Unexpected EOF
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Fill()
   at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.ReadByte()
   at Program.<Main>$(String[] args) in /tmp/szl-deflate/Program.cs:line 21

I also tried running it through zlibs example program zpipe and it gives the same result:

./zpipe -d < data.bin | wc -c
zpipe: invalid or incomplete deflate data
208671

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

Thank you for testing. The exact same exeption is throwing here. I dont know what byte is here the problem. The file is the deflate data of an pdf page content stream.

If i use the uncompromise data and convert the byte[] to and utf 8 string the correct data is combing back. So it seems to be that the single byte which occurs the error is the problem.

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

Is there a way to fix it here?

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

If zlib gives the exact same result, it's the data (input file) that is the problem.
The end looks a bit suspicious, perhaps you can just try removing the end of the file, one bye at a time?

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

It seems to be an performance overkill to remove bytewise. Is it possible to get a more concrete exception on which position the problem occur? A naive way

            int length = 4096;

            using (var input = new FileStream(@"data.bin", FileMode.Open))
            {
                using (var output = new MemoryStream(65536))
                {
                    using (var _inflater = new InflaterInputStream(input))
                    {
                        byte[] data;

                        while (true)
                        {
                            data = new byte[length];

                            try
                            {
                                var size = _inflater.Read(data, 0, length);

                                if (size > 0)
                                {
                                    output.Write(data, 0, size);
                                }
                                else
                                {
                                    break;
                                }
                            }
                            catch (ICSharpCode.SharpZipLib.SharpZipBaseException e) when (e.Message.Equals("Unexpected EOF", StringComparison.OrdinalIgnoreCase))
                            {
                                length -= 1;
                            }
                            catch (Exception)
                            {
                                throw;
                            }
                        }
                    }

                    var strg = System.Text.Encoding.UTF8.GetString(output.ToArray());
                }


            }
        }

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

Yes, something like that is what I meant, but not for the final solution, just to find out what parts of the file shouldn't be passed to INFLATE. I assume it would be the same for all files in this format. Perhaps there is an additional CRC or something appended to the end? Or perhaps multiple streams are appended together in the original file and so the last deflate-record has it's "isLastRecord" bit set to false?

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

Thd pdf specification allows that the stream can be a single deflated stream or an array of streams. But on my understanding the concatenation to one single content file happen after deflating. So in this case the data is produced as closed container which is deflated. WHat is a CRC?

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

is there any idea other libraries in the pdf world with own implementations of deflate can work with the data. I don't why it ends on this point because there is more data behind this point.

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

This project is focused on zip and tar.gz/bz2, so I have no insight into PDF, sorry. Plain DEFLATE is not that common in files, I would probably take a look at the producer of those files to see if it either includes too much or too little data. You could also try debugging your program and stepping back in the stack trace to see why more data is required (you would need to have a basic understanding of how the DEFLATE format works though).

from sharpziplib.

asyncritus avatar asyncritus commented on July 18, 2024

I am getting the same error trying to deflate the data contained in this file
flatedata.zip (unzip the attachment first). It is also a portion of the content stream of a PDF. The data is definitely valid because the PDF from which it was extracted opens fine in Acrobat Reader, and I can also get it to decompress correctly using System.IO.Compression.DeflateStream (after skipping over the first 2 bytes since DeflateStream expects RFC 1951 data vs. RFC 1950 data which InflaterInputStream expects).

from sharpziplib.

asyncritus avatar asyncritus commented on July 18, 2024

It looks like this is actually a bug in Adobe's PDF generation engine. It is leaving off the last byte of the Adler-32 checksum if the last byte is 0x00. In the case of the file I provided, the computed checksum is 0x60F7D300, but the last 4 bytes of the data in the encoded stream are 0x00, 0x60, 0xF7, and 0xD3. In the case of the file @lutz provided, the computed checksum is 0x79DFAE00, but the last 4 bytes of data in the encoded stream are 0x00, 0x79, 0xDF, and 0xAE. I have confirmed that adding a byte with value 0x00 to the end of each these files causes them to process correctly.

It would seem that Acrobat Reader must be ignoring the header and checksum fields and is just processing the raw DEFLATE data.

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

@asyncritus great detective work!

It could also be the case that the way they are reading/writing the checksum allows for truncating trailing null bytes. In the case of SharpZipLib it should be fairly easy to try to fill any missing bytes in the CRC with 0 bytes if it reaches EOF...

from sharpziplib.

piksel avatar piksel commented on July 18, 2024

...or perhaps it's the tool that extracts out the PDF streams that strips the trailing null bytes? How did you produce the file?

from sharpziplib.

asyncritus avatar asyncritus commented on July 18, 2024

I opened the PDF file in a hex editor and stripped out everything directly before and directly after the binary stream data. Here you can see where the 0x00 at the end is missing:

end of data

Of course this is done programmatically by our PDF parsing software where the problem first manifested itself.

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

@asyncritus Great work. And your result is that what i thought about the adobe pdf engine.

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

@piksel We don`t trail these information when we read. t seems to be that the adobe pdf engine do that with a specific update. We could identfify that the behaviour is changed with adobes indesign 18.5 (windows and mac) update. Before it works and after not.

from sharpziplib.

asyncritus avatar asyncritus commented on July 18, 2024

After some further investigation with more examples, I've found that it is not just leaving off trailing 0x00 bytes, but as soon as it encounters a 0x00 byte in the checksum, it stops writing data. For example, in one situation the checksum is 0x001E9C82, and none of those bytes are present. In another case, the checksum is 0x6C00878A, and only 0x6C was present.

Our customer that is having these issues is using InDesign 19.0. We are trying to obtain the original InDesign documents so that we can test with an earlier version.

@lutz Have you contacted Adobe about this issue?

from sharpziplib.

lutz avatar lutz commented on July 18, 2024

We could reproduce the behaviour down to version 18.5. One of our customer could check multiple indesign version and the v18.5 seems to be the first. The v17 should be definitiv works.

We have no contact with Adobe. The problem is that most PDF viewers we check works with the files (Adobe Acrobat/Reader , PDF X Change, Summatra, Browser and so on) It could be that most of theme have the identical behavior of ignoring checksum and interprete the raw data.

So we have not enough argument.

The PDF specification is clear enough to say that deflate should be use and deflate spec is strict in his format (checksum anf so on)

It is not the first time that Adobe as inventer of the PDF format is interprete pdf files more in a free way instead of a strict way

from sharpziplib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.