Comments (22)
It seems like the only thing we can do is to add a way to ignore the CRC (in the library, that is). It should be a useful option to have in any case...
from sharpziplib.
A zip file is not simply a deflate-compressed file, but an archiving format with file tables etc.
The individual files inside the archive may be deflated, but you need to read the file meta data to find out. I think what you are looking for is ZipInputStream
instead of InflaterInputStream
.
from sharpziplib.
the code should not interpretate as "Load a zip archive". The example is simplified. The data4.bin
is extracted and include with deflate comprimised data. The rest (archive and so on) works but not for the data in the file
from sharpziplib.
Okay, I see. How was the file created? If there are no headers or meta data about the deflate stream, it can be hard to debug why the file cannot be read, and it might be related to some unsupported feature in our deflate implementation.
from sharpziplib.
I tried reading from your data file, a single byte per read, and it seems like the deflate stream just ends after reading 208671 byte(s):
❯ dotnet run
Read 208671 byte(s) before exception: ICSharpCode.SharpZipLib.SharpZipBaseException: Unexpected EOF
at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Fill()
at ICSharpCode.SharpZipLib.Zip.Compression.Streams.InflaterInputStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.ReadByte()
at Program.<Main>$(String[] args) in /tmp/szl-deflate/Program.cs:line 21
I also tried running it through zlibs example program zpipe
and it gives the same result:
./zpipe -d < data.bin | wc -c
zpipe: invalid or incomplete deflate data
208671
from sharpziplib.
Thank you for testing. The exact same exeption is throwing here. I dont know what byte is here the problem. The file is the deflate data of an pdf page content stream.
If i use the uncompromise data and convert the byte[] to and utf 8 string the correct data is combing back. So it seems to be that the single byte which occurs the error is the problem.
from sharpziplib.
Is there a way to fix it here?
from sharpziplib.
If zlib gives the exact same result, it's the data (input file) that is the problem.
The end looks a bit suspicious, perhaps you can just try removing the end of the file, one bye at a time?
from sharpziplib.
It seems to be an performance overkill to remove bytewise. Is it possible to get a more concrete exception on which position the problem occur? A naive way
int length = 4096;
using (var input = new FileStream(@"data.bin", FileMode.Open))
{
using (var output = new MemoryStream(65536))
{
using (var _inflater = new InflaterInputStream(input))
{
byte[] data;
while (true)
{
data = new byte[length];
try
{
var size = _inflater.Read(data, 0, length);
if (size > 0)
{
output.Write(data, 0, size);
}
else
{
break;
}
}
catch (ICSharpCode.SharpZipLib.SharpZipBaseException e) when (e.Message.Equals("Unexpected EOF", StringComparison.OrdinalIgnoreCase))
{
length -= 1;
}
catch (Exception)
{
throw;
}
}
}
var strg = System.Text.Encoding.UTF8.GetString(output.ToArray());
}
}
}
from sharpziplib.
Yes, something like that is what I meant, but not for the final solution, just to find out what parts of the file shouldn't be passed to INFLATE. I assume it would be the same for all files in this format. Perhaps there is an additional CRC or something appended to the end? Or perhaps multiple streams are appended together in the original file and so the last deflate-record has it's "isLastRecord" bit set to false?
from sharpziplib.
Thd pdf specification allows that the stream can be a single deflated stream or an array of streams. But on my understanding the concatenation to one single content file happen after deflating. So in this case the data is produced as closed container which is deflated. WHat is a CRC?
from sharpziplib.
is there any idea other libraries in the pdf world with own implementations of deflate can work with the data. I don't why it ends on this point because there is more data behind this point.
from sharpziplib.
This project is focused on zip and tar.gz/bz2, so I have no insight into PDF, sorry. Plain DEFLATE is not that common in files, I would probably take a look at the producer of those files to see if it either includes too much or too little data. You could also try debugging your program and stepping back in the stack trace to see why more data is required (you would need to have a basic understanding of how the DEFLATE format works though).
from sharpziplib.
I am getting the same error trying to deflate the data contained in this file
flatedata.zip (unzip the attachment first). It is also a portion of the content stream of a PDF. The data is definitely valid because the PDF from which it was extracted opens fine in Acrobat Reader, and I can also get it to decompress correctly using System.IO.Compression.DeflateStream (after skipping over the first 2 bytes since DeflateStream expects RFC 1951 data vs. RFC 1950 data which InflaterInputStream expects).
from sharpziplib.
It looks like this is actually a bug in Adobe's PDF generation engine. It is leaving off the last byte of the Adler-32 checksum if the last byte is 0x00. In the case of the file I provided, the computed checksum is 0x60F7D300, but the last 4 bytes of the data in the encoded stream are 0x00, 0x60, 0xF7, and 0xD3. In the case of the file @lutz provided, the computed checksum is 0x79DFAE00, but the last 4 bytes of data in the encoded stream are 0x00, 0x79, 0xDF, and 0xAE. I have confirmed that adding a byte with value 0x00 to the end of each these files causes them to process correctly.
It would seem that Acrobat Reader must be ignoring the header and checksum fields and is just processing the raw DEFLATE data.
from sharpziplib.
@asyncritus great detective work!
It could also be the case that the way they are reading/writing the checksum allows for truncating trailing null bytes. In the case of SharpZipLib it should be fairly easy to try to fill any missing bytes in the CRC with 0 bytes if it reaches EOF...
from sharpziplib.
...or perhaps it's the tool that extracts out the PDF streams that strips the trailing null bytes? How did you produce the file?
from sharpziplib.
I opened the PDF file in a hex editor and stripped out everything directly before and directly after the binary stream data. Here you can see where the 0x00 at the end is missing:
Of course this is done programmatically by our PDF parsing software where the problem first manifested itself.
from sharpziplib.
@asyncritus Great work. And your result is that what i thought about the adobe pdf engine.
from sharpziplib.
@piksel We don`t trail these information when we read. t seems to be that the adobe pdf engine do that with a specific update. We could identfify that the behaviour is changed with adobes indesign 18.5 (windows and mac) update. Before it works and after not.
from sharpziplib.
After some further investigation with more examples, I've found that it is not just leaving off trailing 0x00 bytes, but as soon as it encounters a 0x00 byte in the checksum, it stops writing data. For example, in one situation the checksum is 0x001E9C82, and none of those bytes are present. In another case, the checksum is 0x6C00878A, and only 0x6C was present.
Our customer that is having these issues is using InDesign 19.0. We are trying to obtain the original InDesign documents so that we can test with an earlier version.
@lutz Have you contacted Adobe about this issue?
from sharpziplib.
We could reproduce the behaviour down to version 18.5. One of our customer could check multiple indesign version and the v18.5 seems to be the first. The v17 should be definitiv works.
We have no contact with Adobe. The problem is that most PDF viewers we check works with the files (Adobe Acrobat/Reader , PDF X Change, Summatra, Browser and so on) It could be that most of theme have the identical behavior of ignoring checksum and interprete the raw data.
So we have not enough argument.
The PDF specification is clear enough to say that deflate should be use and deflate spec is strict in his format (checksum anf so on)
It is not the first time that Adobe as inventer of the PDF format is interprete pdf files more in a free way instead of a strict way
from sharpziplib.
Related Issues (20)
- Reading and writing in sync mode problems after update to 1.4 HOT 2
- Unpacking 7z archive failed. HOT 2
- Perf - Add buffer pooling where relevant
- Use CompressionMethod.Stored and flush the ZipOutputStream after each entity was added leads Zip corrupted in SharpZipLib 1.4.2 HOT 9
- Tar file is empty with a size of zero bytes for small tar entry sizes HOT 4
- SharpZipLib v1.4 introduced breaking change
- GetNextEntryAsync might use too large buffer HOT 3
- TAR archive has only 20kb when application high load
- xarchiver can't open tar files created with this library
- Symlink creation not possible with `TarArchive.WriteEntry()`. HOT 1
- .NET Framework 4.6.2 Support?
- Could not load file or assembly 'System.Threading.Tasks.Extensions' or one of its dependencies
- Original DateTime/DosTime value
- TestArchive doesn't handle invalid offsets correctly HOT 2
- It's not possible to tar files starting with "system" in the file name.
- Problem when create zip in network folder
- Add support for .NET Framework 4.6.1 to target frameworks.
- Creating incomplete zip when using zip64.
- Problem reading archives containg Zip64 files HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sharpziplib.