bearcove / rc-zip Goto Github PK
View Code? Open in Web Editor NEWZIP format implementation in Rust, sans-io
License: Apache License 2.0
ZIP format implementation in Rust, sans-io
License: Apache License 2.0
The zip
crate has it, so, for completeness..
I was wondering why some parsers returned a Failure/Error rather than Incomplete - this is why!
Since we're doing our own buffering, especially when reading central directory headers, we want the streaming variant.
I actually haven't tested your code to see if it already supports this, but since I just wrote something about Apple's special zip format (https://iosdev.space/@tempelorg/111993220533529890), I thought I'd look for hastags on zip64 and found this.
So here's the deal: In order to allow >4GB files compressed without using zip64, Apple simply streams the entire file while leaving the sizes in the header set to zero. And the entry in the directory contains the size mod 2^32, of course. So you cannot predict the resulting size, but Apple's unzip can still decompress the files because it blindly decompresses the stream until its end marker.
Would you nice if you support that, too, and spread the word, as Apple has sadly not made this very public.
I'm happy to provide more info and sample files.
We probably don't need the whole thing.
Just curious to see how the two libraries compare nowadays.
With the GHA backend, to speed up CI.
Was added 5 years ago by yours truly. Was it ever needed? Who knows. I'm pretty sure EOF works normally on Windows right now though
When building applications that use rc-zip
with a newish rust compiler we get the warning:
warning: the following packages contain code that will be rejected by a future version of Rust: nom v5.1.2
The nom fix is kind of trivial, however I'm pretty sure nom5 is considered to be well past its end-of-life.
It's written in Rust 2015: https://lib.rs/crates/circular and miri reports UB: sozu-proxy/circular#9
README.md line 74 contains this half sentence:
Due to the nature of the zip format,
ArchiveReader
needs its own
It's like a cliffhanger :D
#13 found a dot-dot directory traversal vulnerability in the jean example program. The mitigation there of stripping ../
from paths is incomplete, and there are still other ways to escape the target directory.
I'm looking at commit d43c169.
The mitigation for #13 treats file names as strings, and removes all instances of the substring ../
(or ..\
on Windows).
https://github.com/rust-compress/rc-zip/blob/d43c16991b30dce0e25862854b16883eb3dd80f0/samples/jean/src/main.rs#L224-L228
First, ../
should be stripped even on Windows. /
is the zip file directory separator, regardless of the separator of the underlying filesystem (see appnote.txt 4.4.17).
Second, a file name may become a directory traversal after being transformed. For example, ..././escape.txt
becomes ../escape.txt
.
rm -f replace.zip
mkdir ...
touch .../escape.txt
zip -0 -X replace.zip ..././escape.txt
rm -r ...
cargo run -- unzip replace.zip
stat ../escape.txt
One entry may be a directory symbolic link pointing outside the target directory, and a later entry may write through the symlink.
This is like CVE-2002-1216 in GNU Tar.
rm -f symlink.zip
ln -s ../ path
zip -0 -X --symlinks symlink.zip path
rm path
mkdir path
touch path/escape.txt
zip -0 -X -D symlink.zip path/escape.txt
rm -r path
cargo run -- unzip symlink.zip
stat ../escape.txt
File names starting with /
are treated as absolute and can write outside the target directory. On Windows, prefixes like C:\
and UNC paths like \\ComputerName\
may also work.
This is like CVE-2001-1269 in Info-ZIP UnZip.
import zipfile
with zipfile.ZipFile("absolute.zip", mode="w") as z:
z.writestr("/tmp/escape.txt", "")
python3 absolute.py
cargo run -- unzip absolute.zip
stat /tmp/escape.txt
UnZip strips the absolute prefix in this case:
Archive: absolute.zip
warning: stripped absolute path spec from /tmp/escape.txt
extracting: tmp/escape.txt
$ unzip Seal.zip
Archive: Seal.zip
warning [Seal.zip]: 1292 extra bytes at beginning or within zipfile
(attempting to process anyway)
inflating: Doc_0/Document.xml
inflating: Doc_0/PublicRes_0.xml
inflating: Doc_0/Pages/Page_0/Content.xml
inflating: OFD.xml
$ ./target/debug/jean unzip Seal.zip
The application panicked (crashed).
Message: called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: Format(InvalidLocalHeader) }
Location: src/libcore/result.rs:1165
creating an EntryReader
requires passing a function that takes an offset and returns an instance of Read
: https://github.com/rust-compress/rc-zip/blob/69b884f85085c5c99a846703f83ddc48e0d086ac/src/format/archive.rs#L218
It happens because a StoredEntry
has no access to the original data source.
While the internal state machine should be able to work without owning the IO source, the public API is nicer if it wraps it. But that would also be dependent on the IO source. Some would support seeking, others would not.
So, ideally, the public API would allow writing this:
let file = File::open(matches.value_of("file").unwrap())?;
let reader = file.read_zip()?;
info(&reader);
for entry in reader.entries() {
println!("Extracting {}", entry.name());
let mut contents = Vec::<u8>::new();
entry.read_to_end(&mut contents)?;
}
But it would also support an IO source that would do HTTP range requests to get the entries.
In lapin, we had the async crate that would give the state machine. Its use was low level, just supporting buffers in and out:
https://github.com/sozu-proxy/lapin/blob/0.16.0/async/examples/connection.rs
But the futures based API was wrapping the state machine and the IO source (that could be a TcpStream
, or a SSL stream defined with openssl or rustls) inside a tokio transport:
https://github.com/sozu-proxy/lapin/blob/0.16.0/futures/src/transport.rs#L45-L137
https://crates.io/crates/no-panic
Codebase has a bunch of expects/unwraps for now, until we figure out all the error handling.
Right now we gratuitously call .cursor_at
on every iteration of the wants_read/process loop, but what if the underlying resource is an HTTP endpoint? Then we wouldn't want to re-establish connections on every buffer read.
libflate is just a wrapper.
It was removed in previous versions, but I think we could re-add a version based only on tokio I/O traits.
This wouldn't accommodate for things like io-uring, but there would probably be commonalities between the sync & async interface and it might clean up the overall package.
The latest cargo-llvmcov
release has support for Rust nightly's experimental branch coverage that just dropped:
There's an experimental flag to ignore unused generics/unused functions:
https://release-plz.ieni.dev/docs/github/quickstart
To make it easier to release this.
I looked at your work. It seemed to me more convenient than crate "zip" (work on which is stopped). But I have a few comments.
The StoredEntry structure does not support Clone/Copy trait. If it supported, then it would be possible to save the received StoredEntry somewhere, to return to work with the archive later.
There are no examples.
Missing description of get_reader function in declaration
pub fn reader<'a, F, R>(&' a self, get_reader: F) -> EntryReader <'a, R> where R: Read, F: Fn(u64) -> R
I think this function gets an offset for Read, changes position and returns Read.
But not so simple.
3.1 get_reader returns an object, i.e. transfers ownership somewhere inside the library. You want to say that I have to make Clone to my Read object before giving it away? std::fs::File does not support cloning.
3.2. Let's look at another example. He is almost a worker, but still not.
fn open1 (archname: & dyn AsRef <Path>)
{
let mut f = File :: open (archname) .unwrap ();
let arch = parse_zip (& mut f);
let entry = & arch.entries () [1];
let rdr = entry.reader (| offset |
{
f.seek (std :: io :: SeekFrom :: Start (offset)). unwrap ();
f
});
}
Here f cannot be used because the closure type is Fn, although it would have to be FnMut.
move | offset | {...}
also does not work
Adding fuzz targets for all the functions that take byte slices is an easy way to assert that they don't panic unexpectedly. Even if it's unlikely to find buffer overflows or use after free in Rust code, in my experience there are at least one or two integer overflows that a fuzzer can find :)
cf. https://rust-fuzz.github.io/book/introduction.html for how to fuzz Rust libraries
instead of forcing errors to fit into io::Error
, it would be easier to have a dedicated error enum that can hold a io::Error
. As an example, see https://github.com/wyyerd/pulsar-rs/blob/master/client/src/error.rs
It can be written manually, or through a proc macro with the failure crate
Error when compiling to wasm32-unknown-unknown
:
Compiling rc-zip v0.0.1
error[E0277]: the trait bound `std::fs::File: ReadAt` is not satisfied
--> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/rc-zip-0.0.1/src/reader/read_zip.rs:64:36
|
64 | ReadAt::read_zip_with_size(self, size)
| ^^^^ the trait `ReadAt` is not implemented for `std::fs::File`
|
= note: required for the cast to the object type `dyn ReadAt`
Hi,
I'm doing an "off the clock"/weekend project that I am hoping will become commercial (but it's far from it right now) and I want to use rc-zip because it works really really well for the zip files I'm using. However it's dependent on chardet which is LGPL (probably because chardet is LGPL) which means I can't use it in a commercial project (given how rust links the libraries).
Is there any way to make rc-zip not use chardet? For example using https://crates.io/crates/chardetng (not sure if it covers the right things)
To put my money where my mouth is I'm more than happy to donate 100USD for this effort via PayPal (or donate somewhere if you prefer). I know it's not much (especially to change a dependency, create a new version and upload to crates.io) but given that my project is miles away from generating any revenue at all it's all I can afford right now really :) I guess it's more a token of appreciation.
I know the right way is to submit a PR but frankly my rust skills are probably too limited to do this right.
Regards,
Niklas
I made the wrapper for an article originally, but I think I want it to catch regressions in CI too.
This was recommended by @Geal a long time ago and shocker, he was right.
Hi,
Sorry this is a bit of a newbie question - I'm trying to follow the read file examples but I can't seem to get it to work. I'm trying to open a zip file, loop through the entries and pass the buffer of the file I want to quick xml (without writing the file to disk). Opening and looping is super simple, but I fail at passing the buffer.
I tried:
let entry_reader = c
.entry
.sync_reader(|offset| positioned_io::Cursor::new_pos(&zipfile, offset));
But it says that sync_reader method doesn't exist.
I tried the other example:
rc_zip::reader::sync::EntryReader::new(sl.entry, |offset| {
positioned_io::Cursor::new_pos(&zipfile, offset)
})
.read_to_string(&mut target)
.unwrap();
But it says that the reader is a private method.
The closest I come is:
let entry_reader = c.entry.reader(|offset| positioned_io::Cursor::new_pos(&zipfile, offset));
Which I think gives me a proper entry_reader, but I can't seem to figure out how to turn this into a Buffer so quick xml can read it. Or figure out how to read it at all.
This is all probably explained in the crates documentation but I'm still at the point where the automatically generated documentation is a bit greek to me :)
This is how I normally read a buffer in quick xml:
let xml_file = File::open(filepath)?;
let buffer = BufReader::new(xml_file);
let mut reader = Reader::from_reader(buffer);
let mut buf = Vec::new();
Any help will be greatly appreciated. Also, to put my money where my mouth is I will send anyone who helps me a beer tip via PayPal
The 128K buffer is probably okay, although, overkill?
But more importantly, especially for evil tests that do 1-byte reads, let's have it read based on its internal buffer size, rather than the passed buffer.
This needs more logic to return from the internal buffer rather than spawn a blocking task on subsequent reads, as long as we have data in the internal buffer. It turns it into a BufReader
sort of but... that's a good place to do it.
Secondly, the Box::pin
is probably wholly unnecessary, it could just be a JoinHandle
I think.
(with optional validation at the end).
The use case for this is streaming decompression given only a Read
.
~/bearcove/rc-zip-samples
❯ jean file nystatehealthcosts2016-2021.zip
Version made by: {MsDos v4.5}, required: {MsDos v4.5}
Encoding: utf-8, Methods: {Unsupported(9)}
4.00 GiB (128.77% compression) (1 files, 0 dirs, 0 symlinks)
We don't support Method 9 right now but according to 7-zip it's Deflate64:
This is needed for ZStandard support, to avoid unnecessary buffering.
Also rename it 'ZipSlice' or something?
rc-zip-sync is probably affected too.
Hi, this is a question and not really an issue. I'm looking at using rc-zip with in an async-await environment. However, ArchiveReader::read() looks to be the only way to feed the reader file data and it will block during the read. Is there an alternate way to feed it data that I may have read using async read directly? Is an async version of read() necessary?
Thanks for clarifying.
~/bearcove/rc-zip-samples
❯ jean file nystatehealthcosts2016-2021.zip
Version made by: {MsDos v4.5}, required: {MsDos v4.5}
Encoding: utf-8, Methods: {Unsupported(9)}
4.00 GiB (128.77% compression) (1 files, 0 dirs, 0 symlinks)
The sample file is publicly available data but to avoid potential S3 costs for the people who own the bucket, I won't link to it here.
I see it mentioned nowhere and i don't see any tests/examples/docs about payload?
I was looking for a good 10 minutes...
Maybe it could be documented if not
Hi!
As far as I see, the library is not maintained anymore. @fasterthanlime do you still maintain this library? Because now the current library state is a little bit unclear.
Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.