Giter Club home page Giter Club logo

snzip's Introduction

Snzip, a compression/decompression tool based on snappy.

What is snzip.

Snzip is one of command line tools using snappy. This supports several file formats; framing-format, old framing-format, hadoop-snappy format, raw format and obsolete three formats used by snzip, snappy-java and snappy-in-java before official framing-format was defined. The default format is framing-format.

Notable Changes

The default format was changed to framing-format in 1.0.0. Set --with-default-format=snzip as a configure option to use obsolete snzip format as the default format as before.

Installation

Install from a tar-ball

Download snzip-1.0.5.tar.gz from https://github.com/kubo/snzip/releases, uncompress and untar it, and run configure.

tar xvfz snzip-1.0.5.tar.gz
cd snzip-1.0.5
./configure
make
make install

If you didn't install snappy under /usr or /usr/local, you need to specify the location by --with-snappy as follows.

# install snzip
tar xvfz snzip-1.0.5.tar.gz
cd snzip-1.0.5
./configure --with-snappy=/xxx/yyy/
make
make install

When both dynamic and static snappy libraries are available, the former is used by default. The compiled snzip depends on libsnappy.so. When --with-static-snappy is passed as a configure option, the latter is used. The compiled snzip includes snappy library.

Note: --with-static-snappy isn't available on some platforms.

You can use --with-default-format to change the default compression format.

./configure --with-default-format=snzip

Install as a rpm package

We don't provide rpm packages. You need to download snzip-1.0.5.tar.gz from https://github.com/kubo/snzip/releases, create a rpm package as follows and install it.

# The rpm package will be created under $HOME/rpmbuild/RPMS.
rpmbuild -tb snzip-1.0.5.tar.gz 

Install from the latest source

To use source code in the github repository.

git clone git://github.com/kubo/snzip.git
cd snzip
./autogen.sh
./configure
make
make install

Install a Windows package.

Download snzip-1.0.5-win32.zip or snzip-1.0.5-win64.zip from https://github.com/kubo/snzip/releases and copy snzip.exe and snunzip.exe to a directory in the PATH environment variable.

Usage

To compress file.tar:

snzip file.tar

Compressed file name is file.tar.sz and the original file is deleted. The file attributes such as timestamp, mode and permissions are not changed as possible as it can.

The compressed file's format is framing-format. You need to add an option -t snappy-java or -t snappy-in-java to use other formats.

snzip -t snappy-java file.tar

or

snzip -t snappy-in-java file.tar

To compress file.tar and output to standard out.

snzip -c file.tar > file.tar.sz

or

cat file.tar | snzip > file.tar.sz

You need to add an option -t [format-name] to use formats except framing-format.

To create a new tar file and compress it.

tar cf - files-to-be-archived | snzip > archive.tar.sz

To uncompress file.tar.sz:

snzip -d file.tar.sz

or

snunzip file.tar.sz

Uncompressed file name is file.tar and the original file is deleted. The file attributes such as timestamp, mode and permissions are not changed as possible as it can.

If the program name includes un such as snunzip, it acts as -d is set.

The file format is automatically determined from the file header. However it doesn't work for some file formats such as raw and Apple iWork .iwa.

To uncompress file.tar.sz and output to standard out.

snzip -dc file.tar.sz > file.tar
snunzip -c file.tar.sz > file.tar
snzcat file.tar.sz > file.tar
cat file.tar.sz | snzcat > file.tar

If the program name includes cat such as snzcat, it acts as -dc is set.

To uncompress a tar file and extract it.

snzip -dc archive.tar.sz | tar xf -

Raw format

Raw format is native format of snappy. Unlike other formats, there are a few limitations: (1) The total data length before compression must be known on compression. (2) Automatic file format detection doesn't work on uncompression. (3) The raw format support is enabled only when snzip is compiled for snappy 1.1.3 or upper.

To compress file.tar as raw format:

snzip -t raw file.tar

or

snzip -t raw < file.tar > file.tar.raw

In these examples, snzip uses a file descriptor, which directly opens the file.tar file, and gets the file length to be compressed. However the following command doesn't work.

cat file.tar | snzip -t raw > file.tar.raw

It uses a pipe. snzip cannot get the total length before compression. The total length must be specified by the -s option in this case.

cat file.tar | snzip -t raw -s "size of file.tar" > file.tar.raw

To uncompress file.tar.sz compressed as raw format

snzip -t raw -d file.tar.sz

or

snunzip -t raw file.tar.sz

You need to set the -t raw option to tell snzip the format of the file to be uncompressed.

Hadoop-snappy format

Hadoop-snappy format is one of the compression formats used in Hadoop. It uses its own framing format as follows:

  • A compressed file consists of one or more blocks.
  • A block consists of uncompressed length (big endian 4 byte integer) and one or more subblocks.
  • A subblock consists of compressed length (big endian 4 byte integer) and raw compressed data.

To compress a file as hadoop-snappy format

snzip -t hadoop-snappy file_name

The default block size used by snzip for hadoop-snappy format is 256k. It is same with the default value of the io.compression.codec.snappy.buffersize parameter. If the block size used by snzip is larger than the parameter, you would get an InternalError Could not decompress data. Buffer length is too small while hadoop is reading a file compressed by snzip. You need to change the block size by the -b option as follows if you get the error.

# if  io.compression.codec.snappy.buffersize is 32768
snzip -t hadoop-snappy -b 32768 file_name_to_be_compressed

To uncompress a file compressed as haddoop-snappy format

snzip -d compressed_file.snappy

The file format is guessed by the first 8 bytes of the file.

Apple iWork .iwa format

Apple iWork .iwa format is a file format used by Apple iWork. The format was demystified here. Basically the .iwa format consists of a Protobuf stream compressed by Snappy.

Snzip uncompresses .iwa files to Protbuf streams and compresses Protobuf streams to .iwa files. You need to set -t iwa on compression and uncompression to specify the file format.

SNZ File format

Note: This is obsolete format. The default format was changed to framing-format.

The first three bytes are magic characters 'SNZ'.

The fourth byte is the file format version. It is 0x01.

The fifth byte is the order of the block size. The input data is divided into fixed-length blocks and each block is compressed by snappy. When it is 16 (default value), the block size is 16th power of 2; 64 kilobytes.

The rest is pairs of a compressed data length and a compressed data block The compressed data length is encoded as snappy::Varint::Encode32() does. If the length is zero, it is the end of data.

Though the rest after the end of data is ignored for now, they may be continuously read as a next compressed file as gzip does.

Note that the uncompressed length of each compressed data block must be less than or equal to the block size specified by the fifth byte.

License

2-clause BSD-style license.

snzip's People

Contributors

kubo avatar masahif2 avatar xianglei avatar zavyrylin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

snzip's Issues

snappy-hadoop - attempt to uncompress give `already has snappy suffix`

I like to have guidance please.
I've created test.txt and ran:
snzip -t hadoop-snappy test.txt
it compressed fine by creating test.txt.snappy
I then attempt to uncompress by running
snzip test.txt.snappy
and it gives me this error
test.txt.snappy already has snappy suffix
any advice would be appriciated.

Thank you

Support for concatenated snappy-in-java files

When dealing with some legacy format files, I noticed that snzip will fail to read snappy-in-java format files that are concatenated together. The issue is when it encounters the 2nd file, it reads the 's' (0x73) from the header and aborts since its not a recognized id.

The simple workaround is to skip the next 6 bytes (nappy\0 ) similar to how the framing2 format implicitly skips the header (this is due to it reading 0xff 0x06 0x00 0x00 as 6, then skipping those 6 bytes (sNaPpY) with the fseek.

Before I sent a real PR I wanted to get some feedback. My quick and dirty workaround does not validate the 2nd header is actually a valid snappy header. However, framing2 doesn't do this either (it relies on the implicit skipping defined by the header format itself).

Creating test file:

$ echo 'hello' | ./snzip -t snappy-in-java > one.snappy
$ echo 'world' | ./snzip -t snappy-in-java > two.snappy
$ cat one.snappy two.snappy > three.snappy

Original version:

$ ./snzip -d -c three.snappy
hello
Unknown compressed flag 0x73

Patched:

$ ./snzip -d -c three.snappy
hello
world

Thoughts/preferences on patch approach?

Hacky version diff:

diff --git a/snappy-in-java-format.c b/snappy-in-java-format.c
index 0f95e1a..2b2579a 100644
--- a/snappy-in-java-format.c
+++ b/snappy-in-java-format.c
@@ -195,6 +195,16 @@ static int snappy_in_java_uncompress(FILE *infp, FILE *outfp, int skip_magic)
     case UNCOMPRESSED_FLAG:
       /* pass */
       break;
+       case 's':
+         /* s== 0x73 Possible concatenated block.
+          * Note that other framing formats like frame2 see 0xff and just skip
+          * the rest of the header due to the header being: 0xff 0x06 0x00 0x00 snappy
+          * (it reads the 3-byte chunk header length resulting in a block length of
+          * 6 bytes, and skips 6 bytes which happens to be == snappy)
+          */
+         /* Likely concatenated snappy file.  We read first byte, skip rest */
+         fseek(infp, SNAPPY_IN_JAVA_MAGIC_LEN - 1, SEEK_CUR); /* TODO strict check? */
+         continue;
     default:
       print_error("Unknown compressed flag 0x%02x\n", compressed_flag);
       goto cleanup;

Memory allocate failure in work_buffer_resize

When snzip tries to read a malfomed archive, it fail to allocate the memory.
Output:

Ȥ�==12351==WARNING: AddressSanitizer failed to allocate 0xffffffffc8617364 bytes
==12351==AddressSanitizer's allocator is terminating the process instead of returning 0
==12351==If you don't like this behavior set allocator_may_return_null=1
==12351==AddressSanitizer CHECK failed: /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:147 "((0)) != (0)" (0x0, 0x0)
    #0 0x4ca7ed in AsanCheckFailed /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/asan_rtl.cc:67                                                                                                                                   
    #1 0x4d1323 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_common.cc:159                              
    #2 0x4cf076 in __sanitizer::ReportAllocatorCannotReturnNull() /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/sanitizer_common/sanitizer_allocator.cc:147                                                                            
    #3 0x424896 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<105553116266496ul, 4398046511104ul, 0ul, __sanitizer::SizeClassMap<17ul, 128ul, 16ul>, __asan::AsanMapUnmapCallback>, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator64<105553116266496ul, 4398046511104ul, 0ul, __sanitizer::SizeClassMap<17ul, 128ul, 16ul>, __asan::AsanMapUnmapCallback> >, __sanitizer::LargeMmapAllocator<__asan::AsanMapUnmapCallback> >::ReturnNullOrDie() /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/../sanitizer_common/sanitizer_allocator.h:1317                                                                                                                                                                                                   
    #4 0x424896 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/asan_allocator.cc:359                       
    #5 0x4205bd in __asan::Allocator::Reallocate(void*, unsigned long, __sanitizer::BufferedStackTrace*) /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/asan_allocator.cc:539                                                      
    #6 0x4205bd in __asan::asan_realloc(void*, unsigned long, __sanitizer::BufferedStackTrace*) /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/asan_allocator.cc:732                                                               
    #7 0x4c1231 in realloc /var/tmp/portage/sys-devel/llvm-3.8.1-r2/work/llvm-3.8.1.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:79                                                                                                                                  
    #8 0x4fe72c in work_buffer_resize /tmp/portage/app-arch/snzip-1.0.3/work/snzip-1.0.3/snzip.c:584:13                                                                                                                                                                        
    #9 0x51667b in snappy_java_uncompress /tmp/portage/app-arch/snzip-1.0.3/work/snzip-1.0.3/snappy-java-format.c:193:7                                                                                                                                                        
    #10 0x4f68ea in main /tmp/portage/app-arch/snzip-1.0.3/work/snzip-1.0.3/snzip.c:401:11                                                                                                                                                                                     
    #11 0x7fcbabbd261f in __libc_start_main /var/tmp/portage/sys-libs/glibc-2.22-r4/work/glibc-2.22/csu/libc-start.c:289                                                                                                                                                       
    #12 0x419988 in _init (/usr/bin/snzip+0x419988)

Attaching the testcase which causes the failure:
10.crashes.zip

Some supplements

在根据readme进行安装,出现了一些错误,查阅资料验证后特此处留下正确安装过程:

snzip.tar.gz
首先在此留下我编译好的包,可以随意放在目录下使用,依赖已经打包到一起了,
tar -zxvf snzip.tar.gz
cd /{You path}/snzip/bin/
. ./precommand.sh
然后就可以使用snzip了。

1、先下载snappy、Snzip的安装包
git clone https://github.com/kubo/snzip.git
wget https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz

2、准备环境
yum install -y gcc gcc-c++ autoconf automake libtool cmake openssl-devel
3、安装snappy
tar -zxvf snappy-1.1.3.tar.gz
cd snappy-1.1.3
./configure --prefix=/root/software/snappy
make
make install
4、安装snzip
./autogen.sh

出现如下错误:
1528371205 1

需要升级一下autoconf
查看当前autoconf版本

rpm -qf /usr/bin/autoconf

卸载当前版本autoconf,下载安装新版本

rpm -e --nodeps autoconf-2.63
wget ftp://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz
tar -zxvf autoconf-2.69.tar.gz
cd autoconf-2.69
./configure --prefix=/usr/
make && make install
查看版本
/usr/bin/autoconf -V

继续安装snzip

./autogen.sh
./configure --with-snappy=/root/software/snappy --prefix=/root/software/snzip
make && make install
安装完成

此时在software目录下

经过测试,snzip只依赖于snappy下的lib目录。

我们把software拷贝到别的机器上使用时需要执行

export LD_LIBRARY_PATH=/{Your path}/snappy/lib:$LD_LIBRARY_PATH
否则会报出错误
error while loading shared libraries:libsnappy.so.1: cannot open shared object file: No such file or directory

在此我留下自己编译好的包,解压后只用先执行bin下的. ./precommand.sh 即可使用snzip,无需再安装。
snzip.tar.gz

raw format

what about a raw format option?
maybe with a maximum size limit

Hadoop is unable to decompress

Hello!
So.. with snzip -t hadoop-snappy <file_to_compress> I can compress and decopress with snzip -d <snappy_file> just fine.
I moved the file to hadoop cluster and ran:
hadoop fs -text <snappy_file>
and got the following error and not sure where to go from here and would like to have your advice please.

17/01/18 14:17:47 INFO compress.CodecPool: Got brand-new decompressor [.snappy]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:123)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
	at java.io.InputStream.read(InputStream.java:101)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59)
	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119)
	at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:106)
	at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:101)
	at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
	at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
	at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
	at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:118)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:372)

I was able to run hadoop fs -text <much-bigger-snappy> for much bigger file and no problem. so Memory Error is misleading... please let me know if there is anything I can provide.

tar usage

Could you add

tar -I snzip -cf archive.tar.sz files-to-be-archived

Question about matches in snzip tool

Hi,
Do you support window size for match offset > 64k when packet is greater?
what are the parameters I should insert to do that
I run snzip tool version 1.0.4
modes I run: framing2 and framing
Thanks;

I wonder how snzip handles IO.

We are compressing the data using snzip in a specific time zone. To explain more, we are compressing /dev/sdb's data into /dev/sdc.

If you look at the picture below, you can see the Read IO(r/s) and Write Sector Size (wKB/s) indicators of the device in /dev/sdb.
By the way, while /dev/sdb's data is being compressed to /dev/sdc, there is almost no disk I/O for /dev/sdc. If you are compressing data to /dev/sdb -> /dev/sdc, shouldn't w/s of /dev/sdc increase?

스크린샷 2024-06-13 오전 9 56 58

Periodically some Write IO occurs in /dev/sdc.

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdc              0.00   14.00      0.00   6812.00     0.00     0.00   0.00   0.00    0.00    9.07   0.13     0.00   486.57   1.93   2.70
sdc1             0.00   14.00      0.00   6812.00     0.00     0.00   0.00   0.00    0.00    9.07   0.13     0.00   486.57   1.93   2.70

My question is: if you are compressing to /dev/sdb -> /dev/sdc, /dev/sdb should be continuously outputting Read IO, /dev/sdc should be continuously outputting Write IO, but /dev/sdc should not be doing periodic Write IO. Why is that?

Building snzip library

I want to use 'snzip' not only command line utility but also C library.

Is there any configuration options to build library now?

Fails to compile on OSX

`crc32.c:47:20: error: endian.h: No such file or directory`

I don't think endian.h is portable, but not sure.

Tar usage

Thank you for this project!

Could you add some examples?

tar -I /usr/local/bin/snzip -cvf snappy.tar.sz folder-to-tar/
tar --use-compress-program /usr/local/bin/snzip -xvf snappy.tar.sz
# bashrc
alias sntar='tar -I /usr/local/bin/snzip'

Unknown file format name raw

Hi!

I'm working on an Apple M2. I have snappy 1.1.10 and snzip 1.0.5 installed via Homebrew,
When I try to run a command using raw format I get an error. For example:

λ ~ snzip -t raw
Unknown file format name raw

Readme says The raw format support is enabled only when snzip is compiled for snappy 1.1.3 or upper. so I tried to compile from source but also then it's not working.

λ ~/Downloads/snzip-1.0.5 ./configure --with-snappy=/opt/homebrew/Cellar/snappy/1.1.10`
...
checking snappy::Uncompress(snappy::Source*, snappy::Sink*)... no
configure: WARNING: raw format is not supported with this snappy version.
...

Any ideas?

Streaming writes in framing2-format not supported

Hi,

Currently snzip lacks the ability to decompress files written with snappy.NewBufferedWriter() streaming writer. Upon reading the header of the next snappy frame it throws an error:

framing2-format.c:227: Unsupported identifier 0x73.

Other tools like python-snappy have been verified to work with this format correctly.

support official framing format

This snzip utility has been fantastic (I've been using it for a while) but i would love to see it implement the standard framing format agreed on in issue #34 of the snappy project.

Since that format also describes using the ".sz" instead of ".snz" and it uses a different file header, it should be possible to keep backwards compatibility for those of us that are already using snzip.

Invalid CRC for framing modes

snzip/crc32.h

Line 15 in 809c6f2

unsigned int crc = ~calculate_crc32c(~0, (const unsigned char *)buf, len);

It would appear that while calculating the masked CRC in the function above, you perform a bitwise NOT operator on the result before masking. I'm unsure what formats require this, if any (possibly snappy-in-java), but this creates a mismatched CRC for most other public implementations of framing2. Specifically:

I was hesitant to make a PR until I verified which formats might require it, but it should be an easy fix (I'd be happy to do it). Possibly two different functions, one which inverts and one which does not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.