divonlan / genozip Goto Github PK

A modern compressor for genomic files (FASTQ, SAM/BAM/CRAM, VCF, FASTA, GFF/GTF/GVF, 23andMe...), up to 5x better than gzip and faster too

License: Other

C 89.57% Makefile 0.48% Shell 1.24% Batchfile 0.01% HTML 0.01% Dockerfile 0.01% sed 0.01% Singularity 0.01% Assembly 5.68% C++ 3.00%

genomics compression fastq vcf bam sam fasta gvf 23andme gzip

genozip's Introduction

Genozip

Genozip is a lossless compressor for FASTQ, BAM/CRAM, VCF and many other genomic files - see https://genozip.com

Genozip is also available on Conda and binary downloads, see installation options.

Building from source: make -j (required for building: gcc 8.5 or above ; nasm).

New: Genozip 15 - with Deep™ - losslessly co-compressing BAM and FASTQ files:

Genozip Genozip is a commercial product, but we make it free for certain academic research use. See eligibility and other licensing options or contact [email protected]

IMPORTANT: Genozip is a commercial product, NOT AN OPEN SOURCE product - we provide our source code to assure users that they will always have access to the code needed to decompress their files. HOWEVER, reverse engineering, code modifications, derivative works or inclusion of the code or parts thereof into other software packages is strictly forbidden by the license.

Attributions for 3rd party source components: attributions.

THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS, COPYRIGHT HOLDERS OR DISTRIBUTORS OF THIS SOFTWARE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

genozip's People

Contributors

Stargazers

Watchers

Forkers

knmkr sivkri neverovsky t-arae genostack schaudge ninadodante gdelevoye stepwise-ai-dev rmonsur carissafletcher

genozip's Issues

Feature suggestion: Interleaved output for FASTQ

Many aligners (eg: BWA) can accept paired end reads standard input if they are "interleaved" - that is, R1, then R2, then R1 then R2 etc.

Currently I think to run something like BWA on genozip compressed FASTQ I'd need to genounzip both of the reads FASTQS to a files, then run BWA on them both.

However - if you support a form of genocat that can output interleaved FASTQ mode then it would be possible to stream directly from genozip compressed FASTQ into BWA.

As a bonus: it would be even more useful if you can support a sharding factor which causes genocat to only output every 1 in N read pairs (could be useful for BAM or VCF as well). This allows us to run N copies of BWA from the same compressed FASTQ, and then we can merge the BAMs afterward.

NB: I tried running genozip in paired mode and then using genocat on the result, but it didn't output the reads in interleaved mode.

Any chance of implementing interleaved mode? It would mean we can run BWA directly from genozip'd FASTQ for paired end reads!

segfault genozip-12.0.8

Hi @divonlan, getting segmentation faults after installing genozip-12.0.8 (both conda and github). It occurs after installation on the first invocation with a file (i.e. before registration - without a file the usual 'see manual... etc etc' is output):

This seems specific to 12.0.8 - well at least, I rolled back to 12.0.5 and had no issues.

Compression of fastq.gz

Hello,

I have compressed the fastq.gz file using
genozip ./test_fastq/test2.R1.fastq.gz ./test_fastq/test2.R2.fastq.gz --pair --reference GRCh37.ref.genozip -o ./fastq_compress/test.genozip
However, when I was uncompressing the test.genozip, the file has become the fastq format although i use the '-z' command
genounzip test.genozip --reference /public/user/zj2020/GRch37/GRCh37.ref.genozip

Is there any problem? Thanks

Running `genozip` in parallel

Ive been testing genozip on a group of paired RNA-seq gzipped FASTQ files and its working really well, however I've noticed that using GNU parallel on >2 samples it errors out

Im using a list of samplenames and running those using parallel. Ive tested 2 parallel jobs (-j2) which works fine but (-j4) errors. Im running genozip via its own conda environment on a fairly old version of ubuntu (12.04 i think)

(genozip) ...:~/genozip$ cat test_list1.tsv
/.../storage/raw_fastq/.../RNAseq/.../.../sample1
/.../storage/raw_fastq/.../RNAseq/.../.../sample2
/.../storage/raw_fastq/.../RNAseq/.../.../sample3

(genozip) ...:~/genozip$ parallel -j4 -a test_list1.tsv genozip --md5 {}_1.fastq.gz {}_2.fastq.gz --pair -E ./GRCh37.ref.genozip -o {/}.grch37.genozip
genozip ./GRCh37.ref.genozip : Reading and caching reference hash table...
Error in file_put_data:1375: failed to rename ./GRCh37.ref.genozip.gcache.tmp to ./GRCh37.ref.genozip.gcache: No such file or directory
If this is unexpected, please contact [email protected].
genozip ./GRCh37.ref.genozip : Done

Error in file_put_data:1375: failed to rename ./GRCh37.ref.genozip.gcache.tmp to ./GRCh37.ref.genozip.gcache: No such file or directory
If this is unexpected, please contact [email protected].
genozip ./GRCh37.ref.genozip : Done
genozip ADII-0679-201388_1.fastq.gz : 0%

Im assuming this is due to the cache being unable to be accessed by more than two files at once?

Genozip multi-stage docker

If I make genozip from source, are there any other files that are required for the programs to work, other than the binaries themself?

binding multiple files defined in file

Hi @divonlan.

Currently genozip can bind all files in current directory using * wildcard. It would be useful to be able to specify paths of files for binding - either piped from cmd or stored in a file cf tar:

tar -cvf my_bams.tar -T my_bams.txt

This would be a handy feature when wanting to bind files that are located within myriad subdirectories.

genozip --register aborted (core dumped)

genozip: Error accessing the Internet: % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:49 --:--:-- 0
Aborted (core dumped)

Invalid header error - Unexpected compressed_offset value when trying to decrypt genozip

Hi,
I ran into a problem while trying to decrypt a vcf file encrypted and compressed with genozip. The error I keep getting is:
Error: invalid header - expecting compressed_offset to be 64 but found 48. section_type=SEC_DICT

this happens only when the file is password-encrypted and not because the password is wrong. Otherwise, everything works great. I really wish I could fix this issue and start using this software regularly.

Thanks in advance,
Guy

Genozip build failed if I rerun dict_id_gen.sh

There is a file named dict_id_gen.sh in the directory. I ran it by chance and then genozip built failed. Why did it happen? What's the usage of this script?

genozip --make-reference not working on .fna file

genozip complains and fails when provided a reference file with a suffix other than '.fa' or '.fasta'

>genozip --make-reference $GENOME.fna

genozip: --make-reference can only be used with FASTA files

While this works:

>ln -s $GENOME.fna $GENOME.fa
>genozip --make-reference $GENOME.fa
genozip A_rabiei_me14.fa : Writing hash table (this can take several minutes)...

A better way would be to actually check the content of the file and validate as fasta.

genocat crash when querying sequences from a compressed fasta file

The following pertains to genozip version 15.0.57 installed via conda for Linux x64.

genocat crashes when querying a specific sequence name from a genozipped Fasta file (see code snipped at the bottom).
When genocat is executed without specifying region(s), the whole file decompress with no issues.
Any idea what went wrong here?

$ genocat -r seq1 test.fa.genozip

22-May-2024 01:24:20 IDT MAIN/1/0: Error in regions_is_site_included:404 line_in_file(1-based)=-1  stack=TOPLEVEL[0]->N/A code_version=15.0.57 file_version=15.0.57: chrom=0 is out of range: num_chroms=0 chrom_did_i=0
Call stack (piz thread):
genocat(+0x91fb1)[0x55c317548fb1]
genocat(main_exit+0x308)[0x55c3175014a8]
genocat(regions_is_site_included+0x18e)[0x55c317694e4e]
genocat(container_reconstruct+0x2850)[0x55c317518aa0]
genocat(reconstruct_one_snip+0xb7c)[0x55c31755677c]
genocat(reconstruct_from_ctx_do+0x1d8)[0x55c317555188]
genocat(+0x98eed)[0x55c31754feed]
genocat(+0x92a2c)[0x55c317549a2c]
/lib64/libpthread.so.0(+0x7ea5)[0x7fda8072fea5]
/lib64/libc.so.6(clone+0x6d)[0x7fda7ff4eb0d]

Disable checking for new version

Hi!

It would be cool if there was an option to disable the new version check. Because if a new version has not yet appeared in conda, but is on genozip.com, then genozip stops every time and offers to update it. And this is absolutely not convenient when there are a lot of files in the queue.

failed compilation: multiple definition of `dict_id_num_aliases';

I was trying to compile the latest github release (12.0.34) and it failed after running make. The output is here https://pastebin.com/Fy6z6rpe.

I am running Ubuntu 21.04, gcc 10.3.0 and GNU Make 4.3.

Thank you.

Possible bug in BAM compression?

Hi,

Really nice tool! The speed and compression improvements over, e.g. gzip, are very impressive.

I think there may be a potential bug in the compression of BAM files. Although the BAM file that I was originally trying has millions of records, I narrowed it down to the following. If I run genozip (v11.0.2) on a SAM file containing the following line, it works fine (genozip --threads 1 -f test.sam):

NS500125:680:HNHVYBGXG:2:11209:16805:14650 256 4 145637796 1 9M1494270N67M * 0 GAGTACGGGGAAGTCATGGAGGGAGACTAGTGCCTAGTATTTGCGGTGCCTGAAAACTTTCTTAAGAAGCAGTTGT A/AAAEEEEEEEEEEEEEAE/EAEEEEEE6AEAEEEEEEEEAEEE<EAAEEEEEEEEEEEEE/EEEAEEEEAAEAE NH:i:4 HI:i:4 AS:i:69 nM:i:1 XS:A:+

However, if I convert that SAM file to a BAM file (I'm using sambamba: sambamba view -S -f bam test.sam -o test.bam), and run genozip --threads 1 -f test.bam, I get the following output:

genozip test.bam : 0%
op_len=1 too long in vb=1494270:
[1] 28905 abort (core dumped) genozip --threads 1 -f test.bam

I think that it is complaining about the length of the number in the middle of the CIGAR string (i.e. 1494270). If I remove one digit from that number, and reconvert the SAM file to BAM, then genozip works without error.

Genozip Error in lookback_get_do:70

Hello,

I got this error while running genozip v13.0.9

Error in lookback_get_do:70: expecting lookback=329 <= lookback_len=4 for ctx=PS vb=1 line_i=14
If this is unexpected, please contact [email protected].

I'm running the program on a slurm cluster and the command was like this : genozip --reference ref.genozip vcf.gz
OR
genozip --threads 30 --vblock 2000 --reference ref.genozip vcf.gz
OR JUST
genozip vcf.gz

Do you have an idea what might be causing this please?

Thanks!

Changes to libbsc source files are not compatible with Apache license

Hi @divonlan, I notice that you added following changes to libbsc source files.

_
// Please see terms and conditions in the file LICENSE.txt
//
// WARNING: Genozip is propeitary, not open source software. Modifying the source code is strictly not permitted
// and subject to penalties specified in the license._

This changes are not compatible with Apache license. Changes to libbsc source files are permitted and encouraged. By using libbsc software your agreed that "any contributions will be under the terms of the license without any terms and conditions".

So please remove warning message and terms + conditions associated with usage of libbsc source code.

FASTQ read change after genozipping

We have recently noticed an issue with our paired-end genozipped DNA FASTQ files, where upon genounzipping, some reads were changed. We used genozip v12.0.37 to execute the following command:
genozip --reference Homo_sapiens_assembly38.ref.genozip --pair file_R1.fastq file_R2.fastq --threads 8

We also used Process from the multiprocessing python library, running 8 instances of genozip simultaneously.

The two files are identical in size, with only the line containing the nucleotide sequence sporadically being changed. Here’s an extract from one of the genounzipped files:

+
FF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF
@A00553:69:HYL2YDSXY:4:1101:6876:1000 1:N:0:TTGGACTC+CTGCTTCC
TATGCATTTCAATACTATAGGATTCACGTTAATAGAAATAACCAGATGAAATGCTTCTGGTATGTCACCTTCCCTACCCACATAAGCCAGTGTTTTTTTCTGTGAATAACAAAAACAGCAGAATTTACTTGCCTATCCGTAAGAAGTTACC

And the respective original file (notice the difference is solely in the nucleotides):

+
FF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF,FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF
@A00553:69:HYL2YDSXY:4:1101:6876:1000 1:N:0:TTGGACTC+CTGCTTCC
TCAGATCACAATGTATACAAATTTTTTTCCTGCTAGTTTTCTTTCACATTACTGCAATCTATCTCTTTTAAAAAAAGTATATAGTGCAGCTATTTCAGCCAGGCACGGTGGTTCATGCCTGTAATCCCAGCACTTTGGGAGGCAGAGGCGG

We could not reproduce this result, and we could find no errors in the log files. Do you have any suggestions for a potential cause/remedy, please?

Exceeding memory allocation when genozipping file from URL

Hi,

I'm trying to genozip a vcf file directly from URL and I'm running out of memory.
This is my command:

 genozip -@ 2 -B 1024 --force ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.recalibrated_variants.vcf.gz

And this is the error I'm getting from the queue scheduling manager (PBSPro):

mem 52479500kb exceeded limit 25165824kb

As you can see I tried limiting the block size, but it doesn't have any impact.

Thanks for developing this tool, I would love to include this in my future pipelines (if I could get it running properly).

Ido

Segmentation fault when decompressing a compressed BAM file

Hello!

First of all, thank you very much for the nice tool!

I wanted to use genozip to compress some BAM files. However, when trying to decompress them I get a Segmentation fault.

I used the following commands for compression:

# 1st try
genozip test.bam --threads 5 --best=NO_REF --noisy -i BAM -o test.bam.genozip
# genozip test.bam : Done (1 minute 52 seconds, BAM compression ratio: 2.6)

# 2nd try
genozip test.bam --threads 5 --noisy -i BAM -o test.bam.genozip
# genozip test.bam : Done (1 minute 1 second, BAM compression ratio: 2.1)

For decompression, I used

genounzip test.bam.genozip --noisy --no-PG --output test.2.bam

The tool was installed via conda: version 13.0.11, h7f98852_0.
The commands were executed in an interactive session on our HPC cluster with 5 CPUs and ca. 130Gb memory being available.
The BAL file used for testing is ca. 1Gb big.

Thank you in advance!

Buffer overflow on long reads.

On a pacbio CLR long-read data set downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20131209_na12878_pacbio/si/NA12878.pacbio.bwa-sw.20140202.bam I get this error:

$ ~/lustre/genozip/genozip -@12 -f -e hs37d5.ref.genozip NA12878.pacbio.bwa-sw.20140202.bam
genozip NA12878.pacbio.bwa-sw.20140202.bam : 3% (39 minutes 29 seconds)*** buffer overflow detected ***: /nfs/users/nfs_j/jkb/lustre/genozip/genozip terminated
Aborted (core dumped)

This was built from the master branch, with git describe --tags claiming genozip-12.0.33-4-g308a84dd. The OS is Ubuntu 18.04.5 LTS (Bionic). I can't recall how I built it, but most likely it'd have just been "make" so using the system gcc with whatever optimisation options are listed in the Makefile.

Failed to decompress the description line of a FASTA-format file

Hi,

Here is an example FASTA sequence file to reproduce the error.

>g1 1|-6|0|5|0|204
A
>g2 0.66|0|0|6|0|202
A

After compressing the above sequence file using genozip and decompress it using genounzip, the resulted sequence file became

>g1 1|-6|0|5|0|204
A
>g2 0.60|0|0|6|0|202
A

And genounzip throw an error:

genounzip seq.fasta.genozip : 
genounzip: Adler32 of reconstructed vblock=1,component=1 (122686285 ) differs from original file (128977747 ).
Note: genounzip is unable to check the Adler32 subsequent vblocks once a vblock is bad
Bad reconstructed vblock has been dumped to: seq.fasta.genozip.vblock-1.start-0.len-44.bad
To see the same data in the original file:
   cat seq.fasta | head -c 44 | tail -c 44 > seq.fasta.genozip.vblock-1.start-0.len-44.good
genounzip: File integrity error: Adler32 of decompressed file seq.fasta is 122686285 , but Adler32 of the original FASTA file was 128977747 
Done (0 seconds)

Leiting

"Make reference" error in the docs

Just doing some testing on the make-reference option

FASTA-specific options (ignored for other file types):

--make-reference  Compresss a FASTA file to be used as a reference in --reference or --REFERENCE.
Example: genozip --make-reference hs37d5.fa.gz

Example: cat *.fa | genozip --input fasta --make-reference --output myref.ref.genozip

That example gives a seg fault because no input stream is specified. I think you just missed the - after --make-reference

Example: cat *.fa | genozip --input fasta --make-reference - --output myref.ref.genozip

Failure to decode reads overhanging reference end.

From the Illumina Platinum Genomes file for NA12878, a snippet:

@SQ	SN:chrM	LN:16571
@RG	ID:NA12878	SM:NA12878
HSQ1004:134:C0D8DACXX:1:1107:20540:135446	101	chrM	16472	37	101M	=	227	-16144	GGGGGTAGCTAAAGTGAACTGTATCCGACATCTGGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGG	@C@FFDDFHHFFHIHIIHHEI9FBFHGGGGGGIIJG<FHIHHIIIIJJJIIGJIHGGJIJJJFGHHHHEDFFDDEABDDDDDDCDEEDDDDDACCCB2?B<	RG:Z:NA12878	XT:A:U	NM:i:2	XN:i:1	SM:i:37	AM:i:37	X0:i:1	X1:i:0	XM:i:2	XO:i:0	XG:i:0	MD:Z:49C50C0

This extends from 16472 to 16572 inclusive due to the 101M cigar, but the reference ends at 16571.

The file encodes fine without warnings, but decode then fails. It would be preferable for the encode to fail or (better) for the decode to succeed, even if it's editing cigar to 100M1S.