gatb / leon Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 4.0 557 KB

Leon - FASTA and FASTQ read compressor

Home Page: https://gatb.inria.fr/software/leon

License: GNU Affero General Public License v3.0

CMake 19.27% Shell 71.73% C++ 9.00%

leon's People

Contributors

Stargazers

Watchers

Forkers

ysard tbraquelaire ninadodante gaetanbenoitdev

leon's Issues

Broken read numbering when decompressing archive without headers

Leon has option "-seq-only" that allows ignoring read names during compression. When decompressing a Leon archive produced with this option, Leon automatically names the decompressed sequences using numbers starting from 0. These numbers go until 50,000, at which point the counter resets to 0. Additionally, an empty sequence is generated at the 50,000 reads boundary. Fragment of the decompressed output:

> 49998
ACACAACTATAATAGGGAAA
> 49999
TTGATTGTTTTGTTTTTGTG
> 50000
> 0
TTCGGATAGTGTGTTCATTA
> 1
TCTCTTTCTTTGGTGATTGA
> 2
CGTCGAGTTGTTTAATTAAA

The two main problems with this counting are:

An empty sequence (with name " 50000") is generated during decompression, although the original file had no such sequence.
Sequence names are not unique. When the data is sufficiently large, the decompressed file will have multiple sequences with each name.

To fix to this problem I suggest removing the artificial upper bound of 50,000 reads.

Also, considering the possibility of huge data, I recommend making sure that the counter can't overflow (e.g., by using arbitrary precision number).

In addition, I would suggest to avoid putting a space between ">" and name, and to start counting with 1 (instead of 0). This will make the output a bit more friendly to downstream tools and to interpretation. But these are less important and can be considered a preference.

Segfault in FASTA Compression with Sequence Length < k-mer Length

Version: e7f6ad2
OS: Ubuntu 16.04, 64 Bit

Trying to compress a FASTA file results in a segfault. Same behaviour with debug build.

Crash occurs in DnaEncoder::smoothQuals(). The problem appears to lie in the read length. If the read length is less than the k-mer length of 31 (?), the crash occurs.

I generated a pair of FASTA files for you. One file is working, the other contains one character less and makes leon crash.

leon_fine.fasta.gz
leon_crash.fasta.gz

I have an Apport crash report of running a debug build and could give you the core dump as well. On the other hand, I am pretty sure that this can be reproduced easily.

Crash when compressing sequence with 1-character name

Leon compression crashes when input has 1-character long sequence name. To reproduce:

Input file "1.fa":

>1
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC

Input file "2.fa":

>12
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC

(The only difference between these inputs is the extra character in sequence name).

Compression commands:
leon -file '1.fasta' -c -kmer-size 2
leon -file '2.fasta' -c -kmer-size 2

The first command crashes with this console output:

        Input format: Fasta
[DSK: nb solid kmers found : 3           ]  100  %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu: 333.3 %   mem: [  82,   82,   82] MB
[Compressing headers                     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  seczsh: segmentation fault (core dumped)  leon -file '1.fasta' -c -kmer-size 2

The second command completes without problems.

OS: Ubuntu 18.04.1 LTS

I found this problem while working on Sequence Compression Benchmark.

argument naming

I would like to suggest to use the following syntax for leon argument:

  -f , --file
  -v, --verbose 
  -t, --threads
  -h, --help

The rule is : 1 dash for short arg ( -h ) and 2 dash for long arg ( --help)

Crash on 2.8 GB data: "EXCEPTION: Pool allocation failed"

Leon compression crashes on some data. Example data:

leon-repro-1.fa.gz (784 MB archive, inside is a 2.8 GB file).

Command to reproduce (after decompressing the gzipped data):

leon -seq-only -file leon-repro-1.fa -c -kmer-size 3

This command crashes with the following colsole output:

        Input format: Fasta
[DSK: Pass 1/1, Step 2: counting kmers   ]  70.5 %   elapsed:   0 min 28 sec   remaining:   0 min 12 sec   cpu: 472.6 %   mem: [  66,   66,   66] MB EXCEPTION: Pool allocation failed for 3012690144 bytes (kmers alloc). Current usage is 16 and capacity is 2097152000

Also after crash Leon leaves 85 temporary files in current directory, totaling 21 GB.

I noticed that Leon paper mentions using Leon on a 733 GB data. Therefore I assumed that comparatively small data size of 2.8 GB should be no problem.

try to compress hg19.fa

I have just tried leon by compressing hg19.fa.

 leon -file hg19.fa -c

I guess I get an error.

[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3626, 4201] MB 
[Compressing headers                     ]  100  %   elapsed:   0 min 17 sec   remaining:   0 min 0  sec
        End header compression
                Headers size: 1193
                Headers compressed size: 604
                Compression rate: 0.5063
Abundance threshold: 4 (auto)    (nb solid kmers: 22647208)
[fill bloom filter                       ]  100  %   elapsed:   0 min 6  sec   remaining:   0 min 0  sec
[Compressing dna                         ]  100  %   elapsed:   0 min 11 sec   remaining:   0 min 0  sec
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    30874 abort (core dumped)  leon -file hg19.fa -c

I was not able to extract the leon file, wich is probably corrupted. It's only 604 bytes.
I will test on a stronger computer to see if it is better.

Leon uses current directory for temporary files

Leon feels free to use current directory for storing temporary files.

For example, leon -file 'data/1.fasta' -c -kmer-size 2 will create a temporary file 1.h5 in current directory. (Normally it's deleted after compression, but is left dangling if compression crashes).

Using current directory as temporary space is problematic because:

It's completely unexpected by the user.
The name may clash with other files already stored there.
Current directory might have insufficient space.
The process may have no write access to current directory.
When running concurrent leon tasks, these temporary files may clash with each other.

A slightly better solution would be to use output directory for temporary file storage (since we at least know that we have write access and most probably some free space there).

An even better way would be to use directory configured in TMPDIR environment variable. (Ideally with a command line option to specify another directory).

output args is required

--ouput arg is required to specify output path. Currently, all data are generated locally.

Problem with System::file().getBaseName

If a filename contains multiple '.' the decoding will fail because System::file().getBaseName returns only the first part of the filename.

Example: leon -file C2448.21.fastq.leon -d -lossless

Start decompression
Input filename: C2448.21.fastq.leon
Qual filename: ./C2448.fastq.qual
Output format: Fastq
Kmer size: 31

Input File was compressed with leon version 1.0.0
Block count: 3

[Decompressing all streams] 100 % elapsed: 0 min 0 sec estimated remaining: 0 min 0 sec

Output filename: ./C2448.fastq.d
Time: 0.79s
Speed: 0.00 mo/s

--> It does only extract the first three sequences without qualities but does not report any error!
--> Check, if .qual file is there if --lossless if given!

gatb / leon Goto Github PK

leon's People

Contributors

Stargazers

Watchers

Forkers

leon's Issues

Broken read numbering when decompressing archive without headers

Segfault in FASTA Compression with Sequence Length < k-mer Length

Crash when compressing sequence with 1-character name

argument naming

Crash on 2.8 GB data: "EXCEPTION: Pool allocation failed"

try to compress hg19.fa

Leon uses current directory for temporary files

output args is required

Problem with System::file().getBaseName

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent