gatb / leon Goto Github PK
View Code? Open in Web Editor NEWLeon - FASTA and FASTQ read compressor
Home Page: https://gatb.inria.fr/software/leon
License: GNU Affero General Public License v3.0
Leon - FASTA and FASTQ read compressor
Home Page: https://gatb.inria.fr/software/leon
License: GNU Affero General Public License v3.0
Leon compression crashes when input has 1-character long sequence name. To reproduce:
Input file "1.fa":
>1
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC
Input file "2.fa":
>12
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC
(The only difference between these inputs is the extra character in sequence name).
Compression commands:
leon -file '1.fasta' -c -kmer-size 2
leon -file '2.fasta' -c -kmer-size 2
The first command crashes with this console output:
Input format: Fasta
[DSK: nb solid kmers found : 3 ] 100 % elapsed: 0 min 0 sec remaining: 0 min 0 sec cpu: 333.3 % mem: [ 82, 82, 82] MB
[Compressing headers ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 seczsh: segmentation fault (core dumped) leon -file '1.fasta' -c -kmer-size 2
The second command completes without problems.
OS: Ubuntu 18.04.1 LTS
I found this problem while working on Sequence Compression Benchmark.
--ouput arg is required to specify output path. Currently, all data are generated locally.
Leon feels free to use current directory for storing temporary files.
For example, leon -file 'data/1.fasta' -c -kmer-size 2
will create a temporary file 1.h5
in current directory. (Normally it's deleted after compression, but is left dangling if compression crashes).
Using current directory as temporary space is problematic because:
A slightly better solution would be to use output directory for temporary file storage (since we at least know that we have write access and most probably some free space there).
An even better way would be to use directory configured in TMPDIR environment variable. (Ideally with a command line option to specify another directory).
I would like to suggest to use the following syntax for leon argument:
-f , --file
-v, --verbose
-t, --threads
-h, --help
The rule is : 1 dash for short arg ( -h ) and 2 dash for long arg ( --help)
Leon has option "-seq-only" that allows ignoring read names during compression. When decompressing a Leon archive produced with this option, Leon automatically names the decompressed sequences using numbers starting from 0. These numbers go until 50,000, at which point the counter resets to 0. Additionally, an empty sequence is generated at the 50,000 reads boundary. Fragment of the decompressed output:
> 49998
ACACAACTATAATAGGGAAA
> 49999
TTGATTGTTTTGTTTTTGTG
> 50000
> 0
TTCGGATAGTGTGTTCATTA
> 1
TCTCTTTCTTTGGTGATTGA
> 2
CGTCGAGTTGTTTAATTAAA
The two main problems with this counting are:
An empty sequence (with name " 50000") is generated during decompression, although the original file had no such sequence.
Sequence names are not unique. When the data is sufficiently large, the decompressed file will have multiple sequences with each name.
To fix to this problem I suggest removing the artificial upper bound of 50,000 reads.
Also, considering the possibility of huge data, I recommend making sure that the counter can't overflow (e.g., by using arbitrary precision number).
In addition, I would suggest to avoid putting a space between ">" and name, and to start counting with 1 (instead of 0). This will make the output a bit more friendly to downstream tools and to interpretation. But these are less important and can be considered a preference.
If a filename contains multiple '.' the decoding will fail because System::file().getBaseName returns only the first part of the filename.
Example: leon -file C2448.21.fastq.leon -d -lossless
Start decompression
Input filename: C2448.21.fastq.leon
Qual filename: ./C2448.fastq.qual
Output format: Fastq
Kmer size: 31
Input File was compressed with leon version 1.0.0
Block count: 3
[Decompressing all streams] 100 % elapsed: 0 min 0 sec estimated remaining: 0 min 0 sec
Output filename: ./C2448.fastq.d
Time: 0.79s
Speed: 0.00 mo/s
--> It does only extract the first three sequences without qualities but does not report any error!
--> Check, if .qual file is there if --lossless if given!
Version: e7f6ad2
OS: Ubuntu 16.04, 64 Bit
Trying to compress a FASTA file results in a segfault. Same behaviour with debug build.
Crash occurs in DnaEncoder::smoothQuals(). The problem appears to lie in the read length. If the read length is less than the k-mer length of 31 (?), the crash occurs.
I generated a pair of FASTA files for you. One file is working, the other contains one character less and makes leon crash.
leon_fine.fasta.gz
leon_crash.fasta.gz
I have an Apport crash report of running a debug build and could give you the core dump as well. On the other hand, I am pretty sure that this can be reproduced easily.
Leon compression crashes on some data. Example data:
leon-repro-1.fa.gz (784 MB archive, inside is a 2.8 GB file).
Command to reproduce (after decompressing the gzipped data):
leon -seq-only -file leon-repro-1.fa -c -kmer-size 3
This command crashes with the following colsole output:
Input format: Fasta
[DSK: Pass 1/1, Step 2: counting kmers ] 70.5 % elapsed: 0 min 28 sec remaining: 0 min 12 sec cpu: 472.6 % mem: [ 66, 66, 66] MB EXCEPTION: Pool allocation failed for 3012690144 bytes (kmers alloc). Current usage is 16 and capacity is 2097152000
Also after crash Leon leaves 85 temporary files in current directory, totaling 21 GB.
I noticed that Leon paper mentions using Leon on a 733 GB data. Therefore I assumed that comparatively small data size of 2.8 GB should be no problem.
I have just tried leon by compressing hg19.fa.
leon -file hg19.fa -c
I guess I get an error.
[DSK: nb solid kmers found : 35010472 ] 100 % elapsed: 3 min 37 sec remaining: 0 min 0 sec cpu: 185.8 % mem: [ 38, 3[DSK: nb solid kmers found : 35010472 ] 100 % elapsed: 3 min 37 sec remaining: 0 min 0 sec cpu: 185.8 % mem: [ 38, 3[DSK: nb solid kmers found : 35010472 ] 100 % elapsed: 3 min 37 sec remaining: 0 min 0 sec cpu: 185.8 % mem: [ 38, 3[DSK: nb solid kmers found : 35010472 ] 100 % elapsed: 3 min 37 sec remaining: 0 min 0 sec cpu: 185.8 % mem: [ 38, 3626, 4201] MB
[Compressing headers ] 100 % elapsed: 0 min 17 sec remaining: 0 min 0 sec
End header compression
Headers size: 1193
Headers compressed size: 604
Compression rate: 0.5063
Abundance threshold: 4 (auto) (nb solid kmers: 22647208)
[fill bloom filter ] 100 % elapsed: 0 min 6 sec remaining: 0 min 0 sec
[Compressing dna ] 100 % elapsed: 0 min 11 sec remaining: 0 min 0 sec
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
[1] 30874 abort (core dumped) leon -file hg19.fa -c
I was not able to extract the leon file, wich is probably corrupted. It's only 604 bytes.
I will test on a stronger computer to see if it is better.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.