Giter Club home page Giter Club logo

leon's People

Contributors

cdeltel avatar clemaitre avatar gatouresearch avatar genscale-admin avatar rizkg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

leon's Issues

Crash when compressing sequence with 1-character name

Leon compression crashes when input has 1-character long sequence name. To reproduce:

Input file "1.fa":

>1
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC

Input file "2.fa":

>12
AGCGCGTCTGGCGTGTATAT
GGCTGCTGTGCATTGTGTTC

(The only difference between these inputs is the extra character in sequence name).

Compression commands:
leon -file '1.fasta' -c -kmer-size 2
leon -file '2.fasta' -c -kmer-size 2

The first command crashes with this console output:

        Input format: Fasta
[DSK: nb solid kmers found : 3           ]  100  %   elapsed:   0 min 0  sec   remaining:   0 min 0  sec   cpu: 333.3 %   mem: [  82,   82,   82] MB
[Compressing headers                     ]  0    %   elapsed:   0 min 0  sec   remaining:   0 min 0  seczsh: segmentation fault (core dumped)  leon -file '1.fasta' -c -kmer-size 2

The second command completes without problems.

OS: Ubuntu 18.04.1 LTS

I found this problem while working on Sequence Compression Benchmark.

output args is required

--ouput arg is required to specify output path. Currently, all data are generated locally.

Leon uses current directory for temporary files

Leon feels free to use current directory for storing temporary files.

For example, leon -file 'data/1.fasta' -c -kmer-size 2 will create a temporary file 1.h5 in current directory. (Normally it's deleted after compression, but is left dangling if compression crashes).

Using current directory as temporary space is problematic because:

  1. It's completely unexpected by the user.
  2. The name may clash with other files already stored there.
  3. Current directory might have insufficient space.
  4. The process may have no write access to current directory.
  5. When running concurrent leon tasks, these temporary files may clash with each other.

A slightly better solution would be to use output directory for temporary file storage (since we at least know that we have write access and most probably some free space there).

An even better way would be to use directory configured in TMPDIR environment variable. (Ideally with a command line option to specify another directory).

argument naming

I would like to suggest to use the following syntax for leon argument:

  -f , --file
  -v, --verbose 
  -t, --threads
  -h, --help

The rule is : 1 dash for short arg ( -h ) and 2 dash for long arg ( --help)

Broken read numbering when decompressing archive without headers

Leon has option "-seq-only" that allows ignoring read names during compression. When decompressing a Leon archive produced with this option, Leon automatically names the decompressed sequences using numbers starting from 0. These numbers go until 50,000, at which point the counter resets to 0. Additionally, an empty sequence is generated at the 50,000 reads boundary. Fragment of the decompressed output:

> 49998
ACACAACTATAATAGGGAAA
> 49999
TTGATTGTTTTGTTTTTGTG
> 50000
> 0
TTCGGATAGTGTGTTCATTA
> 1
TCTCTTTCTTTGGTGATTGA
> 2
CGTCGAGTTGTTTAATTAAA

The two main problems with this counting are:

  1. An empty sequence (with name " 50000") is generated during decompression, although the original file had no such sequence.

  2. Sequence names are not unique. When the data is sufficiently large, the decompressed file will have multiple sequences with each name.

To fix to this problem I suggest removing the artificial upper bound of 50,000 reads.

Also, considering the possibility of huge data, I recommend making sure that the counter can't overflow (e.g., by using arbitrary precision number).

In addition, I would suggest to avoid putting a space between ">" and name, and to start counting with 1 (instead of 0). This will make the output a bit more friendly to downstream tools and to interpretation. But these are less important and can be considered a preference.

Problem with System::file().getBaseName

If a filename contains multiple '.' the decoding will fail because System::file().getBaseName returns only the first part of the filename.

Example: leon -file C2448.21.fastq.leon -d -lossless

Start decompression
Input filename: C2448.21.fastq.leon
Qual filename: ./C2448.fastq.qual
Output format: Fastq
Kmer size: 31

Input File was compressed with leon version 1.0.0
Block count: 3

[Decompressing all streams] 100 % elapsed: 0 min 0 sec estimated remaining: 0 min 0 sec

Output filename: ./C2448.fastq.d
Time: 0.79s
Speed: 0.00 mo/s

--> It does only extract the first three sequences without qualities but does not report any error!
--> Check, if .qual file is there if --lossless if given!

Segfault in FASTA Compression with Sequence Length < k-mer Length

Version: e7f6ad2
OS: Ubuntu 16.04, 64 Bit

Trying to compress a FASTA file results in a segfault. Same behaviour with debug build.

Crash occurs in DnaEncoder::smoothQuals(). The problem appears to lie in the read length. If the read length is less than the k-mer length of 31 (?), the crash occurs.

I generated a pair of FASTA files for you. One file is working, the other contains one character less and makes leon crash.

leon_fine.fasta.gz
leon_crash.fasta.gz

I have an Apport crash report of running a debug build and could give you the core dump as well. On the other hand, I am pretty sure that this can be reproduced easily.

Crash on 2.8 GB data: "EXCEPTION: Pool allocation failed"

Leon compression crashes on some data. Example data:

leon-repro-1.fa.gz (784 MB archive, inside is a 2.8 GB file).

Command to reproduce (after decompressing the gzipped data):

leon -seq-only -file leon-repro-1.fa -c -kmer-size 3

This command crashes with the following colsole output:

        Input format: Fasta
[DSK: Pass 1/1, Step 2: counting kmers   ]  70.5 %   elapsed:   0 min 28 sec   remaining:   0 min 12 sec   cpu: 472.6 %   mem: [  66,   66,   66] MB EXCEPTION: Pool allocation failed for 3012690144 bytes (kmers alloc). Current usage is 16 and capacity is 2097152000

Also after crash Leon leaves 85 temporary files in current directory, totaling 21 GB.

I noticed that Leon paper mentions using Leon on a 733 GB data. Therefore I assumed that comparatively small data size of 2.8 GB should be no problem.

try to compress hg19.fa

I have just tried leon by compressing hg19.fa.

 leon -file hg19.fa -c  

I guess I get an error.

[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3[DSK: nb solid kmers found : 35010472    ]  100  %   elapsed:   3 min 37 sec   remaining:   0 min 0  sec   cpu: 185.8 %   mem: [  38, 3626, 4201] MB 
[Compressing headers                     ]  100  %   elapsed:   0 min 17 sec   remaining:   0 min 0  sec
        End header compression
                Headers size: 1193
                Headers compressed size: 604
                Compression rate: 0.5063
Abundance threshold: 4 (auto)    (nb solid kmers: 22647208)
[fill bloom filter                       ]  100  %   elapsed:   0 min 6  sec   remaining:   0 min 0  sec
[Compressing dna                         ]  100  %   elapsed:   0 min 11 sec   remaining:   0 min 0  sec
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    30874 abort (core dumped)  leon -file hg19.fa -c

I was not able to extract the leon file, wich is probably corrupted. It's only 604 bytes.
I will test on a stronger computer to see if it is better.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.