Giter Club home page Giter Club logo

naf's Introduction

Nucleotide Archival Format (NAF)

NAF is a binary file format for biological sequence data. It's based on zstd, and features strong compression and fast decompression. It can store DNA, RNA, protein or text sequences, with or without qualities. It supports FASTA and FASTQ-formatted sequences, ambiguous IUPAC codes, masked sequence, and has no limit on sequence length or number of sequences. It supports Unix pipes which allows easy integration into pipelines. See NAF homepage for details.

Example benchmark: SILVA 132 LSURef database (610 MB):
From Sequence Compression Benchmark project - visit for details and more benchmarks.

More examples:

Format specification

NAF specification is in public domain: NAFv2.pdf

Encoder and decoder

NAF encoder and decoder are called "ennaf" and "unnaf". After compressing your data with ennaf, you suddenly have enough space. However, if you decompress it back with unnaf, your space is again un-enough.

Installing

Installing with bioconda

To install NAF with bioconda:

conda install naf

See package page for details: naf at bioconda.

Building from source

Prerequisites: git, gcc, make, diff, perl (diff and perl are only used for test suite). E.g., to install on Ubuntu: sudo apt install git gcc make diffutils perl. On Mac OS you may have to install Xcode Command Line Tools.

Building and installing:

git clone --recurse-submodules https://github.com/KirillKryukov/naf.git
cd naf && make && make test && sudo make install

To install in alternative location, add "prefix=DIR" to the "make install" command. E.g., sudo make prefix=/usr/local/bio install

For a staged install, add "DESTDIR=DIR". E.g., make DESTDIR=/tmp/stage install

On Windows it can be installed using Cygwin, and should be also possible with WSL. In Cygwin drop sudo: cd naf && make && make test && make install

Building from latest unreleased source

For testing purpose only:

git clone --recurse-submodules --branch develop https://github.com/KirillKryukov/naf.git
cd naf && make && make test && sudo make install

Compressing

ennaf file.fa -o file.naf

See ennaf -h and Compression Manual for detailed usage.

Decompressing

unnaf file.naf -o file.fa

See unnaf -h and Decompression Manual.

Compressing multiple files

Working with multiple files is possible using Multi-Multi-FASTA as intermediate format. Example commands:

Compressing:
mumu.pl --dir 'Helicobacter' 'Helicobacter pylori*' | ennaf -22 --text -o Hp.nafnaf

Decompressing and unpacking:
unnaf Hp.nafnaf | mumu.pl --unpack --dir 'Helicobacter'

Filename of NAF-compressed single file normally ends with a ".naf". To avoid ambiguity, ".nafnaf" is the recommended suffix for multi-file NAF archives.

Citation

If you use NAF, please cite:

For compressor benchmark, please cite:

  • Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi (2020) "Sequence Compression Benchmark (SCB) database โ€” A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences" GigaScience, 9(7), giaa072, doi: 10.1093/gigascience/giaa072.

naf's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

naf's Issues

Function to compress / decompress sequence only

Is there or could there be a function to compress / decompress sequence only? I have a format I'm using as a replacement for FASTA now, and using zstd as the sequence compression (seq ID's are not compressed). If you could expose a function to do compression/decompression or point me to the right place, I could include naf as an option there.

Thanks

Performance on large files - avoid spilling to disk

Looking through the source code and specifications document, I've noticed that both compress and decompression spill to disk for large files. This is particularly problematic in the decompression scenario due to the high temporary disk usage.

Have you considered extending the file format to support multiple blocks? For example:

Header = format descriptor, format version, sequence type, flags, name separator, line length

DataBlock = Number of sequences, IDs, Comments, Lengths, Mask, Sequence, Quality

And the overall structure:

Header, Title, [DataBlock]+

Then you could stream NAF files with no disk usage and a fixed memory overhead. There is a slight compression penalty to having multiple data block but that will be trivially low for large data blocks. Both BAM and CRAM uses variants of this blocked compression approach.

bioconda installation

Hi Kirill,

would it be possible to add NAF to bioconda?
(I guess it would be highly used by the community after a while, for example in pipelines)

Best regards,
Diogo

Docker image in kubernetes cluster issue

I am trying to use the latest docker image quay.io/biocontainers/naf:1.3.0--hec16e2b_3 to compress and decompress within a nextflow pipeline. I am running the pipeline in a Kubernetes cluster and I keep getting the error: Illegal instruction (core dumped), and terminated with an error exit status (132).

The commands I am running are:

compress

mkdir mytemp
ennaf -o GSM461177_1.trimmed.naf --temp-dir mytemp GSM461177_1.fastqsanger

I've also tried to build my own image and this works locally, but throws the same error when run in the cluster.
Could you assist.

Thanks

decompress error

command:

unnaf --fastq fastq.gz.naf |gzip - >fastq.gz.naf.gz
unnaf error: can't allocate 35115481985 bytes

file size:
fastq.gz.naf 43G

Use conan to download zstd

It would be nice to use conan and potentially CMake to package naf. There is already a conan package for zstd. This will make it possible to easily port the package to different operating systems (such as OSX and Windows). I might help with a PR if appreciated.

multiple input files

the ability to compress and uncompress more than 1 file at a time would be very useful.
e.g.being able to compress all files with suffix .fasta :
$ naf ./*.fasta > .
Adding an option to keep or remove the old fasta after compression could also be helpful, (default to keep them), and a similar flag for keep or remove .naf version after decompression.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.