Giter Club home page Giter Club logo

pqsdc's Introduction

PQSDC

made-with-C++ made-with-OpenMP made-with-MPI


Logo

A Lossless Parallel Quality Scores Data Compressor For Large Scale Genomic Sequencing Data.

About The PQSDCCopy Our ProjectUseageExampleOur Experimental ConfigurationDataset AcquisitionAcknowledgements

screenshot

About The PQSDC

PQSDC is an experimental open-source quality scores data compressor, which utilizes parallel sequence partitioning and a four-level run-length prediction model to increase compression ratio while minimizing memory and time consumption. Furthermore, the compression process can be accelerated through the use of multi-core CPU clusters, resulting in a significant reduction of time overhead.

We are about to release a new version of our PQSDC2, so stay tuned.

Copy Our Project

Firstly, clone our tools from GitHub:

git clone https://github.com/fahaihi/PQSDC.git

Secondly, turn to PQSDC directory:

cd PQSDC/pqsdc_v2

Thirdly, Run the following command:

bash install.sh
#Warning!:GNU Make > 3.82.

Finally, Configure the environment variables with the following command:

export PATH=$PATH:`pwd`/
export PQSDC_V2_PATH="`pwd`/"
source ~/.bashrc

Usage

    Basic Useage: pqsdc_v2 [command option]
       -c [qualities file] [threads]                      *compression mode.
       -d [pqsdc generate directory] [threads]            *decompression mode.
       -h                                                 *print this message.
    Advanced Usage:pqsdc_tools [command option]
       -fileinfo [input-fastq-file]                       *print basic statistic information.
       -dirinfo [input-dir-name]                          *print basic statistic information.
       -verify [source-fastq-file] <mode> [verify-file]   *verify decompression.
          <mode> = reads
          <mode> = qualities
       -filesplite [input-fastq-file] mode <mode>         *splite a FastQ file according <mode>.
          <mode> = ids
          <mode> = reads
          <mode> = describes
          <mode> = qualities
          <mode> = all

Notes:

(1)In order to be compatible with any personal computer, the current version only open-sources the method of parallel compression on a single CPU node with multiple cores.

(2)This open source version only supports fixed-length sequences.

(3)The BIOCONDA version will be updated soon...

Examples

We present the validation dataset PQSDC/data/test.qualities

1、Using 8 CPU cores for compression.

cd ${PQSDC_V2_PATH}
cd ..
cd data
pqsdc_v2 -c test.qualities 8

results:

compression mode.
fileName : test.qualities
threads  : 8
savepath : test.qualities.partition/result.pqsdc_v2
----------------------------------------------------------------------
1 reads partition, generate test.qualities.partition directory.
2 parallel run-length encoding prediction mapping.
3 cascade zpaq compressor.
4 pacing files into test.qualities.partition/result.pqsdc_v2.
5 removing redundant files.
over!
----------------------------------------------------------------------

2、Using 8 CPU cores for decompression.

pqsdc_v2 -d test.qualities.partition 8

results:

running pqsdc algorithm at Sat Jun 17 15:31:22 CST 2023
de-compression mode
fileName : test.qualities.partition
threads  : 8
savepath : test.qualities.partition.partition.pqsdc_v2
----------------------------------------------------------------------
1 unpacking test.qualities.partition/result.pqsdc_v2.
2 unsing zpaq decompression files.
3 parallel run-length encoding prediction mapping.
4 merge partitions to restore the original file
over
----------------------------------------------------------------------

3、Verify if the decompression is successful.

pqsdc_tools -verify test.fastq qualities test.qualities.pqsdc_de_v2

results:

lossless recover all qualities.

Our Experimental Configuration

Our experiment was conducted on the SUGON-7000A supercomputer system at the Nanning Branch of the National Supercomputing Center, using a queue of CPU/GPU heterogeneous computing nodes. The compute nodes used in the experiment were configured as follows:

2*Intel Xeon Gold 6230 CPU (2.1Ghz, total 40 cores),

2*NVIDIA Tesla-T4 GPU (16GB CUDA memory, 2560 CUDA cores),

512GB DDR4 memory, and

8*900GB external storage.

Dataset Acquisition

We experimentally evaluated using the real publicly available sequencing datasets from the NCBI database. download this dataset by the following command:

nohup bash data_download.sh > data_download.log &

Dataset download and extraction using the SRA-Tools:https://github.com/ncbi/sra-tools tool.

Acknowledgements

  • Thanks to @HPC-GXU for the computing device support.
  • Thanks to @NCBI for all available datasets.

Additional Information

Source-Version: V1.2023.05.18.

Latest-Version: V2.1.2023.06.17. V2.1.2023.12.09.

Authors: NBJL-BioGrop.

Contact us: https://nbjl.nankai.edu.cn OR [email protected]

pqsdc's People

Contributors

fahaihi avatar

Stargazers

haonanxie avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.