PQSDC

A Lossless Parallel Quality Scores Data Compressor For Large Scale Genomic Sequencing Data.

About The PQSDC • Copy Our Project • Useage • Example • Our Experimental Configuration • Dataset Acquisition • Acknowledgements •

About The PQSDC

PQSDC is an experimental open-source quality scores data compressor, which utilizes parallel sequence partitioning and a four-level run-length prediction model to increase compression ratio while minimizing memory and time consumption. Furthermore, the compression process can be accelerated through the use of multi-core CPU clusters, resulting in a significant reduction of time overhead.

We are about to release a new version of our PQSDC2, so stay tuned.

Copy Our Project

Firstly, clone our tools from GitHub:

git clone https://github.com/fahaihi/PQSDC.git

Secondly, turn to PQSDC directory：

cd PQSDC/pqsdc_v2

Thirdly, Run the following command：

bash install.sh
#Warning!:GNU Make > 3.82.

Finally, Configure the environment variables with the following command:

export PATH=$PATH:`pwd`/
export PQSDC_V2_PATH="`pwd`/"
source ~/.bashrc

Usage

    Basic Useage: pqsdc_v2 [command option]
       -c [qualities file] [threads]                      *compression mode.
       -d [pqsdc generate directory] [threads]            *decompression mode.
       -h                                                 *print this message.
    Advanced Usage:pqsdc_tools [command option]
       -fileinfo [input-fastq-file]                       *print basic statistic information.
       -dirinfo [input-dir-name]                          *print basic statistic information.
       -verify [source-fastq-file] <mode> [verify-file]   *verify decompression.
          <mode> = reads
          <mode> = qualities
       -filesplite [input-fastq-file] mode <mode>         *splite a FastQ file according <mode>.
          <mode> = ids
          <mode> = reads
          <mode> = describes
          <mode> = qualities
          <mode> = all

Notes:

(1)In order to be compatible with any personal computer, the current version only open-sources the method of parallel compression on a single CPU node with multiple cores.

(2)This open source version only supports fixed-length sequences.

(3)The BIOCONDA version will be updated soon...

Examples

We present the validation dataset PQSDC/data/test.qualities

1、Using 8 CPU cores for compression.

cd ${PQSDC_V2_PATH}
cd ..
cd data
pqsdc_v2 -c test.qualities 8

results:

compression mode.
fileName : test.qualities
threads  : 8
savepath : test.qualities.partition/result.pqsdc_v2
----------------------------------------------------------------------
1 reads partition, generate test.qualities.partition directory.
2 parallel run-length encoding prediction mapping.
3 cascade zpaq compressor.
4 pacing files into test.qualities.partition/result.pqsdc_v2.
5 removing redundant files.
over!
----------------------------------------------------------------------

2、Using 8 CPU cores for decompression.

pqsdc_v2 -d test.qualities.partition 8

results:

running pqsdc algorithm at Sat Jun 17 15:31:22 CST 2023
de-compression mode
fileName : test.qualities.partition
threads  : 8
savepath : test.qualities.partition.partition.pqsdc_v2
----------------------------------------------------------------------
1 unpacking test.qualities.partition/result.pqsdc_v2.
2 unsing zpaq decompression files.
3 parallel run-length encoding prediction mapping.
4 merge partitions to restore the original file
over
----------------------------------------------------------------------

3、Verify if the decompression is successful.

pqsdc_tools -verify test.fastq qualities test.qualities.pqsdc_de_v2

results:

lossless recover all qualities.

Our Experimental Configuration

Our experiment was conducted on the SUGON-7000A supercomputer system at the Nanning Branch of the National Supercomputing Center, using a queue of CPU/GPU heterogeneous computing nodes. The compute nodes used in the experiment were configured as follows:

2*Intel Xeon Gold 6230 CPU (2.1Ghz, total 40 cores),

2*NVIDIA Tesla-T4 GPU (16GB CUDA memory, 2560 CUDA cores),

512GB DDR4 memory, and

8*900GB external storage.

Dataset Acquisition

We experimentally evaluated using the real publicly available sequencing datasets from the NCBI database. download this dataset by the following command:

nohup bash data_download.sh > data_download.log &

Dataset download and extraction using the SRA-Tools：https://github.com/ncbi/sra-tools tool.

Acknowledgements

Thanks to @HPC-GXU for the computing device support.
Thanks to @NCBI for all available datasets.

Additional Information

Source-Version： V1.2023.05.18.

Latest-Version： V2.1.2023.06.17. V2.1.2023.12.09.

Authors: NBJL-BioGrop.

Contact us: https://nbjl.nankai.edu.cn OR [email protected]

fahaihi / pqsdc Goto Github PK

pqsdc's Introduction

PQSDC

About The PQSDC

Copy Our Project

Usage

Examples

1、Using 8 CPU cores for compression.

2、Using 8 CPU cores for decompression.

3、Verify if the decompression is successful.

Our Experimental Configuration

Dataset Acquisition

Acknowledgements

Additional Information

pqsdc's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent