A Lossless Parallel Quality Scores Data Compressor For Large Scale Genomic Sequencing Data.
About The PQSDC • Copy Our Project • Useage • Example • Our Experimental Configuration • Dataset Acquisition • Acknowledgements •
PQSDC is an experimental open-source quality scores data compressor, which utilizes parallel sequence partitioning and a four-level run-length prediction model to increase compression ratio while minimizing memory and time consumption. Furthermore, the compression process can be accelerated through the use of multi-core CPU clusters, resulting in a significant reduction of time overhead.
We are about to release a new version of our PQSDC2, so stay tuned.
Firstly, clone our tools from GitHub:
git clone https://github.com/fahaihi/PQSDC.git
Secondly, turn to PQSDC directory:
cd PQSDC/pqsdc_v2
Thirdly, Run the following command:
bash install.sh
#Warning!:GNU Make > 3.82.
Finally, Configure the environment variables with the following command:
export PATH=$PATH:`pwd`/
export PQSDC_V2_PATH="`pwd`/"
source ~/.bashrc
Basic Useage: pqsdc_v2 [command option]
-c [qualities file] [threads] *compression mode.
-d [pqsdc generate directory] [threads] *decompression mode.
-h *print this message.
Advanced Usage:pqsdc_tools [command option]
-fileinfo [input-fastq-file] *print basic statistic information.
-dirinfo [input-dir-name] *print basic statistic information.
-verify [source-fastq-file] <mode> [verify-file] *verify decompression.
<mode> = reads
<mode> = qualities
-filesplite [input-fastq-file] mode <mode> *splite a FastQ file according <mode>.
<mode> = ids
<mode> = reads
<mode> = describes
<mode> = qualities
<mode> = all
Notes:
(1)In order to be compatible with any personal computer, the current version only open-sources the method of parallel compression on a single CPU node with multiple cores.
(2)This open source version only supports fixed-length sequences.
(3)The BIOCONDA version will be updated soon...
We present the validation dataset PQSDC/data/test.qualities
cd ${PQSDC_V2_PATH}
cd ..
cd data
pqsdc_v2 -c test.qualities 8
results:
compression mode.
fileName : test.qualities
threads : 8
savepath : test.qualities.partition/result.pqsdc_v2
----------------------------------------------------------------------
1 reads partition, generate test.qualities.partition directory.
2 parallel run-length encoding prediction mapping.
3 cascade zpaq compressor.
4 pacing files into test.qualities.partition/result.pqsdc_v2.
5 removing redundant files.
over!
----------------------------------------------------------------------
pqsdc_v2 -d test.qualities.partition 8
results:
running pqsdc algorithm at Sat Jun 17 15:31:22 CST 2023
de-compression mode
fileName : test.qualities.partition
threads : 8
savepath : test.qualities.partition.partition.pqsdc_v2
----------------------------------------------------------------------
1 unpacking test.qualities.partition/result.pqsdc_v2.
2 unsing zpaq decompression files.
3 parallel run-length encoding prediction mapping.
4 merge partitions to restore the original file
over
----------------------------------------------------------------------
pqsdc_tools -verify test.fastq qualities test.qualities.pqsdc_de_v2
results:
lossless recover all qualities.
Our experiment was conducted on the SUGON-7000A supercomputer system at the Nanning Branch of the National Supercomputing Center, using a queue of CPU/GPU heterogeneous computing nodes. The compute nodes used in the experiment were configured as follows:
2*Intel Xeon Gold 6230 CPU (2.1Ghz, total 40 cores),
2*NVIDIA Tesla-T4 GPU (16GB CUDA memory, 2560 CUDA cores),
512GB DDR4 memory, and
8*900GB external storage.
We experimentally evaluated using the real publicly available sequencing datasets from the NCBI database. download this dataset by the following command:
nohup bash data_download.sh > data_download.log &
Dataset download and extraction using the SRA-Tools:https://github.com/ncbi/sra-tools tool
.
Source-Version: V1.2023.05.18.
Latest-Version: V2.1.2023.06.17. V2.1.2023.12.09.
Authors: NBJL-BioGrop.
Contact us: https://nbjl.nankai.edu.cn OR [email protected]