Giter Club home page Giter Club logo

stat_fastq's Introduction

stat_fastq

这是一个统计fastq文件各类指标的工具。 这个工具命名为stat_fastq, 他可以定制输出的格式和指定输出指标。可以支持输入多个fastq文件,并且支持多线程执行.

软件的安装:

git clone https://github.com/thecgs/stat_fastq.git
bash ./stat_fastq/INSTALL.sh

stat_fastq文件是一个编译好的二进制文件,如果你需要重新编译,可以这样编译g++ stat_fastq.cpp -o stat_fastq,注意,编译完成的二进制文件需要和主程序stat_fastq.py在同一文件夹下

使用方法如下:

$ python stat_fastq.py -h

Usage:
    example 1:
    /nfs1/public2/User/chenguisen/01.biosoftware/stat_fastq/stat_fastq.py <fastq.fofn> [option]
    example 2:
    /nfs1/public2/User/chenguisen/01.biosoftware/stat_fastq/stat_fastq.py <file.fq.gz> [option]
    example 3:
    /nfs1/public2/User/chenguisen/01.biosoftware/stat_fastq/stat_fastq.py <file1.fq.gz> [file2.fq.gz] [file3.fq.gz] ... [option]

Option:
    -o   --output         Save output file of tsv format [default: Close]
    -t   --Transposition  Transposition Stdout format [default: Vertical]
    -d   --Distribution   Open output "Reads length distribution" file [default: Close]
    -g   --ggplot         Drawing Reads length distribution plot [default: Close]
    -n   --Reads_Num      Close ouput Reads of Number [default: Open]
    -b   --Reads_Base     Close ouput Reads of Bese Number [default: Open]
    -10  --Q10            Close ouput Q10 [default: Open]
    -20  --Q20            Close ouput Q20 [default: Open]
    -30  --Q30            Close ouput Q30 [default: Open]
    -40  --Q40            Close ouput Q40 [default: Open]
    -50  --Q50            Close ouput Q50 [default: Open]
    -qmi --Min_qual       Close ouput Min quality value [default: Open]
    -qma --Max_qual       Close ouput Max quality value [default: Open]
    -AT  --AT_Bases       Close ouput AT Beses of Number [default: Open]
    -GC  --GC_Bases       Close ouput GC Beses of Number [default: Open]
    -A   --A_Bases        Close ouput A Beses of Number [default: Open]
    -T   --T_Bases        Close ouput T Beses of Number [default: Open]
    -G   --G_Bases        Close ouput G Beses of Number [default: Open]
    -C   --C_Bases        Close ouput C Beses of Number [default: Open]
    -N   --N_Bases        Close ouput N Beses of Number [default: Open]
    -lmi --Min_len        Close ouput Read length Min value [default: Open]
    -lma --Max_len        Close ouput Read length Max value [default: Open]
    -lme --Mean_len       Close ouput Read length Mean value [default: Open]
    -p   --Phread_Type    Close ouput Phread Type [default: Open]
    -h   --help           Show the help message and exit
    -v   --version        Show the version message

Note: The [option] can be anywhere
Datetime: 2022/10/12; Author: Guisen Chen; Email: [email protected]; Cite: https://github.com/thecgs/stat_fastq

例子一:

$ cat fastq.fofn
/data/2022/10-12/WT_1.fq.gz
/data/2022/10-12/M3_1.fq.gz
/data/2022/10-12/M3_2.fq.gz
/data/2022/10-12/WT_2.fq.gz

$ python stat_fastq.py fastq.fofn
                                WT_1                  M3_1                  M3_2                  WT_2
Reads_Num                   82561175              77216259              96803945             106000544
Reads_Base(nt)           12384176250           11582438850           14520591750           15900081600
Q10(%)                       99.997%              99.9993%              99.9993%              99.9995%
Q20(%)                      83.4571%              76.3126%              75.9509%              77.9955%
Q30(%)                      72.6262%              64.9859%              64.3737%              65.6465%
Q40(%)                      36.4314%              52.8567%              52.1012%              52.9124%
Min_qual                           2                     2                     2                     2
Max_qual                          41                    41                    41                    41
AT_Bases(%)     5105473290(41.2258%)  5676494935(49.0095%)  6982315472(48.0856%)  7493666735(47.1297%)
GC_Bases(%)     7278331828(58.7712%)  5905861065(50.9898%)  7538172361(51.9137%)  8406330226(52.8697%)
A_Bases(%)      2945803511(23.7868%)  3237282594(27.9499%)  4052007243(27.9052%)    4561738335(28.69%)
T_Bases(%)      2159669779(17.4389%)  2439212341(21.0596%)  2930308229(20.1804%)  2931928400(18.4397%)
G_Bases(%)      4815207048(38.8819%)  3452302746(29.8064%)  4462377776(30.7314%)  4906221582(30.8566%)
C_Bases(%)      2463124780(38.8819%)  2453558319(29.8064%)  3075794585(30.7314%)  3500108644(30.8566%)
N_Bases(%)       371132(0.00299682%)   82850(0.000715307%)  103917(0.000715653%)   84639(0.000532318%)
Min_len                          150                   150                   150                   150
Max_len                          150                   150                   150                   150
Mean_len                         150                   150                   150                   150
Phread_Type                       33                    33                    33                    33

例子二:

$ python stat_fastq.py fastq.fofn -t -A -T -G -C -10 -40
      Reads_Num Reads_Base(nt)    Q20(%)    Q30(%) Min_qual Max_qual           AT_Bases(%)           GC_Bases(%)            N_Bases(%) Min_len Max_len Mean_len Phread_Type
WT_1   82561175    12384176250  83.4571%  72.6262%        2       41  5105473290(41.2258%)  7278331828(58.7712%)   371132(0.00299682%)     150     150      150          33
M3_1   77216259    11582438850  76.3126%  64.9859%        2       41  5676494935(49.0095%)  5905861065(50.9898%)   82850(0.000715307%)     150     150      150          33
M3_2   96803945    14520591750  75.9509%  64.3737%        2       41  6982315472(48.0856%)  7538172361(51.9137%)  103917(0.000715653%)     150     150      150          33
WT_2  106000544    15900081600  77.9955%  65.6465%        2       41  7493666735(47.1297%)  8406330226(52.8697%)   84639(0.000532318%)     150     150      150          33

该项目的早期版本完全由python编写,起初,历遍fastq的代码如下:

import sys
import gzip
from itertools import islice
file =  sys.argv[1]

def read_fastq(fastq:str):
    n = 0
    f = gzip.open(fastq, 'rb')
    while True:
        try:
            name = next(islice(f,n,n+4,1)).strip().decode()
            seq = next(islice(f,n,n+4,1)).strip().decode()
            comment = next(islice(f,n,n+4,1)).strip().decode()
            qual = next(islice(f,n,n+4,1)).strip().decode()
            n += 4
            yield (name, seq, puls, qual)
        except StopIteration:
            break
    f.close()

for l in read_fastq(file):
    print(l)

之后改用了第三方库pyfastx,快了很多,代码如下:

import pyfastx
with pyfastx.Fastq(fastq, build_index=False) as f:
    for read in fq:
        name, seq, quals = read

再后来搜到了一个C++版本( https://github.com/haiwufan/fastq_stat ),根据这个版本修改了源码(增加了一些指标),速度大幅提升。并且根据python封装了这个工具

stat_fastq's People

Contributors

thecgs avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.