Giter Club home page Giter Club logo

Comments (2)

tseemann avatar tseemann commented on July 21, 2024

@deprekate (I like that name!)

Why would you want to add an integer to the original sequence id ?
The original sequence ID is already unique?

from seqtk.

deprekate avatar deprekate commented on July 21, 2024

Exactly. Ideally one of the properties of a fasta/q files is that all ids are unique. However often times people unknowingly create improper fasta/q files.

One of the main ways they do this is when they combine paired end data into a single file, without renaming the read ids of each end with the proper syntax:
>read_id_1.1 and >read_id_1.2 or >read_id_1/1 and >read_id_1/2

NCBI is partly to blame for this as their SRA tool fastq-dump, does not output the ids correctly; with the suffix.

$ ./fastq-dump --split-spot --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664

$ cat ERR712664.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9

You will notice that in this fastq file there are two reads with the same id (@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841)


You can dump each read end into two different files:

$ ./fastq-dump --split-files --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664

$ cat ERR712664_1.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;

$ cat ERR712664_2.fastq 
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9

And now you need to rename each read and then combine the files, ie:

perl -pi -e 's/ /.1 /' file_1.fq
perl -pi -e 's/ /.2 /' file_2.fq
cat file_1 file_2 > file.fq

But people don't always follow these steps.


What I have to do now is after running seqtk seq -a in.fq.gz > out.fa, I then have to parse the out.fa file again and add an incrementing integer before the first space (actually I never enabled this, i just output an error message and make them fix their data).

It would be nice if seqtk could do this adding a rolling integer. Seqtk does allow for a complete rename with an incrementing integer, but then if I want to map a read at the end of my pipeline I need to keep track of which RENAME value points to which original read id.

(also thanks ya my handle is pretty nifty, a play on words/myname)

from seqtk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.