Comments (2)
@deprekate (I like that name!)
Why would you want to add an integer to the original sequence id ?
The original sequence ID is already unique?
from seqtk.
Exactly. Ideally one of the properties of a fasta/q files is that all ids are unique. However often times people unknowingly create improper fasta/q files.
One of the main ways they do this is when they combine paired end data into a single file, without renaming the read ids of each end with the proper syntax:
>read_id_1.1 and >read_id_1.2
or >read_id_1/1 and >read_id_1/2
NCBI is partly to blame for this as their SRA tool fastq-dump, does not output the ids correctly; with the suffix.
$ ./fastq-dump --split-spot --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664
$ cat ERR712664.fastq
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9
You will notice that in this fastq file there are two reads with the same id (@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841
)
You can dump each read end into two different files:
$ ./fastq-dump --split-files --clip --maxSpotId 1 ERR712664
Read 1 spots for ERR712664
Written 1 spots for ERR712664
$ cat ERR712664_1.fastq
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TCTTCGCTGTTACCTCTGGAAAGAATCGCCTCGCGAAAACGCCGCCCATTTTCACGCGTTAATCCGCCCTGCTCAACAAACCACTGGTAACCATCATCGGCCAACATTTGCGTCCACAGATAAGCGTAATAACCTGCGGCATATCCGCCACCAAAAATATGGGCGAAATAACTGCTGCGATAGCGTGGCGGTATAGCAGGAAGATCCATATTTTCCGCCACCAGCGCCCGCAATTCGCAATCATCGACAT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
A1>A11>A1>D3AABBE11111000BA0AAAE000////B/AA/AAE/BFGGF2@/>/>>//BFF/>EEE0FFF11B1/0?F/FF10B12BF1FF1FFB/<B//?/?FG2?/?@@/?101?111<->A.<11>FF.<--<-;0CGF.CA?AG.....90;00.:----:00;C09FB0--99-:-;--9--9-9//9//-----99B/9/:BBBB9-9;-;A--9-99@-9--9;:----:9/99//;-;
$ cat ERR712664_2.fastq
@ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
TTACACGAAGCAATGCTGGTTGTCGTTGTTTTCGAATTGCGGGCGCTGGTGGCGGACCATATGGCTCTTCCTGCTATACCGCCACGCTCTCGCAGCCGTTATTTCGCCCCTATTTTTGTTGCCGCATATGCCGCAGGTTATTACGCTTCTCTGTTGTCGCCCATGTTGGCCGCTGCTGGTTACCTGTGGTTTGTTGTGCCGGCCGGATTTACGCGTGATCCTTGGCGGCGTTTTCGCGTGGCGTTTCTTT
+ERR712664.1 M02284:53:000000000-A6PLM:1:1101:12764:1841 length=250
11111@1111111A1A311010AA00A00AABA000AA1B///A/AA//B0/B//////@1B21/0>BF>EE11B1>2>2/<</<//</?<////<//??/??FD0<<<-<.<111=-00.0<---..00<:-----.:0::09..9;.9900900/9.....//000./9-----;//99/////9/-999------//--------9//9------///;////--------/-----------///9
And now you need to rename each read and then combine the files, ie:
perl -pi -e 's/ /.1 /' file_1.fq
perl -pi -e 's/ /.2 /' file_2.fq
cat file_1 file_2 > file.fq
But people don't always follow these steps.
What I have to do now is after running seqtk seq -a in.fq.gz > out.fa
, I then have to parse the out.fa file again and add an incrementing integer before the first space (actually I never enabled this, i just output an error message and make them fix their data).
It would be nice if seqtk could do this adding a rolling integer. Seqtk does allow for a complete rename with an incrementing integer, but then if I want to map a read at the end of my pipeline I need to keep track of which RENAME value points to which original read id.
(also thanks ya my handle is pretty nifty, a play on words/myname)
from seqtk.
Related Issues (20)
- seqtk sample: with out without replacement? HOT 1
- `seqtk seq` segfaults on 10G scaffolds HOT 4
- seqtk sample not working as expected HOT 2
- seqtk sample can't properly output fastq.gz HOT 1
- ERROR: the 2nd file has fewer records HOT 1
- The output file size of seqtk subseq is zero HOT 1
- Question: DNA string compressing HOT 1
- seqtk produces different number of reads for paired end files HOT 1
- Problem with seqtk sample HOT 1
- output file contains only one amino acid HOT 1
- seqtk telo -m works partially with 8mer wasp telomere HOT 5
- seqtk comp count CpG
- Seqtk to count sequences same SeqID
- `seqtk hpc input.fq` ignores the quality and converts to .fa
- Is the "sample" feature subsampling without replacement? HOT 1
- DNS Resolution Warning with Singularity Container
- buggy behavior with seqtk subseq command HOT 1
- converting fasta to fastq HOT 2
- Quality scores HOT 1
- CutN fails when sequence begins with Ns
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seqtk.