aakechin / cutprimers Goto Github PK
View Code? Open in Web Editor NEWcurPrimers is a tool for trimming primer sequences from amplicon based NGS reads
License: GNU General Public License v3.0
curPrimers is a tool for trimming primer sequences from amplicon based NGS reads
License: GNU General Public License v3.0
Dear users of cutPrimers! We will be glad to hear from you any comments (positive or negative) about using cutPrimers. If you have any problems, please let us know. We will try to answer you as soon as it is possible. But I think we will be able to answer during 24 hours. Thank you for using cutPrimers that removes primer sequences from NGS reads!
Hi,
I have a adapter trimmed fastq files(R1 and R2 reads). My forward primer sequence is "TGTGCCAGCMGCCGCGGTAA" and reverse primer sequence is "TGGACTACHVGGGTWTCTAAT". With this info, how to use cutPrimers?
I assume the below fasta files needs to be parsed from the fastq files at 5' and 3' end of the R1 and R2 reads. How to do this?
--primersFileR1_5, -pr15 - fasta-file with sequences of primers on the 5'-end of R1 reads
--primersFileR2_5, -pr25 - fasta-file with sequences of primers on the 5'-end of R2 reads. Do not use this parameter if you have single-end reads
--primersFileR1_3, -pr13 - fasta-file with sequences of primers on the 3'-end of R1 reads. It is not required. But if it is determined, -pr23 is necessary
--primersFileR2_3, -pr23 - fasta-file with sequences of primers on the 3'-end of R2 reads
Any help is appreciated and i am curious to compare your tool with our sequencing provider in-house primerclipping results!
Best Regards,
Bala
Hi,
I have a primer bed file:
Chr Amplicon_Start Insert_Start Insert_Stop Amplicon_Stop
chr17 41275996 41276024 41276122 41276149
chr17 41267705 41267733 41267856 41267884
How could I get the primer fasta files?
When I run the shell from the readme file,
'''
python3 cutPrimers.py -r1 example/1_S1_L001_R1_001.fastq.gz -r2 example/1_S1_L001_R2_001.fastq.gz -pr15 example/primers_R1_5.fa -pr25 example/primers_R2_5.fa -pr13 example/primers_R1_3.fa -pr23 example/primers_R2_3.fa -tr1 example/1_r1_trimmed.fastq.gz -tr2 example/1_r2_trimmed.fastq.gz -utr1 example/1_r1_untrimmed.fastq.gz -utr2 example/1_r2_untrimmed.fastq.gz -t 2
'''
As a result I get four files with the following sizes: 4.2 Mb, 2.3 Mb, 3.6 Mb and 2.3 Mb for files with trimmed R1 reads, untrimmed R1 reads, trimmed R2 reads and untrimmed R2 reads, respectively.
But I find some the primer sequences are still in the untrimmed reads files,why?
less primers_R1_5.fa|head
R1
AGAGTGGGTGTTGGACAGTGT
less 1_S1_L001_R1_001.untrim.fastq.gz|grep AGAGTGGGTGTTGGACAGTGT
GATTAGAGCCTAGTCCAGGAGAATGAATTGACACTAATCTCTGCTTGTGTTCTCTGTCTCCAGCAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAAATCACCGA
AGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAGCTGGACACCTACCTGATACCCCAGATCCCCCACAGCCACTACTGACTGCAGCCAGCCACAGGTACAGAGCCACAGGACCCCAAGAATGAGCTTACAAAGTATCACCGA
AGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAGCTGGACACCTACCTGATACCCCAGATCCCCCACAGCCACTACTGACTGCAGCCAGCCACAGGTACAGAGCCACAGGACCCCAAGAATGAGCTTACAAAGGATCACCGA
GATTAGAGCCTAGTCCGGGAGAATGAATTGACACTAATCTCTGCTTGTGTTCTCTGTCTCCAGCAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAAATCACCG
GATTAGAGCCTAGTCCAGGAGAATGAATTGACACTAATCTCTGCTTGTGTTCTCTGTCTCCAGCAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAAATCACCGA
AGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAGCTGGACACCTACCTGATACCCCAGATCCCCCACAGCCACTACTGACTGCAGCCAGCCACAGGTACAGAGCCACAGGACCCCAAGAATGAATCACCGACTGCCCATAGG
AGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGAGCTGGACACCTACCTGATACCCCAGATCCCCCACAGCCACTACTGACTGCAGCCAGCCACAGGTACAGAGCCACAGGACCCCAAGAATGAGCTTACAAATATCACCGA
AGAGTGGGTGTTGGACAGTGTGTGGCTGTGTGGGTCAGTGTATGGCTGTGTGGGTTGGTGAGTGGTTGTGTGGGTTGCTGTGTGTGCGTGTGGGGTGCCTGTTTTGGGGAAAAATAGCTTTTCACATCTGCAATCACCGACTGCCCATAGG
GATTAGAGCCTAGTCCAGGAGAATGAATTGACACTAATCTCTGCTTGTGTTCTCCGTCTCCAGCAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTACCAGTGCCAGGACATCACCGA
And I use cutPrimers to deal amplicon fastq, eighty percent reads trimmed, there are still twenty percent reads untrimmed with the primer sequence. How could I solve this problem?
Any help is appreciated.
Best Regards,
Amy
Hi,
Here is what I do. I have 2 paired-end reads (740,134 reads each) with staggered degenerate primers:
Forward primers | Reverse primers |
---|---|
CCTACGGGNGGCWGCAG | GACTACHVGGGTATCTAATCC |
TCCTACGGGNGGCWGCAG | TGACTACHVGGGTATCTAATCC |
ACCCTACGGGNGGCWGCAG | ACGACTACHVGGGTATCTAATCC |
CTACCTACGGGNGGCWGCAG | CTAGACTACHVGGGTATCTAATCC |
Staggered meaning that 1, 2 or 3 bases are added to the main primer (e.g. CCTACGGGNGGCWGCAG) to increase diversity in Illumina sequencing. I would need to cut these out from the fastq. So I prepared 2 fasta files with 4 primer sequences each and ran cutprimers.py:
python3 cutPrimers.py
-r1 $FWD_read \
-r2 /$REV_read \
-pr15 forward_primers.fa \
-pr25 reverse_primers.fa \
-tr1 trim.pair1.fastq.gz \
-tr2 trim.pair2.fastq.gz \
-utr1 untrimmed1.fastq.gz \
-utr2 untrimmed2.fastq.gz \
--error-number 10 \
-stat trim.statistics.log \
--primer3-absent \
--primer-location-buffer 30 \
--threads 1
A note on the parameters:
--error-number 10
to account for the 2 ambiguous nucleotides (N and W) + the few bases that are added in the beginning of the primer. I tried lowest values but a few good primers remained in the untrimmed output. 10 does the job fine
--primer-location-buffer 30
to limit the region searched (in hope to gain time)
--threads 1
because the parameters I chose are really RAM demanding when I run in parallel.
Now I can't understand well the trim.statistics.log file. It shows stats for 4 primers only (when I was expecting 8), and the number of reads don't match my samples:
Primer | Total_number_of_reads | Number_without_any_errors | Number_with_sequencing_errors | Number_with_synthesis_errors |
---|---|---|---|---|
3F | 599 | 0 | 1198 | 0 |
3R | 599 | 0 | 0 | 0 |
4F | 719215 | 0 | 1438430 | 0 |
4R | 719215 | 0 | 0 | 0 |
It seems to do the job well although it takes quite a lot of time and RAM (I am on a server and it uses 15 GB for a little more than 2 hours for 1 paired-end reads). Do you think my parameters are adapted to what I want to do? Thanks!
Hello,
This programme is very fantastic in my daily use. However, recently I found the fastq files trimmed after by cutprimers.py can not run correct by samtools mpileup programme. My analysis pipeline as following: Fastq reads trimmed by cutprimers.py-- bwa mem reads alignment-- samtools view convert file from sam to bam -- samtools sort to sort the bam file -- samtools mpileup -a to output all position statistics. But finnally the output file lost most positions it should be coveraged well. At the begining, I thought it might be the problems from the samtools mpileup programe. but when i skip the cutprimer step, all position is well reported in mpileup files. So, i write to find your help to solve this disaster problem.
best regards,
Yuanwu
I have one forward PCR primer (F1) and two different reverse primers (let's say R1 and R2). I us miSeq paired-end. I'm guessing that primersFileR1_5 and primersFileR2_5 has to have the same number of sequences. So primersFileR1_5 file, I repeated F1 twice. Weird thing is that the order of primer sequences in the primerFileR2_5 influences the outcome. If I list R1 first and then R2 in the fasta file, it works. But if I list R2 first and R1 second, it doesn't trim. Now in this file, there is no R2 match, but there should be R1 match.
I tried with 1.2 release and the current version pulled out from github, and they behaves in the same way. I think this is a bug, but could you take a look at it?
I'm attaching a test dataset (with 4 read pairs) in the attached tar.gz file. There are four files in the directory. test_R[12].fq are the reads, and pri_R[12]_5.fa is the primer files. I used this command. You will see that with the current pri_R2_5.fa, it doesn't remove any primers. But if you switch the order or R1 and R2 in this file, it works.
python3 cutPrimers.py -r1 test_R1.fq -r2 test_R2.fq -pr15 pri_R1_5.fa -pr25 pri_R2_5.fa -tr1 out.tr1 -tr2 out.tr2 -utr1 out.utr1 -utr2 out.utr2
Thank you,
Naoki
def makeHashes(seq,k):
# k is the length of parts
subSeqs=[]
h=[]
lens=set()
for i in range(len(seq)-k+1):
h.append(hashlib.md5(seq[i:i+k].encode('utf-8')).hexdigest())
lens.add(k)
return(h,lens)
I am new to python. Why add k to set lens in each iteration since it is a fixed value?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.