Giter Club home page Giter Club logo

pughlab / consensuscruncher Goto Github PK

View Code? Open in Web Editor NEW
21.0 6.0 10.0 52.85 MB

ConsensusCruncher is a tool that suppresses errors in next-generation sequencing data by using unique molecular identifiers (UMIs) to amalgamate reads derived from the same DNA template into a consensus sequence.

Home Page: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz474/5498633

License: Other

Shell 0.97% Python 4.81% Jupyter Notebook 94.23%
consensus-sequences fastq-files error-suppression molecular-barcodes

consensuscruncher's People

Contributors

arnavaz avatar jinfengzou-uhn avatar ninatwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

consensuscruncher's Issues

Overlapping barcodes are not clearly well-defined.

Is your feature request related to a problem? Please describe.
When you use a list of barcodes, you may end up with this error:

ValueError: There are overlapping barcodes in the list (difficult to determine which barcode is correct).

The error itself is not very clear. You need to find the corresponding python script and go inside the code to understand what it means exactly, in particular:

https://github.com/pughlab/ConsensusCruncher/blob/master/ConsensusCruncher/extract_barcodes.py

You will find the corresponding error here:

elif check_overlap(blist):raise ValueError("There are overlapping barcodes in the list (difficult to determine which barcode is correct).")

The function check_overlap(blist) gives a better understanding of the signification of the error:

def check_overlap(blist):
"""(list) -> bool
Return boolean indicating whether or not there's overlapping barcodes within the list.
check_overlap(['AACT', 'AGCT'])
False
check_overlap(['AACTCT', 'AACT'])
True
"""
overlap = False
for barcode in blist:
if sum([barcode in b for b in blist]) > 1:
overlap = True
return overlap

In other words, overlapping barcodes do not mean exact duplicates! It's a bit more subtile than this.
AACTCT and AACT will be considered as overlapping barcodes because AACT is a substring of AACTCT.

Describe the solution you'd like
Explicit better what overlapping barcodes mean and if there are any, remove them automatically within the code before running Consensus Cruncher.

Describe alternatives you've considered
To fix this error, you need to verify that all your barcodes are "unique" in the sense that they are not a substring of another barcode.
So I wrote my own script to remove all those "overlapping barcodes" before running Consensus Cruncher.

Out of memory error

ConsensusCruncher silently runs out of memory mid-bam file and makes SSCS and DCS bam files that only contain reads mapping to the alphanumerically-first chromosomes. Could a solution be to only read individual chromosomes from the bam for ConsensusCrunch'ing and then clear the memory before moving on to the next chromosome?

Prefer `os.subprocess.call()` to `os.system()`

Describe the bug

os.system() presents security vulnerabilities. The Python docs indicate that using the subprocess module is preferred.

To Reproduce

See 5 instances of os.system() in ConsensusCruncher.py

Additional considerations

Existing os.subprocess.call invocations in ConsensusCruncher.py create a string and then split it on spaces. This could be dangerous if any of the strings being passed in to the .format() call contain spaces. This could be mitigated by converting lines like this:

call("{} index {}/{}.sorted.bam".format(args.samtools, bam_dir, filename).split(' '))

To:

call([args.samtools, "index", "{}/{}.sorted.bam".format(bam_dir, filename)])

name 'overlap' is not defined in extract_barcodes.py

Describe the bug

name 'overlap' in line 79 is not defined in extract_barcodes.py and a DeprecationWarning for 'U' mode in line 156 and 157

To Reproduce

Steps to reproduce the behavior:

  1. Go to 'ConsensusCruncher/extract_barcodes.py'
  2. Use commands 'ConsensusCruncher/extract_barcodes.py --read1 --read2 --outfile --blist'
  3. Use inputs available at '...'
  4. See error

Expected behavior

Extract UMI barcode

Screenshots

If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: Linux
  • Software versions: Feb 14, 2020

Additional context

Add any other context about the problem here.

Relevant inputs

Please make sure to provide relevant input files

Bam header RG conflict

Describe the bug

The header in the bam files that are created by bwa aligner have two instances of @rg written in the bam header and GATK tools have issues with that. Example:
@rg ID:1 SM:SE17-1161_N_cfDNA-010-R1.fastq.gz PL:Illumina
@pg ID:bwa PN:bwa VN:0.7.15-r1140 CL:/mnt/work1/software/bwa/0.7.15/bwa mem -M -t4 -R @rg ID:1 SM:SE17-1161_N_cfDNA-010-R1.fastq.gz

To Reproduce

Steps to reproduce the behavior:

  1. Go to '...'
  2. Use commands '....'
  3. Use inputs available at '...'
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: [e.g. iOS, Linux, etc]
  • Software versions

Additional context

Add any other context about the problem here.

Relevant inputs

Please make sure to provide relevant input files

The results output is confusing.

Is your feature request related to a problem? Please describe.
In the results output, you get a "bamfiles" and a "consensus" folder.
If you stop at the description of the input parameters from the GitHub documentation page, you would think that the "bamfiles" folder is the output of the program itself.

-o OUTPUT, --output OUTPUT
Output directory, where barcode extracted FASTQ and BAM files will be placed in subdirectories 'fastq_tag' and 'bamfiles' respectively (dir will be created if they do not exist). [MANDATORY]

However, be aware that the "bamfiles" folder contain the bam files from the alignments by bwa sorted and indexed with samtools and NOT the one from consensus cruncher. The bam files resulting from consensus cruncher are in the "consensus" folder (there are more of them, up to 5). This is better explained after in the documentation so it's important to read the full documentation before using the program even though I think this part should be edited (in general users tend to go fast and stop reading at the description of the parameters).

Describe the solution you'd like

  1. The output folders need to be definitively renamed to avoid any confusion for the users:
    "bamfiles" to "bamfiles from bwa-samtools"
    "consensus" to "bamfiles from consensus-cruncher"

  2. And the GitHub documentation page should be edited too regarding the description of the parameters as this part will lead almost anyone to confusion.

barcode overlap checking needs to limit overlap from start

extract_barcode.py

for barcode in blist:
if sum([barcode in b for b in blist]) > 1:
overlap = True

"startswith" is better than "in": if sum([b.startswith(barcode) b for b in blist]) > 1
For example, ACG and TACG are not overlapped.

support variable spacer sequences

Is your feature request related to a problem? Please describe.
I'm analysing libraries that have 5bp UMIs followed by a random 2bp spacer that doesn't have to end in T, whereas I think ConsensusCruncher only supports UMI's with known/fixed spacer sequences that must end in T.

Describe the solution you'd like
I'd like to be able to provide bpattern=NNNNNSS, so that NNNNN is read as the UMI, and SS is skipped as the random spacer. The blist would be a file with all of the 5bp UMI sequences in it (as currently supported).

Describe alternatives you've considered
fgbio supports a read structure of 5M2S+T (5bp UMI matches, 2 bp spacer, remaining bp template) to do this job, but fgbio doesn't rescue singleton reads.

IndexError: string index out of range

Hi

I was running ConsensusCeuncher to collapse UMIs. It seems to output some results:

├── sscs
│   ├── sscs.sorted.bam (.bai)                     
│   ├── singleton.sorted.bam (.bai)                
├── sscs_SC
|   ├── sscs.sc.sorted.bam (.bai)              
├── dcs
│   ├── dcs.sorted.bam (.bai)
├── dcs_SC
│   ├── dcs.sc.sorted.bam(.bai)
│   ├── all.unique.dcs.sorted.bam(.bai)
├── read_families.txt                       Family size and frequency
├── stats.txt                               Consensus sequence formation metrics
├── tag_fam_size.png                        Distribution of reads across family size

However, when I checked on the log, I found an error:

# === DCS ===
SSCS - Total reads: 26020276
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 89306
SSCS singletons: 25841664 

[bam_sort_core] merging from 6 files and 1 in-memory blocks...
Traceback (most recent call last):
  File "/ConsensusCruncher/singleton_correction.py", line 320, in <module>
    main()
  File "/ConsensusCruncher/singleton_correction.py", line 268, in main
    corrected_read = strand_correction(tag, duplex, query_name, singleton_dict)
  File "/ConsensusCruncher/singleton_correction.py", line 101, in strand_correction
    dcs = duplex_consensus(read, complement_read)
  File "/ConsensusCruncher/singleton_correction.py", line 71, in duplex_consensus
    if read1.query_sequence[i] == read2.query_sequence[i] and \
IndexError: string index out of range
[bam_sort_core] merging from 6 files and 1 in-memory blocks...
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 26023245
SSCS SC - Unmapped reads: 0

It looks like DCS was not properly performed? For my experiments, I might just need to use sscs.sc.sorted.bam. Are these final bam files still safe to use? The following is my commands. Thanks!
python3 ConsensusCruncher.py fastq2bam --fastq1 sample_R1.fastq --fastq2 sample_R2.fastq -o out_dir -b bwa -g /picard/2.10.9/picard.jar -r hg38.fasta -s samtools -l umilist.txt

python3 ConsensusCruncher.py consensus -i sample.sorted.bam -o out_dir -s samtools -b cytoBand.txt -g hg38 --cleanup True --scorrect True

Clarify link between genome and bedfile

Is your feature request related to a problem? Please describe.
If the user chooses genome == 'hg38', the bedfile is automatically set to the included hg38_cytoBand.txt file which will silently overwrite any args.bedfile value the user has chosen.

Describe the solution you'd like
Only overwrite args.bedfile if this value is not already specified by the user.

Describe alternatives you've considered
Update the documentation to make very clear that if hg38 is selected, it will use the inbuilt bedfile.

Updating instillation requirements

Hey Nina,

I just started using your program, and was wondering if you could update the requirements.txt to explicitly state matplotlib as a dependency? This may help future users.

Thanks :)

Add picard command in the ConsensusCruncher.py

Is your feature request related to a problem? Please describe.

Removing the readgroup arguments in the bwa aligner command and add Picard AddReplaceReadgroups after the bam file is created.

Describe the solution you'd like
java -jar picard.jar AddOrReplaceReadGroups RGID=4
RGLB=lib1
RGPL=ILLUMINA
RGPU=unit1
RGSM=20

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Allow this library to be installed via pip

Describe the solution you'd like
We'd like to be able to install this library via pip, and have it run ConsensusCruncher.py from the command line.

it would be great if the setup.py file could make use of entry points to allow the helper modules to be run independently from the command line.

Thanks!

Import modules instead of shelling out to them

Is your feature request related to a problem? Please describe.
There are a few instances where commands are created by something like the following:

extractb_cmd = "{}/ConsensusCruncher/extract_barcodes.py --read1 {} --read2 {} --outfile {} --bpattern {} "
        "--blist {}".format(code_dir, args.fastq1, args.fastq2, outfile, args.bpattern, args.blist)
os.system(extractb_cmd)

Describe the solution you'd like
It would be clearer to import the module and use the functions within it, something like:

from extract_barcodes import main_function

main_function(read1=args.fastq1, read2=args.fast2, outfile=outfile, bpattern=args.bpattern, blist=args.blist)

Error on LargeMid_56_L005_R{1,2}.fastq

Hi all,
I'm testing ConsensusCruncher on the given test data: ConsensusCruncher/test/LargeMid_56_L005_R{1,2}.fastq.
But some errors occurred when it came to the consensus step, and below is the output of consensus:

/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py --infile LargeMid/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.bam --cutoff 0.7 --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === SSCS ===
Uncollapsed - Total reads: 19259
Uncollapsed - Unmapped reads: 15
Uncollapsed - Secondary/Supplementary reads: 23
SSCS reads: 0
Singletons: 19216
Bad spacers: 0

# QC: Total uncollapsed reads should be equivalent to mapped reads in bam file.
Total uncollapsed reads: 19259
Total mapped reads in bam file: 19564
QC: check dictionaries to see if there are any remaining reads
=== pair_dict remaining ===
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG
read remaining:
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG	161	7	114597515	60	123M	27	128565	123	CATATGCCATGCACATAAAATGTTATTTATATATTTATTGGTTAAATGAATTAACATTTAAATATTGGCATCGTAAGTGAATAAGTATTCAGTATCTTTGTAATCAATGGGTAACTCATGCTT	array('B', [15, 25, 34, 33, 37, 16, 37, 35, 37, 38, 38, 38, 38, 38, 38, 38, 34, 37, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 34, 37, 38, 38, 38, 38, 38, 38, 38, 37, 33, 37, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 36, 36, 38, 38, 37, 35, 38, 38, 38, 36, 38, 38, 31, 35, 38, 38, 38, 38, 38, 38, 38, 36, 38, 36, 35, 35, 36, 38, 38, 33, 37, 38, 38, 38, 38, 38, 35, 38, 34, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 20), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:13274:2385|CG.TG	81	27	128565	0	123M	7	114597515	123	TTGGGAGTTGGCCTTACTGGGTATCTGTAAGAACAGGGAAAAGGACACGCACCTGGCCTGTGGTGGTTACTTCTTTCTGAATCGTGTCAGAGAACTTGGCTGCTCTGGAAGAGCCAGTTTTGT	array('B', [35, 38, 38, 38, 38, 38, 38, 28, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 35, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 33, 34, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 123), ('RG', '1'), ('XA', 'chr16,-70923438,123M,0;')]
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT
read remaining:
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT	129	4	49985033	60	123M	27	359773	123	GAAGTCCAATTTATCTTCTTTTTTTAAAATCTGTGCCTCATCTGCAAATATTACCAATCGAAAGTCATGAAATTTTTCCCCTAAGATTTTATAGTTTTAGCGCTTACGTTTGGGTCTTTGATC	array('B', [33, 33, 38, 38, 38, 37, 38, 38, 38, 38, 38, 37, 34, 38, 36, 37, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 37, 36, 38, 38, 36, 38, 38, 37, 36, 37, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 36, 31, 31, 29, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 33, 35, 38, 38, 38, 38, 38, 38, 38, 38, 38, 15])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 22), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:15070:3473|TC.GT	65	27	359773	2	123M	4	49985033	123	GACACCACACTTCATGCTCTGGGTGCCTGGTAACCTGAGTTTACCACTTGGAGGAGGTCACTACCTAAAATGTCGCAGTAAATGGTCTGTTGATAGAGCTTGGCTTCTAGTGGGTTAAAGTAC	array('B', [34, 34, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 37, 38, 37, 37, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 35, 38, 38, 38, 38, 38, 36, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 121), ('RG', '1'), ('XA', 'chr16,+71149535,123M,1;')]
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC
read remaining:
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC	97	6	140573779	0	123M	60	145190	123	TGGACGCCAACGACAACTCGCCCTTCGTGCTGTACCCGCTGCAGAACGGCTCCGCGCCCTGCACCGAGCTGGTGCCCCGGGCGGCCGAGCCGGGCTACCTGGTGACCAAGGTGGTGGCGGTGG	array('B', [33, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 26, 35, 38, 38, 36, 38, 38, 31, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 28, 38, 38, 38, 31, 35, 38, 13, 35, 35, 38, 38, 38, 38, 38, 38])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 123), ('RG', '1')]
mate:
HWI-D00331:196:C900FANXX:5:1101:18338:2852|GT.CC	145	60	145190	3	21S80M22S	6	140573779	80	TGGCCAAGCACAGGCTAGTGTTGGGTGATCAATGCAGAAATATGTCACAATGCCCCCTTAGGCAGAGCCTAGACAAAAGCCCCATCACCTGGATGATCAGTACAGGGTTATGTCAAAAAGTTA	array('B', [34, 38, 38, 35, 36, 38, 38, 38, 36, 37, 38, 38, 38, 38, 37, 38, 38, 37, 37, 38, 38, 38, 38, 38, 38, 35, 29, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 31, 33, 37, 34, 38, 38, 38, 38, 38, 38, 35, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 35, 38, 38, 38, 38, 37, 29, 35, 37, 34, 35, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 34, 38, 38, 38, 38, 38, 37, 38, 38, 37, 38, 38, 38, 38, 38, 32, 33])	[('NM', 1), ('MD', '10G69'), ('AS', 75), ('XS', 72), ('RG', '1'), ('XA', 'chrUn_gl000216,-154629,24S77M22S,1;chrUn_gl000225,+92868,2S89M32S,4;chrUn_gl000225,+117160,22S77M24S,2;chrUn_gl000225,-21045,23S92M8S,6;')]
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA
read remaining:
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA	113	20	40392124	0	123M	64	130694	123	CTGAGAGCCGAGCAGCCCAGGGAGCAGGTGTCCGCACAGAGCTCGTAGTGACTGTTCTGAGGGCATTCCATGGCTGCAAGGAGGGGGTGCCGATCAGAGCCCTGGGGAGGGAGGGGCTGCAAG	array('B', [38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 35, 38, 33, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 34, 38, 33, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 34])	[('NM', 0), ('MD', '123'), ('AS', 123), ('XS', 122), ('RG', '1'), ('XA', 'chr19,-40376440,123M,1;')]
mate:
HWI-D00331:196:C900FANXX:5:1101:2357:3051|TG.CA	177	64	130694	8	123M	20	40392124	123	ACACGTCACCCATAAGTGTGTGTTCCCGTGAGGAGAGATTTCTAAGAAATGGCACTGTACACTGAACGCAGTGGCTCACGTCTGTCATCCCGAGGTCAGGAGTTCGAGACCAGCCCGGCCAAC	array('B', [2, 34, 31, 38, 38, 38, 38, 38, 38, 38, 28, 38, 37, 37, 36, 33, 38, 38, 36, 38, 38, 38, 38, 27, 38, 38, 38, 38, 38, 38, 38, 38, 38, 36, 38, 37, 29, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 37, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 34, 37, 35, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 37, 31, 37, 38, 34, 37, 38, 38, 31, 34, 37, 37, 29, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 33, 31])	[('NM', 2), ('MD', '35T16T70'), ('AS', 113), ('XS', 108), ('RG', '1'), ('XA', 'chrUn_gl000220,-126244,123M,3;')]
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC
read remaining:
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC	129	20	41630129	17	36S18M1D28M41S	64	129912	46	CTCCCCCCCCCCCCCCCCCCCCTTTCCCCCCTTTTTTTCTTTTTTTTTTTTTCTTTCCCCCCCCCCTTTTTTTTTTTTTTTTCTTTATTACGTAAGAAATATTGGAGTTGGATGAAATTTTTG	array('B', [33, 33, 16, 31, 29, 30, 36, 36, 38, 38, 33, 38, 29, 14, 27, 27, 34, 36, 38, 38, 29, 14, 14, 15, 15, 15, 15, 15, 15, 14, 14, 14, 15, 15, 15, 15, 14, 14, 15, 15, 15, 24, 15, 15, 14, 14, 14, 14, 14, 14, 14, 14, 15, 24, 15, 15, 15, 15, 15, 25, 25, 34, 29, 25, 34, 14, 14, 15, 15, 26, 31, 38, 38, 38, 33, 38, 31, 32, 26, 32, 38, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])	[('NM', 2), ('MD', '15C2^C28'), ('AS', 34), ('XS', 28), ('RG', '1'), ('XA', 'chr14,+55382840,28S28M67S,0;chr2,+61459994,54S28M41S,0;chr11,+76344684,58S28M37S,0;chr3,-59180857,41S28M54S,0;')]
mate:
HWI-D00331:196:C900FANXX:5:1101:1040:2069|TC.AC	65	64	129912	60	123M	20	41630129	123	TGAACACCCCCGTCACAAGTTTACCTATGTCACAGTCTTGCTCATGTATGCTTGAACGACNAATAAAAGTTCGGGGGGGNGAAGAGAGGAGAGAGAGAGAGAGACGGGGAGAGAGGGGGGAGG	array('B', [33, 33, 38, 38, 38, 38, 38, 38, 38, 37, 36, 38, 38, 38, 34, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 34, 38, 38, 38, 36, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 2, 28, 15, 27, 37, 38, 38, 38, 38, 38, 38, 38, 37, 38, 38, 38, 38, 38, 38, 2, 25, 14, 25, 34, 31, 36, 33, 35, 34, 24, 34, 36, 38, 33, 38, 38, 38, 36, 38, 27, 35, 38, 38, 38, 27, 38, 38, 31, 38, 23, 34, 34, 38, 13, 25, 36, 38, 38, 26, 35, 13, 23, 34])	[('NM', 3), ('MD', '60A18A42A0'), ('AS', 118), ('XS', 77), ('RG', '1')]
=== read_dict remaining ===
=== csn_pair_dict remaining ===
Traceback (most recent call last):
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 391, in <module>
    main()
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 372, in main
    plt.bar(list(tags_per_fam), read_fraction)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 2639, in bar
    ax = gca()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 935, in gca
    return gcf().gca(**kwargs)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 585, in gcf
    return figure()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/pyplot.py", line 534, in figure
    **kwargs)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 46, in new_figure_manager
    return new_figure_manager_given_figure(num, thisFig)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 53, in new_figure_manager_given_figure
    canvas = FigureCanvasQTAgg(figure)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4agg.py", line 76, in __init__
    FigureCanvasQT.__init__(self, figure)
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt4.py", line 63, in __init__
    _create_qApp()
  File "/data4/qiumin/soft/ConsensusCruncher/dep/lib/python3.5/site-packages/matplotlib/backends/backend_qt5.py", line 136, in _create_qApp
    raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable
/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs/LargeMid_56_L005_R1.fastq.sorted.dcs.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === DCS ===
SSCS - Total reads: 0
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 0
SSCS singletons: 0 

/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/singleton_correction.py --singleton LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.singleton.sorted.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === Singleton Correction ===
Total singletons: 19216
Singleton Correction by SSCS: 0
% Singleton Correction by SSCS: 0.0
Singleton Correction by Singletons: 4
% Singleton Correction by Singletons : 0.020815986677768527
Uncorrected Singletons: 19212 

0.009413317839304606
/data4/qiumin/soft/ConsensusCruncher/dep/bin/samtools merge LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.correction.sorted.bam LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.singleton.correction.sorted.bam
/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.sorted.bam --outfile LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.dcs.sc.bam --bedfile /data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 4
SSCS SC - Unmapped reads: 0
SSCS SC - Secondary/Supplementary reads: 0
DCS SC reads: 2
SSCS SC singletons: 0 

LargeMid/LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.all.unique.dcs.bam
Traceback (most recent call last):
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher.py", line 473, in <module>
    args.func(args)
  File "/data4/qiumin/soft/ConsensusCruncher/ConsensusCruncher.py", line 293, in consensus
    '{}/{}_tag_fam_size.png'.format(sample_dir, identifier))
FileNotFoundError: [Errno 2] No such file or directory: 'LargeMid/LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png' -> 'LargeMid/LargeMid_56_L005_R1.fastq.sorted/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png'`

There seem to be 2 problems:

  1. the program trying to use qt to display something and it failed because I don't have one on my node.
  2. Total uncollapsed reads should be equivalent to mapped reads in bam file, but it does not match

So, are these errors acceptable? And I found that in these 2 files, only 2% of the total reads were corrected by singletons method, but according to your paper, the corrected reads shold be more than 30%. Am I making something wrong?

Thanks,
Silen

Error in consensus mode

Hi,

We ran ConsensusCruncher.py fastq2bam on the LargeMid_56 fastq files provided in the test directory. The barcode-extracted fastq files and subsequently the BWA aligned BAMfile were generated without any errors.

ConsensusCruncher.py consensus was run with the following command:
python3 ~/testing/duplex/ConsensusCruncher/ConsensusCruncher.py consensus -i ~/testing/duplex/results/largemid/output/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam -o ~/testing/duplex/results/largemid/output/consensus_output/ -s ~/programs/samtools-1.3.1/samtools --cleanup True --cutoff 0.7 -b ~/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt

This gives the following error:

# === SSCS ===
Uncollapsed - Total reads: 19579
Uncollapsed - Unmapped reads: 0
Uncollapsed - Secondary/Supplementary reads: 0
SSCS reads: 0
Singletons: 0
Bad spacers: 19579

# QC: Total uncollapsed reads should be equivalent to mapped reads in bam file.
Total uncollapsed reads: 19579
Total mapped reads in bam file: 19562
QC: check dictionaries to see if there are any remaining reads
=== pair_dict remaining ===
=== read_dict remaining ===
=== csn_pair_dict remaining ===
Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 391, in <module>
    main()
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py", line 374, in main
    plt.xlim([0, math.ceil(lst_tags_per_fam[-1][0]/10) * 10])
IndexError: list index out of range
# === DCS ===
SSCS - Total reads: 0
SSCS - Unmapped reads: 0
SSCS - Secondary/Supplementary reads: 0
DCS reads: 0
SSCS singletons: 0 

Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py", line 320, in <module>
    main()
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py", line 291, in main
    sscs_correction_frac = (sscs_dup_correction/singleton_counter) * 100
ZeroDivisionError: division by zero
# === DCS - Singleton Correction ===
SSCS SC - Total reads: 0
SSCS SC - Unmapped reads: 0
SSCS SC - Secondary/Supplementary reads: 0
DCS SC reads: 0
SSCS SC singletons: 0 

/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/SSCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/bamfiles/LargeMid_56_L005_R1.fastq.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.bam --cutoff 0.7 --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt --bdelim None
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs/LargeMid_56_L005_R1.fastq.sorted.dcs.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/singleton_correction.py --singleton /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.singleton.sorted.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/programs/samtools-1.7/samtools merge /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted.sscs.sorted.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.correction.sorted.bam /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.singleton.correction.sorted.bam
/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/DCS_maker.py --infile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs_sc/LargeMid_56_L005_R1.fastq.sorted.sscs.sc.sorted.bam --outfile /scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.dcs.sc.bam --bedfile /scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher/hg19_cytoBand.txt
/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/dcs_sc/LargeMid_56_L005_R1.fastq.sorted.all.unique.dcs.bam
Traceback (most recent call last):
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher.py", line 484, in <module>
    args.func(args)
  File "/scratch/hematopath/testing/duplex/ConsensusCruncher/ConsensusCruncher.py", line 301, in consensus
    '{}/{}_tag_fam_size.png'.format(sample_dir, identifier))
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/sscs/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png' -> '/scratch/hematopath/testing/duplex/results/largemid/output/consensus_output//LargeMid_56_L005_R1.fastq.sorted/LargeMid_56_L005_R1.fastq.sorted_tag_fam_size.png'

System
OS: RHEL(7.3)
Other software versions:
Python 3.7.4
BWA 0.7.15
Samtools 1.3.1

We are unable to figure out what could be causing these errors and would be extremely grateful if you can help us with this.
Thanks

The resulting bam file has improper sample name

When using gatk GetSampleName on the resulting bam files, the sample names got are actually fastq file name instead of sample name. These will cause problem for downstram analysis when the incorrect sample name passed down to for example vcf header.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.