timoast / sinto Goto Github PK
View Code? Open in Web Editor NEWTools for single-cell data processing
Home Page: https://timoast.github.io/sinto/
License: MIT License
Tools for single-cell data processing
Home Page: https://timoast.github.io/sinto/
License: MIT License
Hi Tim! Great idea this tool..really useful! I want to subset by barcodes a BAM file from 10x 3' scRNA which have been already subset with samtools for a specific gene locus. sinto barcodes seems to run smoothly with no errors but then my new BAM file is empty:
` sinto filterbarcodes -b BAM_ACE2/1B1_ACE2.bam
-c BC_groups/1B1_ACE2.txt -o BAM_ACE2_BC/1B1_ACE2_BC.bam
--barcodetag "CB"
Function run_filterbarcodes called with the following arguments:
bam BAM_ACE2/1B1_ACE2.bam
cells BC_groups/1B1_ACE2.txt
output BAM_ACE2_BC/1B1_ACE2_BC.bam
trim_suffix False
sam False
nproc 1
barcode_regex None
barcodetag CB
func <function run_filterbarcodes at 0x7fb2498cb8c8>
Function completed in 0.0 m 0.12 s
`
My barcodes file is a tab-delimited txt file with no quotes generated in R like this:
AGCATACTCAATCACG-1 Goblet
AGTCTTTTCATCGCTC-1 Basal
AGTGAGGTCCACGAAT-1 Basal
ATTCTACAGATTACCC-1 Secretory
CAAGGCCAGATCCCAT-1 Ciliated
CATCAGAAGGCTAGAC-1 Goblet
CATTATCCACCTCGGA-1 Secretory
CATTCGCAGCTAGCCC-1 Basal
CCACGGACACAGATTC-1 Secretory
CCTACACAGACGCACA-1 Basal
AAGGTTCCATACTACG-1 Basal
AAGTCTGCATACTCTT-1 Goblet
CGTCCATTCAAGGTAA-1 Ciliated
GATCAGTGTTCACGGC-1 Basal
GATGAAACATTCGACA-1 Ionocytes
GGACATTAGGTCATCT-1 Basal
GGGTTGCCACGACGAA-1 Secretory
ACGATACCACAGAGGT-1 Basal
TCATTACTCGCCGTGA-1 Goblet
TGAGCATTCTGCTGTC-1 Basal
ACGGCCACAGCGATCC-1 Goblet
TTGACTTCAGACTCGC-1 Basal
TTTGCGCCATGAACCT-1 Ciliated
ACTTTCACACTGTTAG-1 Goblet
I checked whether barcodes are actually found in the bam file and that seem to be the case at least for the couple i tested:
samtools view BAM_ACE2/1B1_ACE2.bam | grep -i "AGCATACTCAATCACG-1" A00198:47:H5L7HDMXX:2:1442:24578:30373 256 X 15601544 0 37M2087N53M * GGCGCGATCTCGGCTCACTGCAAGCTCTGCCTCCCGGGTTCACGCCATTCTCCTGCCTCGGCCTCCCGAGTAGCTGGGACTACAGGCGCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:7 HI:i:7 AS:i:85 nM:i:1 RE:A:I BC:Z:CCTGTGCG QT:Z:FFFFFFFF CR:Z:AGCATACTCAATCACG CY:Z:FFFFFFFFFFFFFFFF CB:Z:AGCATACTCAATCACG-1 UR:Z:CCACTTAGTT UY:Z:FFFFFFFFFF UB:Z:CCACTTAGTT RG:Z:1B1:MissingLibrary:1:H5L7HDMXX:2
Also I should i get a bam file for each cell group according to documentation but my output is only one empty BAM file
Much appreciated
Hi @timoast ,
I have spotted a few cases where there are negative positions listed in the fragments file, especially in chrM, I think due the higher chance of finding reads overlapping the start/end of this chromosome. I tried to make a minimal example here:
Using the reads in this BAM file:
VH00445:3:AAAJTTYM5:1:1305:65210:2912 1123 chrM 1 33 32S21M = 1 45 ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA -CC;CCCCC-CCCCCCC-CC-C;;CC;;;C;CC;CCCCCCCCCC;CCC-CCCC NM:i:0 MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0; MQ:i:60 MC:Z:8S45M ms:i:1642 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198 99 chrM 1 33 32S21M = 1 45 ATCATACTCTATTACGCAATAAACATTAACAAGTTAATGTAGCTTAATAACAA CCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCC;CCCC NM:i:0 MD:Z:21 AS:i:21 XS:i:22 XA:Z:chr16,-28715534,8S22M23S,0; MQ:i:60 MC:Z:8S45M ms:i:1794 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1305:65210:2912 1171 chrM 1 60 8S45M = 1 -45 ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA ;CCCCCCCCC-CCCCCCCCC-CCCC;CCC;C-CCCCCCCCCCCC-CCCCCCCC NM:i:0 MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M ms:i:1560 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:2409:29309:43198 147 chrM 1 60 8S45M = 1 -45 ATTAACAAGTTAATGTAGCTTAATAACAAAGCAAAGCACTGAAAATGCTTAGA CCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NM:i:0 MD:Z:45 AS:i:45 XS:i:21 MQ:i:33 MC:Z:32S21M ms:i:1786 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436 1187 chrM 11796 60 52M = 11833 90 ATCCTAATTTCAATATCAAACCTAATTAAACACATCAACTTCCCACTGTACA CCCCCCCCCCCCCCCCCC;-CCCCCCCC;CCC-CCCC-CCCCCCCCCCCCCC NM:i:0 MD:Z:52 AS:i:52 XS:i:19 MQ:i:60 MC:Z:53M ms:i:1684 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
VH00445:3:AAAJTTYM5:1:1410:62900:52436 1107 chrM 11833 60 53M = 11796 -90 ACTTCCCACTGTACACCACCACATCAATCAAATTCTCCTTCATTATTAGCCTC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCC;CCC-CCCCCC;CCC-CCC NM:i:0 MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:52M ms:i:1650 CR:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT CB:Z:ACTAGGCTTCGTATTGAGCCGAACAGTAGT
I run sinto fragments
and get:
chrM -28 40 ACTAGGCTTCGTATTGAGCCGAACAGTAGT 2
This read aligns to chrM position 1 and is soft-clipped by 32 bases (cigar: 32S21M). Looking at the code, I see there is correction for soft-clipping:
Lines 330 to 338 in b57d735
bamtools bamtobed
seems to confirm this (though without the Tn5 offset).
Looking at another soft-clipped read, this time in the middle of chr1:
bam:
VH00445:3:AAAJTTYM5:1:1609:77708:31726 99 chr1 3012667 60 8S44M = 3012708 94 CCGTATTTCTGATCAGTTCTGAGACAAGTTTTCACTTTATCTATGAAGCCCA CC-;CCC-;CC-CCCCCCCCC-CCCC;CC-CC;;C-CCCCCCCCCCC;-CCC NM:i:0 MD:Z:44 AS:i:44 XS:i:19 MQ:i:60 MC:Z:53M ms:i:1608 CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT
VH00445:3:AAAJTTYM5:1:1609:77708:31726 147 chr1 3012708 60 53M = 3012667 -94 CCACTAGGGTGCAGTCCTGTGCTGAACAAGTAACAATGGCCTGAGTGTGACAA CC-CCCCCCCCCCCCCC;CCCC-CC;CCCCCCCC--CCCC;CCCC-CCCCCCC NM:i:0 MD:Z:53 AS:i:53 XS:i:20 MQ:i:60 MC:Z:8S44M ms:i:1482 CR:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT CB:Z:ATTGAACCACATTCGGTCAGGTCACTCAAT
fragments:
chr1 3012662 3012755 ATTGAACCACATTCGGTCAGGTCACTCAAT 1
where it seems like the start position should be 3012666 + 4 = 3012670. Am I missing something about the soft-clipping correction?
Two questions:
Hi,
I have a use case where I need to append the CB tag to each read's read group ID (in addition to setting the read group's SM tag to the cell barcode). I have some working code for this and I could generate a PR against this repo. Are you interested in adding that functionality to this tool?
Hi @timoast,
I installed sinto with conda (conda install sinto
) in a new env. It looks like samtools was not one of the dependencies and was not installed automatically, so I have to manually install it to make it work. Would you consider mentioning it in your installation guide or make samtools a dependency?
Hi Tim,
I am trying to figure out how to merge bed files resulting from multiple sinto runs (so as to be able to run it on a split bam file and parallelize it). for this, I am trying to understand how sinto collapses reads. In the user guide, you say "Within a cell barcode, collapse fragments that share a start or end coordinate on the same chromosome." Could you explain why this is done. I would've thought that for the cell barcode, one would want to collapse only those fragments that have the same start AND end positions. why do you say "start or end"?
Hi, Dr. Tim,
Hope this email finds you well!
I am using the tool sinto to create a scATAC-seq fragments file from the BAM file.
However, I came across an issue, that is, my output file is empty, which means there's nothing in it. Below is my screenshot of how to use the sinto tool. Could you please tell me whether I make a mistake and how to solve it?
#!/bin/bash
#SBATCH --partition=Orion
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --mem=64GB
#SBATCH --ntasks-per-node=1
cd /scratch/qmei/wqq/mouse-TF/
export PATH=/users/qmei/anaconda3/bin/:$PATH
sinto fragments -b Cerebellum_62216.bam -f Cerebellum_62216fragments.bed
Hey there,
I am running sintp filterbarcodes to splitbam file.
Cmd: sinto filterbarcodes -b $.bam -c $cells -p 16
The error message shown as below:
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File ".../lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File ".../lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File ".../lib/python3.9/site-packages/sinto/filterbarcodes.py", line 21, in _iterate_reads
for r in inputBam.fetch(i[0], i[1], i[2]):
File "pysam/libcalignmentfile.pyx", line 2086, in pysam.libcalignmentfile.IteratorRowRegion.next
OSError: truncated file
"""
Thank you very much for the help!
Bests,
Hi,
I installed sinto successfully with python2 etc. But I seem to have an error with python:
File "/home/mfaxel/lib/python2.7/site-packages/sinto-0.7.2.2-py2.7.egg/sinto/tagtorg.py", line 9
return "\t".join(f"{k}:{v}" for k, v in line.items())
^
SyntaxError: invalid syntax
Is there a workaround or anyone already had this problem?
Thanks in advance!
The shift is currently hardcoded as +4/-5. Can this be made configurable? Also it seems +4/-4 is the right shift to apply.
Hi Tim,
I have been running successfully sinto barcodes for a batch of BAM files but now it's throwing an error with another batch of BAM files. I had a look into the bam and they seem ok, and the barcodes list is formatted the same way. Its possible also that these barcodes won't match anything in the bam, but that doesnt throw any errors.
Here is head of one BAM and the barcode list im trying to subset it to:
A00445:16:H7YL5DMXX:1:1439:23601:3098 16 2 46895278 255 1S38M487913N59M * GTCACTGCAACCTCCACCTTCCAGGTTCAAGCAATTCTCCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATTGCACCACTGCACTCCAGCCTGGGTGAC FFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFF:F:FF:FFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:92 nM:i:0 RE:A:I BC:Z:CCTTTGTC QT:Z:FFFFFFFF CR:Z:TTCTCAAAGATGTGGC CY:Z:FFFFFFFFFFFFFFFF CB:Z:TTCTCAAAGATGTGGC-1 UR:Z:TTGGGGTTGG UY:Z::FFFFFFFFF UB:Z:TTGGGGTTGG RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:1236:9136:10614 0 2 47109310 255 66M257308N32M * ACTTTGGGAGGCTGATGTGGGTAGATCACCTGAACTCAGGAGTTCAACACCAGCCTGGCCAACAAGAAACCCCATCTCTACTAAAAATACAAAAAATT :,FFFFFFF,FFF:F,FFFFFFFFFFFFFFFF:F,FFFFFFF:FFFF,F,F:FFF,FFFFFFFF,:FFFFF,FFFFFFFFFFFFF:FFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:84 nM:i:5 RE:A:I BC:Z:AGCACACT QT:Z:,:FFFFF, CR:Z:GTCAGGGAGGTGATAT CY:Z:F::F,FFFFFFFFF,F CB:Z:GTCACGGAGGTGATAT-1 UR:Z:ATCTGGGAAG UY:Z:FFFFFF:F,F UB:Z:ATCTGGGAAG RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:1121:8395:11741 16 2 47151916 255 7S41M257273N50M * CGCCGGCTTTGTTTTTTTTTTTTTTTTTGTATTTTTAGTAGAGACAGGGTTTCACCATGTTGGCCAGCCTGGTCTTGAACTCCTGACCTCAAGTGATC :,F:FF,:,,,:FFFF,FFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFF,FFFFFFFFFFFFFFF,,FFFFFFFFFF:FFFFFFFFFF:FFF::FFF NH:i:1 HI:i:1 AS:i:83 nM:i:2 RE:A:I BC:Z:AGCACACT QT:Z:FFFFFFFF CR:Z:ACGGGTCAGCGTGTCC CY:Z:FFFFFFFFFFFFFFFF CB:Z:ACGGGTCAGCGTGTCC-1 UR:Z:TGGCGTCTGA UY:Z:FF:F,:FFFF UB:Z:TGGCGTCTGA RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:1:1369:20292:30185 0 2 47151961 255 51M200703N47M * CAATGTGTTAGCCAGGATGGTCTAGATCTCCTGACCTTGTGATCCGCCCGCCCCTGCCTCCCAAAGTGCTGGGATTACAGGTGTGAGCCACCGTGCCC FF,FFFFFFFFFFFFFFF:F:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFFF NH:i:1 HI:i:1 AS:i:88 nM:i:3 RE:A:I BC:Z:AGCAAACT QT:Z:FFFFFFFF CR:Z:TGCACGCTCGTGGGAA CY:Z:FF,FFF:FFFF,FFF: CB:Z:TGGACGCTCGTGGGAA-1 UR:Z:TGTTCCATTT UY:Z:FF,:FFFFFF UB:Z:TGTTCCATTT RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:2451:8847:2973 16 2 47168859 255 62M191908N36M * CCTCTGCCTCCCAGGTTCAAGTGATTCTCCTGACTCAGCCTCTAGAGTCGCTGGGATTACAGGCACACGCCACCATGCCAGGCTAATTTTTATATTTT ::FFFFF:FFFFFF,FFFFF:FFFFFFF::FF,FFFF:FFFFFFFFFF,FFFF:FFFFFFFF:FFFFFFF,FFF,FFFF,FFFFFFFFFF:FF,:FFF NH:i:1 HI:i:1 AS:i:84 nM:i:5 RE:A:I BC:Z:TAGGATGA QT:Z::F,F,FFF CR:Z:TGGTTCCAGTCCCGGA CY:Z:F:FFFF,:,:FFFFFF CB:Z:TGGTTCCAGTACCGGA-1 UR:Z:ACGTAGAACC UY:Z:F:FFFFF,:: UB:Z:ACGTAGAACC RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:2:2444:12192:33129 16 2 47179880 255 45M184241N53M * CACACCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCGTGCCACCACACCCAGCTAATTTTGTATTTTTAGTAGAGACGGGGTTTC FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:90 nM:i:2 RE:A:I BC:Z:GTACGCGG QT:Z:FFFFFFFF CR:Z:AGGCCGTAGCAACGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AGGCCGTAGCAACGGT-1 UR:Z:TCGCTTTACT UY:Z:FFFFFFF,FF UB:Z:TCGCTTTACT RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:2
A00445:16:H7YL5DMXX:1:2449:20229:6840 0 2 47212619 255 22M237731N76M * AAACAAAACAAAAAAAAAAAACACCGGGCGTGGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGCAGGCAGATCACAAGGTCAGGAGAT F,FFFF:FFFFFFFFFF::FF:FFFFFFFF:FFF,FFFFFFF:F:FFFF,FFFFFFF,FFFFFF:FFFFFFFFFFFF,FF,,FFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:88 nM:i:3 RE:A:I BC:Z:AGCACACT QT:Z:FFFFFFFF CR:Z:TGCCCTAAGCGTCAAG CY:Z:FFFFFFFFFFFFFFFF CB:Z:TGCCCTAAGCGTCAAG-1 UR:Z:GTCACAAATT UY:Z:FFFFFFFFFF UB:Z:GTCACAAATT RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:2324:16776:36276 16 2 47251321 255 1S24M202853N73M * GATCCTCCCTCCTCAGCCTCCCAAAGTGGGCGGATCACGAGGTCAAGACATCAAGACCATCCTGACCAACATGGCGAAACCCCGTCTGTACTAAAAAT FFFFFFFFFFFFFFFFFFFFFFF:FF,FF,FFFFFFFFFFFFFFFFFF:,FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:1 HI:i:1 AS:i:77 nM:i:8 RE:A:I BC:Z:CCTTTGTC QT:Z:FFFFFF:F CR:Z:TGCCCATTCAGAGACG CY:Z:FFFFF,FFFFFFFFFF CB:Z:TGCCCATTCAGAGACG-1 UR:Z:GAGTGTGCGA UY:Z:FFFFFFFFFF UB:Z:GAGTGTGCGA RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:1:2173:5032:7733 0 2 47253083 255 4S70M108784N24M * GGAATTCAAGACCAGCCTGGCCATCATGGTGTAACCCCATCTCTACTAAAAATACTAAAAATTAGCTAGGTGTGGTGGTTCATGCCTGTAATCCCAGC FFF:FFFFFFFFFFFFFFFFFFF,FFFFFFF,FFF::FFFFFFFFFFFFFFFF:F:F,FFFF,FFFFFFFFFFF:FFFF,F:FFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:84 nM:i:3 RE:A:I BC:Z:GTCCGCGG QT:Z:FF,:FFFF CR:Z:GCGGGTTTCTATCCCC CY:Z:FFFFFFFFFFFFFFFF CB:Z:GCGGGTTTCTATCCCG-1 UR:Z:ACCTTAGGGG UY:Z:,FFFFFFFFF UB:Z:ACCTTAGGGG RG:Z:RPL_CST:MissingLibrary:1:H7YL5DMXX:1
A00445:16:H7YL5DMXX:2:2344:4227:21245 0 2 47261720 255 42M278937N53M3S * GGCTCACACATGTAATCCCAGCACTTTGGGAAGCGAAGGCAGGCGGATTGCTTGAGGCCAGGAGTTTGGGACCAGCCTGGGTCACATAGCCAGACCCT F,,FFFFFFFFFFFFFFFFFFFFFFF,F,FF:FFF:FFFFFFF,FFFFFFFF:FFFFFFFFFFFFFF:FFFFFF,FF,FFFF,FFF:FFFFFFFFFFF NH:i:1 HI:i:1 AS:i:72 nM:i:9 RE:A:I BC:Z:AGCACACT QT:Z::FF,FFFF CR:Z:GTGAAGGGTTGTCTTT CY:Z:FF:FFF
And the barcode tab-delimited file:
AAACCTGAGACTACAA-1 Myeloid
AAACCTGCACCTGGTG-1 Myeloid
AACCATGCATCACGAT-1 T.NK.cells
AACGTTGGTGTTGGGA-1 Myeloid
AACTGGTGTTACGGAG-1 T.NK.cells
AAGACCTTCCAGAGGA-1 T.NK.cells
AAGGAGCTCTGATACG-1 T.NK.cells
AAGGTTCAGGTTACCT-1 T.NK.cells
AAGTCTGGTATCAGTC-1 Myeloid
AAGTCTGTCTATGTGG-1 T.NK.cells
And here is the error:
Function run_filterbarcodes called with the following arguments:
sinto filterbarcodes -b file.bam \
-c celltypes.txt \
--barcodetag "CB"
bam file.bam
cells celltypes.txt
trim_suffix False
nproc 1
barcode_regex None
barcodetag CB
func <function run_filterbarcodes at 0x2b5b48a5b6a8>
Traceback (most recent call last):
File "/broad/hptmp/bgiotti/signac/bin/sinto", line 263, in <module>
options.func(options)
File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/utils.py", line 21, in wrapper
func(args)
File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/cli.py", line 14, in run_filterbarcodes
cellbarcode=options.barcodetag,
File "/broad/hptmp/bgiotti/signac/lib/python3.6/site-packages/sinto/filterbarcodes.py", line 92, in filterbarcodes
unique_classes = list(set(chain.from_iterable(cb.values())))
TypeError: unhashable type: 'list'
Thanks a lot for your support!
Is sinto is able to split the single-cell RNA seq BAM file into multiple files for each filtered barcode (available in cellranger output)
Hi timoast,
Thank you for this very nice tool. Is it possible to add an option in "filterbarcodes" function that allow using different read tag? I've data from BioRad platform and the tag is "DB".
Best,
Jason
Hi,
I have a BAM file that contain this entry, for example:
$ samtools view possorted_bam.hornet.final.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"
A01040:79:H2F2YDRXY:2:2165:10782:19977 163 chr8 120623305 60 50M = 120623620 365 ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF: NM:i:0 MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG CY:Z::FF:FFFFF:::FFFF CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA QT:Z::::F,FFF RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C
A01040:79:H2F2YDRXY:2:2165:10782:19977 83 chr8 120623620 60 50M = 120623305 -365 ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF NM:i:1 MD:Z:13A36 AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG CY:Z::FF:FFFFF:::FFFF CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA QT:Z::::F,FFF RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C
Then I ran sinto filterbarcodes
and got this in the output:
samtools view PBMC002.bam|grep "A01040:79:H2F2YDRXY:2:2165:10782:19977"
A01040:79:H2F2YDRXY:2:2165:10782:19977 163 chr8 120623305 60 50M = 120623620 365 ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF: NM:i:0 MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA QT:Z::::F,FFF RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-3A2DA946
A01040:79:H2F2YDRXY:2:2165:10782:19977 163 chr8 120623305 60 50M = 120623620 365 ATGGGAATGACATTGTATCTTGTGATGTGCTATTTATTAGAAATCAAAAA FF,F,FFFFFFFFFFFFF,FFFFFFF:F:FFFF,FFFFFFFFFFFFFFF: NM:i:0 MD:Z:50 AS:i:50 XS:i:19 CR:Z:TCAGTTTGTGATCAGG CY:Z::FF:FFFFF:::FFFFCB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA QT:Z::::F,FFF RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B
A01040:79:H2F2YDRXY:2:2165:10782:19977 83 chr8 120623620 60 50M = 120623305 -365 ATCGCTGAGAATCTGAACAAATTAAGGGTGTGGGGGTTGGGGGAGGCAGC :F:F,F:,:FFFF,,FF,FFFFFFF:F:F:FF,:FFFFFFFF,FF:FFFF NM:i:1 MD:Z:13A36 AS:i:45 XS:i:23 CR:Z:TCAGTTTGTGATCAGG CY:Z::FF:FFFFF:::FFFF CB:Z:TCAGTTTGTGATCAGG-1 BC:Z:GCTCGTCA QT:Z::::F,FFF RG:Z:PBMC-1-2-3-4:MissingLibrary:1:H2F2YDRXY:2-4836788C-12D1C06B
There's one BAM line that's duplicated, but with a different RG tag. I'm wondering why this happens?
I'm worried that this will create bias when counting the BAM reads for downstream analysis.
Thanks!
Hi, please look at the following comment from a closed issue. Opening a new issue here since I haven't heard back from anyone (presumably because commenting on a closed issue doesn't automatically reopen it).
Thanks.
" As a follow up, looking at the code it seems to me that you use 20 as the threshold for this. i.e. if one end is the same, we allow the other end to be up to 20 bases away for it to still be considered a duplicate. Is that correct?
However, even in that case, I'm confused because I see multiple cases where the end is the same, the start is <20 bases away, but these are still not counted separately (i.e., they are considered duplicates) by sinto. e.g. with the following 4 reads:
A00261:525:HK77VDSX3:1:1133:17969:2613 99 chrM 9947 60 150M = 10023 226 GGTTTGACTATTTCTGTATGTCTCCATCTATTGATGAGGGTCTTACTCTTTTAGTATAAATAGTACCGTTAACTTCCAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:34 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1
A00261:525:HK77VDSX3:1:1133:17969:2613 147 chrM 10023 60 150M = 9947 -226 CAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG :FFFFFFFFFFFFFFFF:FFFFFF:FFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:150 AS:i:150 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFFFFFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:TCGAATTG QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1
A00261:525:HK77VDSX3:1:1370:20518:3302 99 chrM 10092 60 81M = 10092 81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAGTGATATCTCGTATGCCGTCTTCTGCTTGAAA TQ:Z:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
A00261:525:HK77VDSX3:1:1370:20518:3302 147 chrM 10092 60 81M = 10092 -81 CTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:0 CR:Z:ACAGGCTCAGGAGGGT CY:Z:FFFFFF,FFFFFFFFF CB:Z:AAAGCAAGTGGAAACG-1 BC:Z:CGAGTGAT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK77VDSX3:1 TR:Z:CTGTCTCTTATACACATCTGACGCTGCCGACGACAGACGCGACCCTCCTGAGCCTGTGTGTAGATCTCG TQ:Z:::FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I would have expected the following two start,end pairs to be considered separate fragments:
9950 10167
10095 10167
but sinto actually only counts the second fragment here (i.e. 10095 10167), and ignores the first. What am I missing?
Thanks
"
Originally posted by @rtyags in #48 (comment)
Thank you for making this tool, I was searching for something like this and couldn't get other things I found to work. It was difficult to find yours. Im trying to run sinto filterbarcodes to create pseudobulk data. I sam sorted and indexed my bam files and trying ran sinto filterbarcodes, and got the following error
Traceback (most recent call last):
File "/home/tasakis/anaconda2/envs/SingleCells/bin/sinto", line 216, in
options.func(options)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/utils.py", line 21, in wrapper
func(args)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/cli.py", line 14, in run_filterbarcodes
cellbarcode=options.barcodetag
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/filterbarcodes.py", line 91, in filterbarcodes
cb = utils.read_cell_barcode_file(cells)
File "/home/tasakis/anaconda2/envs/SingleCells/lib/python3.6/site-packages/sinto/utils.py", line 198, in read_cell_barcode_file
groups = line[1].split(",")
IndexError: list index out of range
Could you suggest what would lead to this error and how I might fix it?
Thank you for your help!
Hi developers,
When extracting fragments from bam files, sinto would collapse fragments that share the exact same chr start and end coordinates across all cell barcode according to the documentation. Can you please justify this? Is there a reason why different cells can't harbor exactly the same fragments?
Good morning! I am using the filterbarcodes function to subset a bam to generate a psuedobulk bam file for further processing. When using the filterbarcodes function I came across two things:
Is the -o output part of the parameters required? The error:
"sinto: error: unrecognized arguments: -o ./ " occurs.
If the -o parameter is left out the program will run generate two subfiles with the correct A_xx titles about 2-3g in size and exits with the following error leaving the two subfiles in place:
"File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/sinto/filterbarcodes.py", line 55, in mergeAll
raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted"
What typically leads to this error?
Thanks for your help!
Hi,
This is a very useful tool.
I am wondering what would be the best way to nomalize the data to visualize on UCSC genome browser. I saw that CoveragePlot in Signac considers total number of reads per clusters and the total number of cells per cluster. I would like to know whats the best way to get the scaling factor which then can be used to normalize the bedGraph file.
genomeCoverageBed -ibam cluster_bam.bam -bg | awk -v OFS="\t" '{ $4=$4*scaling_factor; print}' > cluster_bam.bg
Then use 'bedGraphToBigWig
' to make a bigWig.
I am splitting a bam file with sinto filterbarcodes -b $BAM -c $CELLS -p 16
. This is a snippet of my input $BAM
:
@HD VN:1.6 SO:coordinate
@SQ SN:chr_1 LN:356613585
GGGATTGGATCTATCT:NS500645:228:HGT2VAFX2:1:11311:15433:7613 1187 chr_1 9139 60 50M = 9211 122 ATATACTCTATTAGCTCCTTTCTTTTTTCCTGGAAAGTAGGACATATTAT AAAAAEEAAEAAE6EA6EEEAEEEEEEEEEEEEEEEEEEE/EEEEEEA6< NM:i:0 MD:Z:50 AS:i:50 XS:i:22 MQ:i:60 MC:Z:50M ms:i:1716 CB:Z:25m_PFA#GGGATTGGATCTATCT
GGGATTGGATCTATCT:NS500645:228:HGT2VAFX2:2:11306:22648:13604 1187 chr_1 9139 60 50M = 9211 122 ATATACTCTATTAGCTCCTTTCTTTTATCCTGGAAAGTAGGACATATTAT AAAAAEEEE<EAEEEEEEEEEAEEEE/E/EEE/EEEE6//EEEAEEEEAE NM:i:1 MD:Z:26T23 AS:i:45 XS:i:0 MQ:i:60 MC:Z:50M ms:i:1742 CB:Z:25m_PFA#GGGATTGGATCTATCT
Temporary files being created are ok, i.e barcodes are read and reads are split correctly, however after merging, all outputs are empty.
Is this a samtools merge or reheader issue that I can't figure out, or is it something sinto-related?
Thank you in advance,
Anamaria
Hi Tim, when loading your package in python I ran into an error:
>>> import sinto
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/user/conda/envs/sinto/lib/python3.7/site-packages/sinto/__init__.py", line 1, in <module>
import importlib.metadata
ModuleNotFoundError: No module named 'importlib.metadata'
Then I noticed in my installation the package is called envs/sinto/lib/python3.7/site-packages/importlib_metadata
(with an underscore instead of a dot). So changing import.metadata
to import_metadata
in conda/pkgs/sinto-0.8.1-pyhfa5458b_0/site-packages/sinto/__init__.py
resolves the error. Maybe this is a version specific naming?
Thanks!
Tilo
My conda environment:
Hi Tim,
So I created a new column for barcode info (using sam), and then convert the sam to the bam file and indexed it. I still get no fragment file.
The head:
7001113:989:HTKVHBCX2:2:1105:15600:5528:77:15:82:15:CGGTATTTGG 0 chr1 3000049 1 23M * 0 0 TCTTTGAAGGTCTGGTAGAACTC DDDDDIIIIIIIIIIIIIIIIII AS:i:0 CB:Z:77158215
7001113:991:HVWNKBCX2:1:2110:2683:18182:87:93:72:36:CAGGTATGGC 0 chr1 3000132 39 53M * 0 0 GACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCADDDDCHIIHGFHIIIIIIIIIIHHHEHIIIIIIIDDGHHHDHHIIIIIIIIHI AS:i:0 CB:Z:87937236
7001113:990:HTKL3BCX2:2:2111:14944:4116:87:93:72:36:CAGGTATGGC 0 chr1 3000134 38 52M * 0 0 CTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCATGDDDDDIIIIIIIIIIIIIIIIIIIGIIIIIIIDFHHHHHIIIIIIIIIHIII AS:i:0 CB:Z:87937236
7001113:991:HVWNKBCX2:2:2105:19568:57493:53:23:77:35:CACTATTTTG 0 chr1 3000159 0 52M * 0 0 GGGGGGGCATGGGACTTTTAGTCCATGAATCTGATCCTGATTTAGCTTTGGTDDDDDIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIHEHIIHGIIIIIH AS:i:-22 CB:Z:53237735
7001113:993:HVWMKBCX2:2:1211:1360:3780:53:23:77:35:CACTATTTTG 0 chr1 3000353 37 53M * 0 0 GTTAATTATAGTACAGTCCCTATGCCCTCTAGTTAGTCTGGCTAAGGGTTTADDDDDIIHIIIIIIIIIIIHIIIIIIIIIIIIHIGIHIIHHIIIHHIIIHIII AS:i:0 CB:Z:53237735
7001113:991:HVWNKBCX2:2:2211:8406:75832:21:06:91:12:TCATCTTTGT 16 chr1 3000464 1 52M * 0 0 TCTTTTTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGCIHIHIIIIHHFF@EHIHGCHF<1IHFEEHEIHIIHIIHHHFIIGIHIDDDDD AS:i:0 CB:Z:21069112
7001113:989:HTKVHBCX2:2:1211:2292:73091:90:80:26:12:CTGTACGGCT 0 chr1 3000559 42 53M * 0 0 CTTCTAGATTTGCTGTCAGGCTGCTAGTGTATACTCTAGTTTCCTTTTGGAGDDCDDIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIHIIIIIIIIIIIIHIII AS:i:0 CB:Z:90802612
7001113:991:HVWNKBCX2:1:2205:17457:98814:90:80:26:12:CTGTACGGCT 0 chr1 3000633 30 53M * 0 0 CTCTTAGGACTGCCTCATTGTGCCCCATATGTTTGGCTATGTTGTGGATTTADDDDDIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 CB:Z:90802612
7001113:989:HTKVHBCX2:1:2216:4512:6199:90:80:26:12:CTGTACGGCT 0 chr1 3000747 32 52M * 0 0 ATTAAGTAGAGTATTGTTCAGTTTCCAGGTGAATGTTGGCTTTCTATTATTTDDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIH AS:i:0 CB:Z:90802612
7001113:989:HTKVHBCX2:2:2209:10226:10613:58:87:87:11:TCGAATTTGT 0 chr1 3000919 32 51M * 0 0 ATTTGGTACTGAGAAGAAGGTATATATCCTTTTGTCTTATGATAAAATGTT DDDDDIIIIIIIHIIIIIHIIIHIIIIIIIIIIIIIIGIIIIIIIIIIIII AS:i:0 CB:Z:58878711
the sinto code:
sinto fragments -b merge_CB.bam -f fragment
Hello,
I have some scATAC data from which I am trying to generate pseudobulk files using text files of cell barcodes. The fastq files were aligned with bowtie2 and then converted into .bams and sorted using samtools.
I am encountering the following error:
sinto filterbarcodes -b f1.sorted.bam -c fibroblast_cells.txt --outdir f1_CFs.bam -p 1
Function run_filterbarcodes called with the following arguments:
bam f1.sorted.bam
cells fibroblast_cells.txt
trim_suffix False
nproc 1
barcode_regex None
barcodetag CB
outdir f1_CFs.bam
sam False
func <function run_filterbarcodes at 0x110ca9550>
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 25, in _iterate_reads
newhead = dict((k, header[k]) for k in ("HD", "SQ", "RG"))
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 25, in <genexpr>
newhead = dict((k, header[k]) for k in ("HD", "SQ", "RG"))
KeyError: 'RG'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/alexwhitehead/miniconda3/bin/sinto", line 8, in <module>
sys.exit(main())
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/arguments.py", line 472, in main
options.func(options)
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/utils.py", line 23, in wrapper
func(args)
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/cli.py", line 17, in run_filterbarcodes
filterbarcodes.filterbarcodes(
File "/Users/alexwhitehead/miniconda3/lib/python3.9/site-packages/sinto/filterbarcodes.py", line 111, in filterbarcodes
idents = p.map_async(
File "/Users/alexwhitehead/miniconda3/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
KeyError: 'RG'
I am unsure if this is caused by missing the read group portion of the header - when I ran
samtools view -H f1.bam
I got the following output:
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@SQ SN:chrM LN:16569
@SQ SN:chr1_KI270706v1_random LN:175055
@SQ SN:chr1_KI270707v1_random LN:32032
@SQ SN:chr1_KI270708v1_random LN:127682
@SQ SN:chr1_KI270709v1_random LN:66860
@SQ SN:chr1_KI270710v1_random LN:40176
@SQ SN:chr1_KI270711v1_random LN:42210
@SQ SN:chr1_KI270712v1_random LN:176043
@SQ SN:chr1_KI270713v1_random LN:40745
@SQ SN:chr1_KI270714v1_random LN:41717
@SQ SN:chr2_KI270715v1_random LN:161471
@SQ SN:chr2_KI270716v1_random LN:153799
@SQ SN:chr3_GL000221v1_random LN:155397
@SQ SN:chr4_GL000008v2_random LN:209709
@SQ SN:chr5_GL000208v1_random LN:92689
@SQ SN:chr9_KI270717v1_random LN:40062
@SQ SN:chr9_KI270718v1_random LN:38054
@SQ SN:chr9_KI270719v1_random LN:176845
@SQ SN:chr9_KI270720v1_random LN:39050
@SQ SN:chr11_KI270721v1_random LN:100316
@SQ SN:chr14_GL000009v2_random LN:201709
@SQ SN:chr14_GL000225v1_random LN:211173
@SQ SN:chr14_KI270722v1_random LN:194050
@SQ SN:chr14_GL000194v1_random LN:191469
@SQ SN:chr14_KI270723v1_random LN:38115
@SQ SN:chr14_KI270724v1_random LN:39555
@SQ SN:chr14_KI270725v1_random LN:172810
@SQ SN:chr14_KI270726v1_random LN:43739
@SQ SN:chr15_KI270727v1_random LN:448248
@SQ SN:chr16_KI270728v1_random LN:1872759
@SQ SN:chr17_GL000205v2_random LN:185591
@SQ SN:chr17_KI270729v1_random LN:280839
@SQ SN:chr17_KI270730v1_random LN:112551
@SQ SN:chr22_KI270731v1_random LN:150754
@SQ SN:chr22_KI270732v1_random LN:41543
@SQ SN:chr22_KI270733v1_random LN:179772
@SQ SN:chr22_KI270734v1_random LN:165050
@SQ SN:chr22_KI270735v1_random LN:42811
@SQ SN:chr22_KI270736v1_random LN:181920
@SQ SN:chr22_KI270737v1_random LN:103838
@SQ SN:chr22_KI270738v1_random LN:99375
@SQ SN:chr22_KI270739v1_random LN:73985
@SQ SN:chrY_KI270740v1_random LN:37240
@SQ SN:chrUn_KI270302v1 LN:2274
@SQ SN:chrUn_KI270304v1 LN:2165
@SQ SN:chrUn_KI270303v1 LN:1942
@SQ SN:chrUn_KI270305v1 LN:1472
@SQ SN:chrUn_KI270322v1 LN:21476
@SQ SN:chrUn_KI270320v1 LN:4416
@SQ SN:chrUn_KI270310v1 LN:1201
@SQ SN:chrUn_KI270316v1 LN:1444
@SQ SN:chrUn_KI270315v1 LN:2276
@SQ SN:chrUn_KI270312v1 LN:998
@SQ SN:chrUn_KI270311v1 LN:12399
@SQ SN:chrUn_KI270317v1 LN:37690
@SQ SN:chrUn_KI270412v1 LN:1179
@SQ SN:chrUn_KI270411v1 LN:2646
@SQ SN:chrUn_KI270414v1 LN:2489
@SQ SN:chrUn_KI270419v1 LN:1029
@SQ SN:chrUn_KI270418v1 LN:2145
@SQ SN:chrUn_KI270420v1 LN:2321
@SQ SN:chrUn_KI270424v1 LN:2140
@SQ SN:chrUn_KI270417v1 LN:2043
@SQ SN:chrUn_KI270422v1 LN:1445
@SQ SN:chrUn_KI270423v1 LN:981
@SQ SN:chrUn_KI270425v1 LN:1884
@SQ SN:chrUn_KI270429v1 LN:1361
@SQ SN:chrUn_KI270442v1 LN:392061
@SQ SN:chrUn_KI270466v1 LN:1233
@SQ SN:chrUn_KI270465v1 LN:1774
@SQ SN:chrUn_KI270467v1 LN:3920
@SQ SN:chrUn_KI270435v1 LN:92983
@SQ SN:chrUn_KI270438v1 LN:112505
@SQ SN:chrUn_KI270468v1 LN:4055
@SQ SN:chrUn_KI270510v1 LN:2415
@SQ SN:chrUn_KI270509v1 LN:2318
@SQ SN:chrUn_KI270518v1 LN:2186
@SQ SN:chrUn_KI270508v1 LN:1951
@SQ SN:chrUn_KI270516v1 LN:1300
@SQ SN:chrUn_KI270512v1 LN:22689
@SQ SN:chrUn_KI270519v1 LN:138126
@SQ SN:chrUn_KI270522v1 LN:5674
@SQ SN:chrUn_KI270511v1 LN:8127
@SQ SN:chrUn_KI270515v1 LN:6361
@SQ SN:chrUn_KI270507v1 LN:5353
@SQ SN:chrUn_KI270517v1 LN:3253
@SQ SN:chrUn_KI270529v1 LN:1899
@SQ SN:chrUn_KI270528v1 LN:2983
@SQ SN:chrUn_KI270530v1 LN:2168
@SQ SN:chrUn_KI270539v1 LN:993
@SQ SN:chrUn_KI270538v1 LN:91309
@SQ SN:chrUn_KI270544v1 LN:1202
@SQ SN:chrUn_KI270548v1 LN:1599
@SQ SN:chrUn_KI270583v1 LN:1400
@SQ SN:chrUn_KI270587v1 LN:2969
@SQ SN:chrUn_KI270580v1 LN:1553
@SQ SN:chrUn_KI270581v1 LN:7046
@SQ SN:chrUn_KI270579v1 LN:31033
@SQ SN:chrUn_KI270589v1 LN:44474
@SQ SN:chrUn_KI270590v1 LN:4685
@SQ SN:chrUn_KI270584v1 LN:4513
@SQ SN:chrUn_KI270582v1 LN:6504
@SQ SN:chrUn_KI270588v1 LN:6158
@SQ SN:chrUn_KI270593v1 LN:3041
@SQ SN:chrUn_KI270591v1 LN:5796
@SQ SN:chrUn_KI270330v1 LN:1652
@SQ SN:chrUn_KI270329v1 LN:1040
@SQ SN:chrUn_KI270334v1 LN:1368
@SQ SN:chrUn_KI270333v1 LN:2699
@SQ SN:chrUn_KI270335v1 LN:1048
@SQ SN:chrUn_KI270338v1 LN:1428
@SQ SN:chrUn_KI270340v1 LN:1428
@SQ SN:chrUn_KI270336v1 LN:1026
@SQ SN:chrUn_KI270337v1 LN:1121
@SQ SN:chrUn_KI270363v1 LN:1803
@SQ SN:chrUn_KI270364v1 LN:2855
@SQ SN:chrUn_KI270362v1 LN:3530
@SQ SN:chrUn_KI270366v1 LN:8320
@SQ SN:chrUn_KI270378v1 LN:1048
@SQ SN:chrUn_KI270379v1 LN:1045
@SQ SN:chrUn_KI270389v1 LN:1298
@SQ SN:chrUn_KI270390v1 LN:2387
@SQ SN:chrUn_KI270387v1 LN:1537
@SQ SN:chrUn_KI270395v1 LN:1143
@SQ SN:chrUn_KI270396v1 LN:1880
@SQ SN:chrUn_KI270388v1 LN:1216
@SQ SN:chrUn_KI270394v1 LN:970
@SQ SN:chrUn_KI270386v1 LN:1788
@SQ SN:chrUn_KI270391v1 LN:1484
@SQ SN:chrUn_KI270383v1 LN:1750
@SQ SN:chrUn_KI270393v1 LN:1308
@SQ SN:chrUn_KI270384v1 LN:1658
@SQ SN:chrUn_KI270392v1 LN:971
@SQ SN:chrUn_KI270381v1 LN:1930
@SQ SN:chrUn_KI270385v1 LN:990
@SQ SN:chrUn_KI270382v1 LN:4215
@SQ SN:chrUn_KI270376v1 LN:1136
@SQ SN:chrUn_KI270374v1 LN:2656
@SQ SN:chrUn_KI270372v1 LN:1650
@SQ SN:chrUn_KI270373v1 LN:1451
@SQ SN:chrUn_KI270375v1 LN:2378
@SQ SN:chrUn_KI270371v1 LN:2805
@SQ SN:chrUn_KI270448v1 LN:7992
@SQ SN:chrUn_KI270521v1 LN:7642
@SQ SN:chrUn_GL000195v1 LN:182896
@SQ SN:chrUn_GL000219v1 LN:179198
@SQ SN:chrUn_GL000220v1 LN:161802
@SQ SN:chrUn_GL000224v1 LN:179693
@SQ SN:chrUn_KI270741v1 LN:157432
@SQ SN:chrUn_GL000226v1 LN:15008
@SQ SN:chrUn_GL000213v1 LN:164239
@SQ SN:chrUn_KI270743v1 LN:210658
@SQ SN:chrUn_KI270744v1 LN:168472
@SQ SN:chrUn_KI270745v1 LN:41891
@SQ SN:chrUn_KI270746v1 LN:66486
@SQ SN:chrUn_KI270747v1 LN:198735
@SQ SN:chrUn_KI270748v1 LN:93321
@SQ SN:chrUn_KI270749v1 LN:158759
@SQ SN:chrUn_KI270750v1 LN:148850
@SQ SN:chrUn_KI270751v1 LN:150742
@SQ SN:chrUn_KI270752v1 LN:27745
@SQ SN:chrUn_KI270753v1 LN:62944
@SQ SN:chrUn_KI270754v1 LN:40191
@SQ SN:chrUn_KI270755v1 LN:36723
@SQ SN:chrUn_KI270756v1 LN:79590
@SQ SN:chrUn_KI270757v1 LN:71251
@SQ SN:chrUn_GL000214v1 LN:137718
@SQ SN:chrUn_KI270742v1 LN:186739
@SQ SN:chrUn_GL000216v2 LN:176608
@SQ SN:chrUn_GL000218v1 LN:161147
@SQ SN:chrEBV LN:171823
@PG ID:bowtie2 PN:bowtie2 VN:2.4.4 CL:"/Users/alexwhitehead/miniconda3/bin/bowtie2-align-s --wrapper basic-0 -p 8 -X2000 --local -x /Users/alexwhitehead/Applications/bowtie2_index/GRCh38_noalt_as -1 /Users/alexwhitehead/AW_ATAC/fetal_heart_sc_fastq/fastq/SRR11692126_S1_L001_R1_001.fastq.gz -2 /Users/alexwhitehead/AW_ATAC/fetal_heart_sc_fastq/fastq/SRR11692126_S1_L001_R2_001.fastq.gz"
@PG ID:samtools PN:samtools PP:bowtie2 VN:1.15.1 CL:samtools view -H f1.sorted.bam
Do I need to edit the header somehow or is this due to another issue?
Thanks,
Alex
Hi , I'm trying to run Sinto filterbarcodes -b .bam -p 1 -c .csv and I get this error:
[E::hts_open_format] Failed to open file "8.tmp > 8.bam" : No such file or directory
samtools reheader: fail to open file '8.tmp > 8.bam': No such file or directory
File "/ihome/rlafyatis/rib35/.local/lib/python3.7/site-packages/sinto/filterbarcodes.py", line 58, in mergeAll
raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted
I'm not sure why I'm getting this error and would appreciate any help.
Thank you!
Hello,
I tried running sinto on a non 10X ATACseq data, and the fragment file is empty. I was wondering if you could lep me with that?
here is the output for samtools view merged.bam | head
(signac_env) -bash-4.2$ samtools view AdultCTX_DNA_merge.bam | head
7001113:989:HTKVHBCX2:2:1105:15600:5528:77:15:82:15:CGGTATTTGG 0 chr1 3000049 1 23M * 0 0 TCTTTGAAGGTCTGGTAGAACTC DDDDDIIIIIIIIIIIIIIIIII AS:i:0
7001113:991:HVWNKBCX2:1:2110:2683:18182:87:93:72:36:CAGGTATGGC 0 chr1 3000132 39 53M * 0 0 GACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCAT DDDDCHIIHGFHIIIIIIIIIIHHHEHIIIIIIIDDGHHHDHHIIIIIIIIHI AS:i:0
7001113:990:HTKL3BCX2:2:2111:14944:4116:87:93:72:36:CAGGTATGGC 0 chr1 3000134 38 52M * 0 0 CTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTTTAGTCCATG DDDDDIIIIIIIIIIIIIIIIIIIGIIIIIIIDFHHHHHIIIIIIIIIHIII AS:i:0
7001113:991:HVWNKBCX2:2:2105:19568:57493:53:23:77:35:CACTATTTTG 0 chr1 3000159 0 52M * 0 0 GGGGGGGCATGGGACTTTTAGTCCATGAATCTGATCCTGATTTAGCTTTGGT DDDDDIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIHEHIIHGIIIIIH AS:i:-22
7001113:993:HVWMKBCX2:2:1211:1360:3780:53:23:77:35:CACTATTTTG 0 chr1 3000353 37 53M * 0 0 GTTAATTATAGTACAGTCCCTATGCCCTCTAGTTAGTCTGGCTAAGGGTTTAT DDDDDIIHIIIIIIIIIIIHIIIIIIIIIIIIHIGIHIIHHIIIHHIIIHIII AS:i:0
7001113:991:HVWNKBCX2:2:2211:8406:75832:21:06:91:12:TCATCTTTGT 16 chr1 3000464 1 52M * 0 0 TCTTTTTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGC IHIHIIIIHHFF@EHIHGCHF<1IHFEEHEIHIIHIIHHHFIIGIHIDDDDD AS:i:0
7001113:989:HTKVHBCX2:2:1211:2292:73091:90:80:26:12:CTGTACGGCT 0 chr1 3000559 42 53M * 0 0 CTTCTAGATTTGCTGTCAGGCTGCTAGTGTATACTCTAGTTTCCTTTTGGAGG DDCDDIIIIIIIIHIIIIIIIIIIIIIHIIIIIIIIHIIIIIIIIIIIIHIII AS:i:0
7001113:991:HVWNKBCX2:1:2205:17457:98814:90:80:26:12:CTGTACGGCT 0 chr1 3000633 30 53M * 0 0 CTCTTAGGACTGCCTCATTGTGCCCCATATGTTTGGCTATGTTGTGGATTTAT DDDDDIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIII AS:i:0
7001113:989:HTKVHBCX2:1:2216:4512:6199:90:80:26:12:CTGTACGGCT 0 chr1 3000747 32 52M * 0 0 ATTAAGTAGAGTATTGTTCAGTTTCCAGGTGAATGTTGGCTTTCTATTATTT DDDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIH AS:i:0
7001113:989:HTKVHBCX2:2:2209:10226:10613:58:87:87:11:TCGAATTTGT 0 chr1 3000919 32 51M * 0 0 ATTTGGTACTGAGAAGAAGGTATATATCCTTTTGTCTTATGATAAAATGTT DDDDDIIIIIIIHIIIIIHIIIHIIIIIIIIIIIIIIGIIIIIIIIIIIII AS:i:0
and this is the code I ran:
sinto fragments -b merged.bam -f fragment
Hi Tim, @timoast
Thank you for working on this useful tool.
I'm trying to make sure that my barcode file is formatted properly and is able to be read in.
Is it necessary to have a groups column in the cells file?
I am currently using a text file with only one barcode per line.
mramos@super ~/data/10x/pbmc_3k $ cat test_atac_barcodes.txt | head -3
TTAGTCAGTCCTCCCA-1
CATATAGAGTCAAGAC-1
GATATGATCTAATCCG-1
I get a "list index out of range" error.
sinto_dir = os.path.expanduser("~/gh/sinto/sinto")
os.chdir(sinto_dir)
utilspy = os.path.join(sinto_dir, "utils.py")
exec(open(utilspy).read())
read_cell_barcode_file(os.path.expanduser("~/data/10x/pbmc_3k/test_atac_barcodes.txt"))
>>> read_cell_barcode_file(os.path.expanduser("~/data/10x/pbmc_3k/test_atac_barcodes.txt"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 204, in read_cell_barcode_file
IndexError: list index out of range
The documentation reads:
File or comma-separated list of cell barcodes. Can be gzip compressed
Thank you for your help!
Best,
Marcel
Hi Tim,
Thought I'd point out an issue that I noticed. I can't figure out exactly what is happening. Basically, when I run filterbarcodes with -p >1 each certain header entries are duplicated once for every process. So Every @PG
entry is duplicated but with a unique string in the ID name.
@PG ID:minimap2-1FF947E PN:minimap2 VN:2.7-r654 CL:minimap2 -ax splice -t 10 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa sc_bams/HP_104_Normal_soup/tmp.fq
@PG ID:minimap2-2E680064-4AC70EAD PN:minimap2 VN:2.7-r654 CL:minimap2 -ax splice -t 10 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no genome.fa sc_bams/HP_104_Normal_soup/tmp.fq
and a unique read group is produced for each process, with a unique string appended to the ID. This bam should only have one read group (the top one is correct), but now has 10 read groups.
@RG ID:HP_104_Normal LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-401FEFD5 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-75ACD5C2 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-17EC0C41 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-58171F5E LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-2AA79EC2 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-738A0D0F LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-4AEA10E1 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-57A6356B LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
@RG ID:HP_104_Normal-5E8828B4 LB:1 PL:ILLUMINA SM:HP_104_Normal PU:1
Hi @timoast,
I am wondering if there is a way to make the program, specifically the fragments
function compatible with SR BAM files.
Thank you very much!
Hi,
Does sinto have the functionality to move/copy prepended cell barcode in the name line to a tag field for bam files?
Thanks!
Hi Tim,
Thanks for developing this tool! It's very handy. Just one question: If I want to merge the bam files from multiple samples after running sinto filterbarcodes
, how can I avoid the issue arising from the use of same cell barcodes from different bam files? Can I add a prefix (e.g. sample ID) to cell barcode name when I run sinto filterbarcodes
?
Many thanks!
Hi,
Thank you for developing this tool.
I am applying this code on a real-time CellRanger output where the bam file size is 35GB and I am getting this error Failed to open file "sam_files/T10/AGGTCATTCCTAAGTG-1_7G1BJ1" : Too many open files
.
Do you have any fix for this?
Regards,
Nitin N.
Hi! I love your tool and I've been using it quite extensively. I have found an error with sinto addtags
on a transcriptome-aligned BAM file. I believe the nature of the error is that it treats every transcript as a chromosome, and there are many thousands compared to a 'normal' amount of chromosomes. Is there a better way to handle breaking chromosomes without recursion?
sinto addtags -b Enrichment_PCR_A1-1A_t.bam -m readname -o Enrichment_PCR_A1-1A_t.tagged.bam -p 88 -f sinto_tagfile.txt
Function run_addtags called with the following arguments:
bam Enrichment_PCR_A1-1A_t.bam
tagfile sinto_tagfile.txt
output Enrichment_PCR_A1-1A_t.tagged.bam
trim_suffix False
sam False
nproc 88
mode readname
func <function run_addtags at 0x7f4837808d30>
Traceback (most recent call last):
File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
return find_chromosome_break(position, chromosomes, current_chrom + 1)
File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
return find_chromosome_break(position, chromosomes, current_chrom + 1)
File ".../lib/python3.9/site-packages/sinto/utils.py", line 75, in find_chromosome_break
return find_chromosome_break(position, chromosomes, current_chrom + 1)
[Previous line repeated 996 more times]
File ".../lib/python3.9/site-packages/sinto/utils.py", line 71, in find_chromosome_break
if position <= chromosomes[current_chrom]:
RecursionError: maximum recursion depth exceeded in comparison
Attempting to run with 1 processor resulted in the same error.
bam file:
d3ce747b-2355-4385-9d2e-5d1f308a5e8b 256 MSTRG.1.1 1 0 64S85M1I10M1D46M3D78M1I83M2D4M3D30M3D12M613N22M1I144M11D5M3I20M1I29M3D62M1D4M2D3M1D111M1D13M1I23M63S * 0 0 * * NM:i:65 ms:i:513 AS:i:489 nn:i:0 ts:A:+ tp:A:S cm:i:124 s1:i:589 de:f:0.0537 MD:Z:41C37C1G13^T18A1G0A24^CCA0C41A0C9G24G3G2G75^CG4^GAG30^CAC113G4T33T25^ACCCCCCAACA7G19A26^GGG2A13G12C8A23^A4^CG3^C16G52T19A21^C6A29 rl:i:0
4909bb1f-6734-4e67-86e2-2f6dce0feff9 0 MSTRG.1.1 1 0 65S100M1D44M1D63M1D9M5D11M1D45M2I42M1I27M1I4M2D12M1D11M1D30M2D11M5D25M1D1M3I11M2D53M64S * 0 0 TATTGTACTTCGTTCAGTTACGTATTGCTGGGGAAGCAGTGGTACAACGCAGGTACATGGGGCCTCCGGAAGTGCGGATCCCAGCGGCAGTCGTGTAGCTGAGCAGGCCTGGGGCTTGGTTCTATGTCCCTGTGGCTATGTTTCCAGTGTCCTCTGGGTGTTTCCAGAGCAACAAGAAACGAATAAATCTCTGCCCCGCAGCGCCTCCACCCAGAGACCCGGACCAAATTCACACAGGACAATCTGTGCCGTGCCCAGCGCAAGCGCCTGGATCGGCCAACGGACCTTGTGGTTCCATCCTTCGCGACACCTCCGAGGACCTGGGACTCCAGTGTGAAGCGCCGTGAACCTGGCCTTCGGGCGCCGCTGTGAGGAACTGGAGGGACGCGCGGCACAAGCTGCAGCACCACCCTGCAAGGTGGGGCACCTGAACCCCGAGACGGCTCCGGCGCCTGCCCACCACTCTGCCCTCTCATCACCTGGGCCCTCAACCCTTTTCTCCTTATATTCCCACTTGTCGAGGGACCCCAGAAGCAAGTGTCACCTCTCCATAAACCTATGTAAAACCAGGCAAAAAAAAAAAAAAAAAAAAACATTTGTAGATCTTCACGGAGCGGAAGAGATCGGAAAGAGCA $#%'07:8==<?CA@=?AB?7;<===%2432))/.%%%;=;31-*+*../.)775=@?B@@A==@::?=:;6??46:=<4454.11'$$%4*150,,&--0.$%%,9:;9;=;74578:<<;:3128?>578880-'-%(4><+<469:>:4<::76.4/::<823'9;;<57871A=;11142:79:878859865,//,.+/...3*+-*+.333955.460/0+(24499955;?=>?>@:<;8=1379@@@>;:/.00287854221/&)')10.&#$('%$$%'*'''-((1>=9:?D;67<;B>?A<;>=99;9:;@ABE@:BB9;7/(+)&*45999-,6:5;=::?>>>896:77=>)%%7&>@8>DCEC388/29</3=@@C@==CFC5><522-*+3379:44-,+'056&-212277697400/.&',2448<=>969609861-.,)-046;:61/0132)6332.-+(')).2-+/+-%*&+6<>;<:0'(&&'&)56==8<7'/145.1>?<<>DFAE728<9>?@=<?>>?2><>AAIF?;>DB?B?ABEB;=<<B=?BG@B@D@A@C??<=7978;/59898,445:66<=96356<?8:>G;7/58>949;<1.0,'& NM:i:46 ms:i:293 AS:i:293 nn:i:0 ts:A:+ tp:A:P cm:i:68 s1:i:350 s2:i:350 de:f:0.0621 MD:Z:23G17C58^A44^C18G22A0C9G10^C9^CTGTG2T0G2G4^A0C1G115^CA12^C11^A30^CC11^CCTGG14A1C8^C1G8A1^CC32G20 rl:i:25
35e30db5-064c-4fb7-b116-81d590761199 256 MSTRG.1.1 1 0 76S26M1I22M4I2M2D2M1I6M1I35M1D22M2D6M3D16M2D9M7D6M6I87M1D26M1D91M1D8M3I21M1D38M1D31M1D40M63S * 0 0 * * NM:i:74 ms:i:172 AS:i:172 nn:i:0 ts:A:+ tp:A:S cm:i:42 s1:i:255 de:f:0.1035 MD:Z:23G17C8^TT8G5G28^T15G2A3^AT0A5^CTG2C1G11^CC9^CGGACCA7A16A0C9G25G2G2G18G0A1A0C2^C26^C20G0C0C27A0C0A3T0G2G19G10^C15G13^A24C0T2G9^C31^C24G15 rl:i:0
27cb0203-4357-4c90-a809-0a29640741b6 256 MSTRG.1.1 1 0 67S132M2I68M2D79M3D80M250S * 0 0 * * NM:i:15 ms:i:294 AS:i:294 nn:i:0 ts:A:+ tp:A:S cm:i:86 s1:i:322 de:f:0.0304 MD:Z:41C91G53A0C9G1^GC25G2G50^CGC1G78 rl:i:0
260f47db-336a-49ab-9fb3-93310a1377a1 256 MSTRG.1.1 1 0 62S64M2I37M5D9M4I16M1I2M3D9M3D23M1I11M1D15M1D10M1I3M1I3M2I22M1D27M1D34M1D8M5D65M2D14M5D10M1D33M1D25M2D4M1D32M95S * 0 0 * * NM:i:74 ms:i:154 AS:i:154 nn:i:0 ts:A:+ tp:A:S cm:i:58 s1:i:273 de:f:0.1024 MD:Z:41C22C1G34^AGAGC5G3C17^GCA7C1^CCC34^G4A0C4A0G0C0G1^G3G21G2G2G6^T0G10G7C2G0G3^T0G33^G0G7^TGTGA65^GA14^CCTGC6C0T2^G0C2C12G16^C25^GA4^C24G7 rl:i:22
462e1896-1f91-42b5-901c-db9f4fb92c16 256 MSTRG.1.1 1 0 97S90M1D51M1D28M1I2M2I78M3I7M1I17M2D83M613N100M5D62M1I16M3I8M1D7M2I5M1D28M2D48M2I2M1D6M3D8M1I5M3I43M1D70M82S * 0 0 * * NM:i:59 ms:i:497 AS:i:473 nn:i:0 ts:A:+ tp:A:S cm:i:166 s1:i:629 de:f:0.0536 MD:Z:90^G51^C16A0C1A89A5G16^GT27G152T0G1^AGCTG0C3C0A73G6^A4G0G6^A28^GA50^T3C2^GGA13G42^C6T0A10C0A1C0A47 rl:i:0
tagfile
d129905b-f46c-4287-be0b-78bcfbc33d41 CB ATCTTGACCTGCAACG
0b337571-132f-4f0a-b4dd-b5df16d9654b CB ACGTTATTGGTCACTC
4fa9ab30-90ea-4041-a30a-18368e288a08 CB CAACGTGGTGGAGTCT
b6105613-6ccb-466b-9896-12bba3f9b999 CB AGCGACCAACGATATT
77ee2019-2acd-4cc9-8345-51f88617466d CB GTTACCTACAACTTGC
Thank you! In the meantime, I may hack together some bash script to operate on the sam file.
similar but different from issue #15
this is my original single cell bam file
@RG ID:CH4-LN_2_L001 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L002 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L003 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L004 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L005 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L006 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@RG ID:CH4-LN_2_L007 SM:CH4-LN_2 LB:CH4-LN_2 PL:illumina
@PG ID:STAR PN:STAR VN:2.7.4a CL:STAR --runThreadN 8 --genomeDir /data/wangzw/dropseqMetadata_b37/STAR --readFilesIn unaligned_mc_tagged_polyA_filtered.fastq --outFileNamePrefix star --outReadsUnmapped Fastx --twopassMode Basic
@PG ID:0 PN:TagReadWithGeneFunction CL:TagReadWithGeneFunction INPUT=merged.bam OUTPUT=star_gene_exon_tagged.bam ANNOTATIONS_FILE=/data/wangzw/dropseqMetadata_b37/b37.refFlat GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf READ_FUNCTION_TAG=XF USE_STRAND_INFO=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false VN:2.3.0(34e6572_1555443285)
ST-E00522:612:H2WHKCCX2:7:1220:14093:56985 16 1 10166 0 49M44N32M * 0 0 CCCTAACCCTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAAGCCCTAACCCTAACCCTAACCCTAACCCTAACCC JJJJJJJJFJJFJJJFAJJJJAJJJJJJ7JJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFAJJJJJFJJJJFFFAA XC:Z:CCTTGTCGACTC
MD:Z:47C33 XF:Z:INTERGENIC PG:Z:STAR RG:Z:CH4-LN_2_L003 NH:i:5 NM:i:1 XM:Z:TTGGCCTC ZP:i:82 UQ:i:41 AS:i:79
ST-E00522:612:H2WHKCCX2:7:2216:25540:11839 16 1 11743 1 150M * 0 0 TGACGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCT <7<-A-JJA7FFAAA))7)7A-7AFF7JA<F7JJ<JJJJFFA-<AJJJFF
JJFFF--777JA<FA-7JF7AJJFJJJJJJJFJJFJJ<J7J<JJJJFJFJF<<FJJFJ<JJJJJJJJFJFFFJJJJJFA7-JJAFF<JJJFFJJJAAAAA XC:Z:GAGACGAGGCCC MD:Z:3T146 XF:Z:INTERGENIC PG:Z:STAR RG:Z:CH4-LN_2_L003 NH:i:3 NM:i:1 XM:Z:ACTGGCCA UQ:i:12 AS:i:146 gf:Z:CODING,INTERGENIC gn:Z:DDX11L1,DDX11L1 gs:Z:+,+
ST-E00522:612:H2WHKCCX2:6:1124:17797:16041 16 1 11762 1 150M * 0 0 CCGGTGTTGAAAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA XC:Z:CTTCTTCCGCTT MD:Z:10G139 XF:Z:INTERGENIC PG:Z:STAR RG:Z:CH4-LN_2_L007 NH:i:3 NM:i:1 XM:Z:ATGGAGGA UQ:i:41 AS:i:146 gf:Z:CODING,INTERGENIC gn:Z:DDX11L1,DDX11L1 gs:Z:+,+
which has 7 RG from CH4-LN_2_L001 to CH4-LN_2_L007
I want to extract reads tagged by 100 cell barcodes, and here is my code:
sinto filterbarcodes -b star_gene_exon_tagged.bam -c xcForTest.txt --barcodetag XC -p 16
xcForTest.txt has 100 cell barcode,like this:
ACCGTCAGCGAT subset
GTTCAGAATAGC subset
GCAACACGAGTG subset
GCTTCACCCTTA subset
TCGATCCACGAG subset
CACGCCAATTAG subset
CGACCGGGAAAA subset
CAAGCATATGCA subset
CTCATGTTGTAG subset
TCCTCCGACCCA subset
...
when i was checking the output bam, which is subset.bam, using code:
samtools view -h ./subset.bam | grep "CH4-LN_2_L007-" | less
I found some unexpected records like:
ST-E00522:612:H2WHKCCX2:6:2112:30259:10521 16 1 197113101 255 25M * 0 0 GCTCTTCTGCATTTCCTAGTAATAT JJJJJJJJJJJJJJJJJJJJFFFAA XC:Z:CTAACAAGTTCT MD:Z:25 XF:Z:CODING NH:i:1 NM:i:0 XM:Z:TGAGCGGG ZP:i:26 UQ:i:0 AS:i:24 gf:Z:CODING gn:Z:ASPM
gs:Z:- RG:Z:CH4-LN_2_L007-364D2CCB PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2103:22922:9027 16 1 197113183 255 48M2040N102M * 0 0 TTCTTTAATTACTCTCCACTTAACAGAAATAACAATTTTCTCTTTAGGCTGCAACACGAAACAGCGCTGCGACACACTGAAGCCCAGGTCCGCGGCCGGGAAGTGGGAGATCTTCACTTCTGCCACCTCCTCGTTAGGGTTGTCTAGGGC <FF7---7-7<A<FJFF<-JJJJJAJFFAFFAFJFJAF7JJFFA7-7-AFJJFJAJAJJJ<JFJJFFAJJFFA<FJJFFA7AA77--AAA7A-7AJ-JFJJA7-<AJJJJJJJAJJJJJJJJJJ7JJJJJJJJJJJJJJJJJJJJFFFAA XC:Z:CTAACAAGTTCT MD:Z:6G1G1G0G1G4G131 XF:Z:CODING NH:i:1 NM:i:6 XM:Z:TGAGCGGG UQ:i:132 AS:i:137 gf:Z:CODING gn:Z:ASPM gs:Z:- RG:Z:CH4-LN_2_L007-364D2CCB
PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2208:13829:28611 16 1 197122849 255 150M * 0 0 AAGTAAAACAAAGAACTAGTTCAATATACAGTACACTTCCTACTCTTCACAGAGAACTGAAATTTTCTATAAAGACATTTATACTTAGGAAACATCAGACAACCAAAGTATGTATAAAACTCACAAGATATTTTACACACAGTTCACAAT AA--A<-F<F<<7<F7-77<<JFF<F7<-A<FFF77-FFF<-AFAAFJJJJJJFJJFJJJAAFJFJJFJFJJJJFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA XC:Z:GTGATGTCGGCT MD:Z:6C143 XF:Z:UTR NH:i:1 NM:i:1 XM:Z:GCGTGAGG UQ:i:12 AS:i:146 gf:Z:UTR gn:Z:ZBTB41 gs:Z:- RG:Z:CH4-LN_2_L007-364D2CCB PG:Z:STAR-7303B068
ST-E00522:612:H2WHKCCX2:6:2105:16122:62505 16 1 197168944 255 150M * 0 0 CTTTTTATAACAAAAATGTCTACTACAGAATTTGCACTGATGATTATTTGATAGTCTTCCAGTTAATTCATTTAGTGTTTCTTCTGGTGATGACTTTTCACTTAGCTCTGAATGAAAAGGGGCAACATTTTCGTTATTTAACAACTTCAC FA<-<7JJJJJJJJJJJJFAJJJJFAJJJJJFJAAFJJJJJJJJFFJJJJJJJJJJFJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJAJFJJJJAF7JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFAFAA XC:Z:GAGTCGAGCGAA MD:Z:100G49 XF:Z:CODING NH:i:1 NM:i:1 XM:Z:AGGGGCGG UQ:i:37 AS:i:146 gf:Z:CODING gn:Z:ZBTB41 gs:Z:- RG:Z:CH4-LN_2_L007-364D2CCB PG:Z:STAR-7303B068
'-364D2CCB' were added to the cell barcode tag, which influence my downstream process.
Is it a bug?
Hi Tim, @timoast
Thanks for helping me get the --cells
input figured out.
I've tested this with 10 barcodes and the function mentioned in #44 works well with this format.
Now, running the full filterbarcodes
command, I get an error that I cannot understand.
I am using the ATAC data from 10X.
This is my script:
export sinto=$HOME/.local/bin/sinto
export BAM_FILE="$HOME/data/10x/pbmc_3k/pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam"
export TEST_BCODES="$HOME/data/10x/pbmc_3k/test_atac_barcodes.tsv"
cd ~/data/10x/pbmc_3k/sinto_filterbarcodes
$sinto filterbarcodes \
--bam $BAM_FILE \
--cells $TEST_BCODES \
--nproc 10 \
--barcodetag "CB"
My barcodes .tsv
file:
mramos@supermicro ~/data/10x/pbmc_3k/sinto_filterbarcodes $ cat $TEST_BCODES | head -3
"TTAGTCAGTCCTCCCA-1" "A"
"CATATAGAGTCAAGAC-1" "B"
"GATATGATCTAATCCG-1" "C"
The output of the command:
Function run_filterbarcodes called with the following arguments:
bam ./data/10x/pbmc_3k/pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam
cells ./data/10x/pbmc_3k/test_atac_barcodes.tsv
trim_suffix False
nproc 10
barcode_regex None
barcodetag CB
func <function run_filterbarcodes at 0x7fdd4fb03670>
[E::hts_open_format] Failed to open file "TTAGTCAGTCCTCCCA-1_L99RMI" : No such file or directory
samtools merge: fail to open "TTAGTCAGTCCTCCCA-1_L99RMI": No such file or directory
[E::hts_open_format] Failed to open file "TTAGTCAGTCCTCCCA-1.tmp" : No such file or directory
samtools reheader: fail to open file 'TTAGTCAGTCCTCCCA-1.tmp': No such file or directory
Traceback (most recent call last):
File "/.local/bin/sinto", line 8, in <module>
sys.exit(main())
File "/.local/lib/python3.8/site-packages/sinto/arguments.py", line 457, in main
options.func(options)
File "/.local/lib/python3.8/site-packages/sinto/utils.py", line 23, in wrapper
func(args)
File "/.local/lib/python3.8/site-packages/sinto/cli.py", line 17, in run_filterbarcodes
filterbarcodes.filterbarcodes(
File "/.local/lib/python3.8/site-packages/sinto/filterbarcodes.py", line 119, in filterbarcodes
mergeAll(idents=idents, classes=unique_classes, nproc=nproc, header = headerfile, remove=True)
File "/.local/lib/python3.8/site-packages/sinto/filterbarcodes.py", line 58, in mergeAll
raise Exception("samtools merge failed, temp files not deleted")
Exception: samtools merge failed, temp files not deleted
Thank you for taking a look!
Best regards,
Marcel
Hi,
I am trying to use sinto barcode command to add the cell barcodes stored in R2.fastq to the R1 and R3 scATAC-seq reads. Here is the command I am using. The command gives the right output but the only problem is that it is using ~30G RAM per sample, which means I will have to assign a lot of memory if running parallel. It would be helpful if you could provide any potential solutions for that. Thanks!
$ sinto barcode --barcode_fastq R2.fastq --read1 R1.fastq --read2 R3.fastq -b 16
The size of certain sample files:
13G R1.barcoded.fastq
20G R3.barcoded.fastq
15G R1.fastq
7.4G R2.fastq
15G R3.fastq
Hi,
Could you help me with speeding up my sinto runs? How does it scale with data size? I have noticed that it runs for a very long time when we have a large bam file as input. Part of this could also be that at no stage does it seem to use multiple processors even though I have provided a high number with -p option. Is it possible that something was missed during installation so that parallelization is somehow not available to sinto on my system?
Thanks
Hi all,
I'm trying to run sinto on our local cluster. We have sinto v0.7.3.1 available with python 3.8.6. GCC v10.2.0 and OpenMPI v4.0.5 are loaded in the background. I use the following code to generate my fragments file:
sinto fragments -p 8 \
-b /dir/file.bam \
-f /dir/file.bed \
--barcode_regex "[^:]*" \
--use_chrom "*"
This generates the following output (with errors):
Function run_fragments called with the following arguments:
bam /dir/file.bam
fragments /dir/file.bed
min_mapq 30
nproc 8
barcodetag CB
cells None
barcode_regex [^:]*
use_chrom *
max_distance 5000
min_distance 10
chunksize 500000
func <function run_fragments at 0x2b3f2e8843a0>
Traceback (most recent call last):
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/bin/sinto", line 8, in <module>
sys.exit(main())
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/arguments.py", line 346, in main
options.func(options)
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/utils.py", line 21, in wrapper
func(args)
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/cli.py", line 45, in run_fragments
fragments.fragments(
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/fragments.py", line 470, in fragments
chrom = utils.get_chromosomes(bam, keep_contigs=chromosomes)
File "/opt/ebsofts/sinto/0.7.3.1-foss-2020b-Python-3.8.6/lib/python3.8/site-packages/sinto/utils.py", line 134, in get_chromosomes
pattern = re.compile(keep_contigs)
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/re.py", line 252, in compile
return _compile(pattern, flags)
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/re.py", line 304, in _compile
p = sre_compile.compile(pattern, flags)
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/opt/ebsofts/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/sre_parse.py", line 668, in _parse
raise source.error("nothing to repeat",
re.error: nothing to repeat at position 0
srun: error: node245: task 0: Exited with exit code 1
We suspect this may be caused by the python version, but not sure. Could there be another reason why these errors are produced?
Hi, I'm new to using this tool and not sure if it worked correctly.
I have a sorted BAM file of attack seq data that's ~100 Gb, and 12 clusters that I'm looking to subset the data into. The filterbarcodes function ran with no errors, but I am having trouble understanding the output. The 12 bam files all were tiny in comparison to the original BAM file (each about 2 kilobytes), where I thought they would each roughly be 1/12 the size of the original BAM file.
Also, each bam file (ex. 0.bam) seems to be accompanied by another binary file named something like 0_7U58WO (also around 2 kb), that I can't figure out how to read or what it is.
I'm not sure how to make sense of this because there were no errors indicated while running the program. Anything that could shed light on these unexpected results would be helpful.
Hi Tim,
Thank you so much for working on this tool! We have been using the filterbarcodes function to demultiplex bam files (to rerun the read alignment step, starting from the output bam from CellRanger). For our particular application, we are interested in unmapped reads, but unfortunately the filterbarcodes function seems to not carry over the unmapped reads after demultiplexing. (ie. the input bam file has unmapped reads, but the resulting demultiplexed bams have 0 unmapped reads). Is it possible to tweak the function to have an option to keep unmapped reads? We would greatly appreciate it!
Thanks again,
Joyce
Hi - I'm new to sinto and I'm unable to to produce a fragments file. My cell barcodes are in the header rows of my bam file, between the first and second underscore, and I have numeric values for my chromosomes, with no "chr" at the beginning. So, I need to know what the regex pattern would look like for both the --barcode_regex and --use_chrom options, and I need to know how to stop the --barcodetag using the default of "BC". Here are the first 10 rows of my bam file:
E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 163 10 3100324 0 87M = 3100324 87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2106:29812:51834_ATTGAATTACAGCCGTCTTACACTGA_ATGCCATTCT 83 10 3100324 0 87M = 3100324 -87 CATTTACACAATGGAATACTACTCAGCTATTAAAAAATGAATTTATGAAATTCCTAGGCAAATGGATGGACCTGGAGGGTATCATCC JJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA NM:i:0 MD:Z:87 MC:Z:87M AS:i:87 XS:i:87 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 99 10 3100767 0 62M = 3100767 62 ATGCCGGGGCCTAGCAAACACAGAAGTGGATGATCACAGTCAGCTATTGGATGGGTCACACG AAFFFJJJJJJJJJJJ-JJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJ NM:i:0 MD:Z:62 MC:Z:43S62M AS:i:62 XS:i:62 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2206:16802:5212_TGAAGAAACCGTTTGTTTACACAACA_GCNATCCATC 147 10 3100767 0 43S62M = 3100767 -62 AGTTTCTTCATCGTCGGCAGCGTCAGATGTGTATGAGATACAGATGCCGGGGCCTAGCAGACACAGAATTGGATGATCACAGTCAGCTATTGGATGGGTCACACG <FFFJ<-JFFFJJJJJJF77F-JJJF<-J-JAJF-JF7AA-7F-JJJJJ7JF-FAA7-<-F<-<A-<F-J7JJJJJJJ7FJF-F7JJJJJJJF-JAAJAFF-JF- NM:i:2 MD:Z:16A8G36 MC:Z:62M AS:i:52 XS:i:57 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 163 10 3104258 60 117M = 3104509 401 AGTGTGTAGCTTATTAGTGGGGTGTTTGGCAGCATACATGAGGTTTTAGATTAAATCCCCCTGTTACAAAATAAGTAAAAGAGCATATCAGACACACCCCCCCATAGGAAAGAACAA JJ7FJJFJJJJJJJJJJFF<FFJJJFJJJJJJJJJJJFFJJJJAJFJJJJJJJJJJFAAJF7AF-<AJJJJFAJJJJJJJJJJ-AF<FJJJJJFF<<-7777A-7FF7-AJJJJ-AA NM:i:1 MD:Z:95C21 MC:Z:150M AS:i:112 XS:i:93 RG:Z:BPA1 XA:Z:10,-7456812,20M2I92M3S,5;10,+22240801,94M2D23M,7;10,+22431027,94M3D23M,8;
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 163 10 3104273 0 15S68M33S = 3104509 386 AGTGTGTAGGTGATGAGTGGGGGGTTTGTCAGAATACATGAGGATTTAGATGAAATCACCCGGATACAAAAGAAGTAAAAGAGAATAAAAGACGGCACAGAGCATATAATAAAACA AA<-FFJA7-7-FAA-----FJ---7-<--A<-77F<<--A-----7FA-<7-<77-<-A-----7-<7---7----7--A-A<------<----77--A--7--7-77-<-7-7< NM:i:9 MD:Z:7T5G3C10T7T5C3T1T7T11 MC:Z:150M AS:i:23 XS:i:23 RG:Z:BPA1 XA:Z:8,-16209092,29S21M66S,0;1,+129147041,65S20M31S,0;14,-14404651,24S19M73S,0;10,-113473301,33S19M64S,0;15,+60550338,70S19M27S,0;
E00558:642:HFL3TCCX2:8:2118:27225:71717_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCGC 83 10 3104509 60 150M = 3104258 -401 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGGCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC FJJJFJFJJJJJJJFJJJJJJFJJJJJJJJJJJJJJJJJJJJAFFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA NM:i:0 MD:Z:150 MC:Z:117M AS:i:150 XS:i:102 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2120:9628:66127_TGAAGAAAGCAGTAGATGAGAGTTAT_ATATGCTCNC 83 10 3104509 60 150M = 3104273 -386 TCAGTAGGCAGACAGGAATAACCAAGGCCAGAAGATAATCTCTTTCCAATGGTCATAGAACCCTTCACTCTGCAGGCTGAGATGTGTTGCCATTATGAAGGAGATAAAAGTTTCAGGGGATCTTGTGTTGTTAGCCTCAATGGAAAGAAC A7JJFJAF<-F7-F<JJJJJF7-A777--FJJJFJJAA7FFJFFFJJJJFA7-JFJJJFAA-7AAF-FJJJJFJFFJJJAJFJJJJF<FJJJJJJFJJFAJJAJJJJJJJJJJJJJFJJJ<JJJJJJJFFJJJJJA-FJJJJJJFFFFAA NM:i:1 MD:Z:52G97 MC:Z:15S68M33S AS:i:145 XS:i:97 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 99 10 3104650 60 150M = 3105087 553 GGAAAGAACATGTTCATGTTGACACAAGCACTGGCAACTGGACTCAATTGGATCCTAGATTGAAGAAGAGTATAGAAATAGGGAAGGAAGACAGGACTCGATCTTCCTTCTTAGAGAAGACTACAGAGGGTGACTGCAAGACCTGGCGTG AAFFFJJJFJJJFJJJJJJJJJJJJJJJFJFJJJJJJJJJJJJJAJJJJFJ<FJFJFFJFJJJAFAFJ7JFFAJJ7F<7FAJJFJAAJJJJJJJJJJJJFFJFJAJJJ<AAJJFJAJJJJJJJJJFAAAAJ7FFA<AFJJF7FJ77<--< NM:i:1 MD:Z:147T2 MC:Z:116M AS:i:147 XS:i:95 RG:Z:BPA1
E00558:642:HFL3TCCX2:8:2201:25083:16006_TGAAGAAAGCAGTAGAGCACTTGGCG_TATGCNTTAC 147 10 3105087 60 116M = 3104650 -553 GTGCGGAAGAGGAGGCACACAACATGTAAGAACCAGAGGGGATTGAGGACACCAAGGATTTCTCCTCTTAAGTCAACACGATCCACACACATATGAACTCACAGGTACTGGAGTAG 7FF7JJJFJFAF7-F-7F-AFFJJJJAAFFJFJJJA<JFAFFJJJJJ<FJJJJJJJJJJFA-A-FFFJJJFAFFFJJJJJJJJJJFFJJJJJJJFF7JJJJJJJJJJAAFJJJJJJ NM:i:0 MD:Z:116 MC:Z:150M AS:i:116 XS:i:103 RG:Z:BPA1 XA:Z:10,+7455967,116M,3;10,+3205176,112M4S,4;
Hi @timoast,
Thanks for adding nametotag, this is very helpful for my dataset (linked-reads sharing the same molecular barcode). I currently have mapped reads in a bam file with the barcode embedded in the read header, same format as isssue #32 where the barcode is not at the beginning of the read.
example (barcode = CATTTGGCCTCGAATCGCGTCGGTGCGGTAACACTC)
A00564:478:HG5NJDSX3:1:2556:3992:34867_CATTTGGCCTCGAATCGCGTCGGTGCGGTAACACTC_GAACGACTACCACAG
You provided the regex to use:
--barcode_regex "(?<=)(.*)(?=)"
which works for me with sinto fragments, however I get an error with sinto nametotag:
Traceback (most recent call last):
File "/programs/sinto-0.8.0/bin/sinto", line 8, in
sys.exit(main())
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/arguments.py", line 457, in main
options.func(options)
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/utils.py", line 23, in wrapper
func(args)
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/cli.py", line 109, in run_nametotag
tagtoname.move(
File "/programs/sinto-0.8.0/lib/python3.9/site-packages/sinto/tagtoname.py", line 51, in move
cell_barcode = re_match.group()
AttributeError: 'NoneType' object has no attribute 'group'
I get a similar error for filterbarcodes using the same regex and input bam file.
Im running sinto v0.8.0
Thanks for your help!
Hi developers,
I understand that the chunk_bam() function splits the genome into multiple intervals for multiprocessing.
Basically, for each paralleled task, it calls pysam.fetch() to retrieve all the reads that map to the supplied interval. One concern to me is that, if certain reads overlap with more than one "intervals" (thus, will be fetched by pysam more than once from parallel jobs), will those reads be double counted?
Please let me know if this is a valid concern or not based on your experience. Really appreciate it!
Hello Tim,
I am trying to use sinto filterbarcodes for making bam files from my scATAC clusters, and despite running without any error I am getting empty bam files. I saw that this issue had been brought up previously, and updating the sinto version solved the problem. However, I am using the latest 0.7.2.2 version of sinto, so that is probably not the issue. In that post you had also asked the user to ensure that the "cells" file is indeed tab-delimited and I verified that is true for me.
I am running the following code:
sinto filterbarcodes -b fragments_10X_sorted.bam -c cells.csv --barcodetag "CB"
This is how the head of my bam file looks:
GACCTTCGTTATGCAC-2 0 chr1 10158 255 151M * 0 0 * *
TCAAGGTAGTGAACCG-3 0 chr1 10229 255 98M * 0 0 * *
ATTGTCTTCGAAGCCC-4 0 chr1 10335 255 219M * 0 0 * *
GTAGTACCAAGAAACT-4 0 chr1 10793 255 142M * 0 0 * *
And this is the head of my cells file:
AAACGAAAGAACGACC-4 11
AAACGAAAGACCTATC-2 0
AAACGAAAGAGGAATG-3 2
AAACGAAAGCCTATAC-3 11
AAACGAAAGCGTCAAG-4 2
AAACGAAAGCTAGCAG-1 2
Please let me know if I am missing something/what is the potential cause of the issue.
Thanks
Debbie
For citing use of Sinto would you like us to include the link to the github page and the version used as you prefer for Signac or do you have a specific way you would like this program cited?
Thanks for your time!
Hi Tim, thanks for the package!
Here's a simple toy bam file on which I was running sinto fragments
. The MAPQs, fragment lengths etc seem fine, but there's no output when I run:
sinto fragments -b lol.bam -f lol.frag
lol.bam.zip
This file has 5 read pairs. I see outputs when I add more read pairs.
I was tracing through the code and it's most likely an edge case here:
Lines 218 to 224 in 9ac3e8c
The fragment_dict
looks good per chromosome but somehow doesn't make it to complete
. Not entirely sure of the logic but can look into it if you'd like. Thanks again.
Hello, does filter barcodes keep all reads when I give it a barcode list? both mapped and unmapped reads?
Hi, thanks for making this tool!
I've come across this issue and I'm not sure if this is the expected behavior or not. I'm using Sinto 0.7.1
to create a fragments file from a Cell Ranger bam file. In the output, I get many fragments with the same start/end position (around 6000 in total). For example:
chr5 49658161 49658162 CGCACAGCACCTATTT-1 1
chr5 49658161 49658164 GATTGACCACGTTGTA-1 2
chr5 49658161 49658168 TGTGTCCGTATTGTCG-1 1
chr5 49658162 49658162 CTCTACGCAAAGGTCG-1 1 # <--- this fragment
chr5 49658162 49658168 CCGTACTCACACACAT-1 2
chr5 49658162 49658173 GTGGATTCAGCAACAG-1 1
chr5 49658166 49658432 CTGAATGAGGACTAGC-1 2
chr5 49658168 49658168 CACCTTGAGCCTGTAT-1 3
When comparing to the Cell Ranger fragments file from the same bam, I don't see any of these. From Cell Ranger, the minimum fragment size seems to be 10, so maybe it has been filtered. Should I filter the Sinto fragments as well?
I need to temporarily remove read groups from my BAM file in order to run BQSR in a read-group unaware mode. I thought I might just rename the RG read tag to something else while I run BQSR and then rename the read tag back to RG. I've looked around a little, and I can't find a tool to do it, so I'm writing a script for it. Would that be something you would accept a PR for?
Dear Tim,
I am trying to understand how sinto ends up filtering and selecting unique fragments per cell. my input bam file has the following reads assigned to my cell of interest
A00261:518:HK73GDSX3:1:1515:27118:35211 147 chr1 9997 51 110S40M = 10010 -27 CCACAGCCGCGGCAAAGCCACATCACTTTCACCTCCACCAACACACAAAATCAAACAATCACTAACGCTAACTGTCTGACTCACTCTGCCTCACTATACCTAAACCTATACCGATAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :,,,,,,,,,,,,,,F,,,F,:,,,,:,,,,,,,,F,:,,,:,,,,,,,,:,F,,,,:,,,,F,,,,,,,,,,,,F,,,,F,,,,,,,,,F:,:,,F,,,,:,:,,,F,:,:,:F:,F:,:F,FFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:40 AS:i:40 XS:i:37 XA:Z:chr6,-147869,113S37M,0;chr7,-10002,114S36M,0;chr1,-180752,114S36M,0;chr15,+101981123,36M114S,0; CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:TTATTGGT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2125:4083:16673 147 chr1 10002 0 43S107M = 10010 -99 CCTCTTTCTCCTGCAGCGTCATATGTTTAGTATAGCCCTCCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ,,,,:,,,,,,,,,,,F::,,,,,,,,:,,,,,,,,,,,:,,,,:::,:F,:F,,:,:F:FF,F::FF,F:FFF::,,FFF:F,FFF:F,FFF:FFFFFF:FFFF:F,FFF:::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:107 AS:i:107 XS:i:108 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFF::FFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1216:26946:37012 147 chr1 10003 0 98S52M = 10045 -10 CACCCCAACTCTAATGCCTCGGCGTCCACCTAGTCCTACTCATATTCATTGTGGTTACGGGTTTGTCTTCGGTATCGTAAGATGTGTATATTACACTTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ,,,,,,,,,,,,,F,,,,,,,,,,,,,,,,,F,:,,::,,,,FF:,,,F:,:,,F,F,:,F:,,F:,:::,F:,:,,,,,,,FFF,F,:,F,,,,,F,,F,FFFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:52 AS:i:52 XS:i:53 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC QT:Z:FFFF:FFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1515:27118:35211 99 chr1 10010 60 62M2I35M3D28M23S = 9997 27 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCTAACCCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACCCTAACCCTAACCCTAACCCTAACCCGGGGCGTTACGCTCCCTCTAACC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FF:FF,FFF,,,FF,,F:F,,:F::,F:,::FFF,,:F,::::,:FF,:FFFF:F,FFF:,FFFF:F:,FFF,FFF:,::FF,,:FF,,,,,,,,,,,,,,,,,,,,,,, NM:i:6 MD:Z:45T51^CCA28 AS:i:103 XS:i:91 XA:Z:chr7,+10001,14S48M2I35M1D28M23S,4;chr7,+10035,45M3D19M4D35M1D28M23S,10;chr1,+180749,53M2D9M2I35M1D20M31S,6; CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:TTATTGGT QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2125:4083:16673 99 chr1 10010 0 101M49S = 10002 99 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAAACCAAAACCCACTCACTTATAAACATCTACGAACCAACCAGACAAAGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF:FFFFF:FFFFF,FF:FF:FF::F,FF,F:,FF,FF,FF,:,,:,,,,,:,F,F,FFF,,,,,,F,:F,F,,F,FFFF,,, NM:i:0 MD:Z:101 AS:i:101 XS:i:100 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFF::FFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1664:20157:30859 147 chr1 10027 0 74S76M = 10033 -70 ACCACCGAGATCTACACATATTCATGGTTGTAACGCGTCTGTTGTAGGCAGCGTCATATGTGTATATTATACTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ,,F,F,FF,FF:FF,F,:FF,FF,,,,F:FFF:F:,,::F::,,F,F,,F,:FF,,,FF,F,FFF,::F,,,,,F:::FFFF,,FFF,::F::,,FF:FFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:76 AS:i:76 XS:i:77 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1431:25192:2347 99 chr1 10028 0 83M67S = 10034 81 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCAAAACCAAACACTAACCCACAACCAGACGCTCCAACTAACCCTAAGCCTAAGCCTGCAAGTAAGCCTCG FFFFFFFFFFFF:FFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFF:FFF:FFFFFFF:FFFFF:,,,F,F:,,:,,F:F,FFF::F::,,,F,,,F,,FF,F,,:,:,,:,:,,,:,:,,,,F,,,,,:,:,,F,:,, NM:i:1 MD:Z:75T7 AS:i:78 XS:i:77 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:GGTCCAAG QT:Z:FFFFF:FF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2414:19244:7044 1123 chr1 10028 0 83M67S = 10028 81 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAGACGAAAAAAAACAACTAACACAACCCCACACAAAACACAATACCCTATCCCGAGCGCTGCGACTAA FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFF:FF,FF,FF:FFF::FFF:FFFFF:FF:FF:F::FFFFFF:,,F,,::,,:,F,,:F:,,,,,,,,,F::,:F:,:,,,F,F,,,:,F:,,,:,,,,,::,,,,:,,, NM:i:0 MD:Z:83 AS:i:83 XS:i:81 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA QT:Z:F:FFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1318:11731:7592 99 chr1 10028 0 83M67S = 10400 414 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCATAAACCAAAACATCAACATAACCCTAACACTACCCCAATCCCTACCCCTAACGCTCAGCGTAG F,FFFFFFFFFFFFFFFF,,F,FF:FFF:FFFF:F:F:FF:F,,,FFF::,,:F,::::F,F:FFF,,F,:F:FF:FFF,F,F,:F,,,,,F,,:F,,,,,F,,,,FF,,,:::F,F:F,,,,,,,,,,F,,F,,,,,,,,F,,,,,,,, NM:i:0 MD:Z:83 AS:i:83 XS:i:83 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFF,F:::FFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:TTATTGGT QT:Z::FFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:2414:19244:7044 1171 chr1 10028 0 69S81M = 10028 -81 AGAGAGCAACACTCATACTATGTTGTAACGGATCTGTATTAGTAAGAGTCAGATGTAGCTAAGACACATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC ,,::F,,,FFFF,,F,,,,F,::,,,F,,,:,:F,,F,,F,F:FF,,,F,FF:,,,,,,,,F,:,,,F,FFFFFFF,F:FFF:FFFFFFFFFFFF:FF,FF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:81 AS:i:81 XS:i:81 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:AACGGTCA QT:Z:F:FFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1664:20157:30859 99 chr1 10033 0 78M72S = 10027 70 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAGACAATTAAAAACAACTCACAGCCCACGATACCCGAACTCATCGCGTATGGCGTGGGCTGCGGGTAACCGGG FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFF:FF,FFF:F:FF,:FFFF:FF:FFFFF:FF,FF,,,,F,,:,F,F,F,:,F:,,,F,,,:F:,:,F,FF,FFF,F,,F,,,F,:,,F,,,,,,,,,,,,,,,,,,,, NM:i:0 MD:Z:78 AS:i:78 XS:i:76 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:CCGAACTC QT:Z:FFFFFFFF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1
A00261:518:HK73GDSX3:1:1431:25192:2347 147 chr1 10034 0 75S75M = 10028 -81 TACCACTTAGATATACACTTATACTACGTTTTAGCGTTTCTGTATTCGTAAGCGTAAGATTATAAATAAACATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :F,F,,,:F:,:,F:,,,F,::,,FF,:,,F,:,:,,::,F,F:,,:,,,F,,,F:F,:,,,:,F,F,,,,,,F,FF:FFF::FFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF NM:i:0 MD:Z:75 AS:i:75 XS:i:75 CR:Z:AGATTCAAGGTTGTAA CY:Z:FFFFFFFFFFFFFFFF CB:Z:CTGAATATCCTGGTCT-1 BC:Z:GGTCCAAG QT:Z:FFFFF:FF RG:Z:Sample_output:MissingLibrary:1:HK73GDSX3:1`
However, sinto output has just 1 line corresponding to this cell, and that is :
chr1 10013 10031 CTGAATATCCTGGTCT-1 1
I understand that many of these reads will get removed due to mapping quality. Still, I don't really understand what leads to the positions 10013 and 10031. Is this due to +4/-5 shifting? Even so, I don't see how these numbers are arrived at. Could you please help me understand this?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.