The bam data processed by bamtofastq is different from the original FASTQ file. about bamtofastq HOT 14 CLOSED

10xgenomics commented on August 23, 2024

The bam data processed by bamtofastq is different from the original FASTQ file.

from bamtofastq.

Comments (14)

pmarks commented on August 23, 2024 2

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

from bamtofastq.

davemcg commented on August 23, 2024 1

Anyone from 10X can comment on this? This recently just tripped us up...according to Illumina the "001" ending is uninformative....what does it mean to 10x?

from bamtofastq.

davemcg commented on August 23, 2024 1

I've confirmed (see pachterlab/kb_python#104) that when you run with the --reads-per-fastq flag set with a crazy big number you get the proper results.

Currently I believe there's a bug in bamtofastq (if it is spitting out N files per bam...they should generate a "proper" result if you feed them all to kallisto/etc) somehow. It seems that the barcodes are getting scrambled somehow.

In the short-term the developers should remove the --reads-per-fastq=N flag and simply output one triplet of fastq files.

from bamtofastq.

sbooeshaghi commented on August 23, 2024

What options are you running bamtofastq with? Could it be that the number of reads per fastq option is "splitting" the fastqs into 001... etc?

From src/main.rs:

bamtofastq/src/main.rs

Line 74 in cfac1de

--reads-per-fastq=N Number of reads per FASTQ chunk [default: 50000000]

If so then I think the solution would be to concatenate the 001... fastqs in order based on Read and Lane number.

(edit) alternatively you could set N really large, as suggested by @davemcg via twitter.

from bamtofastq.

davemcg commented on August 23, 2024

We've been cating the files up to now to produce a R1 and R1 fastq pair. ....that should maintain the order. Even so...doesn't the scrambled kallisto results suggest some barcode shenanigans are happening?

These are all supposed to be the same sample....as long as the for/rev isn't getting mixed, then the order the files get cat'ed shouldn't matter?

from bamtofastq.

pmarks commented on August 23, 2024

The final group of numbers in the fastq filename is the 'chunk' number. You get a new chunk for each group of N reads, where you can set N with --reads-per-fastq=N. Key point: you need to make sure that (You can make the old version Illumina bcl2fastq tool do this chunking, which is why we do it).

@davemcg You need to make sure that you cat together the R1, R2 in the same order as defined by the chunk number. The order is arbitrary, but it needs to be the same for both files. Does that make sense?

On the original question @Rui-Jing: the different folders correspond to different lanes in the original flowcell, which get packed into the BAM file with distnict BAM read groups. bamtofastq outputs the reads in different folders for each read group. You should pass both of the FASTQ paths to Cellranger, and it will use all the reads.

from bamtofastq.

davemcg commented on August 23, 2024

If you follow the issue I placed with the kallisto developers, it REALLY APPEARS that the barcodes (at least in my example) are getting jumbled. I understand that kallisto working isn't a priority of yours, but I'd appreciate an explanation (full code chunks are given) of why running bamtofastq on a 10x bam (https://sra-download.ncbi.nlm.nih.gov/traces/sra43/SRZ/011680/SRR11680523/MouseACS6.bam), then running the fastq file pairs (NOT CONCATENATED) results in "mixing" with kalllisto.

pachterlab/kb_python#104 (comment)

from bamtofastq.

pmarks commented on August 23, 2024

@davemcg probably the best sanity check is to see if read header (the line that start with @) is matched line-by-line between the corresponding R1 and R2 files. 'Corresponding' R1 and R2 files have the same filename, except for the R1 or R2 portion.

For example bamtofastq_s1_L2_R1_003.fastq and bamtofastq_s1_L2_R2_003.fastq always need to 'stay together' throughout the analysis -- so if you're passing files into Kallisto, you need to make sure they match up this way, or you will indeed see scrambling. (Your workaround of dumping everything into one file is great, because it means you don't need to worry about 'matching up' the FASTQs). Does that make sense?

If there's a bug in bamtofastq it will most likely show up as mismatched read headers in corresponding FASTQ files. Do you observe that?

from bamtofastq.

davemcg commented on August 23, 2024

Command run:

kb count -t 12 -i index.idx -g t2g.txt -x 10xv2 -o kb_standard --workflow standard --filter bustools \
fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz

kb count takes the input as matched pairs. All space separated. So f1_r1_001.fq.gz f1_r2_001.fq.gz f1_r1_002.fq.gz f1_r2_002.fq.gz is how you would put in two pairs of files.

However, this invocation resulted in scrambled data (fwiw this also happens in alevin, a pseudoaligner from a different group).

When I ran the "001" and "002" pairs independently in kb count...the results are fine.

This suggests that the pairs are being set up properly. Otherwise kb count should return 0 (or near 0) pseudoaligned results.

Here the first header 10 line from the last pair of fastq files:

$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 1:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 1:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 1:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 1:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 1:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 1:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 1:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 1:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 1:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 1:N:0:0
$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 3:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 3:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 3:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 3:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 3:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 3:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 3:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 3:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 3:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 3:N:0:0

(thanks for your help - I am very open to the fact I could have borked this....but I haven't out why yet - the only thing that seems to fix my issue is to prevent bamtofastq from creating multiple files)

from bamtofastq.

teng-gao commented on August 23, 2024

Hi I still don't understand why there are two folders? I've been recently experiencing the same issue

from bamtofastq.

howtofindme commented on August 23, 2024

Hi,
Thanks for developing this powerful tool. But when I processed the BAM file from the CellRanger pipeline with bamtofastq, the results didn't match the original FASTQ sequencing data. The code is " ./

hi,have you figured it ou? which folder do you use to do cellranger count?

from bamtofastq.

howtofindme commented on August 23, 2024

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

hi, I have the sampe problems, which folder do you use to do cellranger count?

from bamtofastq.

pmarks commented on August 23, 2024

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

from bamtofastq.

howtofindme commented on August 23, 2024

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

thank very much! I benefit a lot @pmarks

from bamtofastq.

The bam data processed by bamtofastq is different from the original FASTQ file. about bamtofastq HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent