Giter Club home page Giter Club logo

Comments (14)

pmarks avatar pmarks commented on August 23, 2024 2

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

from bamtofastq.

davemcg avatar davemcg commented on August 23, 2024 1

Anyone from 10X can comment on this? This recently just tripped us up...according to Illumina the "001" ending is uninformative....what does it mean to 10x?

from bamtofastq.

davemcg avatar davemcg commented on August 23, 2024 1

I've confirmed (see pachterlab/kb_python#104) that when you run with the --reads-per-fastq flag set with a crazy big number you get the proper results.

Currently I believe there's a bug in bamtofastq (if it is spitting out N files per bam...they should generate a "proper" result if you feed them all to kallisto/etc) somehow. It seems that the barcodes are getting scrambled somehow.

In the short-term the developers should remove the --reads-per-fastq=N flag and simply output one triplet of fastq files.

from bamtofastq.

sbooeshaghi avatar sbooeshaghi commented on August 23, 2024

What options are you running bamtofastq with? Could it be that the number of reads per fastq option is "splitting" the fastqs into 001... etc?

From src/main.rs:

--reads-per-fastq=N Number of reads per FASTQ chunk [default: 50000000]

If so then I think the solution would be to concatenate the 001... fastqs in order based on Read and Lane number.

(edit) alternatively you could set N really large, as suggested by @davemcg via twitter.

from bamtofastq.

davemcg avatar davemcg commented on August 23, 2024

We've been cating the files up to now to produce a R1 and R1 fastq pair. ....that should maintain the order. Even so...doesn't the scrambled kallisto results suggest some barcode shenanigans are happening?

These are all supposed to be the same sample....as long as the for/rev isn't getting mixed, then the order the files get cat'ed shouldn't matter?

from bamtofastq.

pmarks avatar pmarks commented on August 23, 2024

The final group of numbers in the fastq filename is the 'chunk' number. You get a new chunk for each group of N reads, where you can set N with --reads-per-fastq=N. Key point: you need to make sure that (You can make the old version Illumina bcl2fastq tool do this chunking, which is why we do it).

@davemcg You need to make sure that you cat together the R1, R2 in the same order as defined by the chunk number. The order is arbitrary, but it needs to be the same for both files. Does that make sense?

On the original question @Rui-Jing: the different folders correspond to different lanes in the original flowcell, which get packed into the BAM file with distnict BAM read groups. bamtofastq outputs the reads in different folders for each read group. You should pass both of the FASTQ paths to Cellranger, and it will use all the reads.

from bamtofastq.

davemcg avatar davemcg commented on August 23, 2024

If you follow the issue I placed with the kallisto developers, it REALLY APPEARS that the barcodes (at least in my example) are getting jumbled. I understand that kallisto working isn't a priority of yours, but I'd appreciate an explanation (full code chunks are given) of why running bamtofastq on a 10x bam (https://sra-download.ncbi.nlm.nih.gov/traces/sra43/SRZ/011680/SRR11680523/MouseACS6.bam), then running the fastq file pairs (NOT CONCATENATED) results in "mixing" with kalllisto.

pachterlab/kb_python#104 (comment)

from bamtofastq.

pmarks avatar pmarks commented on August 23, 2024

@davemcg probably the best sanity check is to see if read header (the line that start with @) is matched line-by-line between the corresponding R1 and R2 files. 'Corresponding' R1 and R2 files have the same filename, except for the R1 or R2 portion.

For example bamtofastq_s1_L2_R1_003.fastq and bamtofastq_s1_L2_R2_003.fastq always need to 'stay together' throughout the analysis -- so if you're passing files into Kallisto, you need to make sure they match up this way, or you will indeed see scrambling. (Your workaround of dumping everything into one file is great, because it means you don't need to worry about 'matching up' the FASTQs). Does that make sense?

If there's a bug in bamtofastq it will most likely show up as mismatched read headers in corresponding FASTQ files. Do you observe that?

from bamtofastq.

davemcg avatar davemcg commented on August 23, 2024

Command run:

kb count -t 12 -i index.idx -g t2g.txt -x 10xv2 -o kb_standard --workflow standard --filter bustools \
fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_001.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L001_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L002_R2_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz

kb count takes the input as matched pairs. All space separated. So f1_r1_001.fq.gz f1_r2_001.fq.gz f1_r1_002.fq.gz f1_r2_002.fq.gz is how you would put in two pairs of files.

However, this invocation resulted in scrambled data (fwiw this also happens in alevin, a pseudoaligner from a different group).

When I ran the "001" and "002" pairs independently in kb count...the results are fine.

This suggests that the pairs are being set up properly. Otherwise kb count should return 0 (or near 0) pseudoaligned results.

Here the first header 10 line from the last pair of fastq files:

$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R1_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 1:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 1:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 1:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 1:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 1:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 1:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 1:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 1:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 1:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 1:N:0:0
$ zcat fastq_MouseACS6/P18Chx10creACS10_MissingLibrary_1_H7MJCCCXY/bamtofastq_S1_L003_R2_002.fastq.gz  | head -n 100 | grep "^@" | head -n 10
@E00173:630:H7MJCCCXY:3:1220:16407:27029 3:N:0:0
@E00173:630:H7MJCCCXY:3:1220:16457:27714 3:N:0:0
@E00173:630:H7MJCCCXY:3:2215:6796:8236 3:N:0:0
@E00173:630:H7MJCCCXY:3:2122:16143:13597 3:N:0:0
@E00173:630:H7MJCCCXY:3:2202:5112:31265 3:N:0:0
@E00173:630:H7MJCCCXY:3:1204:27316:34043 3:N:0:0
@E00173:630:H7MJCCCXY:3:1219:14773:17694 3:N:0:0
@E00173:630:H7MJCCCXY:3:2108:5152:11998 3:N:0:0
@E00173:630:H7MJCCCXY:3:1123:8288:1713 3:N:0:0
@E00173:630:H7MJCCCXY:3:1124:26666:2540 3:N:0:0

(thanks for your help - I am very open to the fact I could have borked this....but I haven't out why yet - the only thing that seems to fix my issue is to prevent bamtofastq from creating multiple files)

from bamtofastq.

teng-gao avatar teng-gao commented on August 23, 2024

Hi I still don't understand why there are two folders? I've been recently experiencing the same issue

from bamtofastq.

howtofindme avatar howtofindme commented on August 23, 2024

Hi,
Thanks for developing this powerful tool. But when I processed the BAM file from the CellRanger pipeline with bamtofastq, the results didn't match the original FASTQ sequencing data. The code is " ./

hi,have you figured it ou? which folder do you use to do cellranger count?

from bamtofastq.

howtofindme avatar howtofindme commented on August 23, 2024

@teng-gao there is a separate output folder for each flowcell and lane of original FASTQs. You need to make sure that the R1/R2 pair within each directory stay properly paired together. Going to close this issue, since we were able to validate that the fastqs were properly paired by manual inspection.

hi, I have the sampe problems, which folder do you use to do cellranger count?

from bamtofastq.

pmarks avatar pmarks commented on August 23, 2024

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

from bamtofastq.

howtofindme avatar howtofindme commented on August 23, 2024

@howtofindme if you get multiple output folders from bamtofastq, you should use all of them to re-run cellranger. You can specify a comma-separated list of paths to the --fastqs argument (see docs here). They are usually in different paths if the original fastq data originated from multiple flowcell runs.

thank very much! I benefit a lot @pmarks

from bamtofastq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.