Giter Club home page Giter Club logo

Comments (13)

arshajii avatar arshajii commented on August 23, 2024

Hi,
Could you show your original FASTQ and preprocess/sort commands? The cause of the problem itself seems to be the line break in the quality score string of the last read in 2.txt.

But these FASTQs are somewhat strange irrespective of that; it looks like a base is missing from both mates, and the second mates have the barcode on the read sequence (maybe first and second mates were swapped by mistake?).

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

Sure. Thanks for the prompt response. This is the script I have been running, sorry if it is an error on my part:

#!/usr/bin/env bash
FASTQ1S=(SI_18785_S1_L001_R1_001.fastq.gz SI_18785_S1_L002_R1_001.fastq.gz)
FASTQ2S=(SI_18785_S1_L001_R2_001.fastq.gz SI_18785_S1_L002_R2_001.fastq.gz)

./ema count \
       -w 4M-with-alts-february-2016.txt \
       -1 <(zcat "${FASTQ1S[@]}") \
       -o counts_file

mkfifo --mode=0666 fq1
mkfifo --mode=0666 fq2
zcat "${FASTQ1S[@]}" > fq1 & 
zcat "${FASTQ2S[@]}" > fq2 & 
./ema preproc \
       -w 4M-with-alts-february-2016.txt \
       -n 20 \
       -c counts_file \
       -1 fq1 \
       -2 fq2
rm fq1 fq2

## sort
find bucket0* -iname '*.fastq' | sort | parallel -n2 ./ema sort -1 {1} -2 {2}

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

Raw files obtained from:

zcat fastq/HYKKLBCXY/outs/fastq_path/HYKKLBCXY/SI_18785/SI_18785_S1_L001_R2_001.fastq.gz | head -n 1024 > 2_raw.txt
zcat fastq/HYKKLBCXY/outs/fastq_path/HYKKLBCXY/SI_18785/SI_18785_S1_L001_R1_001.fastq.gz | head -n 1024 > 1_raw.txt

the gzipped fastq's themselves where produced by long-ranger

2_raw.txt
1_raw.txt

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

Ok I see the problem....

zcat "${FASTQ1S[@]}" > fq1 & 
zcat "${FASTQ1S[@]}" > fq2 & 

will fix and let you know

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

So I fixed my bug but the problem remains:

 -> ./ema align -R $'@RG\tID:foo\tSM:bar' -r /mctp/users/mcieslik/proj/study/10x-pipeline/refs/refdata-GRCh38-2.1.0/fasta/genome.fa -1 bucket000/fq1.preproc.fastq -2 bucket000/fq2.preproc.fastq -o test.sam
BWA initialization...
[M::bwa_idx_load_from_disk] read 0 ALT contigs
Processing reads...
ema: src/samrecord.c:101: rc: Assertion `0' failed.

it looks like a base is missing from both mates

Perhaps a naive question, how would you know that?
I tried to fix ema code by changing read_len to read_len - 1, and it does not try to rc() the null byte anymore.

	if (rec != NULL && rec->rev) {
		for (int i = read_len - 1; i >= 0; i--) {
			fputc(rc(fq->read[i]), out);
		}

		fputc('\t', out);

		for (int i = read_len - 1; i >= 0; i--) {
			fputc(fq->qual[i], out);
		}
	} else {
		for (int i = 0; i < read_len; i++) {
			fputc(fq->read[i], out);
		}

		fputc('\t', out);

		for (int i = 0; i < read_len; i++) {
			fputc(fq->qual[i], out);
		}
	}

Offending processed files:
fq1.preproc.txt
fq2.preproc.txt

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

I went through your code and I see that read-length is hard-coded:

include/align.h
#define READ_LEN   151

Which is surprising
read1: 150, index: 8, read2: 150

HiSeq 2500 instruments are limited to ~126 base-pairs but still can be used to sequence 10x libraries...

from ema.

arshajii avatar arshajii commented on August 23, 2024

Yeah, the read length really should be a user-defined parameter. I'll implement that, but in the meantime could you try changing that READ_LEN to 150 and see if that works?

from ema.

arshajii avatar arshajii commented on August 23, 2024

I just added a -l option for specifying read length (for ema preproc/sort/align). In your case it'd be -l 150; let me know if that works.

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

Thanks so much for the quick fix. I verified that setting READ_LEN to 150 and recompiling fixes the issue. Now I am testing your latest version with parameterized read length.
Some comments:

  • I think the assumption that all reads in a FASTQ file are of the same length is too strict. Often sequencing cores will provide FASTQ files that have low-quality bases at the ends of reads removed.
  • the ema_wrapper script needs to forward the -l setting

from ema.

arshajii avatar arshajii commented on August 23, 2024

Thanks for your comments; I just updated the script to accept a -l option. As for the first point, I definitely agree. It shouldn't be too hard to handle the lengths on a per-read basis as long as there is some upper bound, which is likely what I'll end up doing.

from ema.

arshajii avatar arshajii commented on August 23, 2024

Closing this issue for now, but let me know if you run into any more problems.

from ema.

mcieslik-mctp avatar mcieslik-mctp commented on August 23, 2024

Thanks, all seems fine now.

from ema.

arshajii avatar arshajii commented on August 23, 2024

FYI the -l option has now been removed; read lengths are inferred automatically (meaning it will work when reads come in varying lengths).

from ema.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.