Giter Club home page Giter Club logo

Comments (4)

arshajii avatar arshajii commented on August 23, 2024

Hi @pontushojer,
Sorry for the delay in responding. This is quite strange, and looks like a bug that only comes up when processing a large set of reads.

Couple quick questions:

  • Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?
  • How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

from ema.

pontushojer avatar pontushojer commented on August 23, 2024

Hi @arshajii,

No worries!

  • Are you running a pre-built version of EMA (e.g. installed from brew/conda)? Or are you using a version built from source?

I am running a pre-built version from conda, version 0.6.2 build h8b12597_1.

  • How big is the dataset that causes this issue? For debugging purposes, it would be very helpful if we could produce some subset of it that still causes this error. (Or if it's not too big, any chance you could share it so we can use it in debugging?)

The datasets have been about 400-500 M read-pairs, so far I have had issues on about three of my dataset.

I have so far been unable to generate a smaller dataset to replicate the issue. If I extract the read-pairs for barcodes surrounding the entry that causes the error in the full dataset, it completes without error. I will continue to try and generate a subset, as you say it would help narrowing this down.

I can check about sending a full dataset...

from ema.

pontushojer avatar pontushojer commented on August 23, 2024

@arshajii I have now managed to generate a smaller subset that can recreate the issue.

Running the following:

ema align -1 <(pigz -cd failing.fastq.gz) -R '@RG\tID:1\tSM:20\tPU:unit1\tPL:ILLUMINA' -r genome.fa -t 4 -p 10x 2> mapping.log | samtools sort - -@ 4 -o out.bam -O BAM -l 0 2> sorting.log

outputs this to the sorting.log:

[E::sam_parse1] CIGAR and query sequence are of different length
samtools sort: truncated file. Aborting

If I skip the pipe to samtools sort and look at the unsorted file from ema I find two faulty entries:

A00187:292:H7G2JDSXY:3:2674:9516:18662:TTTTTTTGTAAGGAACTGAA	73	chrX	147384013	60	8006656M	*	0	0	ATAAAATTAAAAAAAAAAAAAAAAAAAAAAAAATATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA	F:FF,F:F,FFF,:FFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFF,FFFFFF,,FFFFFF	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620035	XF:i:0	RG:Z:1
...
A00187:292:H7G2JDSXY:3:1271:24126:15405:AAAAAAAAAAAAATAAAAAA	73	chr3	26999354	22	46139657M691D46137351c691D	*	0	0	CCCCCTCATTGTCCTTGTCTATTACATTTTTATTTTTATATTATAATAGCTTATGGTATGTAAT	FF:F::FF:F,:FF:F:FF,FFFFFFF:,FF,FFFF:FFFFFFFFFFFF,,FFFF:FF::FF,F	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:1620051	XF:i:0	RG:Z:1

As you see the cigars are 8006656M and 46139657M691D46137351c691D respectively which are both wrong. Also they have the input barcodes of TTTTTTTGTAAGGAACTGAA and AAAAAAAAAAAAATAAAAAA (found in read name) but the tagged barcode is BX:Z:AAAAAAAAAAAAAAAA-1 for both.

Hope this helps to locate the issue!

Subset: failing.fastq.gz

from ema.

pontushojer avatar pontushojer commented on August 23, 2024

Hi @arshajii.

I was wondering if you have had the opportunity too look into this issue after I posted the subset?

from ema.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.