Giter Club home page Giter Club logo

ema's Introduction

EMA: An aligner for barcoded short-read sequencing data

Build Status License Mentioned in Awesome 10x Genomics

EMA uses a latent variable model to align barcoded short-reads (such as those produced by 10x Genomics' sequencing platform). More information is available in our paper. The full experimental setup is available here.

Install

With brew 🍺

brew install brewsci/bio/ema

With conda 🐍

conda install -c bioconda ema

From source 🛠️

git clone --recursive https://github.com/arshajii/ema
cd ema
make

(The --recursive flag is needed because EMA uses BWA's C API.)

Usage

usage: ema <count|preproc|align|help> [options]

count: perform preliminary barcode count (takes interleaved FASTQ via stdin)
  -w <whitelist path>: specify barcode whitelist [required]
  -o <output prefix>: specify output prefix [required]
  -p: using haplotag barcodes

preproc: preprocess barcoded FASTQ files (takes interleaved FASTQ via stdin)
  -w <whitelist path>: specify whitelist [required]
  -n <num buckets>: number of barcode buckets to make [500]
  -h: apply Hamming-2 correction [off]
  -o: <output directory> specify output directory [required]
  -b: output BX:Z-formatted FASTQs [off]
  -p: using haplotag barcodes
  -t <threads>: set number of threads [1]
  all other arguments: list of all output prefixes generated by count stage

align: choose best alignments based on barcodes
  -1 <FASTQ1 path>: first (preprocessed and sorted) FASTQ file [none]
  -2 <FASTQ2 path>: second (preprocessed and sorted) FASTQ file [none]
  -s <EMA-FASTQ path>: specify special FASTQ path [none]
  -x: multi-input mode; takes input files after flags and spawns a thread for each [off]
  -r <FASTA path>: indexed reference [required]
  -o <SAM file>: output SAM file [stdout]
  -R <RG string>: full read group string (e.g. '@RG\tID:foo\tSM:bar') [none]
  -d: apply fragment read density optimization [off]
  -p <platform>: sequencing platform (one of '10x', 'tru', 'cpt', 'haplotag', 'dbs', 'tellseq') [10x]
  -i <index>: index to follow 'BX' tag in SAM output [1]
  -t <threads>: set number of threads [1]
  all other arguments (only for -x): list of all preprocessed inputs

help: print this help message

See the Other Sequencing Platforms below for more information about the implementation details of different linked-read sequencing technologies.

Input formats

EMA has several input modes:

  • -s <input>: Input file is a single preprocessed "special" FASTQ generated by the preprocessing steps below.
  • -x: Input files are listed after flags (as in ema align -a -b -c <input 1> <input 2> ... <input N>). Each of these inputs are processed and all results are written to the SAM file specified with -o.
  • -1 <first mate>/-2 <second mate>: Input files are standard FASTQs. For interleaved FASTQs, -2 can be omitted. The only restrictions in this input mode are that read identifiers must end in :<barcode sequence> and that the FASTQs must be sorted by barcode. For 10x data, the above two modes are preferred.

Parallelism

Multithreading can be enabled with -t <num threads>. The actual threading mode is dependent on how the input is being read, however:

  • -s, -1/-2: Multiple threads are spawned to work on the single input file (or pair of input files).
  • -x: Threads work on the input files individually.

(Note that, because of this, it never makes sense to spawn more threads than there are input files when using -x.)

End-to-end workflow (10x)

In this guide, we use the following additional tools:

We also use a 10x barcode whitelist, which can be found here.

Preprocessing

Preprocessing 10x data entails several steps, the first of which is counting barcodes (-j specifies the number of jobs to be spawned by parallel):

cd /path/to/gzipped_fastqs/
parallel -j40 --bar 'pigz -c -d {} | \
  ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *RA*.gz

Make sure that the FASTQs are interleaved and only contain the actual reads in the files above (as opposed to sample indices, typically with I1 in their filenames rather than RA). This will produce *.ema-ncnt and *.ema-fcnt files, containing the count data.

If you do not have interleaved files, you can interleave them as follows:

parallel -j40 --bar 'paste <(pigz -c -d {} | paste - - - -) <(pigz -c -d {= s:_R1_:_R2_: =} | paste - - - -) | tr "\t" "\n" |\
  ema count -w /path/to/whitelist.txt -o {/.} 2>{/.}.log' ::: *_R1_*.gz

where s:_R1_:_R2_: is the regex that casts first-end filenames into the second-end filenames (make sure to adjust this if your naming scheme is different).

Now we can do the actual preprocessing, which splits the input into barcode bins (500 by default; specified with -n). This preprocessing can be parallelized via -t, which specifies how many threads to use:

pigz -c -d *RA*.gz | ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log

or if you do not have interleaved files:

paste <(pigz -c -d *_R1_*.gz | paste - - - -) <(pigz -c -d *_R2_*.gz | paste - - - -) | tr "\t" "\n" |\
  ema preproc -w /path/to/whitelist.txt -n 500 -t 40 -o output_dir *.ema-ncnt 2>&1 | tee preproc.log

Mapping

First we map each barcode bin with EMA. Here, we'll do this using a combination of GNU Parallel and EMA's internal multithreading, which we found to be optimal due to the runtime/memory trade-off. In the following, for instance, we use 10 jobs each with 4 threads (for 40 total threads). We also pipe EMA's SAM output (stdout by default) to samtools sort, which produces a sorted BAM:

parallel --bar -j10 "ema align -t 4 -d -r /path/to/ref.fa -s {} |\
  samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -" ::: output_dir/ema-bin-???

Lastly, we map the no-barcode bin with BWA:

bwa mem -p -t 40 -M -R "@RG\tID:rg1\tSM:sample1" /path/to/ref.fa output_dir/ema-nobc |\
  samtools sort -@ 4 -O bam -l 0 -m 4G -o output_dir/ema-nobc.bam

Note that @RG\tID:rg1\tSM:sample1 is EMA's default read group. If you specify another for EMA, be sure to specify the same for BWA as well (both tools take the full read group string via -R).

Postprocessing

EMA performs duplicate marking automatically. We mark duplicates on BWA's output with sambamba markdup:

sambamba markdup -t 40 -p -l 0 output_dir/ema-nobc.bam output_dir/ema-nobc-dupsmarked.bam
rm output_dir/ema-nobc.bam

Now we merge all BAMs into a single BAM (might require modifying ulimits, as in ulimit -n 10000):

sambamba merge -t 40 -p ema_final.bam output_dir/*.bam

Now you should have a single, sorted, duplicate-marked BAM ema_final.bam.

Other sequencing platforms

EMA can also be run using data from other linked-read or sequencing platforms than 10x Genomics. Other platforms are selected using the flag -p <platfrom>. Available platforms and their flags specifications are:

For preprocessing with subcommands count and preproc, only 10x Genomics and Haplotagging reads are enabled a the moment.

Haplotagging

The haplotagging method for generating long reads was presented in Meier et al. 2021 PNAS. The platform uses a 16 bp barcode. If using haplotagging data, where barcodes are coded in the read headers as BX:Z:AxxCxxBxxDxx, you do not need to provide a barcode whitelist for the count or proproc steps.

TELL-Seq

The TELL-seq linked-read platform is commercially available from Universal Sequencing and was presented in Chen et al. 2020 GenomeRes. The platform uses a 18 bp semi-degenerate barcode. The FASTQs can for example be preprocess using the Universal Sequencing TELL-Read pipeline to generate barcodes tagged FASTQs as below where the barcode is added after the read name as below

@A00741:47:HCM53DRXX:1:1101:18159:7326:TTATTTAATCTTAGTCGT 1:N:0:1
TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA
+
,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF

Note that these FASTQs need to be sorted by barcode before using ema align.

EMA also supports TELL-seq data provided in the longranger basic format, e.g. BX tagged FASTQs as below

@A00741:47:HCM53DRXX:1:1101:18159:7326 BX:Z:TTATTTAATCTTAGTCGT
TTATTTAATCTTAGTCGTCCTGGCTAATTTTTTTGTATTTTTATTAGATACGGGATTTCTCCATGTTGGCTTGGCGGGTCTCAAACTCTTGACCTTAGGTGATCTGCCTGCCTCAGCCTCCCAAAGTGCTGGGATTACCGGCGTGAGCCACCGCACCCAGCCTA
+
,FFFFFFFFF:FFF::FF,FFFFFFFFFF,FFF:F,F::,FFFF,F,F,FF,,FFFFFFFFFF,,FF:F:FF,F:F,,FFFFFFFF:FFFFFFFF:F,:FFFFFF:FFF:FF,FFFFFFFFFFFF:,::F,FFFF:FFFF,FF,FFFFFF:,,FFFFFFFFFFF

DBS

EMA can run using linked reads generated using the method presented in Redin et al. 2019 SciRep, commonly referred to as Droplet Barcode Sequencing (DBS). For running ema align with DBS linked-read the FASTQs must have the 20 base barcode present in the read header, similar to output from longranger basic. Here is an example FASTQ entry with the barcode CTTGGTCATTCATACAGTCC.

@A00621:130:HN5HWDSXX:4:1103:8639:28573 BX:Z:CTTGGTCATTCATACAGTCC
CAGTGGGAGCCCTGACCTTGTTTTTCTGTAAGTAGACGGTCCATCTAGGGGTGATGGGAGAAAGTGACAGATCATCAGGCATTGGATTCTCCTAAGGAGGGTGCAATGTAGATCCCTCGCGTGCAGAACTCAATGTAGGGTTCATGCTCCC
+
F,FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF:FFFFFFFFFF:FFFFF,FFFFFFFFFFFFFFFF:FF::FFF,FFFFFFF,FFFFFFFFFF,FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFF:F:FFFFFFFF

CPT-seq/TruSeq SLR

Instructions for preprocessing and running EMA on data from CPT-seq and TruSeq Synthetic Long Reads can be found here.

Output

EMA outputs a standard SAM file with several additional tags:

  • XG: alignment probability
  • MI: cloud identifier (compatible with Long Ranger)
  • XA: alternate high-probability alignments

ema's People

Contributors

arshajii avatar edharry avatar inumanag avatar pdimens avatar pontushojer avatar rchikhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ema's Issues

Error during preproc

Hi,

I downloaded WGS 10x data from https://support.10xgenomics.com/genome-exome/datasets/2.1.4/NA12878_WGS_v2, subset it with seqtk sample to 10k reads:

seqtk sample -s100 NA12878_WGS_v2_S1_L001_R1_001.fastq 10000 > subset10k_L001_R1_001.fastq
seqtk sample -s100 NA12878_WGS_v2_S1_L001_R2_001.fastq 10000 > subset10k_L001_R2_001.fastq

And taken barcodes from the tenkit package from 10x: https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/barcodes/4M-with-alts-february-2016.txt

Then I tried to run the pipeline. The count command ran fine (though on first attempt it crashed on compressed input before I figured .fastq.gz is not supported):

cat subset10k_L001_R*_001.fastq | ema count -1 - -w 4M-with-alts-february-2016.txt -o counts_file

However the following preproc command dies with the following error:

cat subset10k_L001_R*_001.fastq | ema preproc -1 - -w 4M-with-alts-february-2016.txt -c counts_file -n 2
> ema: src/preprocess.c:389: preprocess_fastqs: Assertion `__extension__ ({ size_t __s1_len, __s2_len; (__builtin_constant_p (id1) && __builtin_constant_p (id2) && (__s1_len = strlen (id1), __s2_len = strlen (id2), (!((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) || __s1_len >= 4) && (!((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) || __s2_len >= 4)) ? __builtin_strcmp (id1, id2) : (__builtin_constant_p (id1) && ((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) && (__s1_len = strlen (id1), __s1_len < 4) ? (__builtin_constant_p (id2) && ((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) ? __builtin_strcmp (id1, id2) : (__extension__ ({ const unsigned char *__s2 = (const unsigned char *) (const char *) (id2); register int __result = (((const unsigned char *) (const char *) (id1))[0] - __s2[0]); if (__s1_len > 0 && __result == 0) { __result = (((const unsigned char *) (const char *) (id1))[1] - __s2[1]); if (__s1_len > 1 && __result == 0) { __result = (((const unsigned char *) (const char *) (id1))[2] - __s2[2]); if (__s1_len > 2 && __result == 0) __result = (((const unsigned char *) (const char *) (id1))[3] - __s2[3]); } } __result; }))) : (__builtin_constant_p (id2) && ((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) && (__s2_len = strlen (id2), __s2_len < 4) ? (__builtin_constant_p (id1) && ((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) ? __builtin_strcmp (id1, id2) : (__extension__ ({ const unsigned char *__s1 = (const unsigned char *) (const char *) (id1); register int __result = __s1[0] - ((const unsigned char *) (const char *) (id2))[0]; if (__s2_len > 0 && __result == 0) { __result = (__s1[1] - ((const unsigned char *) (const char *) (id2))[1]); if (__s2_len > 1 && __result == 0) { __result = (__s1[2] - ((const unsigned char *) (const char *) (id2))[2]); if (__s2_len > 2 && __result == 0) __result = (__s1[3] - ((const unsigned char *) (const char *) (id2))[3]); } } __result; }))) : __builtin_strcmp (id1, id2)))); }) == 0' failed.
[1]    1448 broken pipe  cat subset10k_L001_R*_001.fastq |
       1449 abort        ema preproc -1 - -w 4M-with-alts-february-2016.txt -c counts_file -n 2

Does it have something to do with subsetting the input?

Attaching a tarball with the inputs.

NA12878_WGS_10x_subset10k.gz

Mapping Longranger basic FASTQ format with BX:Z tag

Does EMA support mapping reads in the Longranger basic FASTQ format, where the corrected barcode is in the BX:Z tag of the FASTQ header?

@E00316:145:HMVG3CCXX:5:1214:6786:3014 BX:Z:AAACACCAGACAATAC-1

Does EMA support mapping reads whose barcodes have not yet been separated from read 1? That is, the uncorrected barcode is the first 16 nucleotides of read 1?

Read length requirements, and error handling

Running ema count on an output from the sequencer, it returns with non-zero without any message, leaving empty *.ema-fcnt and *.ema-ncnt files. Figured that it failed when running into this read:

@A00130:39:H5GWHDMXX:1:1101:8919:3192 1:N:0:TATGATTC
GTCGTAACATC
+
FFFFFFFFFFF

Does it have to do with the read length? Are there are minimal requirements? And can it be handled with the script - with error message of built-in filtering?

reference lenght constraint

Hi, we are trying to align some 10x reads to a PacBio reference that consist of 4000 scaffolds. We would like to know if EMA has any constraint regarding the size of the reference as LongRanger has (1000 scaffolds). Thanks in advance.

Reads mapping to N's in the beginning of a chromosome with 1000genomes BWA indices

There is an odd behaviour we encountered for EMA aligning against the hg38 reference genome. For some reason many reads appear to map to the N stretch in a beginning of a chromosome. Below is an IGV screenshot of the region chr5:1-15000, from top to bottom - EMA GRCh37, EMA hg38, Longranger GRCh37, and Longranger hg38:

screen shot 2018-07-20 at 20 20 00

Reads are colored by the BX tag (bleak colors) and mismatches (bright colors). As you can see, in the rightmost part (it's the telomeric repeat region) EMA reads map to hg38 with tons of mismatches, basically most of the bases are wrong and the average MAPQ is as low as 3. Bust the most odd thing for us is that so many reads are mapped to the leftmost part: the first 10k of chr5 on both reference genomes are just stretches of N.

All 4 BAMs are generated from the same read set. GRCh37 is a non-decoy version (does not have the hs37d5 contig). The EMA version is 0.6.1.

We are wondering why such mapping can be happening? If that can of help, we sliced the chr5:1-15000 regions from all 4 BAM files and uploaded in a tarball: https://drive.google.com/file/d/1o2mrzkS0qt2L98-3cY5y5RFalpWwsmVF/view?usp=sharing

Happy to share other data if that can be helpful!

cc @brainstorm and @ohofmann

Vlad

Incorrect template length output

Looking over the output of some test [haplotag] data, column 9 (TLEN) is showing 0 for many alignments, even if the alignment positions are present.

Example record:

A00470:481:HNYFWDRX2:1:2239:18150:28354_A61C72B93D96_GAACGACTACCACAG	113	2R	1555	54	150M	=	1555	0	AAAGAATAAATTGACTAAGTTATGTCATTTTAAGCGGTCAAAATGGGTGAATTTCCGATTTCAAGTACCAGGCGAACAGAAGACACCTTCTAGAGATTCTGTTCAACTGGTAAGCAAAAACAGTAAATTGCCTAAGTTTTACATTATAAG	FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF,FFFFFFFFFF,FFF:F:FF:FF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:2	BX:Z:A61C72B93D96	XG:f:1	MI:i:66006	XF:i:1	RG:Z:sample1

TLEN = 0
alignment start = 1554
alignemnt end = 1704

EMA split-read handling

Hello,

I use haplotagging data with the EdHarry ema fork and have been trying to use the LEVIATHAN variant caller, which relies on split read information to call variants. My team noticed the ema bam files dont have any flags for supplementary (split-read) alignments (2048 or 0x800), whereas mapping with bwa did.

In the README, you propose mapping the noBC fastq using bwa mem -M, which looks like it changes the supplementary mapping flag to a secondary flag instead, so maybe the -M (or equivalent) is baked into ema? Is there a way for us to retrain this split-read information for use in LEVIATHAN?

Replace XC with MI ?

If the cloud identifier XC is compatible with MI:i, could MI:i be used instead, for compatibility with Lariat?

10x bamtofastq with EMA BAM

Hi,

A side feature request popped up while exploring the hg38 alignments peculiarities.

I tired to use this tool to convert a portion of EMA BAM file back to barcoded FASTQ: https://support.10xgenomics.com/docs/bamtofastq But the tool reported it as "Unrecognized 10x BAM file.". I guess it's likely that the tool is too tied to Lariat specificities. But it would be fantastic if EMA BAMs were compatible with this tool. Not sure if it's feasible without the modifications of the tool.

Or maybe there are alternatives that you could suggest?

Vlad

make -j fails

make -j fails with the following error, but make -j1 succeeds.

g++  -c -std=c++14 -O3 -march=native -pthread  -c -o cpp/count.o cpp/count.cc
…
g++ -c -MMD -MP -I. -g -O3 -DGITVER=\"0.6.0\" count.cc -o count.o
In file included from /usr/include/c++/5/unordered_map:35:0,
                 from count.cc:3:
/usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

For the complete build log, see https://circleci.com/gh/brewsci/homebrew-bio/1998

ema align errors

I am running ema on haplotagged sequences and the preprocessing seems to be done correctly. But while running
parallel --bar -j10 "ema align -t 4 -d -r $ref_genome -p 10x -s {} | samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -" ::: ${stem}_outdir/ema-bin-???
I am getting the several errors -

2% 13:487=0s x4159_C12_R1_001_outdir/ema-bin-022
BWA initialization...
[M::bwa_idx_load_from_disk] read 0 ALT contigs
Processing reads...
ema: src/util.c:142: safe_malloc: Assertion n == 0 || p != ((void *)0)' failed. ema: src/util.c:142: safe_malloc: Assertion n == 0 || p != ((void *)0)' failed.
samtools sort: failed to read header from "-"

And also sometimes -

[bwa_idx_load_from_disk] Failed to allocate 125163479 bytes at bwa.c line 281: Cannot allocate memory

Can it used for read of length 500bp ?

Noticing that the default read length limit is 200, I changed the max_read_length to be 1000 to force the program to work.
But I'm not sure if it will still work well? Will it still producing reliable result?

conda installation: "illegal hardware instruction" error

Hi,

I installed EMA with conda install -c bioconda ema on a CentOS release 6.9 system. However, ema count command is giving me an illegal hardware instruction error:

$ cat test_wl.txt
CGACACGGTTTGGGCC
$ cat test.fq
@A00130:39:H5GWHDMXX:1:1101:1090:1000 1:N:0:TATGATTC
GAGATTCAGAAGTGCCCAGGGATTATTGTAAGGATGAAATGATATCAGGCATGAAATTCTGTCTGGATGCCAGGTGCGACTATTTTGGTCCTCACAGCCACCTGTTGCCCAGATAGACATTCCTATCCTGGGATGACAGTCAACCATAGTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF:FFFFFFFF::F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F,FFF:FFFFF:FFFFFFF,FFF,FFFFF:FFFFFFFFFFFFF,
@A00130:39:H5GWHDMXX:1:1101:1090:1000 2:N:0:TATGATTC
GTAGTAGCCACTTATGCCTAGTGTTCCATTATTGGAAGCTAAGCACGTGGGAGTTATTTATATCCTGCTGCTCAAGGTCATCGCCAAGGACTGATTGTAAAAATTCAAAAAATTGTAACCTCAGGCATAAATGGGTTAAAGGTGTGTGACT
+
:FFFFFFFFFFFFF:F,:FFFFFFFFFFFFFFFFFFFFF:,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFF:FFFFFFF:FF,FF:FF:FFFFFFFFFFFFFFFFFFFFFFFFF
$
$ ema count -w test_wl.txt test.fq -o test
[1]    19491 illegal hardware instruction  ema count -w test_wl.txt < test.fq -o test

I doesn't happen with a manual installation from source.

Platform Flag

Hello,

I was wondering how the alignment process differs based on the -p flag for platforms - i.e. ['10x', 'tru', 'cpt']?

If these change some internal parameters I was wondering if support for another platform - stLFR could possibly be added?

Best,
Ellis

computational bottlenecks

I have run the EMA workflow with great results. Even beyond the smart aligner I really like that you provided third-party implementations to critical steps in 10x data analysis i.e. barcode correction and barcode-aware duplicate marking. While the software certainly works well in it's current state, I wonder whether you would consider addressing some of the most critical computational bottlenecks.

  1. The FASTQ preprocessing step is currently a single-threaded bottleneck in the pipeline execution taking ~50% of the overall wall-clock run-time on our hardware.
  2. Parallelization of the sorting / alignment / duplicate marking steps requires at least 8GB of memory per-core, making it impractical on 64+ core servers.
  3. Use of Picard.

I think 1. could be possibly alleviated using the following strategy.

fastq.gz -> gunzip | split -l 40e6 > chunked_fastq
chunked_fastq > fix_barcode > barcoded_chunked_fastq
barcoded_fastq(s) >  split_by_barcode > buckets

The first and last step's would remain single-threaded, but at least barcode fixing would be parallelized.
For 3. I would suggest using sambamba to mark the duplicates, the software is multi-threaded so using it with ~4 cores per bucket (and process fewer buckets in parallel) would lower memory requirements.

Thanks!

`ema align` command

Hi,

Not clear how to run ema align from the preproc-generated directory (called output_dir in the documentation). The example doesn't mention this directory, and not clear where parallel is taking the input parameters for -s {} - nothing is passed with ::::

parallel --bar -j10 "ema align -t 4 -d -r /path/to/ref.fa -s {} | samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -"

ema align output SAM parsing error

I have been running ema (version 0.6.2) on reads in the longranger basic FASTQ format (BX:Z in header). I pipe the output directly to samtools sort. My command looks something like this.

ema align -1 <(pigz -cd in.1.fastq.gz) -2 <(pigz -cd in.2.fastq.gz) -r genome.fa -t 20 -p 10x 2> mapping.log | samtools sort - -@ 20 -o mapping.bam 2> sorting.log

Mostly this have been working fine but every other run have failed because samtools sort gets a parsing error

Either this one

[E::sam_parse1] unrecognized CIGAR operator

Or this one

[E::sam_parse1] CIGAR and query sequence are of different length

Both are related to the CIGAR so I think there is some formatting error here that creeps in every now and then. I looked SAM entries that caused the error in the one of the runs and found some strange things. Below are four lines, of which the third (marked in bold) is causing the error.

ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	83	chr1	146535682	3	66M	=	146535535	-213	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,+148197632,66M,0;
ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG	163	chr1	146535535	3	151M	=	146535682	213	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	NM:i:0	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:0.5	MI:i:-1470287527	XF:i:0	RG:Z:4	XA:Z:chr1,-148197694,151M,0;
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	89	chr2	158084867	48	4214864^@	*	0	0	GATTCTGGTGACCTCACTGTAGGGAGACAGGAGCCTCTGGCAAGTTACCAAACTGCCAGCCTATTC	JFJJJFJJJJFFFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJFJJJFFJ	NM:i:4	BX:Z:AAAAAAAAAAAAAAAA-1	XG:f:1	MI:i:-1470287525	XF:i:0	RG:Z:4
ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG	165	*	0	0	*	chr2	158084867	0	TTTTATTCCCATATTCAATTTTCTTCTTTTCACTTGGAGAAAACCTCTTAGGCTAAAGACCATTGGAAGTGGCCTCCATGTTTAACTCACTGGCCCCTAATTTCTGCTGGACCTTCATTTCAATCGACTGTGTTTCAGGAGGGAAGCGATT	AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJ	BX:Z:AAAAAAAAAAAAAAAA-1	RG:Z:4

As you can see this SAM entry is just plain strange. It aslo contains a ^@ character for some reason. Interestingly the SEQ and QUAL strings for this entry does not below to the read with the QNAME ST-E00266:342:HYW32CCXY:1:1211:12439:69625:TTTGTTCCCTAAGTAACACG instead it belongs to the first entry in my example named ST-E00266:342:HYW32CCXY:1:1212:14001:52854:AAAAAAAAAAAAAAAAAATG. This is also the read-pair just before the one causing the error in my FASTQ.

I have tried to replicate the error on a smaller subset of my data but have so far been unsuccessful. For example if i take the FASTQ entries corresponding to the failed SAM entries shown above I don't get any error. So somehow this only happens when running the full dataset.

Do you have any idea what could be causing this?

Input and workflow unclear

Dear author,

I would like to use Emerald to align my 10X genomics reads to a reference.

After reading the README and the issues page I am confused about the following things:

  1. What input it needed?
    After 10X sequencing I obtained and R1 and R2 file. The R1 file has the uncorrected 16 bp barcode+ 7 bp spacer.

However, I did run Longranger basic on these. This produces an interleaved fastq file with corrected barcodes. These barcodes have been added to the header, if they are valid, using the BX:Z: tag.

I used sed on this interleaved fastq file, as you advised in one of the issue threads: zcat barcoded.fastq.gz | sed '1~4 s/ BX:Z:(.*)-1/:\1/' | ema count -w ~/bin/longranger-2.2.2/longranger-cs/2.2.2/tenkit/lib/python/tenkit/barcodes/4M-with-alts-february-2016.txt -o my_sample_count 2> Ema_count.log

The log for this shows the following:

:: Loading 10X took 3.2 s
:: Dumped block 1
:: Dumped block 2
:: Dumped block 3
:: Dumped block 4
:: Dumped block 5
:: Dumped block 6
:: Dumped block 7
:: Dumped block 8
:: Dumped block 9
:: Dumped block 10
:: Dumped block 11
:: Dumped block 12
:: Dumped block 13
:: Dumped block 14
:: Dumped block 15
:: Counting took 2247.3 s
:: Reads with OK barcode: 289,882 out of 306,670,720
:: Ignored 0 reads
:: Dumped block 16
:: Printing took 9.7 s
:: Processed 306,670,720 reads (198,128 MB uncompressed) in 2,260 s

That seems not OK as around 95% of the reads has a valid barcode.

This brings me to the following point..

  1. What steps are needed to align the reads?

Are both ema count and preprocess needed if you already have a barcode corrected interleaved fastq file?
If so, how should they be run?

  1. What does RA.gz refer to in the README?

Any help would be much appreciated.

Regards,

Ronald

8 bp whitelist

The 10X dataset that I would like to use with EMA appears to have an 8 bp index instead of 16. Do you know where I can find an 8 bp whitelist?

Thanks,
Michael

Option for preprocess and map to write and read gzipped files?

I've been trying to develop an EMA processing pipeline using Cromwell and WDL with a Google Compute Engine backend. This involves switching a lot of the multithreaded parallelism you illustrate in your pipeline in the README to instance-based parallelism, which requires localizing intermediate files such as the ema-bin-* files created by ema preprocess to different GCE instances. Since these files are essentially uncompressed fastqs, they can be quite large and therefore can take a while to transfer between instances. It would be nice if ema preprocess could (at least optionally) write the bin files gzipped, and ema map could read gzipped input bin files.

`ema align` silently fails with `[noeol]` FASTA files

Dear ema team,

Thanks a lot for the tool! After some trial and error I noticed that when I use FASTA references from a particular provider that does not put EOL at the end of the file, ema align quits with the following error:

ema: src/main.c:52: chrom_index: Assertion `0' failed.

I understand that this is not the most common issue, but maybe it's worth adding a check or a note to the documentation? It was not really easy to figure out the cause of the error for now.

What's the meaning of density probs?

I find the definition of profiles for different sequencing technologies in Techs.h:

static PlatformProfile profiles[] = {
	{.name = "10x",
	 .extract_bc = extract_bc_10x,
	 .many_clouds = 0,
	 .dist_thresh = 50000,
	 .error_rate = 0.001,
	 .n_density_probs = 4,
	 .density_probs = {0.6, 0.05, 0.2, 0.01}},

	{.name = "tru",
	 .extract_bc = extract_bc_truseq,
	 .many_clouds = 1,
	 .dist_thresh = 15000,
	 .error_rate = 0.001,
	 .n_density_probs = 4,
	 .density_probs = {0.6, 0.05, 0.2, 0.01}},

	{.name = "cpt",
	 .extract_bc = extract_bc_cptseq,
	 .many_clouds = 1,
	 .dist_thresh = 3500,
	 .error_rate = 0.01,
	 .n_density_probs = 9,
	 .density_probs = {0.6, 0.01, 0.15, 0.001, 0.05, 0.001, 0.02, 0.001, 0.01}},

	{.name = NULL},
};

As I want to create profiles for the other sequencing technologies, could you interpret the meaning of density_probs?

Interpreting read counts

Hi,

I'm testing EMA on a set of 10x reads that are stored both in two separate fastQ files (R1 and R2). The number of reads in R1 is 9,000,743 and in R2 is, of course, the same, so in total we have 9,000,743 "read pairs" (or 18,001,486 total reads).

When I run count on the fastQ files, I get the following log:

$ parallel --will-cite -j 2 'paste <(gzip -cd {} | paste - - - -) <(gzip -cd {= s:_R1_:_R2_: =} | paste - - - -) | tr "\t" "\n" | ema count -w 4M-with-alts-february-2016.txt -o {/.}' ::: *_R1_*.gz >> count.log
:: Loading 10X took 2.0 s
:: Counting took 494.5 s
:: Reads with OK barcode: 8,953,372 out of 9,000,743
:: Ignored 0 reads
:: Dumped block 1
:: Printing took 0.5 s
:: Processed 9,000,743 reads (6,350 MB uncompressed) in 497 s

For what I understand, only R1 has the barcode at the start its sequence.

I think I'm just confused with the read counts logged by EMA. Would the above log message means that EMA correctly counted the ~9M barcodes from R1? or that it only processed ~9M of reads (half of the total read count)?

Many thanks,
Pedro

failed assetion

hey, I tried running ema but get:

ema: src/samrecord.c:101: rc: Assertion `0' failed.

During alignment of anything -example offending fastqs (produced using ema preprocess, sort)

1.txt
2.txt

Compilation error

Dear ema developers,
after having a core dump error with the ema preproc pipeline, using the conda installed version, I followed the suggestions given in Issue #27 and tried to build from source.
However, make instruction gave this error:

make[1]: Leaving directory '/home/simone/software/ema/bwa'
g++ -lstdc++ -march=native -O3 -flto -fopenmp -lpthread obj/split.o obj/align.o obj/barcodes.o obj/bwabridge.o obj/techs.o obj/samrecord.o obj/main.o obj/util.o obj/samdict.o cpp/main.o cpp/count.o cpp/format.o cpp/correct.o -Lbwa -lbwa -lm -lz -lpthread -o ema
lto1: fatal error: bytecode stream in file ‘obj/split.o’ generated with LTO version 3.0 instead of the expected 6.0
compilation terminated.
lto-wrapper: fatal error: g++ returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
Makefile:41: recipe for target 'ema' failed
make: *** [ema] Error 1

How do you think I could solve this? Are there some requirements on the compiler version?
Thanks,
Simone

How to use "-b" flag?

Dear developers,

I have read all discussion/issues related to "BX:Z" and tried to generate new fastq files by adding the -b flag to my "ema preproc" run:

-b: output BX:Z-formatted FASTQs [off]

However, as far as I can tell it does not actually print new fastq files anywhere in the file system. I get my 500 bins and downstream alignment works just fine but despite inspecting the source code, I simply can't figure out how the "-b" flag works.

Does it take any extra parameters or is it simply a boolean on/off?
Is it supposed to print to STDOUT or a new file?

Grateful for feedback on this!
Cheers!

Core dump in ema count

Hi,

installed ema via conda on an ubuntu server ...

$ cat test.ilv.fq | \ 
    ema count - \
    -w ../3_whitelist_barcodes/4M-with-alts-february-2016.txt \
    -o counts_file
Illegal instruction (core dumped)

Is this because of the fastq file? Or am running it the wrong way?

$ head -8 test.ilv.fq
@D00562:316:HKTGFBCX2:1:1105:1064:2069 1:N:0:AAGCGCTG
AGATGGGTCTGGGACCGCATGAGCTGTGGTAAACTAAGCGCGGCGTTGAGTTTCGTCACTGAACTTGCATGCTTGCGTGTATCATTGCGTGCTTCGAAACTTAACGTCGCGNTTAGTTTACCACAGCTAATAAGTGAAAATAAGTTAANT
+
DADDDHGIIIIIIFHIHIIIIGHHIIIIHIIICHHHCHIIIIHIIHHHGHHHIHHIDHHIHGIHGHHGIGIIIIIGDEHHIIIIIIIIGIHGHHHIIIIIIHEHIIIIIII#<<EHHHIHIIHHIIIIIIIIIIIIHIIHHHHIHEHH#<
@D00562:316:HKTGFBCX2:1:1105:1064:2069 2:N:0:AAGCGCTG
AAAGTATTCACAGTGTGAATACTTTATTGGGTTTAGGTGGAAAGTTACCCTCGGCCCCCCAGGGACCTGCCCTGACCAATTTCCTTGTTGGAAGTTAATTTTAATTGTAAACTAATTATATCGGCATTAGATATTCTTATTGCTTTTACA
+
@@D@DHIIHIIIIIGHHHHIIIIIIHHHIIIHIIIIIIIIIIIIHHIHHHHIIHHIIIIHIIIIIIHIHHIHIHHIIIIIIIIHHIIHHHIHHHHHIIIII<FHIIIIIIIIHHIIIIIIIIIHIIIIIIIIIIIIIHIIIHIIIIIIII

cheers,

dom

New preprocessing release

First, wanted to thank you for making a 10x alignment tool that outperforms Lariat, and for publishing all the computation scripts you used in a paper in a notebook. We are in the University of Melbourne Center for Cancer Research evaluating applications of 10x for WGS somatic variant calling, and your notebook serves a fantastic reference for evaluation on our internal cancer samples.

Currently for our runs we rely on LongRanger for alignment. However, we are having quite a bit of technical issues running it on cluster, and would be happy to evaluate EMA instead and consider integrating it into https://github.com/chapmanb/bcbio-nextgen. However, at the moment we weren't able to get EMA to the end, except for small tests: the preprocessing step has been taking quite a substantial amount of time so far (20 hours for a NA12878 WGS sample on 30 threads, however it's using only one core anyway as far as I can see). I'm not sure if things are going normal or got stuck. In any case, I understand that you have a more efficient preprocessing method for internal use. I wondering if you have any estimates by any chance when you could release it?

Vlad

Read group in ema_wrapper.sh

This line does not work for me:

bwa mem -t "$t" -M -R "${R//$'\t'/'\t'}" "$r" bucket_no_bc/*1.no_bc.fastq bucket_no_bc/*2.no_bc.fastq > "bucket_no_bc/$OUTSAM"

It gets translated into the following one and BWA fails with an error:

+ bwa mem -t 2 -M -R '@RG'\''\t'\''ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0'\''\t'\''SM:NA12878_WGS_v2' /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa bucket_no_bc/subset_L001_R1.no_bc.fastq bucket_no_bc/subset_L001_R2.no_bc.fastq
[E::bwa_set_rg] no ID within the read group line

However, I managed to fix it for me by removing the single quotes in the last bit of the replacement statement:

bwa mem -t "$t" -M -R "${R//$'\t'/\t}" "$r" bucket_no_bc/*1.no_bc.fastq bucket_no_bc/*2.no_bc.fastq > "bucket_no_bc/$OUTSAM"

(see "${R//$'\t'/\t}" instead of "${R//$'\t'/'\t'}")

I'm wondering if that might be bash version specific or something? Also, is it possible to improve the usage of the pipeline by removing the requirements to pass the read group as an argument explicitly?

Trimming 1bp of R2

As far as I understand, you trim 16+7 based from the first read only, and leave the mate read alone:

In summary, in the barcode extraction stage, we remove the 16bp barcode from the first mate
of each read pair, and trim an additional 7bp to account for potential ligation artifacts resulting
from the barcode ligation process during sequencing (the second mate shares the same barcode
as the first mate).

The Longranger authors actually recommend additionally trimming the first base of R2: https://community.10xgenomics.com/t5/Genome-Exome-Forum/Best-practices-for-trimming-adapters-when-variant-calling/td-p/470

In terms of trimming, we recommend trimming the first 16+7bp of R1, and the first 1bp of R2. R1 contains the 16bp 10x barcode + 7bp of low accuracy sequence from an N-mer oligo. The first bp of R2 empirically has about a 5x higher mismatch rate. Given the stats you're showing, I don't expect the trimming to have a huge influence -- my guess is that you'll get the biggest win from filtering poor variants.

Have you guys considered that by any chance? Wondering if it would improve speed or quality at all.

Not all input reads are included in `ema align` output

I have noticed that in some instances not all reads that are use as input are included in the output from ema align. I have included some test data (see data.zip) for such a case below. The command used and the output can be seem below.

command

ema align -p 10x -s ema-bin-000.no_map -r ref.fasta

output

BWA initialization...
[M::bwa_idx_load_from_disk] read 0 ALT contigs
@HD	VN:1.3	SO:unsorted
@SQ	SN:chrA	LN:50000
@SQ	SN:chrB	LN:40000
@SQ	SN:chrC	LN:9000
@SQ	SN:chrD	LN:1000
@RG	ID:rg1	SM:sample1
@PG	ID:ema	PN:ema	VN:0.6.2	CL:./../ema align -p 10x -s ema-bin-000.no_map -r ref.fasta
Processing reads...

As you can see no alignment are in the output.

I did some digging (with my non-existing C-knowledge) and found that this [for-loop] (https://github.com/arshajii/ema/blob/1d3b65d6504f0698ed7613efc4e688d011b8346c/src/align.c#LL347C3-L349C3) in src/align.c does not append any records for the file in question. ema should output these reads as unmapped if this was the case.

Below is the contents of the file used in the example.

File: ema-bin-000.no_map

AGGCTCGAGTCAAGTA @ST-E00273:177:HMTTCCCXX:1:1218:29467:32953 TAGACTGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCATCTCAAAAAAATAAAAAATAAAAATAAAAAATTTCATTAAGAAAATAAAAAAATAAACTGAAATCTTATAATAAAACCTGG KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKA,,,FK,K,,,,A,,,,,AKKAKF<KA,,,,,F<,<A,F,A<A,,A7,<,<,A,,<,, TATTTTTTATTTTTTTGAGATGGAGTCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAGTCTTCTTGAGATACTTGACTCGCGCCTAGATCGGAAGAGCGTCGTGTATGGAATGAGTGGAGATATCGGTGGTCGTCGTTCCGTGATA AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKK,<,<,,,,A<K,,A,7,,,,,(,A,<FKA,,FKAKKAAF,(A,A,,,7F,,,<AFK,77,A,,,,,,,,A,,,,,,,,,,,,<,<

data.zip

Picard error in ema_wrapper.sh

Getting an error from Picard in ema_wrapper.sh:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0 clashes with ID:ema. Line:
@PG     ID:ema  PN:ema  VN:0.1.0        CL:/home/vlad/bin/ema align -1 bucket000/subset_L001_R1.preproc.fastq -2 bucket000/subset_L001_R2.preproc.fastq -r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa -o bucket000/ema_out.sam -R @RG     ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0    SM:NA12878_WGS_v2; File /data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/
10k/bucket000/ema_out.sorted.bam; Line number 87

The @RG line in bucket000/ema_out.sorted.bam is indeed having two ID tags, the first one ID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0
is the one that I provided with the -R option to ema_wrapper.sh:

EMAPATH=/home/vlad/bin/ema \
PICARDPATH=/data/cephfs/punim0010/extras/10x/miniconda/envs/10x/share/picard-2.17.11-0/picard.jar \
/data/cephfs/punim0010/extras/vlad/synced/umccr/10x/ema/ema_wrapper.sh \
-r /data/cephfs/punim0010/extras/10x/refdata-GRCh37/fasta/genome.fa \
-R $'@RG\tID:NA12878_WGS_v2:LibraryNotSpecified:1:unknown_fc:0\tSM:NA12878_WGS_v2' \
-t 2

And another one is ID:ema, meaning it was added by the pipeline as well?

I'm wondering if it's me doing something wrong?

Internal duplicate marking

Duplicates are being marked with Picard right now. This should be easy enough to do within EMA (like Lariat does), just by looking for reads with the same barcode that map to the same place. This would remove the Picard dependency as well.

Barcodes not encoded/decoded correctly for haplotag reads.

During testing of #51 I noticed that haplotag barcodes were not the same when comparing input to output of ema align.

For example this is the read A00814:267:HTMH3DRXX:2:1101:3197:1016 in the FASTQ input and SAM output

FASTQ -> BX:Z:A80C01B18D66

@A00814:267:HTMH3DRXX:2:1101:3197:1016 BX:Z:A80C01B18D66	RX:Z:GAAACGAAGGCAT+CACAGACGTCACA	QX:Z:FFFFFFFFFFFFF+FFFFFFFFFFFFF

SAM -> BX:Z:A114C114B114D114

A00814:267:HTMH3DRXX:2:1101:3197:1016	137	chr10	99608106	0	8S19M6S	*	0	0	GGTCCACTATTTTCCCTGACACATAGCGTGCCA	FF,FFFFFF:F:F,FF,:F,F,FFF::::FF:F	NM:i:0	BX:Z:A114C114B114D114	XG:f:0.5	MI:i:0	XF:i:0	RG:Z:rg1	XA:Z:chr17,-77835071,10S19M4S,0;

Looking at the barcode counts before an after mapping this is even clearer.

This is the top10 barcodes by count in my test FASTQ

      4 A60C01B36D05
      4 A48C01B86D83
      4 A39C01B62D05
      4 A35C01B41D83
      4 A22C01B15D55
      3 A96C01B63D50
      3 A93C01B35D61
      3 A93C01B18D28
      3 A90C01B94D74
      3 A90C01B33D44

and this is the top10 barcodes by count in the ema align output.

    514 A114C114B114D114
    450 A127C127B125D112
    122 A127C127B127D112
     82 A114C114B114D87
     66 A127C127B126D84
     64 A114C114B114D54
     38 A110C114B114D114
     24 A87C114B114D114
     20 A114C87B114D114
     18 A127C100B114D114

Most of these does not even exist in the input FASTQ. Also it seems that multiple input barcodes are assigned to the same output barcode. This would potentially be detrimental for ema's functionality as it would the try to map reads from different barcodes to the same "molecule".

This is probably interesting for @pdimens.

ema align core dumped

Hi,
I tried to use the ema align command to have barcoded reads aligned to a reference.
However I got this error:

BWA initialization...
[M::bwa_idx_load_from_disk] read 0 ALT contigs
Processing reads...
ema: src/techs.c:8: extract_bc_10x: Assertion `bc_str != NULL' failed.
Aborted (core dumped)

Since ema preproc didn't produce any "special fastq" file, I am not specifying the -s parameter.
I read another user (Issue #15 ) had this error and solved it with:
cat reads.fastq | paste - - - - - - - - | grep "BX:Z:" | tr '\t' '\n' > reads_noBXZ.fastq
However, the files I got are empty.
Do you know how could I solve this?
Simone

10X Long Ranger with EMA BAM

Hello,

We have barcoded short-read samples that we aligned with EMA and we are interested in using the resulting BAM file for the 10x Long Ranger pipeline. Specifically, we would like to provide the EMA BAM file as input for Long Ranger's structural variant calling and other downstream analyses. For single variants we already used GATK's HaplotypeCaller (v4.0.8.1) on the BAM file as described in your experimental pipeline. It seems that this pipeline only accepts fastq files for input and we would like to know if it would be possible to instead use results from EMA.

ema-bin-??? and possible preproc failure

I am not sure if ema-bin-??? refers to something I should be replacing. When I try to run the ema align step I get the error:

error: file /home/metzgerm/EMA/ema-bin-??? could not be opened

The preproc does not make any files or folders with "ema-bin" in the name. Should it? The preproc log is short and unclear:

:: Bucketing 1 inputs into 500 files with 40 threads
:: Loading known counts file counts_file.ema-ncnt ...
:: Loading known counts ... done in 4.9 s
:: Loading full counts file counts_file.ema-fcnt ...
::: Loading dump à? aê‘1 of size � @ 14,913,081 ( �Bec� 910 MB) done in 2.052 s

I have run:
cat ~/EMA/some_clam_S3_L004_interleaved.fastq | ema count -w ~/EMA/4M-with-alts-february-2016.txt -o counts_file 2>&1 | tee count.log
cat ~/EMA/some_clam_S3_L004_interleaved.fastq | ema preproc -w ~/EMA/4M-with-alts-february-2016.txt -n 500 -t 40 -o ~/EMA *.ema-ncnt 2>&1 | tee preproc.log
#parallel --bar -j10 "ema align -t 4 -d -r ~/arrow/Mar.3.3.3.consensus.fasta -s {} | samtools sort -@ 4 -O bam -l 0 -m 4G -o {}.bam -" ::: ~/EMA/ema-bin-???

Thanks for your help.

Best,
Michael

BWA incompatible alignment SAM flags

I was trying to use the ema-aligned BAM file for downstream structural-variant discovery. I noticed that extraction of split/discordant reads (using https://github.com/hall-lab/extract_sv_reads) resulted in no split/discordant reads being discovered. At least part of the reason is the non-standard use of alignment flags by ema. For example, whereas BWA may align a read with flag 113:

    read paired (0x1)
    read reverse strand (0x10)
    mate reverse strand (0x20)
    first in pair (0x40)

EMA will align the same read pair with flag 115

    read paired (0x1)
    read mapped in proper pair (0x2)
    read reverse strand (0x10)
    mate reverse strand (0x20)
    first in pair (0x40)

For Illumina chemistry the rev-rev orientation is not a proper pair. In many cases the "proper" reads are aligned to different chromosomes...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.