bwa-mem2 / bwa-mem2 Goto Github PK
View Code? Open in Web Editor NEWThe next version of bwa-mem
License: Other
The next version of bwa-mem
License: Other
Setting:
Note:
Log (Segmentation Fault):
⟫ bwa-mem2 mem -t 36 genome/hs38DH.fa /home/dnanexus/in/reads_fastqgzs/0/ERR2990063_1.fastq.gz /home/dnanexus/in/reads2_fastqgzs/0/ERR2990063_2.fastq.gz -K 100000000 -Y -R '@RG\tID:ERR2990063_1\tPL:ILLUMINA\tPU:ERR2990063_1\tLB:ERR2990063\tSM:ERR2990063' > /dev/null
-----------------------------
Executing in AVX512 mode!!
-----------------------------
Ref file: genome/hs38DH.fa
Entering FMI_search
reference seq len = 6434693835
... [Deleted for brievity]
[0000] 3. Calling kt_for - worker_bwt
[0000] read_chunk: 100000000, work_chunk_size: 7276690, nseq: 48190
[0000][ M::kt_pipeline] read 48190 sequences (7276690 bp)...
[0000] 3. Calling kt_for - worker_aln
Inferring insert size distribution from data, l_pac: 3217346917, n: 662252
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 279551, 4, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (240, 282, 341)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (38, 543)
[M::mem_pestat] mean and std.dev: (294.94, 74.25)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 644)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 662252 reads in 104.068 CPU sec, 3.060 real sec
[0000] 2. Calling mem_process_seqs.., task: 46
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
Segmentation fault
139
Below are the files we used:
Reference genome (hs38DH.fa as a tar file): https://dl.dnanex.us/F/D/jG73Z8595GxZYXpKJbqGZzJyX15Q3V0YQG96BxJ7/hs38DH.bwa-index.avx512.v2.tar.gz
ERR2990063_1.fastq.gz: https://dl.dnanex.us/F/D/7Z6ygpJ3xVzf7bG04gx4JVPZ2y4G842g52fJJ9B9/ERR2990063_1.fastq.gz
ERR2990063_2.fastq.gz: https://dl.dnanex.us/F/D/8545B424zVg53Y3BKFyzXxXbJzJV1f3X6KP8V84g/ERR2990063_2.fastq.gz
STDOUT/STDERR of the run:
https://dl.dnanex.us/F/D/pv3VqJyv1842gkk0G45Q0qv362PqvP12J20G61ZX/ERR2990063.log
I used a fastq sequence of varying length to align with the reference genome and obtained segment errors.
Setting:
Command:
⟫ bwa-mem2 mem -t 8 genome/hs38DH.fa /home/dnanexus/in/reads_fastqgzs/0/ERR3242672_1.fastq.gz /home/dnanexus/in/reads2_fastqgzs/0/ERR3242672_2.fastq.gz -K 100000000 -Y -R '@RG\tID:ERR3242672_1\tPL:ILLUMINA\tPU:ERR3242672_1\tLB:ERR3242672\tSM:ER^C242672' > /dev/null
Error message:
...
[M::mem_pestat] (25, 50, 75) percentile: (827, 2626, 3427)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8627)
[M::mem_pestat] mean and std.dev: (2383.79, 1436.55)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 11227)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[0000] 10. Calling kt_for - worker_sam
*** Error in `bwa-mem2': free(): invalid next size (normal): 0x00007f3a1d66fb50 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f48411137e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f484111c37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f484112053c]
bwa-mem2[0x44cdf8]
bwa-mem2[0x445869]
bwa-mem2[0x42658c]
bwa-mem2[0x46a631]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f4841f286ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f48411a341d]
======= Memory map: ========
00400000-0049d000 r-xp 00000000 00:2a 75543 /bwa-mem2/bwa-mem2.avx512bw
0069c000-006b0000 rw-p 0009c000 00:2a 75543 /bwa-mem2/bwa-mem2.avx512bw
006b0000-006d1000 rw-p 00000000 00:00 0
025dd000-03a99000 rw-p 00000000 00:00 0 [heap]
7f38f4000000-7f38f8000000 rw-p 00000000 00:00 0
7f38f8000000-7f38fbffd000 rw-p 00000000 00:00 0
7f38fbffd000-7f38fc000000 ---p 00000000 00:00 0
7f390c000000-7f3910000000 rw-p 00000000 00:00 0
...
Below are the files we used:
Reference genome (hs38DH.fa as a tar file, built after 6743183): https://dl.dnanex.us/F/D/19XXQfV51qBYpP4jgY46YzpX7ZB6GQYv78Zg6B4x/hs38DH.bwa-index.avx512.v2.tar.gz
ERR3242672_1.fastq.gz (12.03 GiB): https://dl.dnanex.us/F/D/FXyYZk9kqk8Y0qVGPV1VkBpY0byfQyB1Q2zf7v84/ERR3242672_1.fastq.gz
ERR3242672_2.fastq.gz (13.77 GiB): https://dl.dnanex.us/F/D/8y5qbZbkbVYyJJZXqB1PPg1YVZ5JFK5J7QjYx82j/ERR3242672_2.fastq.gz
Hi - I know the team isn't currently focused on adding features, but it would be great to add support for uBam input. This is increasingly the recommended workflow (e.g. by GATK/Picard team) since uBams can store metadata e.g. UMIs that are not easily stored or parsed in fastqs, and is increasingly supported by other popular aligners e.g. Novoalign and Bowtie 2.
Hello,
I'm not that knowledgeable about AVX compilation processes so I was wondering what parameters do I change in the Makefile to make it compatible with AVX2 AMD CPUs?
Using the precompiled binaries gives me an error about using compatible AVX CPUs, and the makefile fails to get AVX2 to build.
Best,
Chang
We have attempted to use bwa-mem2
to process a large dataset.
bwa mem
./bin/bwa-mem2 mem -p ref/genome.fa data/mini.xi.gz > /dev/null
Taking 1 of the cases the error messages change depending on number of CPUs:
double free or corruption (!prev)
free(): invalid next size (normal)
As it was working with AVX2, I ran tests on each of the binaries in isolation:
The host used supports avx512bw.
casm3-os0000001: grep avx512bw /proc/cpuinfo | uniq -c
60 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke
With help from @mp15 we have a dataset that reliably reproduces this error with just 212 read pairs (151 length).
@mp15 has traced the problem with "Address Sanitiser" :
Just a brief update as to where I have got to. It turns out that whilst Valgrind doesn't work on AVX-512 instructions, Address Sanitiser does and you can breakpoint the process at __asan_on_error@plt to see what is about to start writing over your stack/heap. I have managed to trace the issue we are seeing to the following line:
src/kswv.cpp:326 mySeq1SoA[k * SIMD_WIDTH8 + j] = (seq1[k] == AMBIG_ ? AMBR:seq1[k]);
Which as far as I can tell is part of a AVX-512 banded smith-waterman implementation and I believe part of the mate pair rescue feature?
The array it is indexing into is a subunit of:
#define MAX_SEQ_LEN_REF_SAM 2048 #define SIMD_WIDTH8 64 uint8_t *seq1SoA = NULL; seq1SoA = (uint8_t *)_mm_malloc(MAX_SEQ_LEN_REF_SAM * SIMD_WIDTH8 * numThreads * sizeof(uint8_t), 64);
via:
//#pragma omp parallel num_threads(numThreads) { int32_t i; // uint16_t tid = omp_get_thread_num(); uint16_t tid = 0; uint8_t *mySeq1SoA = seq1SoA + tid * MAX_SEQ_LEN_REF_SAM * SIMD_WIDTH8;
Apparently the thread number here will always be 0 and I assume numThreads will always be 1? In anycase that makes the maximum array size: 2048641=131072. The line in question is part of a loop:
for(k = 0; k < sp.len1; k++) { // mySeq1SoA[k * SIMD_WIDTH8 + j] = (seq1[k] == AMBIG_?0xFF:seq1[k]); mySeq1SoA[k * SIMD_WIDTH8 + j] = (seq1[k] == AMBIG_ ? AMBR:seq1[k]); }
And when it errors, k is 2048 and j is 0, with sp.len1 equal to 20577 and past the end of the array, therefore it is likely to be sp.len1 at fault. I am now trying to figure out why this is happening but we may need to pass this to Intel and ask them to figure it out.
Unfortunately, I am unable to attach the data directly to the issue due to DAA requirements.
To apply for access please see here, I believe it requires you to be a member of an academic
institution though:
https://www.sanger.ac.uk/legal/DAA/MasterController
Please let me know if there is any further information I can provide.
P.S.
The following have been observed on other datasets, I am still confirming if these are specific to avx512bw, so far all 174 are shown to work on sse41 and 81 on avx2 (incomplete on remainder):
bwa-mem2: malloc.c:2927: __libc_malloc: Assertion `!victim || chunk_is_mmapped (mem2chunk (victim)) || ar_ptr == arena_for_chunk (mem2chunk (victim))' failed.
bwa-mem2: malloc.c:3567: _int_malloc: Assertion `(fwd->size & NON_MAIN_ARENA) == 0' failed.
bwa-mem2: malloc.c:3722: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
Setting:
Note:
Command:
bwa-mem2 mem -t 36 genome/hs38DH.fa /home/dnanexus/in/reads_fastqgzs/0/150420_HG005_29-40_R1.fastq.gz /home/dnanexus/in/reads2_fastqgzs/0/150420_HG005_29-40_R2.fastq.gz -K 100000000 -Y -R '@RG\tID:150420_HG005_29-40_1\tPL:ILLUMINA\tPU:150420_HG005_29-40_1\tLB:150420_HG005_29-40\tSM:150420_HG005_29-40'
Error message:
*** Error in `bwa-mem2': free(): invalid next size (normal): 0x00007f4344068230 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f527cb417e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f527cb4a37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f527cb4e53c]
bwa-mem2[0x44a38d]
bwa-mem2[0x42acba]
bwa-mem2[0x40a501]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f527d9566ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f527cbd141d]
We were mapping downsampled HG005 WGS FASTQ as input. We realized that there are specific 12 consecutive read pairs triggering the error. When further removing the read pairs from the top or bottom, mapping succeed without any error.
Below are the files we used:
Reference genome (hs38DH.fa as a tar file): https://dl.dnanex.us/F/D/jG73Z8595GxZYXpKJbqGZzJyX15Q3V0YQG96BxJ7/hs38DH.bwa-index.avx512.v2.tar.gz
150420_HG005_29-40_R1.fastq.gz: https://dl.dnanex.us/F/D/J8J0fp049fG3Yzfyxf0KG7k03YJ6J1P1q4vyz27y/150420_HG005_29-40_R1.fastq.gz
150420_HG005_29-40_R2.fastq.gz: https://dl.dnanex.us/F/D/vbPkzZkjyq6JZQf1j2Jgjjb1xV248fvx10jv254K/150420_HG005_29-40_R2.fastq.gz
STDOUT/STDERR of the run:
https://dl.dnanex.us/F/D/8KzbQy2j5zZ10V6bvK3bjf2Pg9Gqf1bg5qbzbbX7/150420_HG005_29-40.log
Ubuntu 14.04.5 LTS
Dual Intel Xeon E5-2620 v4 with 512GB RAM
Mapping NA12878 to HG38
Running in AVX2 mode
bwa-mem2 mem \
-t 32 \
-o ${OUT} \
-R "@RG\tID:bench\tLB:bench\tSM:bench\tPU:bench\tPL:ILLUMINA" \
Homo_sapiens_assembly38.fasta \
ERR194147_1.fastq.gz \
ERR194147_2.fastq.gz
Attempting the above with the precompiled binaries fails fairly quickly with:
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[0000] 10. Calling kt_for - worker_sam
[0000] read_chunk: 320000000, work_chunk_size: 320000118, nseq: 3168318
[0000][ M::kt_pipeline] read 3168318 sequences (320000118 bp)...
[0000][ M::mem_process_seqs] Processed 3168318 reads in 650.924 CPU sec, 22.033 real sec
[0000] 2. Calling mem_process_seqs.., task: 20
[0000] 3. Calling kt_for - worker_bwt
ERROR! seedBuf count exceeds size. count = 256000, tmp.m = 1, size = 256000
Reserved memory allocated for storing seeds has falling short!!!
Please increase the value of macros: AVG_SEEDS_PER_READ & AVG_AUX_SEEDS_PER_READ, in src/macro.h, re-compile and re-run.
Thu Jul 25 11:33:23 EDT 2019
However, if I compile myself (GCC, I don't have a license for icpc) with all the current commits it gets much farther, but still fails with this:
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RR
[0000] 10. Calling kt_for - worker_sam
[0000] read_chunk: 320000000, work_chunk_size: 320000118, nseq: 3168318
[0000][ M::kt_pipeline] read 3168318 sequences (320000118 bp)...
[0000][ M::mem_process_seqs] Processed 3168318 reads in 566.920 CPU sec, 19.277 real sec
[0000] 2. Calling mem_process_seqs.., task: 398
[0000] 3. Calling kt_for - worker_bwt
[0000] read_chunk: 320000000, work_chunk_size: 320000118, nseq: 3168318
[0000][ M::kt_pipeline] read 3168318 sequences (320000118 bp)...
[0000] 3. Calling kt_for - worker_aln
bwa-mem2: src/bwamem.cpp:2170: void mem_chain2aln_across_reads_V2(const mem_opt_t*, const bntseq_t*, const uint8_t*, bseq1_t*, int, mem_chain_v*, mem_alnreg_v*, mem_cache*, int64_t, int64_t, int64_t, int): Assertion `numPairsRight < 512 * 500' failed.
Thu Jul 25 19:40:16 EDT 2019
Hi there,
It appears that there are some small differences in alignment when invoking bwa-mem2 vs bwa version 0.7.17-r1188.
To demonstrate, I obtained a pair of NA12878 FASTQ files from 1000 Genomes:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/ERR001713_1.filt.fastq.gz
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA12878/sequence_read/ERR001713_2.filt.fastq.gz
I then invoked both the bwa and bwa-mem2 as follows:
$ bwa mem -K 10000000 -t 8 hs37d5.fa ERR001713_1.filt.fastq.gz ERR001713_2.filt.fastq.gz | samtools view -u - > bwa.bam
$ bwa-mem2 mem -K 10000000 -t 8 hs37d5.fa ERR001713_1.filt.fastq.gz ERR001713_2.filt.fastq.gz | samtools view -u - > bwa-mem2.bam
Now, when looking at the "samtools stats" output for each BAM file, there are some subtle differences:
$ diff <(samtools stats bwa.bam | grep '^SN') <(samtools stats bwa-mem2.bam | grep '^SN')
20c20
< SN bases mapped (cigar): 284761502 # more accurate
---
> SN bases mapped (cigar): 284761501 # more accurate
23,24c23,24
< SN mismatches: 1854434 # from NM fields
< SN error rate: 6.512235e-03 # mismatches / bases mapped (cigar)
---
> SN mismatches: 1854433 # from NM fields
> SN error rate: 6.512232e-03 # mismatches / bases mapped (cigar)
And here's one example of a slightly different alignment:
$ samtools view bwa.bam | grep '^ERR001713.5425457'
ERR001713.5425457 83 6 83350583 0 7S29M = 83350484 -128 ACTAGAAAAATCCAATAGCATATCAAAAAGATAATC (4&;-7>>>>4;,>>9>1<&>;<-7><>>><;<>;< NM:i:1 MD:Z:28A0 MC:Z:36M AS:i:28 XS:i:29
ERR001713.5425457 163 6 83350484 0 36M = 83350583 128 TACCAAAACCAGGAATGGACATAACAAAAAAAGAAA <;><;<>>><>7.<89.<49;9>,87;>><5.8&5+ NM:i:1 MD:Z:15A20 MC:Z:7S29M AS:i:31 XS:i:31
$ samtools view bwa-mem2.bam | grep '^ERR001713.5425457'
ERR001713.5425457 83 13 102092425 0 8S28M = 102092325 -128 ACTAGAAAAATCCAATAGCATATCAAAAAGATAATC (4&;-7>>>>4;,>>9>1<&>;<-7><>>><;<>;< NM:i:0 MD:Z:28 AS:i:28 XS:i:29
ERR001713.5425457 163 13 102092325 0 36M = 102092425 128 TACCAAAACCAGGAATGGACATAACAAAAAAAGAAA <;><;<>>><>7.<89.<49;9>,87;>><5.8&5+ NM:i:1 MD:Z:15A20 AS:i:31 XS:i:31
That is, in the first read of "ERR001713.5425457", the soft clipping differs by one basepair.
Are these subtle differences in alignments between bwa and bwa-mem2 to be expected?
Hello!
Thank you for this new version! We have started testing it and we have found that in the alignment produced by bwa-mem2
CIGAR tags like MC:Z:101M
are missing comparing to the alignment produced by the original version.
Is this behaviour expected?
Is there a timeline for when this will be considered production ready?
A ~50% savings in time / money / energy is pretty significant for such a widely used tool.
Hi
I want to use bwa-mem2 in wheat genome (17G) .
bwa-mem2 was compiled from source, but the program had an segmentation fault when building index.
bwa-mem2 index -p CS_parts_bwamem2 161010_Chinese_Spring_v1.0_pseudomolecules_parts.fasta
[bwa_index] Pack FASTA... 85.80 sec
init ticks = 763897684114
ref seq len = 29094523130
binary seq ticks = 497034160160
build index ticks = 21694227971367
ref_seq_len = 29094523130
count = 0, 7837288578, 14547261565, 21257234552, 29094523130
BWT[12766107114] = 4
CP_SHIFT = 5, CP_MASK = 31
sizeof CP_OCC = 64
Segmentation fault
When segementation fault, I got 5 output files:
28G CS_parts_bwamem2.0123
12M CS_parts_bwamem2.amb
2.1K CS_parts_bwamem2.ann
55G CS_parts_bwamem2.bwt.8bit.32
3.4G CS_parts_bwamem2.pac
My system info : Linux debian 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux ; 512G memory
By the way, bwa-mem works well. The genome file : ftp://ftp.ensemblgenomes.org/pub/plants/release-43/fasta/triticum_aestivum/dna/
Hi,
I get an error when make the code,
g++ -c -g -O3 -fpermissive -DENABLE_PREFETCH -DPAIRED_END=1 -DMAINY=0 -DSAIS=1 -DDEB=0 -DRDT=0 -DMAXI=0 -DNEW=1 -DSORT_PAIRS=0 src/fastmap.cpp -o src/fastmap.o
In file included from src/bandedSWA.h:42:0,
from src/bwamem.h:39,
from src/fastmap.h:45,
from src/fastmap.cpp:39:
/usr/lib/gcc/x86_64-redhat-linux/4.8.5/include/smmintrin.h:31:3: error: #error "SSE4.1 instruction set not enabled"
^
make: *** [src/fastmap.o] Error 1
What is the problem?
Thank you!
Hi
First I thank you for for the job this is great keep going ;)
I write because I'm blocked with the index creation and the alignment part.
I checkout the source files and make with CXX=icpc multi
Then I start to build a reference form hg19.fa so i did
bwa-mem2 index hg19.fa
it generates the following files
hg19.amb
hg19.fa
hg19.fa.ann
hg19.fa.fai
hg19.fasta.map
hg19.rpac
hg19.ann
hg19.fa.0123
hg19.fa.bwt
hg19.fa.pac
hg19.pac
hg19.rsa
hg19.bwt
hg19.fa.amb
hg19.fa.bwt.8bit.32
hg19.fa.sa
hg19.rbwt
hg19.sa
with bwa-mem2.avx2
and finally I start the alignment with
bwa-mem -t 8 $REF $FASTQ1 $FASTQ2 > $SAM
and it used the bwa-mem2.avx2 version
but when I look the sam file all my reads are aligned in chr1 at position 1 and when I compare with the original bwa this is wrong...I think it's a matter of how the index is created but I'm not sure... Have you an idea when it comes from please?
I also tried the precompiled lib but no help...
I run on Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Bests
frederic
Hi,
I get a segmentation fault when building an index with commit 97a2db0, which I don't get with commit a9fd42d
Here's the assembly I was indexing, which now gives me a segmentation fault on commit 97a2db0
wget https://www.dropbox.com/s/jpozyb4avje01p8/abyss.fa.gz
gunzip abyss.fa.gz
./bwa-mem2 index -p abyss ../../abyss.fa
[bwa_index] Pack FASTA... Segmentation fault (core dumped)
The CPU architecture is Intel SSE4.1
I've been running some tests on a set of 30x WGS samples. I have been trying to run the sample on a c5d.9xlarge on AWS which has ~69 GB free RAM when I launch (~2 GB per core). So far, all of my attempts have failed to complete the mapping. The log doesn't contain any OOM errors but I have made the following observations:
I then ran Intel's Inspector tool on the executable to look for potential memory leaks. The tool identified a few potential issues:
I tried the naive thing of just checking if c->seed is not NULL and if it's not, calling
free(c->seed)
just before we set c->seed to the new auxSeedBuf buffer, right around line 398, but it looks like that's not quite right since I get an error message about free being called against an invalid pointer.
In case it is helpful, the results of the Intel Inspector run can be found in the tar file here:
https://dl.dnanex.us/F/D/Qbk7Jz49kB51828xbYgB9KQGXVx4gK71VGvzBvqv/bwa_intel_inspector_project.tar.gz
Hi Bwa-mem2 team,
first of all thank you guys for creating this project, it looks really promissing and exciting. Halving the runtime of a well known tools is no small task!
I ran some experiments on Bwa-mem2 vs bwa mem, and initially it looked really promissing with the diff between 2 files only beeing the expected difference in the header of the sam file. This was done on only a single human chromosome with reads known to come from this chromosome.
However when i alligned reads from a different file against the full human reference genome there was an apperent problem where every read was mapped to chr1:1, while the same reads mapped to the same reference genome with the original dont map there at all.
to reproduce:
Input file:
SRR7890883_1k.fastq.gz
code:
#Unzip input file
gunzip SRR7890883_1k.fastq.gz
# get latest release
curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre1/bwa-mem2-2.0pre1_x64-linux.tar.bz2 | tar jxf -
#update path to add bwa-mem2
PATH="$PATH:$(pwd)/bwa-mem2-2.0pre1_x64-linux/"
# get reference genome
gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta .
# index file
bwa-mem2 index Homo_sapiens_assembly38.fasta
# align 1000 reads
bwa-mem2 mem -t 16 -o rerun2.sam Homo_sapiens_assembly38.fasta SRR7890883_1k.fastq
Please note:
ran on gcp (n1-highmem-16 (16 vCPUs, 104 GB memory) )
release: VN:2.0pre1
exit code: 1 (appears to be an issue in the release according to #2 )
SRR7890883_1k.fastq contains the first 1000 reads from another file
Output:
samtools view rerun2.sam | head
SRR7890883.2 0 chr1 1 60 51M199S * 0 0 GGGATGGCGGGCGCCGGGAGCGAAGCCCGGTTCGCCGGGCTGTCGCTGGTGNAGCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNNNNGAGNTGANCCGAGCCCGCCTTAAGCCCAGATCGGCNNNNNNNNNNNNNNNTCTGNANNNNNNNNNNNNNNNNNNNNNNNNNCTNCAGCAGCTAGTTGAGCTGCACCAGCGACAGCCCGGCGAACCGGGCTTCGNNNNNN AAFFFKKKKKKFKKAFFAAFKAKKKKKKKAKFKKK7FAAFKKKKAKKKFKK#KKKK##################################F####################A7,#<A7#<FKKA(AAFFFKKKKKKKKKK,7,FKK###############KKFF#,#########################FF#KKKK7FKK,A,AFKKKKKAAKKFFKFKK<FKK,,,AA777AFFF(7A,,###### NM:i:32 MD:Z:0C2G0A0A0C0A1C0T1T2T0T4T0C0T1T0G0G1T0G0A0T1G0G0T1A2C0C0T1A0A0A1C1 AS:i:51 XS:i:0 SA:Z:chr1,1,+,210S34M6S,60,21;
SRR7890883.2 2048 chr1 1 60 210H34M6H * 0 0 CTGCACCAGCGACAGCCCGGCGAACCGGGCTTCG KKKAAKKFFKFKK<FKK,,,AA777AFFF(7A,, NM:i:21 MD:Z:1G1G1A4T0C0T0C0C0T0T0G0A2T0C0T0G0T3T0G0A0T1 AS:i:34 XS:i:23 SA:Z:chr1,1,+,51M199S,60,32;
SRR7890883.3 0 chr1 1 60 196S47M7S * 0 0 GCCATTCATATACTAAGCAACAAAACATCAGCAGGATGCGGAAGGTCCCGANAGTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAANTGTNCATACAACAAAATAAGTCAGGTTAATTNNNNNNNNNNNNNNNTCCANNNNNNNNNNNNNNNNNNNNNNNNNNNTTNNTGAGATACAGGTTGATGGACGGATGGCTACATGGATGTGATGGAGAAGNNNNNN <<AFAFFKF<F,FKFAAFKKFKKKFFKKKAKK7AK7FKFKKKA,F,AFAAF#,<<K#######################################################KK,#<AK#FKKKA,AAFFAKKKKKKA<<FK7FFKK###############F,AF###########################FK##F7A77AFKKAAFAFKKKKFFAAFKK,,A,A7FFFFKFKK,,<A,A,,<###### NM:i:33 MD:Z:0C1G2A0C0A0G0C0T0C1C0C1T2G0C0T0C2T0G0G0G0T0G0A0T1G0G0T0G0A0C1C0C0T3 AS:i:47 XS:i:0
SA:Z:chr1,1,+,12S39M199S,60,31;
SRR7890883.3 2048 chr1 1 60 12H39M199H * 0 0 CTAAGCAACAAAACATCAGCAGGATGCGGAAGGTCCCGA FKFAAFKKFKKKFFKKKAKK7AK7FKFKKKA,F,AFAAF NM:i:31 MD:Z:1G0G0G0A0A0C1G0C0T0C0T1C1T0G0A0G0C0T0C0T0G0T0G2T0G0A0T0G0G0G0T2 AS:i:39 XS:i:28 SA:Z:chr1,1,+,196S47M7S,60,33;
SRR7890883.4 0 chr1 1 60 51M199S * 0 0 CACTAAAGTGTTCATGACTTGTTCTACATAAAACTAATTCAACCTGTATGANAGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTCNACCNGTTTTGAGCAATGCAGCAAATCTAGCCNNNNNNNNNNNNNNNTCTANNNNNNNNNNNNNNNNNNNNNNNNNNNTANNCAAACAGGTAGAAAACATGGCCTGAAGGATTGTCTCTGCCACCACCTCNNNNNN AAFFFKFKKKKKKKKKAKKKF,FAAFKKKKKAFFKKKFKFAKF,7,F<AAK#KKK,#######################################################,,,#F<F#,,,7<,AAFFFFFKAFKKFKKKFKFAK###############,,<7###########################,7##,7A,<AKKKKFKFK<FK<FK,7AA,AAFKAKKK7FKAFKAAKAKKKFA###### NM:i:42 MD:Z:1G0G0G2C0A0G0C1C0T0C0C0T0T0G0A0G0C1C0T0G0T0G0G0G0T0G1T0G0G0G0T0G0A1T0C1T0G0A0A1G0C0G0 AS:i:51 XS:i:0
SA:Z:chr1,1,+,197S47M6S,60,37;
SRR7890883.4 2048 chr1 1 60 197H47M6H * 0 0 AAACAGGTAGAAAACATGGCCTGAAGGATTGTCTCTGCCACCACCTC 7A,<AKKKKFKFK<FK<FK,7AA,AAFKAKKK7FKAFKAAKAKKKFA NM:i:37 MD:Z:0C0G0G0G1A0C0A0G0C0T0C0T0C1T2A0G2C0T0G0T1G0G2A0T0G0G0G0T0G0A0C0T1C0T0G0A0A0 AS:i:47 XS:i:0 SA:Z:chr1,1,+,51M199S,60,42;
SRR7890883.5 0 chr1 1 60 51M199S * 0 0 GGTGACTGTCACAAAGCAATGCCTGTTGCCAAACCTGTGGCTCCTCCTTCTNCTTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNGNGTCNGATCGGGACCTCAACTGCAGGAAAGGNNNNNNNNNNNNNNNNGGACNNNNNNNNNNNNNNNNNNNNNNNNNNNAGNNGGAGGAGCCACAGGTTTGGCAACAGGCATTGCTTTGTGACAGTCACCANNNNNN ,A<A<F<FKKKKKKKKKKKKKKKKKFKKKKKKKKKF7KFAKKKKKKKKFKK#K<FK#######################################################7#,#FKK#K,F<AAAA<FAA7FAKKKKKKKF<AK################KFFK###########################KK##KKFKKFFFKKKKKKKKKKKKKKKKKKAKKKKKKKKKFKF7<AAAAKAF###### NM:i:37 MD:Z:0C1G2A0C0A0G1T1T0C0C0T0T0G1G0C0T4G1G0T0G1T0G0G0G0T0G0A0C0T0C1T0G0A0A0A0G1G0 AS:i:51 XS:i:0 SA:Z:chr1,1,+,196S48M6S,60,31;
SRR7890883.5 2048 chr1 1 60 196H48M6H * 0 0 GGAGGAGCCACAGGTTTGGCAACAGGCATTGCTTTGTGACAGTCACCA KKFKKFFFKKKKKKKKKKKKKKKKKKAKKKKKKKKKFKF7<AAAAKAF NM:i:31 MD:Z:0C1G1A1C0A0G0C0T0C0T0C0C3A0G0C0T1T1T0G0G0G2A1G0G5T0C0C0T0G0A0A1 AS:i:48 XS:i:0 SA:Z:chr1,1,+,51M199S,60,37;
SRR7890883.6 0 chr1 1 60 51M199S * 0 0 GCACAACTGACGACTGAGAGGACATGAACACACACGTGAATAAGTGCATGGNTTCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANCNCNCNTCACATACAAATAGGAAACCGTCTTGNNNNNNNNNNNNNNNNAGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNCTCAGTCGTCAGTTGTGCAGATCGGAAGAGCGTCGTGTAGGGATAGAGNNNNNN AAAFFKKFKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK#FFKK#######################################################K#F#<#,#<FFF<AAAFFFKKKKKKKKKKKKKKK################AAFK############################F##KFKKKKKKKKKFKKKKKFFKFKKKKKFFAFFFKFFKFFAAAA<,<F,A###### NM:i:34 MD:Z:0C0G0G0G3A1C0T0C0T1C0T0T3C0T1T0G0T0G0G0G0T0G1T0G0G4C1C0C0T0G0A0A1G0C1 AS:i:51 XS:i:0
Output from same sequences aligned with bwa mem (original)
AS:i:54 XS:i:22 SA:Z:chr6,50729317,+,192S52M6S,60,2;
SRR7890883.1 2048 chr6 50729317 60 192H52M6H * 00 GANTTTTCAGTTTAGCAAAATAATGCTGGTAGGTTGGCGTGTCTTTGAAACT A,#F,,F<AA,7F,FAFF,A<7<A,FFK<,,A<AK,A,<FFF<<7AKKF7<F NM:i:2 MD:Z:2C26A22 AS:i:45 XS:i:0 SA:Z:chr6,50729331,-,194S56M,60,1;
SRR7890883.2 16 chr12 122896010 60 194S56M * 0 0NNNNNNCGAAGCCCGGTTCGCCGGGCTGTCGCTGGTGCAGCTCAACTAGCTGCTGNAGNNNNNNNNNNNNNNNNNNNNNNNNNTNCAGANNNNNNNNNNNNNNNGCCGATCTGGGCTTAAGGCGGGCTCGGNTCANCTCNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTNCACCAGCGACAGCCCGGCGAACCGGGCTTCGCTCCCGGCGCCCGCCATCCC ######,,A7(FFFA777AA,,,KKF<KKFKFFKKAAKKKKKFA,A,KKF7KKKK#FF#########################,#FFKK###############KKF,7,KKKKKKKKKKFFFAA(AKKF<#7A<#,7A####################F##################################KKKK#KKFKKKAKKKKFAAF7KKKFKAKKKKKKKAKFAAFFAKKFKKKKKKFFFAA NM:i:1 MD:Z:4G51
AS:i:54 XS:i:0 SA:Z:chr12,122895994,+,192S52M6S,60,2;
SRR7890883.2 2048 chr12 122895994 60 192H52M6H * 00 CTNCAGCAGCTAGTTGAGCTGCACCAGCGACAGCCCGGCGAACCGGGCTTCG FF#KKKK7FKK,A,AFKKKKKAAKKFFKFKK<FKK,,,AA777AFFF(7A,, NM:i:2 MD:Z:2C8C40 AS:i:45 XS:i:0 SA:Z:chr12,122896010,-,194S56M,60,1;
SRR7890883.3 16 chr16 74377111 0 194S56M * 0 0NNNNNNCTTCTCCATCACATCCATGTAGCCATCCGTCCATCAACCTGTATCTCANNAANNNNNNNNNNNNNNNNNNNNNNNNNNNTGGANNNNNNNNNNNNNNNAATTAACCTGACTTATTTTGTTGTATGNACANTTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTACTNTCGGGACCTTCCGCATCCTGCTGATGTTTTGTTGCTTAGTATATGAATGGC ######<,,A,A<,,KKFKFFFF7A,A,,KKFAAFFKKKKFAFAAKKFA77A7F##KF###########################FA,F###############KKFF7KF<<AKKKKKKAFFAA,AKKKF#KA<#,KK#######################################################K<<,#FAAFA,F,AKKKFKF7KA7KKAKKKFFKKKFKKFAAFKF,F<FKFFAFA<< NM:i:2 MD:Z:4A39G11 AS:i:49 XS:i:49 SA:Z:chr16_KI270853v1_alt,541677,+,196S46M8S,0,0;
SRR7890883.3 2048 chr16_KI270853v1_alt 541677 0 196H46M8H *0 0 TGAGATACAGGTTGATGGACGGATGGCTACATGGATGTGATGGAGA F7A77AFKKAAFAFKKKKFFAAFKK,,A,A7FFFFKFKK,,<A,A, NM:i:0 MD:Z:46 AS:i:46 XS:i:46 SA:Z:chr16,74377111,-,194S56M,0,2;
SRR7890883.4 16 chr12 70521176 60 194S56M * 0 0NNNNNNGAGGTGGTGGCAGAGACAATCCTTCAGGCCATGTTTTCTACCTGTTTGNNTANNNNNNNNNNNNNNNNNNNNNNNNNNNTAGANNNNNNNNNNNNNNNGGCTAGATTTGCTGCATTGCTCAAAACNGGTNGAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCCTNTCATACAGGTTGAATTAGTTTTATGTAGAACAAGTCATGAACACTTTAGTG ######AFKKKAKAAKFAKF7KKKAKFAA,AA7,KF<KF<KFKFKKKKA<,A7,##7,###########################7<,,###############KAFKFKKKFKKFAKFFFFFAA,<7,,,#F<F#,,,#######################################################,KKK#KAA<F,7,FKAFKFKKKFFAKKKKKFAAF,FKKKAKKKKKKKKKFKFFFAA NM:i:1 MD:Z:4G51
AS:i:54 XS:i:0 SA:Z:chr12,70521109,+,197S47M6S,60,0;
SRR7890883.4 2048 chr12 70521109 60 197H47M6H * 00 AAACAGGTAGAAAACATGGCCTGAAGGATTGTCTCTGCCACCACCTC 7A,<AKKKKFKFK<FK<FK,7AA,AAFKAKKK7FKAFKAAKAKKKFA NM:i:0 MD:Z:47 AS:i:47 XS:i:0 SA:Z:chr12,70521176,-,194S56M,60,1;
SRR7890883.5 0 chr12 2963243 60 56M194S * 0 0 GGTGACTGTCACAAAGCAATGCCTGTTGCCAAACCTGTGGCTCCTCCTTCTNCTTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNGNGTCNGATCGGGACCTCAACTGCAGGAAAGGNNNNNNNNNNNNNNNNGGACNNNNNNNNNNNNNNNNNNNNNNNNNNNAGNNGGAGGAGCCACAGGTTTGGCAACAGGCATTGCTTTGTGACAGTCACCANNNNNN ,A<A<F<FKKKKKKKKKKKKKKKKKFKKKKKKKKKF7KFAKKKKKKKKFKK#K<FK#######################################################7#,#FKK#K,F<AAAA<FAA7FAKKKKKKKF<AK################KFFK###########################KK##KKFKKFFFKKKKKKKKKKKKKKKKKKAKKKKKKKKKFKF7<AAAAKAF###### NM:i:1 MD:Z:51C4
AS:i:54 XS:i:0 SA:Z:chr12,2963243,-,7S47M196S,60,0;
SRR7890883.5 2064 chr12 2963243 60 7H47M196H * 0 0GGTGACTGTCACAAAGCAATGCCTGTTGCCAAACCTGTGGCTCCTCC AKAAAA<7FKFKKKKKKKKKAKKKKKKKKKKKKKKKKKKFFFKKFKK NM:i:0 MD:Z:47 AS:i:47 XS:i:0 SA:Z:chr12,2963243,+,56M194S,60,1;```
Move FMI related parts from bwtbuild to FMI_Search
Hi,
We compiled the latest commit of bwa-mem2 with g++ (our icpc does not work somehow):
make -j20 multi ARCH_FLAGS=-mavx2
and used it in bwa-meth pipeline which pipes C-to-T converted fastq on the fly into bwa.
Note the index has been built with the same bwa-mem2 on a c-to-T converted genome "v37_decoy_plus_phage.fasta.bwameth.c2t"
We got an assertion failure which (by reading the readme) might due to memory allocation problem.
The error log is:
Ref file: /gpfs/genomedb/b37/b37_bwamem2/v37_decoy_plus_phage.fasta.bwameth.c2t
Entering FMI_search
reference seq len = 12550012029
count
0, 1
1, 4851497105
2, 6275006015
3, 7698514925
4, 12550012029
Reading other elements of the index from files /gpfs/genomedb/b37/b37_bwamem2/v37_decoy_plus_phage.fasta.bwameth.c2t
prefix: /gpfs/genomedb/b37/b37_bwamem2/v37_decoy_plus_phage.fasta.bwameth.c2t
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = /gpfs/genomedb/b37/b37_bwamem2/v37_decoy_plus_phage.fasta.bwameth.c2t.0123
Reference genome size: 12550012028 bp
Done readng reference genome !!
readLen: 97
[0000] 1: Calling process()
No. of pipeline threads: 2
[0000] read_chunk: 160000000, work_chunk_size: 160000216, nseq: 1357552
[0000][ M::kt_pipeline] read 1357552 sequences (160000216 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[M::kt_pipeline] 0 single-end sequences; 1357552 paired-end sequences.....
[0000] 3. Calling kt_for - worker_bwt
bwa-mem2: src/bwamem.cpp:662: smem_struct *mem_collect_smem(const mem_opt_t *, const bseq1_t *, int, smem_struct *, int *, short *, unsigned char *, int *, long &): Assertion `query_pos_ar[pos] < len' failed.
///
The readme file states that one could increase the values of corresponding macros but we would like to know exactly where and how to tweak...
Thanks a lot in ahead.
Best,
Yi
Hi,
I tried with pre-compiled and compiled (hg19 and hg38). The same output:
bwa-mem2/bwa-mem2 index Homo_sapiens_assembly19.fasta
[bwa_index] Pack FASTA... 13.35 sec
init ticks = 204490443602
ref seq len = 6203953124
binary seq ticks = 117021996235
build index ticks = 6870572410362
ref_seq_len = 6203953124
count = 0, 1811627992, 3101976562, 4392325132, 6203953124
Segmentation fault (core dumped)
Is this being worked on? Where can I get the index files for the reference genome suitable for bwa-mem2.
Thanks.
I just gave bwa-mem2 (binaries; building from source doesn't change the effect) a quick try, using paired reads in two separate files, and I got the following error:
[0000] 1: Calling process()
Threads used (compute): 1
Projected #read in a task: 79375
------------------------------------------
Memory pre-allocation for chaining: 2544445000
Memory pre-allocation for BSW: 239617024
Memory pre-allocation for BWT: 51611776
------------------------------------------
No. of pipeline threads: 2
bwa-mem2: src/fastmap.cpp:253: ktp_data_t *kt_pipeline(void *, int, void *, mem_opt_t *, worker_t &): Assertion `ret->n_seqs <= nreads' failed.
Aborted (core dumped)
If I run the same command with only the left or right set of reads there isn't an issue. I think an important point here is the way the headers of the fastq encode the left and right read. The read set that I'm working that causes this error arises from mapping followed by subsetting of the bam to retain only mapped reads, and then output with samtools fastq with the -N flag (so left and right are distinguished by the /1 and /2 suffix). If I use the original reads that retain the @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>
format then bwa-mem2 seems to work fine. Note that both read sets work with bwa-mem so I assume the error is just arising as a result of the pairing information? Please let me know if additional information would be useful.
nohup ./bwa-mem2
the sam file cannot be recognized when I use samtools to convert to bam file
[W::sam_read1] Parse error at line 1 [main_samview] truncated file.
I'm trying out the pre-compiled bwa-mem2 binaries on an AWS r5d.12xlarge (384 GB of RAM, 900 GB storage, 48 cores) and keep exiting with a non-zero exit code while it is writing out the .2bit.64 file. No error message is presented, so I'm not sure exactly what the underlying problem is. It is reproducible and deterministic though. Running on the same reference genome two times in row will result in the same failure with the output files in the exact same state, as determined by running md5sum on all of the output files upon the program exiting:
> md5sum hs37d5.fa.*
4a509f50e9ca3027fdb211af51e02bd2 hs37d5.fa.0123
c5287824ea836aed6e34dd2400f83643 hs37d5.fa.amb
b9445ac9206ed3710dce7c6bffcf54a6 hs37d5.fa.ann
c2bb6afb52e2efff51288d33f2b5bcd2 hs37d5.fa.bwt.2bit.64
191da2e42cf9d6f9b7553a9d3cbbdbef hs37d5.fa.bwt.8bit.32
cbc647939a64ce1e5600f17e499db4f1 hs37d5.fa.pac
My command is simply
./bwa-mem2 index hs37d5.fa
Do you have any suggestions on what I might try?
Dear authors,
I just used the latest commit (Dec 12, 2019), as gcc compiled version, for our GATK pipeline and it finally runs stable and produced comparable results to our bwa-mem pipeline. I have tried the previous versions with very issues posted here. Would it be possible to offer the latest commit as ICPC compiled binaries to optimize the speed?
Cheers,
Thomas
Hi,
I am using the precompiled version 2.0pre2 and built the index using bwa-mem2 index
for Grch38. I am now trying to align but get the error:
-----------------------------
Executing in AVX512 mode!!
-----------------------------
Ref file: /reference/GRCh38_full_analysis_set_plus_decoy_hla.fa
Entering FMI_search
bwa-mem2: src/FMI_search.cpp:56: FMI_search::FMI_search(char *): Assertion `reference_seq_len > 0' failed.
The command looks like:
docker run \
-v /media/maxh/fast_storage/1x_reads:/data \
-v /media/maxh/slow_storage/human_data/reference_genome/mem2_index:/reference \
cannabinoid/bwa-mem2:0.1 bwa-mem2 mem \
/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa \
/data/ERR3239279_R1_1x.fastq.gz \
/data/ERR3239279_R2_1x.fastq.gz
It is detecting the indices because if I run with a malformed reference path it gives a different error.
I am going to try and rebuild the indices, will update when this is complete.
src/bwtbuild.cpp: In function ‘int build_index(const char*)’:
src/bwtbuild.cpp:710:37: error: expected ‘)’ before ‘PRIu64’
710 | fprintf(stderr, "init ticks = %" PRIu64 "\n", __rdtsc() - startTick);
| ~ ^~~~~~~
| )
src/bwtbuild.cpp:742:43: error: expected ‘)’ before ‘PRIu64’
742 | fprintf(stderr, "binary seq ticks = %" PRIu64 "\n", __rdtsc() - startTick);
| ~ ^~~~~~~
| )
src/bwtbuild.cpp:752:44: error: expected ‘)’ before ‘PRIu64’
752 | fprintf(stderr, "build index ticks = %" PRIu64 "\n", __rdtsc() - startTick);
| ~ ^~~~~~~
| )
make: *** [Makefile:74: src/bwtbuild.o] Error 1
Add the __STDC_FORMAT_MACROS
before inttypes.h
in src/bwtbuild.cpp
fixed the issue:
#define __STDC_FORMAT_MACROS 1
#include <inttypes.h>
bwa-mem2 reproducibly returns the status code 1, which causes scripts and Makefiles to assume failure.
To reproduce:
wget http://ftp.ensembl.org/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
bwa-mem2 index Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
echo $?
# 1
Same when running on a gunzip
-ped FASTA file instead.
Or, using human (sorry):
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
bwa-mem2 index hs37d5.fa.gz
echo $?
# 1
The same is true when invoking bwa-mem2 mem
.
bwa-mem2 version
# 2.0pre1
uname -srvio
# Linux 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 GNU/Linux
lsb_release -a
# No LSB modules are available.
# Distributor ID: Ubuntu
# Description: Ubuntu 18.04.2 LTS
# Release: 18.04
# Codename: bionic
Hi, if I just run make
on my amd3900x, there are some warnings and an error message (log shown at the end). make arch=avx2
ran smoothly.
Warning and error messages:
src/bwamem.cpp: In function ‘void mem_chain2aln_across_reads_V2(const mem_opt_t*, const bntseq_t*, const uint8_t*, bseq1_t*, int, mem_chain_v*, mem_alnreg_v*, mem_cache*, int64_t, int64_t, int64_t, int)’:
src/bwamem.cpp:2016:37: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
2016 | _mm_prefetch((const char*) query, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bwamem.h:39,
from src/bwamem.cpp:31:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bwamem.cpp:2034:51: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
2034 | _mm_prefetch((const char) (srtg + spos + 64), 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bwamem.h:39,
from src/bwamem.cpp:31:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bwamem.cpp:2035:40: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
2035 | _mm_prefetch((const char) (lim_g), 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bwamem.h:39,
from src/bwamem.cpp:31:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bwamem.cpp:2072:37: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
2072 | _mm_prefetch((const char) rseq, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bwamem.h:39,
from src/bwamem.cpp:31:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
g++ -c -g -O3 -fpermissive -msse4.1 -DENABLE_PREFETCH -DPAIRED_END=1 -DMAINY=0 -DSAIS=1 -DDEB=0 -DRDT=0 -DMAXI=0 -DNEW=1 -DSORT_PAIRS=0 src/profiling.cpp -o src/profiling.o
g++ -c -g -O3 -fpermissive -msse4.1 -DENABLE_PREFETCH -DPAIRED_END=1 -DMAINY=0 -DSAIS=1 -DDEB=0 -DRDT=0 -DMAXI=0 -DNEW=1 -DSORT_PAIRS=0 src/bandedSWA.cpp -o src/bandedSWA.o
src/bandedSWA.cpp: In member function ‘void BandedPairWiseSW::smithWatermanBatchWrapper16(SeqPair, uint8_t*, uint8_t*, int32_t, uint16_t, int8_t)’:
src/bandedSWA.cpp:3755:63: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
3755 | _mm_prefetch((const char*) seqBufRef + (int64_t)spf.idr, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bandedSWA.cpp:30:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bandedSWA.cpp:3756:68: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
3756 | _mm_prefetch((const char) seqBufRef + (int64_t)spf.idr + 64, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bandedSWA.cpp:30:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bandedSWA.cpp:3799:63: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
3799 | _mm_prefetch((const char) seqBufQer + (int64_t)spf.idq, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bandedSWA.cpp:30:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void __P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
src/bandedSWA.cpp:3800:68: warning: invalid conversion from ‘int’ to ‘_mm_hint’ [-fpermissive]
3800 | _mm_prefetch((const char) seqBufQer + (int64_t)spf.idq + 64, 0);
| ^
| |
| int
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/emmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/pmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/tmmintrin.h:31,
from /usr/lib/gcc/x86_64-redhat-linux/9/include/smmintrin.h:32,
from src/bandedSWA.h:42,
from src/bandedSWA.cpp:30:
/usr/lib/gcc/x86_64-redhat-linux/9/include/xmmintrin.h:52:46: note: initializing argument 2 of ‘void _mm_prefetch(const void*, _mm_hint)’
52 | _mm_prefetch (const void *__P, enum _mm_hint __I)
| ~~~~~~~~~~~~~~^~~
g++ -c -g -O3 -fpermissive -msse4.1 -DENABLE_PREFETCH -DPAIRED_END=1 -DMAINY=0 -DSAIS=1 -DDEB=0 -DRDT=0 -DMAXI=0 -DNEW=1 -DSORT_PAIRS=0 src/FMI_search.cpp -o src/FMI_search.o
src/FMI_search.cpp: In destructor ‘FMI_search::~FMI_search()’:
src/FMI_search.cpp:129:14: error: ‘c_bcast_array’ was not declared in this scope
129 | _mm_free(c_bcast_array);
| ^~~~~~~~~~~~~
make: *** [Makefile:75: src/FMI_search.o] Error 1
I wanted to test speed improvements of BWA-MEM2 over BWA-MEM and run the same WGS 30x sample on AWS instance c5.9xlarge (36CPUS, 72GB):
bwa mem -R '@RG\tID:1\tLB:FDA_GARVAN\tPL:Illumina-HiSeq-XTen\tSM:NA12878' -t 36 human_g1k_v37_decoy.fasta NA12878-Garvan-Vial1_R1.fastq.gz NA12878-Garvan-Vial1_R2.fastq.gz > NA12878-Garvan-Vial1_R.sam
bwa-mem2 mem -R '@RG\tID:1\tLB:FDA_GARVAN\tPL:Illumina-HiSeq-XTen\tSM:NA12878' -t 36 human_g1k_v37_decoy.fasta NA12878-Garvan-Vial1_R1.fastq.gz NA12878-Garvan-Vial1_R2.fastq.gz > NA12878-Garvan-Vial1_R.sam
BWA-MEM finished in 4 hours, 12 minutes and BWA-MEM2 in 3 hours, 40 minutes.
Is that expected?
BWA-MEM2 log is attached..
Hi,
I was wondering how one might go about using bwa-mem2 as a library in the same way you can in the original bwa;
https://github.com/lh3/bwa/blob/master/example.c
or is there now another way to interface our own applications with bwa-mem2?
I've had some success creating an archive of objects and linking to a basic program however I'm struggling to work out how to load an index and create a trivial example similar to the example in bwa.
$ ar -csr libbwa.a utils.o kthread.o kstring.o ksw.o bwa.o bwamem.o bwamem_pair.o bwamem_extra.o bwtbuild.o bwtindex.o bntseq.o
$ cat makefile
CC = g++ -std=gnu++14
LIB = -lpthread -lz -lm -llzma -lbz2 -lrt
INC = -I./bwa-mem2/src
LD = bwa-mem2/src/libbwa.a
all : hello
hello : hello.cpp
$(CC) -g -O3 -fpermissive $^ $(LD) $(INC) $(LIB) -o $@
clean:
rm hello
A trivial example that builds the index, however I can't then access functions such as bwa_idx_load_bwt
because they are not being compiled
Line 348 in b1e436c
$ cat hello.cpp
#include <stdio.h>
#include <zlib.h>
#include <string.h>
#include <errno.h>
#include <assert.h>
#include "bwa.h"
#include "bwamem.h"
#include "kseq.h" // for the FASTA/Q parser
KSEQ_DECLARE(gzFile)
int main(int argc, char *argv[])
{
//bwaidx_t *idx;
gzFile fp;
kseq_t *ks;
mem_opt_t *opt;
if (argc < 2) {
fprintf(stderr, "Usage: hello <idx.base>\n");
return 1;
}
bwa_idx_build(argv[1], argv[1]);
free(opt);
kseq_destroy(ks);
gzclose(fp);
return 0;
}
I realise that I'm probably attempting to access the bwa-mem2 API incorrectly to build the index, I'm just not clear how this should be done. Any help would be greatly appreciated.
Setting:
Command:
⟫ bwa-mem2 mem -t 8 genome/hs38DH.fa /home/dnanexus/in/reads_fastqgzs/0/ERR3242459_1.fastq.gz /home/dnanexus/in/reads2_fastqgzs/0/ERR3242459_2.fastq.gz -K 100000000 -Y -R '@RG\tID:ERR3242459_1\tPL:ILLUMINA\tPU:ERR3242459_1\tLB:ERR3242459\tSM:ERR3242459' > /dev/null
Error message:
...
[M::mem_pestat] low and high boundaries for proper pairs: (1, 9676)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[0000] 10. Calling kt_for - worker_sam
*** Error in `bwa-mem2': malloc(): memory corruption: 0x00007fa911bc31b0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fb835a857e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fb835a9013e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fb835a92184]
bwa-mem2[0x4720d0]
bwa-mem2[0x44d083]
bwa-mem2[0x4458b5]
bwa-mem2[0x42658c]
bwa-mem2[0x46a631]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fb83689a6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb835b1541d]
======= Memory map: ========
00400000-0049d000 r-xp 00000000 00:2a 75544 /bwa-mem2/bwa-mem2.avx512bw
0069c000-006b0000 rw-p 0009c000 00:2a 75544 /bwa-mem2/bwa-mem2.avx512bw
006b0000-006d1000 rw-p 00000000 00:00 0
01f8c000-04448000 rw-p 00000000 00:00 0 [heap]
7fa878000000-7fa87bffc000 rw-p 00000000 00:00 0
7fa87bffc000-7fa87c000000 ---p 00000000 00:00 0
...
Below are the files we used:
Reference genome (hs38DH.fa as a tar file, built after 6743183): https://dl.dnanex.us/F/D/19XXQfV51qBYpP4jgY46YzpX7ZB6GQYv78Zg6B4x/hs38DH.bwa-index.avx512.v2.tar.gz
ERR3242459_1.fastq.gz (11.34 GiB): https://dl.dnanex.us/F/D/bBp5FZBgk9q8BpvFk1KXk06XGgbXGpfq9GQ1jBxY/ERR3242459_1.fastq.gz
ERR3242459_2.fastq.gz (11.92 GiB): https://dl.dnanex.us/F/D/8Bjk0vVV9k6f7v7Bx4X25JYZ2P3493P348yv7Xkv/ERR3242459_2.fastq.gz
I'm attempting to align a single lane of Novaseq-sequenced COLO829 paired-end using the pre-built binaries (https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre1/bwa-mem2-2.0pre1_x64-linux.tar.bz2
per the README). I indexed the reference genome using bwa-mem2 index
with no arguments and am running under a GCP instance with 408GB RAM. I've tried with various values for the -t
parameter with the same results.
$ bwa-mem2 mem -o colo.bam -t 8 reference_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta R1_001.fastq R2_001.fastq
-----------------------------
Executing in AVX512 mode!!
-----------------------------
Ref file: reference_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
Entering FMI_search
reference seq len = 6191387963
count
0, 1
1, 1808064624
2, 3095693982
3, 4383323340
4, 6191387963
Reading other elements of the index from files reference_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
prefix: reference_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = reference_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta.0123
Reference genome size: 6191387962 bp
Done readng reference genome !!
readLen: 151
[0000] 1: Calling process()
Threads used (compute): 8
Projected #read in a task: 529811
------------------------------------------
Memory pre-allocation for chaining: 16983621416
Memory pre-allocation for BSW: 1916936192
Memory pre-allocation for BWT: 494814208
------------------------------------------
No. of pipeline threads: 2
[0000] read_chunk: 80000000, work_chunk_size: 80000102, nseq: 529802
[0000][ M::kt_pipeline] read 529802 sequences (80000102 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[0000] 3. Calling kt_for - worker_bwt
bwa-mem2: src/bwamem.cpp:394: int test_and_merge(const mem_opt_t *, long, mem_chain_t *, const abc *, int, abc *, long *, long): Assertion `((*auxSeedBufCount) + c->m) <= auxSeedBufSize' failed.
Hi I tried to align reads to a genome with 2 small fragments but when I start bwa-mem2 I get:
Reference genome size: 526 bp
Done readng reference genome !!
readLen: 69
[0000] 1: Calling process()
No. of pipeline threads: 2
bwa-mem2: src/fastmap.cpp:253: ktp_data_t *kt_pipeline(void *, int, void *, mem_opt_t *, worker_t &): Assertion `ret->n_seqs <= nreads' failed.
e.g.
~> bwa-mem2 index
Usage: bwa index [-p prefix] <in.fasta>
Hi,
First of all, thank you for creating bwa-mem2. It is a very cool tool.
I met a problem when I tried to execute in AVX2 mode.
make arch=avxw
./bwa-mem2 mem hg38.fa XXX.fastq
Executing in AVX2 mode!
Ref file:.....
Entering FMI_search
reference seq_len =
segmentation fault (core dumped)
I'm specifying read group information at the command line and am noticing that in the SAM header output, there is no new line inserted between my read group information line and the following @pg line. Looking at the code, I see that there appears to be a #if statement that would lead to a path of code that would include the newline and another path that results in no newline:
Lines 543 to 553 in 30a633d
Was there a reason for omitting the newline in the second path?
I have a one question on splitted reference genome.
I firstly mapped PE-WG sequence files to whole genome. Its output is about 80GB.
and then i splitted reference genome into only chr22 (the shortest chromosome) and mapped files to splitted genome.
But Its output is about 80GB at the same above.
Do i need to process something?
Thanks.
Is is possible to add the -5
option (for split alignment, take the alignment with the smallest coordinate as primary) from the original bwa mem?
Hello,
I am encountering this issue
./bwa-mem2-2.0pre1_x64-linux/bwa-mem2 index coli.fa
./bwa-mem2-2.0pre1_x64-linux/bwa-mem2 mem 2 -p -t 8 coli.fa paired.fq
-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: 2
Entering FMI_search
ERROR! Unable to open the file: 2.bwt.8bit.32
The 2.bwt.8bit.32 file is there and not empty.
ls coli.fa*
coli.fa coli.fa.0123 coli.fa.amb coli.fa.ann coli.fa.bwt.2bit.64 coli.fa.bwt.8bit.32 coli.fa.pac
I am running Ubuntu 18.04 and my processor is
Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
thank you
Hi,
I get the Error message:
readLen: 120
[0000] 1: Calling process()
No. of pipeline threads: 2
[0000] read_chunk: 320000000, work_chunk_size: 320000191, nseq: 2320440
[0000][ M::kt_pipeline] read 2320440 sequences (320000191 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[0000] 3. Calling kt_for - worker_bwt
bwa-mem2: src/bwamem.cpp:662: SMEM* mem_collect_smem(const mem_opt_t*, const bseq1_t*, int, SMEM*, int32_t*, int16_t*, uint8_t*, int32_t*, int64_t&): Assertion `query_pos_ar[pos] < len' failed.
nothing to output but the core file ~130G (core dumped). How can I fix this error?
So, this is similar to #10, but a different error while mapping. Hopefully I have provided enough information below to make the error reproducible. I mapped the same reads to a different reference without bwa-mem2
stopping/failing during mapping as it does below.
# Get reads
wget https://www.dropbox.com/s/vfn16yck7xohcnq/chicago-lib_001_R1.fq.gz
wget https://www.dropbox.com/s/1lwgr3sfbbyokcq/chicago-lib_002_R1.fq.gz
wget https://www.dropbox.com/s/e6lmcak6clhackc/chicago-lib_003_R1.fq.gz
wget https://www.dropbox.com/s/0o6jvk66uh6oe1x/hi-c-lib_003_R1.fq.gz
wget https://www.dropbox.com/s/h8oa2lryrpufk88/hi-c-lib_002_R1.fq.gz
wget https://www.dropbox.com/s/5c62ujxx5puad1q/hi-c-lib_001_R1.fq.gz
# Concatenate reads
cat *.fq.gz > chicago-hi-c_1.fastq.gz
# Get reference sequence
wget https://www.dropbox.com/s/le19d5bpfhahplh/dromedary.flye-corr.fa.gz
# Unzip reference
gunzip dromedary.flye-corr.fa.gz
# Index reference
bwa-mem2 index -p test dromedary.flye-corr.fa
# map reads
bwa-mem2 mem -t 75 test fastq/chicago-hi-c_1.fastq.gz 2> test.log |sambamba view -q -t 75 -S -f bam /dev/stdin > test_1.bam &
# standard error
-----------------------------
Executing in SSE4.1 mode!!
-----------------------------
Ref file: test
Entering FMI_search
reference seq len = 4025262461
count
0, 1
1, 1178846653
2, 2012631231
3, 2846415809
4, 4025262461
Reading other elements of the index from files test
prefix: test
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = test.0123
Reference genome size: 4025262460 bp
Done readng reference genome !!
[0000] 1: Calling process()
Threads used (compute): 75
Info: projected #read in a task: 4966897
------------------------------------------
Memory pre-allocation for chaining: 10450.3513 MB
Memory pre-allocation for BSW: 17971.2768 MB
Memory pre-allocation for BWT: 5798.5632 MB
------------------------------------------
No. of pipeline threads: 2
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 2968.276 CPU sec, 166.577 real sec
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 1
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 2904.072 CPU sec, 87.171 real sec
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 2
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 3518.648 CPU sec, 65.854 real sec
[0000] 2. Calling mem_process_seqs.., task: 3
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 3089.436 CPU sec, 74.955 real sec
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 4
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 3153.928 CPU sec, 105.077 real sec
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 5
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
[0000] 10. Calling kt_for - worker_sam
[0000][ M::mem_process_seqs] Processed 4966888 reads in 3100.744 CPU sec, 64.002 real sec
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 6
[0000] 3. Calling kt_for - worker_bwt
[0000] 3. Calling kt_for - worker_aln
bwa-mem2: src/bwamem.cpp:2170: void mem_chain2aln_across_reads_V2(const mem_opt_t*, const bntseq_t*, const uint8_t*, bseq1_t*, int, mem_chain_v*, mem_alnreg_v*, mem_cache*, int64_t, int64_t, int64_t, int): Assertion `numPairsRight < BATCH_SIZE * SEEDS_PER_READ' failed.
Would be nice to have as to not require pipes to samtools in order to generate the file.
Hi,
Please can you indicate which version of icpc was used to build the v2.0pre2 binaries. We are seeing ~10% slowdown when we compile with 19.0.4.235.
Thanks,
Keiran
Should give non-zero exit codes if there is a problem with input data:
~> /opt/wtsi-cgp/bin/bwa-mem2 mem -T 30 -Y -p -t 6 /.../genome.fa /.../tmpMap_PD23336b/split/1/148698_i.fq_000000.gz > /dev/null
-----------------------------
Executing in AVX512 mode!!
-----------------------------
Ref file: /.../genome.fa
Entering FMI_search
ERROR! Unable to open the file: /.../genome.fa.bwt.8bit.32
~> echo $?
0
Hello,
I guess a simple question but what would be the difference between running
bwa-mem2 mem
VS
bwa-mem2.avx2 mem
On a proc supporting AVX2 instructions? Is the first executable a kind of wrapper that would identify the correct architecture?
Thank you
Hi,
I am migrating from the original bwa to bwa-mem2 and I am trying unsuccessfully to generate the indexes for the reference genome GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz, available at:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
The output is:
$ bwa-mem2 index GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna
[bwa_index] Pack FASTA... 9.78 sec
init ticks = 177097768626
ref seq len = 6211430126
binary seq ticks = 121988005759
Allocation of 46.28 GB for suffix_array failed.
Current Allocation = 52.06 GB
The output is the same in different machines indexing this reference genome.
Indexing the same genome with original bwa does not report any error.
Previous bwa mem version supported long reads through present settings (-x). Do you plan to introduce it here as well?
First, thank you for your work on this important program. I am looking forward to the improvements in BWA MEM2.
Running BWA MEM2 with the pre-built binaries results in the following for me:
bwa-mem2: src/bwamem.cpp:662: smem_struct *mem_collect_smem(const mem_opt_t *, const bseq1_t *, int, smem_struct *, int *, short *, unsigned char *, int *, long &): Assertion `query_pos_ar[pos] < len' failed.
The machine is Intel Skylake n1-standard-32 on GCP (32 core, 120GB RAM).
Reference genome: GRCh38.no_alt_analysis_set.fa.gz
Index generated: ./bwa-mem2 index GRCh38.no_alt_analysis_set.fa.gz
Input files - NA12878-I30 NovaSeq WGS from BaseSpace (the first 10,000 reads of each pair can reproduce the issue). These files are attached.
Command and error:
time ./bwa-mem2 mem ../references/GRCh38.no_alt_analysis_set.fa.gz NA12878-I30.10000r.R1.fastq.gz NA12878-I30.10000r.R2.fastq.gz -t 32 -R '@RG\tID:NA12878-I30\tLB:NA12878-I30\tPL:ILLUMINA\tPU:NONE\tSM:NA12878' | samtools sort -O BAM -o NA12878.bam -
Entering FMI_search
reference seq len = 6199845083
count
0, 1
1, 1817861264
2, 3099922542
3, 4381983820
4, 6199845083
Reading other elements of the index from files ../references/GRCh38.no_alt_analysis_set.fa.gz
prefix: ../references/GRCh38.no_alt_analysis_set.fa.gz
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = ../references/GRCh38.no_alt_analysis_set.fa.gz.0123
Reference genome size: 6199845082 bp
Done readng reference genome !!
readLen: 150
[0000] 1: Calling process()
Threads used (compute): 32
Projected #read in a task: 2133343
------------------------------------------
Memory pre-allocation for chaining: 68386443208
Memory pre-allocation for BSW: 7667744768
Memory pre-allocation for BWT: 1966149632
------------------------------------------
No. of pipeline threads: 2
[0000] read_chunk: 320000000, work_chunk_size: 3008555, nseq: 20000
[0000][ M::kt_pipeline] read 20000 sequences (3008555 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[0000] 3. Calling kt_for - worker_bwt
[0000] read_chunk: 320000000, work_chunk_size: 0, nseq: 0
bwa-mem2: src/bwamem.cpp:662: smem_struct *mem_collect_smem(const mem_opt_t *, const bseq1_t *, int, smem_struct *, int *, short *, unsigned char *, int *, long &): Assertion `query_pos_ar[pos] < len' failed.
[NA12878-I30.10000r.R2.fastq.gz](https://github.com/bwa-mem2/bwa-mem2/files/3205601/NA12878-I30.10000r.R2.fastq.gz)
[NA12878-I30.10000r.R1.fastq.gz](https://github.com/bwa-mem2/bwa-mem2/files/3205602/NA12878-I30.10000r.R1.fastq.gz)
pre-information is below
computer spec = Ubuntu, 80core, 760GB ram
bwa version = 2.0pre1
testfiles = 1.7G, 1.7G (paired)
Ref genome = GRCh38
cmd doesn't work.
bwa-mem2 mem -t 75 ref/grch38.fa ../test1.fastq.gz ../test2.fastq.gz > out.sam
-----------------------------
Executing in AVX2 mode!!
-----------------------------
Ref file: ref/grch38.fa
Entering FMI_search
reference seq len = 6176572803
count
0, 1
1, 1811087986
2, 3088286402
3, 4365484818
4, 6176572803
Reading other elements of the index from files ref/grch38.fa
prefix: ref/grch38.fa
[M::bwa_idx_load_ele] read 0 ALT contigs
Done reading Index!!
Reading reference genome..
Binary seq file = ref/grch38.fa.0123
Reference genome size: 6176572802 bp
Done readng reference genome !!
readLen: 151
[0000] 1: Calling process()
Threads used (compute): 75
Projected #read in a task: 4966897
------------------------------------------
Memory pre-allocation for chaining: 159218850232
Memory pre-allocation for BSW: 17971276800
Memory pre-allocation for BWT: 4638883200
------------------------------------------
No. of pipeline threads: 2
[0000] read_chunk: 750000000, work_chunk_size: 750000088, nseq: 4966888
[0000][ M::kt_pipeline] read 4966888 sequences (750000088 bp)...
[0000] 2. Calling mem_process_seqs.., task: 0
[0000] 3. Calling kt_for - worker_bwt
Segmentation fault (core dumped)
but it did work well
bwa-mem2 mem -t 40 ref/grch38.fa ../test1.fastq.gz ../test2.fastq.gz > out.sam (20G)
also 'bwa-mem' works too. but file size with made from 'bwa-mem' is 26G. Is it just difference from algorithm? is it negligible?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.