cmu-safari / blend Goto Github PK

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applications: 1) finding overlapping reads and 2) read mapping. Described by Firtina et al. (published in NARGAB https://doi.org/10.1093/nargab/lqad004)

License: Other

Makefile 0.61% C 65.82% Shell 14.64% Python 5.33% Perl 0.20% JavaScript 13.07% Gnuplot 0.26% Dockerfile 0.06%

bioinformatics blend de-novo-assembly genome-analysis genome-assembly minimizers read-mapping strobemers fuzzy-seeds read-overlapping

blend's People

Contributors

Stargazers

Watchers

Forkers

patrick59zm mattheww95 biopim

blend's Issues

undesired 'mm_map_frag rechain' in sam file

Dear Blend development team,
I was interesting in testing BLEND for short read mapping. The mapping of paired-end Illumina reads against a tomato genome work perfectly but the output sam file contained 252 lines with "mm_map_frag rechain" after the PG line:

@SQ     SN:17-PSC-SL_TK14181.1.0_Chr11  LN:53848686
@SQ     SN:17-PSC-SL_TK14181.1.0_Chr12  LN:68218429
@RG     ID:var1    SM:var1     LB:Solution     PL:illumina     PU:none
@PG     ID:blend        PN:blend        VN:1.0  CL:blend -ax sr -t 4 -R @RG\tID:var1\tSM:var1\tLB:Solution\tPL:illumina\tPU:none slycopersicum.fasta.ind reads_1.fastq.gz reads_2.fastq.gz
mm_map_frag rechain
mm_map_frag rechain
...

These lines seem to be problematic for further processing with samtools:

samtools flagstat tmp.sam
[W::sam_read1_sam] Parse error at line 16
samtools flagstat: error reading from "tmp.sam"

Best regards,

Thomas

Questions on running Blend on laptop

Hi Blend team, I am trying to run blend on my laptop to map recently released ONT duplex reads. I cut single fastq.gz into small chunks (each contains 20000 reads) and run below command. But after generating some .tmp file, the process was killed (I believe exceeding max mem ~9GB here).

blend -ax map-ont -t 6 --secondary=no -I 50M -a --split-prefix hg002 hg38.fa ont_small_chunk.fq.gz

I am not quite sure about "-I 50M" just assuming blend will map reads to part of the whole index to save memory. Am I right? Any advice to run blend on platform with restrained resources? Or maybe it should not be run this way. Thanks a lot!

Original fastq is here:
https://human-pangenomics.s3.amazonaws.com/submissions/0CB931D5-AE0C-4187-8BD8-B3A9C9BFDADE--UCSC_HG002_R1041_Duplex_Dorado/Dorado_v0.1.1/stereo_duplex/11_15_22_R1041_Duplex_HG002_1_Dorado_v0.1.1_400bps_sup_stereo_duplex_pass.fastq.gz

Conda package not working in Mac M1

Hello!
I was interested in using the program. I installed the program following a step-by-step process in a Mac M1:

mamba create -n blend-bio
mamba install -c bioconda blend-bio
blend -h

and the output was:

50665 illegal hardware instruction blend -h

I also tried using a environment with x86 architecture, but it produced the same error.

I hope you can help me.

Thanks!

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads

In the paper https://arxiv.org/abs/2112.08687 BLEND was tested on only one non-HiFi read dataset. That was a simulated read dataset for one of the smallest eukaryotic genomes — the genome of Saccharomyces cerevisiae.

To test how well BLEND performs on real (non-simulated) datasets of genomes which have more typical sizes, I used it to assemble genomes from these two sets of reads:

Caenorhabditis elegans, PacBio CLR reads used in the article https://www.sciencedirect.com/science/article/pii/S2589004220305770 . For polishing I also used Illumina reads from that article. The nematode genome size is approximately 100 Mbp.
Arabidopsis thaliana, Nanopore reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR5530736 . For polishing I also used Illumina reads https://www.ncbi.nlm.nih.gov/sra/?term=ERR2173372 . The size of arabidopsis' genome is approximately 120 Mbp.

I searched for overlaps, then assembled the genomes with Miniasm using default parameters, then polished the assemblies using long reads with Racon, and then polished the assemblies using both long and short reads with HyPo. The assemblies were compared with references using QUAST.

The search for overlaps was performed with Blend 1.0 and, for comparison, with Minimap 2.22, using 22 threads of Intel Xeon X5670.

For the nematode, results are as follows:

	Minimap2	BLEND
Time to find overlaps	10m	3h 37m
Maximum RAM consumption	20G	44G
N50	2,056,511	1,915,190
NGA50	589,675	563,498
misassemblies	740	707
Genome fraction	99.692%	99.683%
Total length	109,516,352	108,958,103

So, the assemblies of the nematode genome made with Minimap2 and with BLEND are similar. However, Blend required 20x more time to find overlaps and 2x more RAM.

For arabidopsis Minimap found overlaps in 30 minutes using 29G RAM. I terminated BLEND because it didn't finish in 24 hours. At the moment I terminated it, BLEND was using 300G RAM.

So, it seems that on non-HiFi datasets for genomes not as small as the genome of Saccharomyces cerevisiae BLEND is slower than Minimap2 and uses more RAM. This may be so because BLEND doesn't deal efficiently with repetitive seeds.

Some questions about the article

Could you please answer some questions about the article (https://arxiv.org/pdf/2112.08687.pdf):

For HiFi reads you used Minimap2 with the option --ava-pb that is intended for PacBio CLR reads and not PacBio HiFi reads (Table S1). Why didn't you try Minimap2 with some other parameters? For example you could have increased the window size and the minimizer size. I suppose this will make Minimap2 faster and decrease its RAM consumption, thus reducing the difference between BLEND and Minimap2 on HiFi reads.
Why did you use N50 and not NGA50 (Table 2)? N50 may be inflated due to misassemblies that result in improper sequence junctions.
Why did you measure k-mer completeness and average identity using unpolished assemblies (Table 3)? Miniasm assemblies require polishing, because the accuracy of its contigs is the same as the accuracy of the reads used for the assembly. The higher accuracy of BLEND in Table 3 means that contigs made with BLEND are composed of slightly more accurate reads than contigs made with Minimap2, but the difference in accuracy may disappear after polishing.
Taking into account that you used only one non-HiFi long read dataset and BLEND performed on it worse than Minimap2 (N50 in Table 2), is it correct to say that BLEND is probably fit only for HiFi long reads, and not PacBio CLR or Nanopore reads?

With best wishes,
Mikhail Schelkunov

cmu-safari / blend Goto Github PK

blend's People

Contributors

Stargazers

Watchers

Forkers

blend's Issues

undesired 'mm_map_frag rechain' in sam file

Questions on running Blend on laptop

Conda package not working in Mac M1

A test of BLEND on two real datasets of PacBio CLR and Nanopore reads

Some questions about the article

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent