chhylp123 / hifiasm Goto Github PK

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

License: MIT License

C++ 92.66% C 7.07% Makefile 0.05% Roff 0.22%

bioinformatics denovo-assembly genomics hifi-read pacbio

hifiasm's Introduction

Getting Started

# Install hifiasm (requiring g++ and zlib)
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make

# Run on test data (use -f0 for small datasets)
wget https://github.com/chhylp123/hifiasm/releases/download/v0.7/chr11-2M.fa.gz
./hifiasm -o test -t4 -f0 chr11-2M.fa.gz 2> test.log
awk '/^S/{print ">"$2;print $3}' test.bp.p_ctg.gfa > test.p_ctg.fa  # get primary contigs in FASTA

# Assemble inbred/homozygous genomes (-l0 disables duplication purging)
hifiasm -o CHM13.asm -t32 -l0 CHM13-HiFi.fa.gz 2> CHM13.asm.log
# Assemble heterozygous genomes with built-in duplication purging
hifiasm -o HG002.asm -t32 HG002-file1.fq.gz HG002-file2.fq.gz

# Hi-C phasing with paired-end short reads in two FASTQ files
hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz HG002-HiFi.fq.gz

# Trio binning assembly (requiring https://github.com/lh3/yak)
yak count -b37 -t16 -o pat.yak <(cat pat_1.fq.gz pat_2.fq.gz) <(cat pat_1.fq.gz pat_2.fq.gz)
yak count -b37 -t16 -o mat.yak <(cat mat_1.fq.gz mat_2.fq.gz) <(cat mat_1.fq.gz mat_2.fq.gz)
hifiasm -o HG002.asm -t32 -1 pat.yak -2 mat.yak HG002-HiFi.fa.gz

# Single-sample telomere-to-telomere assembly with HiFi, ultralong and Hi-C reads
hifiasm -o HG002.asm --h1 read1.fq.gz --h2 read2.fq.gz --ul ul.fq.gz HG002-HiFi.fq.gz

See tutorial for more details.

Getting Started
Introduction
Why Hifiasm?
Usage
Results
Getting Help
Limitations
Citing Hifiasm

Introduction

Hifiasm is a fast haplotype-resolved de novo assembler initially designed for PacBio HiFi reads. Its latest release could support the telomere-to-telomere assembly by utilizing ultralong Oxford Nanopore reads. Hifiasm produces arguably the best single-sample telomere-to-telomere assemblies combing HiFi, ultralong and Hi-C reads, and it is one of the best haplotype-resolved assemblers for the trio-binning assembly given parental short reads. For a human genome, hifiasm can produce the telomere-to-telomere assembly in one day.

Why Hifiasm?

Hifiasm delivers high-quality telomere-to-telomere assemblies. It tends to generate longer contigs and resolve more segmental duplications than other assemblers.
Given Hi-C reads or short reads from the parents, hifiasm can produce overall the best haplotype-resolved assembly so far. It is the assembler of choice by the Human Pangenome Project for the first batch of samples.
Hifiasm can purge duplications between haplotigs without relying on third-party tools such as purge_dups. Hifiasm does not need polishing tools like pilon or racon, either. This simplifies the assembly pipeline and saves running time.
Hifiasm is fast. It can assemble a human genome in half a day and assemble a ~30Gb redwood genome in three days. No genome is too large for hifiasm.
Hifiasm is trivial to install and easy to use. It does not required Python, R or C++11 compilers, and can be compiled into a single executable. The default setting works well with a variety of genomes.

Usage

Assembling HiFi reads without additional data types

A typical hifiasm command line looks like:

hifiasm -o NA12878.asm -t 32 NA12878.fq.gz

where NA12878.fq.gz provides the input reads, -t sets the number of CPUs in use and -o specifies the prefix of output files. For this example, the primary contigs are written to NA12878.asm.bp.p_ctg.gfa. Since v0.15, hifiasm also produces two sets of partially phased contigs at NA12878.asm.bp.hap?.p_ctg.gfa. This pair of files can be thought to represent the two haplotypes in a diploid genome, though with occasional switch errors. The frequency of switches is determined by the heterozygosity of the input sample.

At the first run, hifiasm saves corrected reads and overlaps to disk as NA12878.asm.*.bin. It reuses the saved results to avoid the time-consuming all-vs-all overlap calculation next time. You may specify -i to ignore precomputed overlaps and redo overlapping from raw reads. You can also dump error corrected reads in FASTA and read overlaps in PAF with

hifiasm -o NA12878.asm -t 32 --write-paf --write-ec /dev/null

Hifiasm purges haplotig duplications by default. For inbred or homozygous genomes, you may disable purging with option -l0. Old HiFi reads may contain short adapter sequences at the ends of reads. You can specify -z20 to trim both ends of reads by 20bp. For small genomes, use -f0 to disable the initial bloom filter which takes 16GB memory at the beginning. For genomes much larger than human, applying -f38 or even -f39 is preferred to save memory on k-mer counting.

Hi-C integration

Hifiasm can generate a pair of haplotype-resolved assemblies with paired-end Hi-C reads:

hifiasm -o NA12878.asm -t32 --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz

In this mode, each contig is supposed to be a haplotig, which by definition comes from one parental haplotype only. Hifiasm often puts all contigs from the same parental chromosome in one assembly. It has cleanly separated chrX and chrY for a human male dataset. Nonetheless, phasing across centromeres is challenging. Hifiasm is often able to phase entire chromosomes but it may fail in rare cases. Also, contigs from different parental chromosomes are randomly mixed as it is just not possible to phase across chromosomes with Hi-C.

Hifiasm does not perform scaffolding for now. You need to run a standalone scaffolder such as SALSA or 3D-DNA to scaffold phased haplotigs.

Trio binning

When parental short reads are available, hifiasm can also generate a pair of haplotype-resolved assemblies with trio binning. To perform such assembly, you need to count k-mers first with yak first and then do assembly:

yak count -k31 -b37 -t16 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -t16 -o mat.yak maternal.fq.gz
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak NA12878.fq.gz

Here NA12878.asm.dip.hap1.p_ctg.gfa and NA12878.asm.dip.hap2.p_ctg.gfa give the two haplotype assemblies. In the binning mode, hifiasm does not purge haplotig duplicates by default. Because hifiasm reuses saved overlaps, you can generate both primary/alternate assemblies and trio binning assemblies with

hifiasm -o NA12878.asm -t 32 NA12878.fq.gz 2> NA12878.asm.pri.log
hifiasm -o NA12878.asm -t 32 -1 pat.yak -2 mat.yak /dev/null 2> NA12878.asm.trio.log

The second command line will run much faster than the first.

Ultra-long ONT integration

Hifiasm could integrate ultra-long ONT reads to produce the telomere-to-telomere assembly:

hifiasm -o NA12878.asm -t32 --ul ul.fq.gz HiFi-reads.fq.gz

For the single-sample telomere-to-telomere assembly with Hi-C reads:

hifiasm -o NA12878.asm -t32 --ul ul.fq.gz --h1 read1.fq.gz --h2 read2.fq.gz HiFi-reads.fq.gz

For the trio-binning telomere-to-telomere assembly；

hifiasm -o NA12878.asm -t32 --ul ul.fq.gz -1 pat.yak -2 mat.yak HiFi-reads.fq.gz

Output files

Hifiasm generates different types of assemblies based on the input data. It also writes error corrected reads to the prefix.ec.bin binary file and writes overlaps to prefix.ovlp.source.bin and prefix.ovlp.reverse.bin. For more details, please see the complete documentation.

Results

The following table shows the statistics of several hifiasm primary assemblies assembled with v0.12:

_Dataset	_Size	_Cov.	_{Asm options}	_{CPU time}	_{Wall time}	_RAM	_N50
_{Mouse (C57/BL6J)}	_2.6Gb	_×25	_{-t48 -l0}	_172.9h	_4.8h	_76G	_21.1Mb
_{Maize (B73)}	_2.2Gb	_×22	_{-t48 -l0}	_203.2h	_5.1h	_68G	_36.7Mb
_Strawberry	_0.8Gb	_×36	_{-t48 -D10}	_152.7h	_3.7h	_91G	_17.8Mb
_Frog	_9.5Gb	_×29	_-t48	_2834.3h	_69.0h	_463G	_9.3Mb
_Redwood	_35.6Gb	_×28	_-t80	_3890.3h	_65.5h	_699G	_5.4Mb
_{Human (CHM13)}	_3.1Gb	_×32	_{-t48 -l0}	_310.7h	_8.2h	_114G	_88.9Mb
_{Human (HG00733)}	_3.1Gb	_×33	_-t48	_269.1h	_6.9h	_135G	_69.9Mb
_{Human (HG002)}	_3.1Gb	_×36	_-t48	_305.4h	_7.7h	_137G	_98.7Mb

Hifiasm can assemble a 3.1Gb human genome in several hours or a ~30Gb hexaploid redwood genome in a few days on a single machine. For trio binning assembly:

_Dataset	_Cov.	_{CPU time}	_{Elapsed time}	_RAM	_N50
_{HG00733, [father], [mother]}	_×33	_269.1h	_6.9h	_135G	_{35.1Mb (paternal), 34.9Mb (maternal)}
_{HG002, [father], [mother]}	_×36	_305.4h	_7.7h	_137G	_{41.0Mb (paternal), 40.8Mb (maternal)}

Human assemblies above can be acquired from Zenodo and non-human ones are available here.

Getting Help

For detailed description of options, please see tutorial or man ./hifiasm.1. The -h option of hifiasm also provides brief description of options. If you have further questions, please raise an issue at the issue page.

Limitations

Purging haplotig duplications may introduce misassemblies.

Citating Hifiasm

If you use hifiasm in your work, please cite:

Cheng, H., Concepcion, G.T., Feng, X., Zhang, H., Li H. (2021) Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods, 18:170-175. https://doi.org/10.1038/s41592-020-01056-5

Cheng, H., Jarvis, E.D., Fedrigo, O., Koepfli, K.P., Urban, L., Gemmell, N.J., Li, H. (2022) Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology, 40:1332–1335. https://doi.org/10.1038/s41587-022-01261-x

hifiasm's People

Contributors

Stargazers

Watchers

hifiasm's Issues

Memory consumption

Hi,
Thanks for the great software.
When I run hifiasm (0.7/0.8) to assemble one huge genome, more than 40Gb with 1.4Tb CCS data on a 4Tb node.
Suppose 4Tb node is enough given 35.6 Gb redwood genome only needs 699Gb memory.
Unfortunately, I failed to complete the jobs after the index step, the error is 'Segmentation fault (core dumped)', any ideas to help me out.
Thanks.

high switching error in one contig

Hi,
I used hifiasm for trio-binning assembly and it works very well, giving reference-size haplotypes.
However there is one contig showing higher number of switching error, exemplified from the blob plot below.
And I was able to identify the contig with yak trioeval.
The output from yak is "S h1tg000081l 13640 5824 13621 19 18 5805"
What is the best way to correct this contig? And how should I do it?

Thanks,
Zhen

Generating HiFi reads from a "butterfly"?

Pardon the ignorance, but how did you generate HiFi reads from a butterfly? We were told by our PacBio provider that we need 15 ug DNA (and we are working with Dipteran fly, so that was not possible from a single individual and thus we could only use the low-input DNA protocol with continuous long reads). We were also told that we couldn't use the PacBio Sequel I and could only generate HiFi reads from the PacBio Sequel II, but that doesn't seem to be the case according to PacBio's website.

Edit...so it does seem possible to use the low DNA input workflow for HiFi read generation
https://youtu.be/PJXDhusSexY

Empty output - zero overlaps

Hi,

I'm running into some trouble running hifiasm on CCS reads. Specifically, whenever I run the program it proceeds for a while, never outputing any kind of error message (not even segmentation fault). When it finishes, it creates all the output files, but they're all empty except for the .bin files. I have a species with a big genome (estimated size above >7 Gb), I have 260 Gbp of data (average subread length 15kb), so should be over 30x coverage.

The command I used was:

./hifiasm -o assembly.asm -t 40 -f38 -z20 subreads.fq.gz

I tried to vary the -f option and excluding the -z option, but results remain unchanged. I ran this both on a computer with 1 Tb of RAM and on a different server with slurm, allocating 200 Gb to the process, results were the same on both machines. I ran the program with the test data and everything goes well, so I assume everything is ok with the installation. The output mentions zero overlaps, is this a sign something went wrong with the sequencing, or is this a problem with the options I'm using?

Thanks for the help!

SAMPLE OF OUTPUT

[M::ha_analyze_count] lowest: count[86] = 1600
[M::ha_analyze_count] highest: count[87] = 1611
[M::ha_hist_line] 2: ****************************************************************************************************> 40070115
[M::ha_hist_line] 3: ****************************************************************************************************> 9280334
[M::ha_hist_line] 4: ****************************************************************************************************> 3692809
[M::ha_hist_line] 5: ****************************************************************************************************> 1934045
[M::ha_hist_line] 6: ****************************************************************************************************> 1185357
[M::ha_hist_line] 7: ****************************************************************************************************> 796019
[M::ha_hist_line] 8: ****************************************************************************************************> 570160
[M::ha_hist_line] 9: ****************************************************************************************************> 427354

[... lines ommitted ...]

[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[44] = 376
[M::ha_pt_gen] peak_hom: 44; peak_het: 40
[M::ha_pt_gen::1307.63518.51] ==> indexed 7590996 positions
[M::ha_assemble::1355.419[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 55
[M::purge_dups] purge duplication coverage threshold: 55
[M::adjust_utg_by_primary] primary contig coverage range: [35, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.12-r304
[M::main] CMD: hifiasm/hifiasm -o assembly.asm -t 40 -f38 subreads.fq.gz
[M::main] Real time: 1384.462 sec; CPU: 26092.516 sec; Peak RSS: 41.007 GB

Recommended settings for PacBio program ccs for converting subreads to HiFi reads for downstream usage with hifiasm

Hi,

Do you have recommended settings for the PacBio program ccs (https://github.com/PacificBiosciences/ccs) for converting raw subreads to HiFi reads for downstream usage with hifiasm?

Best,
Jean Elbers

get corrected reads in fasta format

In earlier versions hifiasm stored the corrected reads in a fasta file, now it seems it stores them in a binary file. Is there any way to retrieve them back into a fasta format? For downstream analysis, its a very nice feuture to have these corrected reads.

rerun on bin files

Hi,

How can I rerun Hifiasm with different parameters based on the generated bin files?

Thanks,
André

Separate haplotype-resloved untig in GFA

Hi, chhylp123

For species with high heterozygous rate (i.e., Butterfly), different haplotypes can be fully separated. It is important to remove small bubbles from the haplotype-resolved unitig graph. The reason is that some small bubbles are caused by somatic mutations or noise in data, which are not the real haplotype information. In this case, haplotype-resolved processed unitig graph without small bubbles should be better.

Haplotype-resolved untig fasta(490M) seems exactly twice of my GenomeScope estimated haploid size (260M, 1.8% het), but the result just the untig number, how can I get parental and maternal haplotype? Or the hifiasm just output the haplotype-resolved untig, but need other information (such as HiC or Trio) to separate two haplotype for diploid.

Zhigui Bao

Higher QV in a_ctg than p_ctg

Thank you for the nice and very efficient assembler.

I assembled a highly heterozygous genome and hifiasm did very well separating primary and haplotigs (p_ctg size is only a bit larger than the expected genome size and a_ctg length is 90% of p_ctg size).

I ran Merqury for QC from short-read kmers of the same sample. I always have higher consensus accuracy QV in a_ctg (QV=~49) than p_ctg (QV=~37) independently of the parameters I use in hifiasm.
Is it expected?
Are there some hifiasm options that could improve QV of p_ctg?

Thanks

Hifiasm increases scaffold size for --high-het (or default) compared to -l0

Just checking to make sure this makes sense, but hifiasm has a bigger N50 for --high-het (or default purge_dups settings [identical output with --high-het or defaults]) than using -l0 for the p_ctgs that are produced by either --high-het or -l0. This seems like it should be the opposite? All other settings are defaults hifiasm -p prefix --high-het or -l0 hifi.reads.fasta.gz with hifiasm 0.13-r307.

Can I use canu trimmed reads as input for hifiasm?

Hi,

Canu (https://github.com/marbl/canu) assemble a genome through three steps: correct, trim and assemble. The first step will generate a file called correctedReads.fasta and the second step will generate a file called trimmedReads.fasta. I am wondering, it is OK or accepatable to use the canu-generated trimmed reads as input for hifiasm?

I am asking this question because I want to assemble a genome using both the raw PacBio CLR reads and the PacBio HiFi reads. Canu/HiCanu did not work very well for this situation (a lower N50 length than two independent assemblies), so I would like to switch to hifiasm (the resulting assembly has a better N50 length than two independent assemblies).

Leiting

How to get coverage depth information in the GFA?

Hello,
Thanks for this very useful tool. One lacking feature (or perhaps I could just not find it?) is that there is no coverage depth information available in the GFAs produced. Is there a way to access this information (or to request for it to be added in a future update of hifiasm)?

Any recommended parameter-set to adjust the low coverage hifi data(15~20X)

Hi,
Thanks for developing this cool tool.
Any recommended parameter-set to adjust for the large genome assembly with low coverage hifi data(15~20X)? like -k ？

Thanks!

GFA `A` lines

I am wondering about the gfa output - is there any documentation about the lines beginning with A ? I assume that is giving the locations of the reads on the unitigs/contigs. I would really like to be able to extract out the supporting reads for any location on the contig (for example, to find how many reads support a variant between assemblies). It would also be nice to have the cigar to know exactly how the read lines up on the contig. It would save me a step and reduce the ambiguity that could result from having to align the reads back to the assembly after the fact. Would it be easy to add in the cigar?

Lower contiguity with longer insert size

Hi Haoyu/Heng Li,

Thanks for the fantastic software for assembling HiFi reads! I've been testing it on human genome extensively and one of the things that I was trying to understand is the tradeoff between insert size and accuracy. I assembled the GIAB HG002 data using both the 15kb and the 20kb libraries (Version 0.5, default parameters) at approximately the same coverage and obtained the following stats:

15kb 2 cells, ~ 17 fold coverage (Average Q = 33.5, Mean insert size = 12.9 kb), N = 793 contigs, NG50 = 37 Mbp, NGA50 = 16.9 Mbp
20kb 2 cells, ~ 16 fold coverage (Average Q = 31.1, Mean insert size = 18.5 kb), N = 770 contigs, NG50 = 21 Mbp, NGA50 = 12.9 Mbp

Here's the p_utg gfa graph for 15kb:

And for 20kb:

It seems like at the unitig stage the 20kb dataset has quite a lot fewer (and shorter) nodes and edges than the 15kb dataset, and once it's collapsed to contigs the 15kb is more contiguous. The 20kb graph also appears to have more "spikes" on the contigs (likely from the lower accuracy?). Do you think there's any parameter (k-mer?) that we can tune when we're using lower accuracy reads with higher insert size?

Comparison of hifiasm with simulated CLR corrected reads and simulated HiFi reads

Hi,

I just wanted to show a comparison using simulated PacBio CLR reads corrected with Illumina reads versus simulated HiFi reads for a Drosophila melanogaster genome assembly (haploid assembly "converted" to diploid with ~2% heterozygosity).

# Use Drosophila melongaster PacBio assembly
cd /genetics/elbers/test/fly2
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/401/745/GCA_003401745.1_ASM340174v1/GCA_003401745.1_ASM340174v1_genomic.fna.gz




# Use BCFtools 1.10.2 and SAMtools 1.10
conda activate bcftools1.10.2




# Use Seqtk to convert soft-masked bases to upper-case bases, also compress with bgzip
# https://github.com/lh3/seqtk
/genetics/elbers/bin/seqtk/seqtk seq -U \
GCA_003401745.1_ASM340174v1_genomic.fna.gz | \
bgzip -@75 > GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz




# Convert to diploid with approximately 2% heterozygosity rate, max indels=20bp
# mutate.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/mutate.sh \
in=GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz \
ow=t \
vcf=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz \
out=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
ziplevel=6 \
ploidy=2 \
subrate=0.0192 \
indelrate=0.001 \
maxindel=20 \
nohomopolymers=t \
hetrate=1 2> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.log.txt

# genome size
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz | \
grep -v ">"|wc -m
# 140687135

# 2% heterozygosity is how many bases
calc 0.02\*140687135
# 0.02*140687135 = 2813742.700000

# number of mutations added
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz| \
grep -v "^#"|wc -l
# 2830139

# ~2% het rate




# get only 1 haplotype from the "diploid" reference
bgzip -@75 -dc GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz|\
/genetics/elbers/bin/seqtk/seqtk seq -L0|paste - - |grep -P "haplo_0\t"| \
tr '\t' '\n' |\
/genetics/elbers/bin/seqtk/seqtk seq -L60 |\
bgzip -@75 \
> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz




# make reference for randomreads.sh
# randomreads.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.17 \
reads=100 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=100000 \
out=/dev/null




# make 60x haploid coverage for Illumina reads
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
coverage=30 paired=t maxinsert=550 mininsert=450 \
out1=illumina1.fastq.gz out2=illumina2.fastq.gz > random_reads_illumina.log 2>&1




# interleave the paired-end reads
# reformat.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/reformat.sh \
in=illumina1.fastq.gz in2=illumina2.fastq.gz out=illumina.int.fastq 2>/dev/null




# use KmerGenie 1.7051 to get an idea of k-mer with that produces longest N50
# http://kmergenie.bx.psu.edu/
mkdir -p /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads

cd /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads
/genetics/elbers/kmergenie-1.7051/kmergenie ../illumina.int.fastq \
> kmergenie-illumina-raw-reads.log 2>&1
rm ../illumina.int.fastq

k=`grep "^best k:" \
kmergenie-illumina-raw-reads.log | grep -Po "\d+"` 
echo "best k=${k}"




# make 30x haploid coverage for PacBio CLR reads
# error rate from 13 - 15 % minimum 1000bp midlength 20000bp maximum 30000bp
cd /genetics/elbers/test/fly2

/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.15 \
coverage=15 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=30000 \
out=pacbio.fastq.gz > random_reads_pacbio.log 2>&1



# make 30x haploid coverage for PacBio reads for Hifi reads
# error rate from 1 - 0.1 % minimum 9000bp midlength 10000bp max 12000bp
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.001 pbmax=0.01 \
coverage=15 paired=f \
gaussianlength=t \
minlength=9000 midlength=10000 maxlength=12000 \
out=hifi.fastq.gz > random_reads_pacbio_hifi.log 2>&1




# de novo assemble Illumina reads 
# with ABySS 2.2.4 using k-mer value from kmer genie
# partition command allows parallel (mpi) assembly with ABySS
# https://github.com/bcgsc/abyss/
cd /genetics/elbers/test/fly2
mkdir -p abyss

# break up Illumina read pairs into 35 paired files (70 files in total)
# to speed up assembly with ABySS mpi mode
# partition.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.79/partition.sh threads=75 -Xmx350g -Xms350g \
in1=illumina1.fastq.gz \
in2=illumina2.fastq.gz \
out=abyss/illumina-raw-%-R1.fasta out2=abyss/illumina-raw-%-R2.fasta \
ways=35 2>/dev/null

cd /genetics/elbers/test/fly2/abyss

# note I don't know of a simpler command for ABySS to do the following with
# 75 parallel jobs

/genetics/elbers/bin/abyss-2.2.4/build/bin/abyss-pe k=${k} \
name=illumina-raw graph=gfa1 \
lib="pe1 pe2 pe3 pe4 pe5 pe6 pe7 pe8 pe9 pe10 pe11 pe12 pe13 pe14 pe15 \
pe16 pe17 pe18 pe19 pe20 pe21 pe22 pe23 pe24 pe25 pe26 pe27 pe28 pe29 \
pe30 pe31 pe32 pe33 pe34 pe35" \
np=75 j=75 \
pe1="illumina-raw-0-R1.fasta illumina-raw-0-R2.fasta" \
pe2="illumina-raw-1-R1.fasta illumina-raw-1-R2.fasta" \
pe3="illumina-raw-2-R1.fasta illumina-raw-2-R2.fasta" \
pe4="illumina-raw-3-R1.fasta illumina-raw-3-R2.fasta" \
pe5="illumina-raw-4-R1.fasta illumina-raw-4-R2.fasta" \
pe6="illumina-raw-5-R1.fasta illumina-raw-5-R2.fasta" \
pe7="illumina-raw-6-R1.fasta illumina-raw-6-R2.fasta" \
pe8="illumina-raw-7-R1.fasta illumina-raw-7-R2.fasta" \
pe9="illumina-raw-8-R1.fasta illumina-raw-8-R2.fasta" \
pe10="illumina-raw-9-R1.fasta illumina-raw-9-R2.fasta" \
pe11="illumina-raw-10-R1.fasta illumina-raw-10-R2.fasta" \
pe12="illumina-raw-11-R1.fasta illumina-raw-11-R2.fasta" \
pe13="illumina-raw-12-R1.fasta illumina-raw-12-R2.fasta" \
pe14="illumina-raw-13-R1.fasta illumina-raw-13-R2.fasta" \
pe15="illumina-raw-14-R1.fasta illumina-raw-14-R2.fasta" \
pe16="illumina-raw-15-R1.fasta illumina-raw-15-R2.fasta" \
pe17="illumina-raw-16-R1.fasta illumina-raw-16-R2.fasta" \
pe18="illumina-raw-17-R1.fasta illumina-raw-17-R2.fasta" \
pe19="illumina-raw-18-R1.fasta illumina-raw-18-R2.fasta" \
pe20="illumina-raw-19-R1.fasta illumina-raw-19-R2.fasta" \
pe21="illumina-raw-20-R1.fasta illumina-raw-20-R2.fasta" \
pe22="illumina-raw-21-R1.fasta illumina-raw-21-R2.fasta" \
pe23="illumina-raw-22-R1.fasta illumina-raw-22-R2.fasta" \
pe24="illumina-raw-23-R1.fasta illumina-raw-23-R2.fasta" \
pe25="illumina-raw-24-R1.fasta illumina-raw-24-R2.fasta" \
pe26="illumina-raw-25-R1.fasta illumina-raw-25-R2.fasta" \
pe27="illumina-raw-26-R1.fasta illumina-raw-26-R2.fasta" \
pe28="illumina-raw-27-R1.fasta illumina-raw-27-R2.fasta" \
pe29="illumina-raw-28-R1.fasta illumina-raw-28-R2.fasta" \
pe30="illumina-raw-29-R1.fasta illumina-raw-29-R2.fasta" \
pe31="illumina-raw-30-R1.fasta illumina-raw-30-R2.fasta" \
pe32="illumina-raw-31-R1.fasta illumina-raw-31-R2.fasta" \
pe33="illumina-raw-32-R1.fasta illumina-raw-32-R2.fasta" \
pe34="illumina-raw-33-R1.fasta illumina-raw-33-R2.fasta" \
pe35="illumina-raw-34-R1.fasta illumina-raw-34-R2.fasta" \
v=-v unitigs > illumina-raw-abyss.log 2>&1
rm -f illumina-raw-?-R?.fasta illumina-raw-??-R?.fasta




# Fill in the "S" lines of the gfa file using gfakluge 1.1.2
# https://github.com/edawson/gfakluge
/genetics/elbers/gfakluge-1.1.2/gfak fillseq -f \
illumina-raw-1.fa illumina-raw-1.gfa1 > illumina-raw-1.gfa




# Error-correct the raw PacBio reads with GraphAligner 1.0.12
# https://github.com/maickrau/GraphAligner
# and the graph you made in the previous steps
conda deactivate
conda activate graphaligner
GraphAligner -t 75 -g illumina-raw-1.gfa \
-f ../pacbio.fastq.gz \
--corrected-out pacbio.raw.abyss.notclipped.fasta \
--preset dbg > pacbio.raw.abyss.notclipped.fasta.log 2>&1
conda deactivate




# Convert lower-case bases to upper-case bases
conda activate bcftools1.10.2
/genetics/elbers/bin/seqtk/seqtk seq -U \
pacbio.raw.abyss.notclipped.fasta|bgzip -@75 \
> pacbio.raw.abyss.notclipped.fasta.gz
conda deactivate




# de novo assemble with hifiasm 0.12-r304
# https://github.com/chhylp123/hifiasm
cd /genetics/elbers/test/fly2/hifiasm-hifi

# hifi reads
/genetics/elbers/bin/hifiasm/hifiasm -o hifiasm-hifi2 -t75 \
../hifi.fastq.gz --high-het > hifiasm2.log 2>&1

awk '/^S/{print ">"$2"\n"$3}' \
hifiasm-hifi2.p_ctg.gfa | fold > hifiasm-hifi2.p_ctg.fasta

# corrected reads
/genetics/elbers/bin/hifiasm/hifiasm -o hifiasm-cor -t75 \
../abyss/pacbio.raw.abyss.notclipped.fasta.gz --high-het > hifiasm3.log 2>&1

awk '/^S/{print ">"$2"\n"$3}' hifiasm-cor.p_ctg.gfa | fold \
> hifiasm-cor.p_ctg.fasta




# Run Quast 5.1.0rc1 on haplotype0 of reference and hifiasm results
# https://github.com/ablab/quast
# Remember that I converted GCA_003401745.1 to diploid with ~2% het rate
cd /genetics/elbers/test/fly2

mkdir -p quast4
cd /genetics/elbers/test/fly2/quast4
conda activate quast5.0.2
/genetics/elbers/quast-5.1.0rc1/quast.py --fragmented --fast --threads 75 \
-r ../GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
--eukaryote -o ./ ../hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta ../hifiasm-hifi/hifiasm-cor.p_ctg.fasta \
> quast.log 2>&1
conda deactivate

column -ts $'\t' report.tsv


Assembly                     hifiasm_hifi2.p_ctg  hifiasm_cor.p_ctg
# contigs (>= 0 bp)          54                   297
# contigs (>= 1000 bp)       54                   297
# contigs (>= 5000 bp)       54                   297
# contigs (>= 10000 bp)      54                   297
# contigs (>= 25000 bp)      51                   297
# contigs (>= 50000 bp)      38                   267
Total length (>= 0 bp)       140894018            126334800
Total length (>= 1000 bp)    140894018            126334800
Total length (>= 5000 bp)    140894018            126334800
Total length (>= 10000 bp)   140894018            126334800
Total length (>= 25000 bp)   140857808            126334800
Total length (>= 50000 bp)   140365172            125113430
# contigs                    54                   297
Largest contig               30137170             3306885
Total length                 140894018            126334800
Reference length             140694204            140694204
Reference GC (%)             42.05                42.05
N50                          23562678             821324
NG50                         23562678             741584
N90                          2232689              180295
NG90                         2232689              -
L50                          3                    43
LG50                         3                    52
L90                          6                    156
LG90                         6                    -
# misassemblies              35                   33
# misassembled contigs       16                   31
Misassembled contigs length  82394091             24116595
# local misassemblies        1                    172
# scaffold gap ext. mis.     0                    0
# scaffold gap loc. mis.     0                    0
# unaligned mis. contigs     0                    0
# unaligned contigs          0 + 0 part           0 + 15 part
Unaligned length             0                    22898
Genome fraction (%)          97.102               89.348
Duplication ratio            1.029                1.004
# N's per 100 kbp            0.00                 0.00
# mismatches per 100 kbp     1378.10              1018.66
# indels per 100 kbp         69.40                94.81
Largest alignment            27942815             3306870
Total aligned length         140637770            126237320
NA50                         22378608             761401
NGA50                        22378608             684993
NA90                         1358953              171380
NGA90                        1358953              -
LA50                         3                    48
LGA50                        3                    58
LA90                         9                    174
LGA90                        9                    -




# Analyze quality scores of reads and assemblies using
# YAK version r55 - yet another k-mer analyzer
# https://github.com/lh3/yak
cd /genetics/elbers/test/fly2

/genetics/elbers/yak/yak count -b37 -t75 -o raw.yak \
<(bgzip -@75 -cd illumina1.fastq.gz) \
<(bgzip -@75 -cd illumina2.fastq.gz) > sr.raw.yak.log 2>&1

cd /genetics/elbers/test/fly2
for ref in `ls -1 abyss/pacbio.raw.abyss.notclipped.fasta.gz hifi.fastq.gz hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta hifiasm-hifi/hifiasm-cor.p_ctg.fasta`;do
/genetics/elbers/yak/yak qv -t75 /genetics/elbers/test/fly2/raw.yak ${ref} 2>/dev/null |tail -n +1034 >> test2
done

for ref in `ls -1 abyss/pacbio.raw.abyss.notclipped.fasta.gz hifi.fastq.gz hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta hifiasm-hifi/hifiasm-cor.p_ctg.fasta`;do
echo $ref >> test3
done

paste test3 <(cut -f 2-3 test2) |sort -k2,2 | \
sed '1i Sequences_or_assembly\traw_quality_value\tadjusted_quality_value' |column -tx

Sequences_or_assembly                       raw_quality_value  adjusted_quality_value
hifi.fastq.gz                               22.816             22.820
abyss/pacbio.raw.abyss.notclipped.fasta.gz  33.499             33.522
hifiasm-hifi/hifiasm-cor.p_ctg.fasta        36.636             36.687
hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta      49.417             50.800

Update added program links 10:30 CET 1 Sep 2020

the assembled result need to be redundant?

Hi,

I am trying to assemble a heterotetraploid genome. Add -l0 to the assembly parameter, and combine the information of a_ctg and p_ctg files as a result. If I want to distinguish between haplotypes, do I need to de-redundate the genome?

Thanks,
Yulin Bai

Typo in README.md?

Shouldn't

awk '/^S/{print ">"$1;print $2}' test.p_ctg.gfa > test.p_ctg.fa  # get primary contigs in FASTA

awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa  # get primary contigs in FASTA

In the README.md for commit 55f95c0?

misassembly introduced in hifiasm v0.5

I am following hifiasm for our plant genome assembly, and noticed that with my code update from hifiasm v03 to v05 the assembly n50 increased but in my genome 2 appearant misassemblies got my attention.
I highlight one in the images below. We have a tetraploid plant of ~ 400MB haploid size, and we have >100x hifi coverage of the haploid genome, meaning ~25x per haplotype.
I already assembled this genome with hiCANU, and noticed also there that with changing the bogart (assemble) parameters too much, I easily introduce misassemblies. As far as I know, I can't alter the settings for hifiasm.
In this image you see the alignment of the contigs to a related public reference, with the assembly error:

If I look at the same positions (zoom in) but for the unitigs; I see that the unitigs do not contain this error:

Are their any ways to make the contigging more stringent? (version 0.3 did not have this error yet)
I have to say that I am not 100% sure this is not a true biological case, I am confindent that this is an assembly error, also from what I've seen with hiCanu. If you need more info, let me know

partial contigs with half-reduced coverage ccs reads is misassembled? No, by contrast, it has assembled very long and novel tandem repeat sequeces

Hi chhylp123,
I have sequenced a diploid genome (repeat content >70%) with 25X coverage HiFi reads. Luckly, I got a wonderful contigs with N50 of 44 Mb by hifiasm 0.12.

Then I anchored contigs to chromosomes by Allmaps with ~1500 high quality genetic markers. Finally, I obtained 10 pesudomolecules. However, I found there was 20-MB region in chr7 which is not supported by genetic markers and synteny with other homologus species.

Furthermore, I mapped ccs reads to the final assembly, and I found that 20-Mb region with half-reduced coverage reads.

Meanwhile, I also mapped the RNA-seq reads to the genome, and no reads covered this region. So, I think this 20-Mb region maybe misassembled.

However, this 20-Mb region was located in a single contig (108 Mb) which were constructed by sereval utgs (the length of both terminal utgs (utg000064l and utg000017l) are 29 Mb and 47 Mb, separately ), and there is no obvious evidence support to break this contig.

Therefore, I am wondering whether there are other probabilities for this assembly? And have you ever met that some assembly regions covered by half depth reads before? May be high heterozygosity for 20-Mb?

Thanks!

Dong An

adding HiC data

Dear Haoyu,

It has been a very nice experience using hifiasm. Thanks for your exercellent wrok.

May I ask for your suggestion on how to combine HiC data with hifiasm assemblies in general, for both phasing and scaffolding?

Many thanks in advance.

Kind regards,

Weihong

cannot find output files

Hi
I am running HIfiasm on a dataset of D.melanogaster HiFi reads that was downloaded from NCBI.
I am using the command line;
time hifiasm -o -t 36
It has the intention of creating a folder in home directory and placing the generated files in it. However, after hrs of working I get
'writing reads to disk ....
Segmentation fault (core dumped)'

There is no new folder created and I cannot find the specified output files.
What am I doing wrong? I am rather new to assembling these genomes

How high is “high heterozygosity”?

Curious if there is a recommendation on what level of heterozygosity to use the “--high-het“ option?

hifiasm output: 731.928 MB single contig

Hi,
I am working with highly heterozygous genome. My genome size is ~400MB with 9 chromosome.
Hifiasm gives me single contig of 731.928 MB.

This is the stats:
Main genome scaffold total: 1
Main genome contig total: 1
Main genome scaffold sequence total: 731.928 MB
Main genome contig sequence total: 731.927 MB 0.000% gap
Main genome scaffold N/L50: 1/731.928 MB
Main genome contig N/L50: 1/731.927 MB
Main genome scaffold N/L90: 1/731.928 MB
Main genome contig N/L90: 1/731.927 MB
Max scaffold length: 731.928 MB
Max contig length: 731.927 MB
Number of scaffolds > 50 KB: 1
% main genome in scaffolds > 50 KB: 100.00%

It does not look right.
Any help, would be great.

Insufficient memory？

Hello, thank you for your software, I had a problem, my fasta sequence has 200G, I used version=0.5, no log file is generated and then the program automatically kills, What's the reason for this? Is it a lack of memory?

Highly heterozygous genome phased by hifiasm, wondered added value of trio-binning?

hifiasm.log.txt
Hi,

Thanks for the amazing tool, I'm a bit new to genome assembling, and have just tried hifiasm using public SRA data. Using default settings I've got seemingly very nice assemblies. The total size of the p_ctg is very reasonable (693M) and the N50 is over 30M, the species is apple, as we know it's highly heterozygous, my alternative assembly (a_ctg) is around 589M, which I think is also quite reasonable. You can see two obvious peaks from the kmer distribution (Please help to have a quick look at the attached working log to see if everything is okay, thanks!).

My first question is does p_ctg represent one full haplotype? As I understand p_ctg + a_ctg + collapsed homozygous regions = 2c ? How can I generate another haplotype consisting of a_ctg and the collapsed homozygous regions?
My second question is how much improvements can be made to the assembly if short illumina reads for the parents also available? (I think it's already quite good, maybe some other added-value of using trio-binning?) Although sometimes it's hard for us to know the exact parents for some of the species being sequenced (especially the wild ones).

Thank you so much.

check resulting assembly from haplotype binning is better than without

Hello there,
Thanks for developing such wonderful tool. There is an idea that I want to test.
I would like to know if the resulting haplotype-specific assembly from hifiasm using '-3 -4' is better than a diploid hifiasm assembly. For instance, for a region of interest, I have already information (say from Hi-C) that allows me to separate the HiFi reads into two haplotypes.
so I ran hifiasm using the following command:
hifiasm -o haplotype_binning_-3-4out -t 8 -3 haplotype1_read_list -4 haplotype2_read_list comb.fa

I wanted to know if this assembly is better than an assembly from default hifiasm, so I ran hifiasm with the following command:
hifiasm -o comb.hifiasm -t8 comb.fa

It produces the output files as shown in the attachment.
I wonder what are the files that I should look at for a direct comparison, say check N50?
I am very curious to know essentially what hifiasm did, specifically, how differently the '-3 -4' mode behave differently from the default diploid assembly.

Hope my questions are clear :-).
thanks,
zhenzhen

many fragment sequence in the p_ctg

Hi Haoyu/Heng Li,
Thanks for the fantastic software for assembling HiFi reads! Recently I am following hifiasm(0.8-r279) for our plant genome assembly(~350M), and the p_ctg is as follows:
Total_length 460M, Max_length 51M, N50_length 33M, Total_number 2646.
The contiguity is higher than any other assemblers I have used , however I also notice that 2542 contigs length are less than 50kb, about 96% of total contigs number.
I wonder why there are so many fragment contigs and can I remove these contigs from p_ctg?
Thank you.

Manually specify where het and hom k-mer peaks are - hifiasm's guess incorrect.

Hi there, trying out my first assemblies with hifiasm with an animal whose genome is around 170 Mb, and is over 4% heterozygous. We know this from >100x coverage k-mer spectra from Illumina WGS reads.

The k-mer spectrum from the PacBio HiFi data tells the same story. I made a k-51 spectrum since I saw that is what hifiasm uses by default. The data are around 44x coverage, and you can see the peak from the het k-mers around 20 and the homozygous k-mers around 40.

The k-mer spectrum from hifiasm matches the above spectrum:

[M::ha_ft_gen::499.056*[email protected]] ==> filtered out 3811358 k-mers occurring 95 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::698.676*5.57] ==> counted 18097002 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 152684
[M::ha_analyze_count] highest: count[19] = 451630
[M::ha_hist_line]     1: ****************************************************************************************************> 7830671
[M::ha_hist_line]     2: ****************************************************************************************************> 770193
[M::ha_hist_line]     3: ******************************************************************** 309013
[M::ha_hist_line]     4: ****************************************** 191939
[M::ha_hist_line]     5: ********************************** 154466
[M::ha_hist_line]     6: ********************************** 152684
[M::ha_hist_line]     7: *********************************** 156256
[M::ha_hist_line]     8: ************************************** 170569
[M::ha_hist_line]     9: ****************************************** 191057
[M::ha_hist_line]    10: ************************************************ 215175
[M::ha_hist_line]    11: ******************************************************* 247482
[M::ha_hist_line]    12: *************************************************************** 283626
[M::ha_hist_line]    13: *********************************************************************** 319430
[M::ha_hist_line]    14: ****************************************************************************** 354206
[M::ha_hist_line]    15: ************************************************************************************** 388878
[M::ha_hist_line]    16: ******************************************************************************************** 417222
[M::ha_hist_line]    17: ************************************************************************************************** 441975
[M::ha_hist_line]    18: **************************************************************************************************** 450105
[M::ha_hist_line]    19: **************************************************************************************************** 451630
[M::ha_hist_line]    20: *************************************************************************************************** 445987
[M::ha_hist_line]    21: ********************************************************************************************** 426016
[M::ha_hist_line]    22: **************************************************************************************** 397584
[M::ha_hist_line]    23: ******************************************************************************** 360297
[M::ha_hist_line]    24: ********************************************************************** 315880
[M::ha_hist_line]    25: ************************************************************* 276191
[M::ha_hist_line]    26: **************************************************** 234698
[M::ha_hist_line]    27: ******************************************** 196798
[M::ha_hist_line]    28: ************************************ 163054
[M::ha_hist_line]    29: ***************************** 132947
[M::ha_hist_line]    30: ************************* 111858
[M::ha_hist_line]    31: ********************* 93468
[M::ha_hist_line]    32: ****************** 81804
[M::ha_hist_line]    33: **************** 73618
[M::ha_hist_line]    34: *************** 68704
[M::ha_hist_line]    35: ************** 64746
[M::ha_hist_line]    36: ************** 63374
[M::ha_hist_line]    37: ************** 62412
[M::ha_hist_line]    38: ************** 62932
[M::ha_hist_line]    39: ************** 62554
[M::ha_hist_line]    40: ************** 61114
[M::ha_hist_line]    41: ************* 58898
[M::ha_hist_line]    42: ************* 57009
[M::ha_hist_line]    43: ************ 54254
[M::ha_hist_line]    44: *********** 51216
[M::ha_hist_line]    45: *********** 48483
[M::ha_hist_line]    46: ********** 45364
[M::ha_hist_line]    47: ********* 41100
[M::ha_hist_line]    48: ******** 37727
[M::ha_hist_line]    49: ******** 34807
[M::ha_hist_line]    50: ******* 31219
[M::ha_hist_line]    51: ****** 27505
[M::ha_hist_line]    52: ***** 24763
[M::ha_hist_line]    53: ***** 22128
[M::ha_hist_line]    54: **** 20128
[M::ha_hist_line]    55: **** 17958
[M::ha_hist_line]    56: **** 15938
[M::ha_hist_line]    57: *** 14744
[M::ha_hist_line]    58: *** 13672
[M::ha_hist_line]    59: *** 12817
[M::ha_hist_line]    60: *** 12142
[M::ha_hist_line]    61: *** 11357
[M::ha_hist_line]    62: ** 10574
[M::ha_hist_line]    63: ** 10200
[M::ha_hist_line]    64: ** 9739
[M::ha_hist_line]    65: ** 9206
[M::ha_hist_line]    66: ** 8849

However, hifiasm gets the position of the homozygous peak incorrect, and calls it at 19, while it should be around 40. The results of running this assembly with --high-het was that the primary assembly was twice as big as it should have been (400Mb instead of 200Mb). So, both haplotypes are ending up in the same primary assembly.

Main genome scaffold total: 753
Main genome contig total:   753
Main genome scaffold sequence total: 422.8 MB
Main genome contig sequence total:   422.8 MB (->  0.0% gap)
Main genome scaffold N/L50: 95/1.4 MB
Main genome contig N/L50:   95/1.4 MB
Number of scaffolds > 50 KB: 558
% main genome in scaffolds > 50 KB: 98.5%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All       753        753  422,765,292  422,765,292   100.00%
   1 kb       753        753  422,765,292  422,765,292   100.00%
 2.5 kb       753        753  422,765,292  422,765,292   100.00%
   5 kb       753        753  422,765,292  422,765,292   100.00%
  10 kb       753        753  422,765,292  422,765,292   100.00%
  25 kb       727        727  422,203,962  422,203,962   100.00%
  50 kb       558        558  416,220,132  416,220,132   100.00%
 100 kb       458        458  409,207,212  409,207,212   100.00%
 250 kb       361        361  393,262,727  393,262,727   100.00%
 500 kb       274        274  361,301,464  361,301,464   100.00%
   1 mb       140        140  263,063,929  263,063,929   100.00%
 2.5 mb        25         25   83,986,696   83,986,696   100.00%
   5 mb         2          2   10,227,160   10,227,160   100.00%

The secondary assembly is too small:

Main genome scaffold total: 1011
Main genome contig total:   1011
Main genome scaffold sequence total: 25.6 MB
Main genome contig sequence total:   25.6 MB (->  0.0% gap)
Main genome scaffold N/L50: 270/26.2 KB
Main genome contig N/L50:   270/26.2 KB
Number of scaffolds > 50 KB: 49
% main genome in scaffolds > 50 KB: 21.7%

 Minimum    Number    Number     Total        Total     Scaffold
Scaffold      of        of      Scaffold      Contig     Contig
 Length   Scaffolds  Contigs     Length       Length    Coverage
--------  ---------  -------  -----------  -----------  --------
    All     1,011      1,011   25,632,451   25,632,451   100.00%
   1 kb     1,011      1,011   25,632,451   25,632,451   100.00%
 2.5 kb     1,011      1,011   25,632,451   25,632,451   100.00%
   5 kb     1,011      1,011   25,632,451   25,632,451   100.00%
  10 kb     1,008      1,008   25,607,535   25,607,535   100.00%
  25 kb       298        298   13,556,140   13,556,140   100.00%
  50 kb        49         49    5,556,899    5,556,899   100.00%
 100 kb        16         16    3,360,515    3,360,515   100.00%
 250 kb         4          4    1,596,952    1,596,952   100.00%
 500 kb         1          1      798,418      798,418   100.00%
   1 mb         0          0            0            0     0.00%

Is there anything I can do to specify where hifiasm should expect the het and homozygous peaks? I think this would be helpful to others working with highly heterozygous species. Thanks so much!

duo haplotyping

We would like to assemble and haplotype a tree genome for which we only have the maternal genome available. The father could be any surrounding tree.
How could we generate the yak paternal file using genome and maternal kmer contents?

Misjoins in the assembly

Hi,

I have used hifiasm to try assemble a few mammals and fish and have run into problems with mis-joins in the assembly. These issues are not detected until we map hic data and attempt to scaffold the contigs and often results in us having to make manual cuts in the contigs. Below is an example hic map of 3 contigs scaffolded into one chromosome. I have highlighted the beginning and end of the middle contig with blue arrows, which appears to have an issue at the start.

It is also quite easy to see which scaffold this contig belongs to by looking at the hic interactions with another scaffold:

I have highlighted the contig of interest here in the p_ctg gfa, which is around 16Mb in length and this foreign region makes up the last ~600kb.

For completenesss, here is the p_utg gfa, however I am not sure how to trace back to our contig of interest as the graph is very difficult to interpret.

I mapped our ccs reads back to the assembly and found our break point, which shows a drop in coverage and a potential separation of the reads into the two haplotypes, although this may also be an issue with the underlying sequence in this region and the mapping.

I ran hifiasm v0.13 without any additional arguments and perfomed mapping of the CCS using pbmm2 with the CCS preset. We had around 36X coverage of this genome.

It would be great to hear if you have detected any similar issues and if we can tune the parameters to avoid such issues?

Many thanks,

Tom

Alternate assembly seqences in Primary assembly seqence position

Hello,
For the hifisam software to assembly the genome, we can get the *.a_ctg.gfa and *.p_utg.noseq.gfa file, but I want to know the
position of the *.a_ctg.gfa sequences in *.p_utg.noseq.gfa seqences, how to get it?

Best wishes

Usage question: trio options -3/-4

Hi,
I am using hifiasm v0.11 in trio mode and supply lists of hap1/hap2 read names via the -3 and -4 command line options. Do I understand correctly that the read lists should only contain names of haplotype-specific reads? In other words, those reads that are contained in the input FASTQ but that are not part of these two read name files are treated by hifiasm as not haplotype-specific and will be used in the assembly process for both haplotypes. Thanks for confirming or clarifying how to properly use the -3 / -4 options.
Best,
Peter

adding extra data to hifiasm

Hi,

Thanks for the great software - it's fantastic to have an assembly tool which we can run in a modest amount of time.

I have a question (rather than an issue).

We are working on a diploid plant species (genome size 950Mbp) which is relatively heterozygous (1.5% at the base level).
We want to see just how far we can push hifi data to generate an assembly without any additional technology.

I currently have around 89Gbp of data (~88x coverage) data with Hifi data.
This gives us a pretty good assembly N50 - 37,230,366
To estimate if adding data would improve our assembly further, I downsampled the data which gave us the following N50 values

90% - 32,307,688
80% - 27,353,633
70% - 21,585,766

This would appear to indicate that adding more data would improve the N50.
I do also see an increase in the number of total contigs (which is not unexpected) as the coverage increases.

Unfortunately I do not have a good reference for this species so I am relying on comparing it to a closely related species.
I can see in some cases that a single contig spans a complete chromosome (on the closely related species).

You mentioned in a previous post that hifiasm works best at ~40x coverage and if coverage is too high - 'high coverage may introduce weird things'. Do you think that hifiasm could actually continue to improve our assembly if we were to add more data, or do you think it may start to do some of these 'weird things'.
Or would it be a better strategy to do more stringent filtering on the input data to maintain the same coverage. We currently are using the default Q20 cutoff.

error when generating fasta with gfa2fa

Any idea what this error could be about? Or if it would impact the output fasta?

Thanks,
KF

gfatools gfa2fa asm.p_ctg.gfa > p.contigs.fasta

[W::gfa_fix_utg] one or multiple A-lines on 'utg000549l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg001025l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg001466l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg003357l' are incorrect

How to choose k-mer size for trio-binning assembly

Hi,
I used the HiFiasm trio mode to phased my genome into maternal and paternal assembly. The first step is using yak to generate trio index. When I used yak count -b37 -t16 -k21 and yak count -b37 -t16 -k19 the genome size was different. -k21 will generate larger genome size than -k19 with the same parameters using hifiasm to assemble. So It's there any formula like Canu-trio can suggest the best k-mer size for user?
Thanks!

Best way to cite the assembler?

I'm in the process of contributing to a paper which used HiFiASM, do you have a preprint or something with a DOI that I can use to cite the program?

Is racon necessary for the hifiasm downstream？

Hi, @chhylp123

Since the hifiasm have 3 rounds correct for HiFi reads，the quality of default assembly from hifiasm is high enough. Do I need to mapping the HiFi fastq to the assembly to do polishing using racon?

Thanks.

Zhigui Bao

filtering based on coverage?

Is there an easy way to extract the coverage for each unitig/contig? I have a 5 Gb plant genome that ballooned to a 21 Gb assembly from soil contamination. Hoping I don't have to align the reads to the assembly to get the coverage and find the ~30x contigs. I am assuming the gfa files only list the reads from the tiling path. Is that right? I am assuming that -n only applies to tiling path reads??

Thanks,
KF

bad reads in tiling path

It appears that much of the error in hifiasm assemblies is associated with unsupported reads integrated into the tiling path. They are found by aligning the original HiFi reads or corrected reads to the assembly and looking for clusters of SNPs/INDELs. How could this be addressed within hifiasm? How does this read survive correction?

Thanks,
KF

generate the corrected fastq reads

Hello,
I wonder if there is a way to generate the corrected Hifi fastq reads into .fq format. Right now they are in binary format, right?
thanks,
Zhenzhen

How can I get reads which used for trio assembly from HiFiasm

Hi,
Recently, I have used Hifiasm to generate haplotype-resolved assemblies with trio binning. But I want to know which the reads name which used for assembly two haplotype-resolved genome separately. So I can get these two reads set from raw pacbio hifi reads then feed to Hicanu to compare the resultes.

Can you give me any suggestion or parameter?

Metagenomics assembly with Hifiasm?

Hi,

I trying to assemble metagenomics hifi reads. Is Hifiasm suited for metagenomics assembly? and if so do you have any recommended settings for that purpose?

Best,
Jean

memory needed

Dear hifiasm team
I'm wondering how much memory the package needs for a read dataset of 30x coverage from a single human chromosome?

when align the CCS reads to hifiasm result by using minimap2, Approximate per-base sequence divergence is very high

Hi,
Recently I use hifiasm (0.8-r279) to assemble Redwood（data from ./ https://sra-download.ncbi.nlm.nih.gov/traces/sra34/SRR/010948/）with asm options -t 80. The result is 35G, and N50 5.4M.
However,when I map the CCS reads to the assemble genome by using minimap2( minimap2-2.17 (r941) , --secondary=no -x asm20), I find that almost evey ccs reads can align to several contigs，and dv are in the range of 0.1~0.2，similar with ONT alignment.
I think this phenomenon is abnormal，and actural I also use minimap2 to map other ccs reads to there corresponding contigs(also assembled with hifiasm), the dv value are all about 0.005. So I want to know is there really something wrong happend? and can you give me some advice to avoid these phenomena ? Thank you!

premature kmer cutoff

I have a couple of datasets like the below where it looks like there is an initial, possibly erroneous peak at kmer frequency 3, which then leads to hifiasm dropping most of the kmers for the assembly. I've tried the -D10, but no real change in the result.

[M::ha_analyze_count] lowest: count[2] = 6069143
[M::ha_analyze_count] highest: count[3] = 7235226
[M::ha_hist_line]     2: ************************************************************************************ 6069143
[M::ha_hist_line]     3: **************************************************************************************************** 7235226
[M::ha_hist_line]     4: ********************************************************************************************** 6793566
[M::ha_hist_line]     5: ******************************************************************************** 5819678
[M::ha_hist_line]     6: **************************************************************** 4658271
[M::ha_hist_line]     7: ************************************************ 3500309
[M::ha_hist_line]     8: ************************************ 2612406
[M::ha_hist_line]     9: **************************** 2006019
[M::ha_hist_line]    10: ************************ 1740065
[M::ha_hist_line]    11: ************************ 1751566
[M::ha_hist_line]    12: ************************** 1881720
[M::ha_hist_line]    13: **************************** 2059802
[M::ha_hist_line]    14: ********************************* 2414617
[M::ha_hist_line]    15: *************************************** 2817786
[M::ha_hist_line]    16: ********************************************* 3247131
[M::ha_hist_line]    17: *************************************************** 3675817
[M::ha_hist_line]    18: ******************************************************* 4007673
[M::ha_hist_line]    19: *********************************************************** 4263849
[M::ha_hist_line]    20: ************************************************************* 4401881
[M::ha_hist_line]    21: ************************************************************* 4444681
[M::ha_hist_line]    22: ************************************************************* 4392425
[M::ha_hist_line]    23: ********************************************************** 4207993
[M::ha_hist_line]    24: ******************************************************* 3960917
[M::ha_hist_line]    25: ************************************************** 3600819
[M::ha_hist_line]    26: ********************************************* 3219978
[M::ha_hist_line]    27: *************************************** 2835019
[M::ha_hist_line]    28: *********************************** 2558213
[M::ha_hist_line]    29: ******************************* 2256288
[M::ha_hist_line]    30: **************************** 2060429
[M::ha_hist_line]    31: ************************** 1901848
[M::ha_hist_line]    32: ************************** 1851258
[M::ha_hist_line]    33: ************************** 1850593
[M::ha_hist_line]    34: ************************** 1892168
[M::ha_hist_line]    35: **************************** 1993563
[M::ha_hist_line]    36: ***************************** 2109533
[M::ha_hist_line]    37: ****************************** 2201966
[M::ha_hist_line]    38: ******************************** 2342791
[M::ha_hist_line]    39: ********************************** 2440881
[M::ha_hist_line]    40: *********************************** 2547906
[M::ha_hist_line]    41: ************************************ 2628198
[M::ha_hist_line]    42: ************************************* 2673573
[M::ha_hist_line]    43: ************************************** 2717951
[M::ha_hist_line]    44: ************************************* 2669472
[M::ha_hist_line]    45: ************************************ 2597238
[M::ha_hist_line]    46: *********************************** 2496652
[M::ha_hist_line]    47: ********************************* 2375755
[M::ha_hist_line]    48: ****************************** 2193939
[M::ha_hist_line]    49: **************************** 2037768
[M::ha_hist_line]    50: ************************** 1886916
[M::ha_hist_line]    51: ************************ 1701060
[M::ha_hist_line]    52: ********************* 1501367
[M::ha_hist_line]    53: ****************** 1323015
[M::ha_hist_line]    54: **************** 1158475
[M::ha_hist_line]    55: ************** 985960
[M::ha_hist_line]    56: ************ 848042
[M::ha_hist_line]    57: ********** 720656
[M::ha_hist_line]    58: ******** 613567
[M::ha_hist_line]    59: ******* 522236
[M::ha_hist_line]    60: ****** 421345
[M::ha_hist_line]    61: ***** 345199
[M::ha_hist_line]    62: **** 282627
[M::ha_hist_line]    63: *** 234878
[M::ha_hist_line]    64: *** 196966
[M::ha_hist_line]    65: ** 167464
[M::ha_hist_line]    66: ** 140883
[M::ha_hist_line]    67: ** 124071
[M::ha_hist_line]    68: ** 109375
[M::ha_hist_line]    69: * 93330
[M::ha_hist_line]    70: * 83896
[M::ha_hist_line]    71: * 78482
[M::ha_hist_line]    72: * 70598
[M::ha_hist_line]    73: * 65065
[M::ha_hist_line]    74: * 60794
[M::ha_hist_line]    75: * 58992
[M::ha_hist_line]    76: * 57068
[M::ha_hist_line]    77: * 56897
[M::ha_hist_line]    78: * 54293
[M::ha_hist_line]    79: * 54742
[M::ha_hist_line]    80: * 53652
[M::ha_hist_line]    81: * 52653
[M::ha_hist_line]    82: * 53446
[M::ha_hist_line]    83: * 52509
[M::ha_hist_line]    84: * 53028
[M::ha_hist_line]    85: * 52292
[M::ha_hist_line]    86: * 53718
[M::ha_hist_line]    87: * 53726
[M::ha_hist_line]    88: * 53380
[M::ha_hist_line]    89: * 54470
[M::ha_hist_line]    90: * 55626
[M::ha_hist_line]    91: * 54319
[M::ha_hist_line]    92: * 54240
[M::ha_hist_line]    93: * 56616
[M::ha_hist_line]    94: * 56854
[M::ha_hist_line]    95: * 55967
[M::ha_hist_line]    96: * 57542
[M::ha_hist_line]    97: * 58223
[M::ha_hist_line]    98: * 57932
[M::ha_hist_line]    99: * 59190
[M::ha_hist_line]   100: * 60010
[M::ha_hist_line]   101: * 58622
[M::ha_hist_line]   102: * 58082
[M::ha_hist_line]   103: * 58887
[M::ha_hist_line]   104: * 57855
[M::ha_hist_line]   105: * 53057
[M::ha_hist_line]   106: * 52710
[M::ha_hist_line]   107: * 53964
[M::ha_hist_line]   108: * 54917
[M::ha_hist_line]   109: * 54081
[M::ha_hist_line]   110: * 53832
[M::ha_hist_line]   111: * 51629
[M::ha_hist_line]   112: * 47683
[M::ha_hist_line]   113: * 45744
[M::ha_hist_line]   114: * 40777
[M::ha_hist_line]   115: * 39379
[M::ha_hist_line]   116: * 37060
[M::ha_hist_line]   117: * 37442
[M::ha_hist_line]  rest: **************************** 2031545
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 3; peak_het: -1
[M::ha_ft_gen::527.803*[email protected]] ==> filtered out 117538870 k-mers occurring 15 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::649.205*6.83] ==> counted 14181699 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 131849
[M::ha_analyze_count] highest: count[14] = 260899
[M::ha_hist_line]     1: ****************************************************************************************************> 10770336
[M::ha_hist_line]     2: ****************************************************************************************************> 711398
[M::ha_hist_line]     3: ****************************************************************************************************> 477310
[M::ha_hist_line]     4: ****************************************************************************************************> 357784
[M::ha_hist_line]     5: ****************************************************************************************************> 279986
[M::ha_hist_line]     6: *************************************************************************************** 226170
[M::ha_hist_line]     7: ******************************************************************** 178593
[M::ha_hist_line]     8: ********************************************************* 148284
[M::ha_hist_line]     9: *************************************************** 131849
[M::ha_hist_line]    10: *************************************************** 132070
[M::ha_hist_line]    11: ******************************************************** 146009
[M::ha_hist_line]    12: *************************************************************** 165522
[M::ha_hist_line]    13: *************************************************************************** 195489
[M::ha_hist_line]    14: **************************************************************************************************** 260899
[M::ha_hist_line]  rest:  0
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::752.412*8.78] ==> indexed 21772882 positions
[M::ha_assemble::949.210*[email protected]] ==> corrected reads for round 1
[M::ha_assemble] # bases: 8221544631; # corrected bases: 6132722; # recorrected bases: 18957
[M::ha_assemble] size of buffer: 0.512GB
[M::ha_pt_gen::1052.394*13.53] ==> counted 11170165 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 121171
[M::ha_analyze_count] highest: count[14] = 221771
[M::ha_hist_line]     1: ****************************************************************************************************> 7972140
[M::ha_hist_line]     2: ****************************************************************************************************> 562090
[M::ha_hist_line]     3: ****************************************************************************************************> 418851
[M::ha_hist_line]     4: ****************************************************************************************************> 345446
[M::ha_hist_line]     5: ****************************************************************************************************> 274879
[M::ha_hist_line]     6: ****************************************************************************************************> 223168
[M::ha_hist_line]     7: ******************************************************************************* 175033
[M::ha_hist_line]     8: ***************************************************************** 144146
[M::ha_hist_line]     9: ********************************************************* 126427
[M::ha_hist_line]    10: ******************************************************* 121171
[M::ha_hist_line]    11: ************************************************************ 133867
[M::ha_hist_line]    12: ******************************************************************** 150913
[M::ha_hist_line]    13: ***************************************************************************** 171504
[M::ha_hist_line]    14: **************************************************************************************************** 221771
[M::ha_hist_line]    15: ******************************************** 97971
[M::ha_hist_line]    16: *********** 24641
[M::ha_hist_line]    17: ** 4875
[M::ha_hist_line]  rest: * 1272
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1155.817*14.20] ==> indexed 21791689 positions
[M::ha_assemble::1324.871*[email protected]] ==> corrected reads for round 2
[M::ha_assemble] # bases: 8219204883; # corrected bases: 755893; # recorrected bases: 27755
[M::ha_assemble] size of buffer: 0.492GB
[M::ha_pt_gen::1425.545*16.32] ==> counted 10883111 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 120900
[M::ha_analyze_count] highest: count[14] = 219799
[M::ha_hist_line]     1: ****************************************************************************************************> 7720292
[M::ha_hist_line]     2: ****************************************************************************************************> 538136
[M::ha_hist_line]     3: ****************************************************************************************************> 408123
[M::ha_hist_line]     4: ****************************************************************************************************> 341917
[M::ha_hist_line]     5: ****************************************************************************************************> 275581
[M::ha_hist_line]     6: ****************************************************************************************************> 223669
[M::ha_hist_line]     7: ******************************************************************************* 174658
[M::ha_hist_line]     8: ****************************************************************** 144019
[M::ha_hist_line]     9: ********************************************************* 125414
[M::ha_hist_line]    10: ******************************************************* 120900
[M::ha_hist_line]    11: ************************************************************ 132795
[M::ha_hist_line]    12: ******************************************************************** 149423
[M::ha_hist_line]    13: ***************************************************************************** 169641
[M::ha_hist_line]    14: **************************************************************************************************** 219799
[M::ha_hist_line]    15: *********************************************** 102332
[M::ha_hist_line]    16: ************* 27982
[M::ha_hist_line]    17: *** 6462
[M::ha_hist_line]    18: * 1426
[M::ha_hist_line]  rest:  542
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1528.858*16.65] ==> indexed 21765787 positions
[M::ha_assemble::1695.380*[email protected]] ==> corrected reads for round 3
[M::ha_assemble] # bases: 8219055948; # corrected bases: 258593; # recorrected bases: 27049
[M::ha_assemble] size of buffer: 0.480GB
[M::ha_pt_gen::1798.606*17.92] ==> counted 10811427 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 120732
[M::ha_analyze_count] highest: count[14] = 219552
[M::ha_hist_line]     1: ****************************************************************************************************> 7658951
[M::ha_hist_line]     2: ****************************************************************************************************> 531565
[M::ha_hist_line]     3: ****************************************************************************************************> 404901
[M::ha_hist_line]     4: ****************************************************************************************************> 340623
[M::ha_hist_line]     5: ****************************************************************************************************> 275702
[M::ha_hist_line]     6: ****************************************************************************************************> 223799
[M::ha_hist_line]     7: ******************************************************************************* 174452
[M::ha_hist_line]     8: ****************************************************************** 144051
[M::ha_hist_line]     9: ********************************************************* 125201
[M::ha_hist_line]    10: ******************************************************* 120732
[M::ha_hist_line]    11: ************************************************************ 132550
[M::ha_hist_line]    12: ******************************************************************** 149277
[M::ha_hist_line]    13: ***************************************************************************** 168969
[M::ha_hist_line]    14: **************************************************************************************************** 219552
[M::ha_hist_line]    15: *********************************************** 103278
[M::ha_hist_line]    16: ************* 28638
[M::ha_hist_line]    17: *** 6901
[M::ha_hist_line]    18: * 1597
[M::ha_hist_line]  rest:  688
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1905.213*18.06] ==> indexed 21755852 positions
[M::ha_assemble::1995.330*[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 3361840
[M::ha_print_ovlp_stat] # strong overlaps: 224248
[M::ha_print_ovlp_stat] # weak overlaps: 3137592
[M::ha_print_ovlp_stat] # exact overlaps: 2502940
[M::ha_print_ovlp_stat] # inexact overlaps: 858900
[M::ha_print_ovlp_stat] # overlaps without large indels: 3346936
[M::ha_print_ovlp_stat] # reverse overlaps: 339323
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 12
[M::purge_dups] purge duplication coverage threshold: 12
[M::adjust_utg_by_primary] primary contig coverage range: [7, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.12-r304
[M::main] CMD: /software/vertres/bin-external/hifiasm-0.12/hifiasm -t28 -o wrPisGeom1 /lustre/scratch116/tol/projects/darwin/data/annelids/Piscicola_geometra/working/wrPisGeom1.hicanu.20201015/wrPisGeom1.trimmedReads.fasta.gz
[M::main] Real time: 2020.912 sec; CPU: 36935.352 sec; Peak RSS: 18.649 GB

High duplication from BUSCO using x15 cov

Hi,

Very nice tool as easy to run and quick which can be a major headache sometimes. I posted below and just realised I put both smrt cells fasta and fastq in so duplicated the data. I'll try that again and hopefully this helps.

Results look significantly better than HiCanu. I have 10,000 contigs. My problem is the longest contig I have is 3.5Mbp and the BUSCO has high duplication which corresponds to the genome size being 80% too large too. I think I only have x15 coverage, I have requested more data and might be able to push it to 20x but will have to wait. Is there a parameter that I could tweak to get this duplication down rather than use additional tool to do that. It is early days, and I need to play around more and read more but in case I am missing anything obvious.
C:99.9%[S:19.7%,D:80.2%],F:0.1%,M:-0.0%,n:1658

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      24233 sequences, total length 2033386837 bp (including 439 repeats of total length 7159438 bp).
--   bubbles:      6428 sequences, total length 192094085 bp.
--   unassembled:  94184 sequences, total length 1000046846 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      968422           104   130308813
--     20      660064           270   260179731
--     30      498917           500   390143923
--     40      405200           791   520051788
--     50      328828          1148   650274462
--     60      275039          1581   780008382
--     70      228439          2100   910187095
--     80      188891          2723  1040077841
--     90      154826          3482  1170033072
--    100      126343          4413  1300104055
--    110      100700          5565  1430064823
--    120       75672          7052  1560023035
--    130       53416          9090  1690051034
--    140       31413         12252  1820020140
--    150       17660         17857  1950017255
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      24233 sequences, total length 3030398075 bp (including 439 repeats of total length 10703713 bp).
--   bubbles:      6428 sequences, total length 286330754 bp.
--   unassembled:  94184 sequences, total length 1482274716 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1636484            63   131210376
--     20     1228508           155   260856458
--     30      973050           274   390584764
--     40      800740           422   520250313
--     50      677931           599   650558423
--     60      598845           802   780360444
--     70      520751          1034   910162938
--     80      454585          1302  1040427748
--     90      406316          1604  1170167952
--    100      359785          1944  1300091512
--    110      317659          2330  1430305495
--    120      277794          2766  1560173378
--    130      244161          3264  1690067561
--    140      213205          3835  1820172826
--    150      185735          4490  1950080750
--    160      158965          5246  2080102254
--    170      133649          6135  2210046641
--    180      109905          7205  2340006684
--    190       87483          8527  2470045576
--    200       65031         10236  2600029303
--    210       44371         12664  2730012789
--    220       30249         16250  2860018091
--    230       18978         21591  2990006743
--

Difference between "prefix.hap1.p_ctg.gfa" and "prefix.hap1.a_ctg.gfa" ?

Hi,
I want to know the difference between "prefix.hap1.p_ctg.gfa" and "prefix.hap1.a_ctg.gfa", is there any difference between them? If I do not need to distinguish haplotypes, can I consider "prefix.hap1.p_ctg.gfa" as the final result after assembly? Thanks a lot！

hifiasm with -l0 still apparently merging different segmental duplications

Hi, thanks a lot for the great assembler.

I am trying to assemble a homozygous plant genome with 1.6gb with 40X CCS reads. Interestingly, this plant seems to have accumulated large segmental duplications that share a high similarity. In some cases over 95%. We have tried hifiasm with and without the purging option (-l0). The assembly looks much better without the purging step (N50 20Mb) in comparison with the purging step (N50 12 Mb). However, it seems that even without purging some regions after mapping seems to contain collapsed segmental duplications as checked by mapping the reads. Some regions seem to have double coverage.

Any parameter suggestion to improve the assembly?

Thanks!
André

Low contig N50 for a fish genome with 40x HiFi reads

Hi, @chhylp123

hifiasm works great on all my plant genome assembly, it produce more contiguous and complete genome in less resource usage.

But I recently assembly a fish genome with hifiasm, it have low contig N50.

Here is the all the information I have. Do you have any suggestions for tuning?

genomscope estimation from the survey data: 789M, 0.517% het, repeat 49.3%
HiFi data: 33.6G HiFi data from two SMRT cell, ccs N50 20,226, Q30 93.96
Assembly stat：

The kmer histo from the HiFi seems different from the genomescope.

Stat	hifiasm(0.11)	wtdbg2(2.4)	flye(2.8)
genomesize	919821446	870750720	1.15E+09
contigNumber	2313	4435	16179
contigN50	1152427	823296	324539
contigL50	214	280	831

hifiasm -o fish -t 28 fish.hifi.fastq.gz

BUSCO stat on fish.p_ctg.fa: C:93.3%[S:88.6%,D:4.7%],F:2.0%,M:4.7%,n:4584

I have tried blast 50000 seqs from the HiFi reads, it mostly mapped to the fish sequences in NR database.

Typically, fish genome have more simple repeat (~8%) than plant(<2%).

Here is the log of hifiasm
fish.hifiasm.log

Here is the raw unitig graph (fish.p_utg.noseq.gfa)
fish.p_utg.noseq.gfa.zip

Thanks,

Zhigui Bao

chhylp123 / hifiasm Goto Github PK

hifiasm's Introduction

Getting Started

Table of Contents

Introduction

Why Hifiasm?

Usage

Assembling HiFi reads without additional data types

Hi-C integration

Trio binning

Ultra-long ONT integration

Output files

Results

Getting Help

Limitations

Citating Hifiasm

hifiasm's People

Contributors

Stargazers

Watchers

Forkers

hifiasm's Issues

SAMPLE OF OUTPUT

Recommend Projects

Recommend Topics

Recommend Org