maggi-chen / inspector Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 9.0 14.74 MB

A tool for evaluating long-read de novo assembly results

License: MIT License

Python 98.93% Dockerfile 1.07%

inspector's People

Contributors

Stargazers

Watchers

Forkers

chonglab mdpelletier yangxiaofeill colindaven skyclub3

inspector's Issues

Typo in inspector-correct.py

Hello, using inspector-correct.py with --datatype nano-corr leads to an error that data type is invalid, due to a space in front of nano-corr in this line:

if inscor_args.datatype not in ['pacbio-raw','pacbio-hifi', 'pacbio-corr', 'nano-raw',' nano-corr']:

Hi, I am finding some troubles with inspector-correct.
After the last update of scripts, 4 days ago, now it is successfully producing the contig_corrected.fasta file. But when looking at the log of the correction process, every time that a structural error needs to be re-assembled with flye, an error like this occurs:

Base error correction for  ctg001250  finished. Time cost:  0.00589680671692
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
             --nano-corr | --subassemblies) file1 [file_2 ...]
             --genome-size size --out-dir dir_path [--threads int]
             [--iterations int] [--min-overlap int] [--resume]
             [--debug] [--version] [--help]
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
             --nano-corr | --subassemblies) file1 [file_2 ...]
             --genome-size size --out-dir dir_path [--threads int]
             [--iterations int] [--min-overlap int] [--resume]
             [--debug] [--version] [--help]
flye: error: argument -g/--genome-size is required
flye: error: argument -g/--genome-size is required
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
             --nano-corr | --subassemblies) file1 [file_2 ...]
             --genome-size size --out-dir dir_path [--threads int]
             [--iterations int] [--min-overlap int] [--resume]
             [--debug] [--version] [--help]
flye: error: argument -g/--genome-size is required
FLYETIME for  ctg001250__925207__925554__347__exp 0.044429063797
FLYETIME for  ctg001250__931956__933136__1180__exp 0.0446717739105
FLYETIME for  ctg001250__813565__813971__406__exp 0.0445201396942
Inspector Assembly Fail  ctg001250__925207__925554__347__exp
Inspector Assembly Fail  ctg001250__931956__933136__1180__exp
Inspector Assembly Fail  ctg001250__813565__813971__406__exp

It is hard to know if the end result will have that contig corrected, or if it failed in doing that.
Is there any way to avoid that "genome size required" error that flye is producing?

Can Inspector be used in homological polyploidy?

hi , Thank you for developing such a good software。 Your article write that we have developed Inspector to comprehensively evaluate assembly quality and identify assembly errors in haploid and diploid genomes. Can I use Inspector to find structural assembly errors in homological polyploidy genome ?

cat: test_out/ae_merge_workspace/inv_merged_*: No such file or directory

I'd like to inspector to perform assembly correction of two haplotype-resolved HiFi assemblies of a human individual; So i fristly run the test dataset to validate successful installation of inspector. In the log file, i found a warning :
cat: test_out/ae_merge_workspace/inv_merged_*: No such file or directory;
However in the directory of test_out/ae_merge_workspace/, there was a file named inv_merged_ctg1. so do I need to worry about this warning? and only 2 structure error and 298 small-scale assembly error were present in the summary_statistics file. Is the difference of result caused by different versions of dependencies for Inspector?

Any help would be greatly appreciated,
Jingwei Yue!

####the log file:####
import pandas.util.testing as tm
[M::mm_idx_gen::0.0481.05] collected minimizers
[M::mm_idx_gen::0.0793.31] sorted minimizers
[M::main::0.0793.31] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.0913.00] mid_occ = 70
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.1002.83] distinct minimizers: 242594 (96.40% are singletons); average occurrences: 1.138; average spacing: 5.328; total length: 1470648
[M::worker_pipeline::14.6467.00] mapped 6532 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -a -Q -N 1 -I 10G -t 8 test_out/valid_contig.fa testdata/read_test.fastq.gz
[M::main] Real time: 14.660 sec; CPU: 102.577 sec; Peak RSS: 1.226 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Collect info from ctg2
cat: test_out/ae_merge_workspace/inv_merged_*: No such file or directory
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
end n100

####files in the directory of test_out/ae_merge_workspace/
deletion-merged
del_merged_ctg1
insertion-merged
ins_merged_ctg1
inversion-merged

#######Dependencies for Inspector:
python 3.7
minimap2 2.2
samtools 1.6

Inspector error

Hi Maggi,

I have been using Inspector to check the assembled genome through Flye. However, continuously it gives me the same error even if I changed some default parameters. Here is the error, How it should be fixed? thanks

/home/xxxxx/Inspector/inspector.py -c genome.fasta -r genome.fastq.gz -d nanopore -t 20
.
.
.
.
.
.
Collect info from scaffold_96561_np12_RagTag_polished
Collect info from scaffold_97192_np12_RagTag_polished
Collect info from scaffold_97806_np12_RagTag_polished
Collect info from scaffold_97933_np12_RagTag_polished
Collect info from scaffold_98921_np12_RagTag_polished
sh: 1: cat: Argument list too long
sh: 1: cat: Argument list too long
sh: 1: cat: Argument list too long
Traceback (most recent call last):
File "/home/xxxxxxx/Inspector/inspector.py", line 131, in
cov=denovo_static.mapping_info_ctg(denovo_args.outpath,chromosomes_large,chromosomes_small,totalcontiglen,totalcontiglen_large)
File "/home/xxxxxx/Inspector/denovo_static.py", line 138, in mapping_info_ctg
splrate=round(10000*float(splitread)/mapped)/100.0
ZeroDivisionError: float division by zero

Best,
Mergi

Inspector error while collecting info from contigs

Hi,

We've been using inspector without trouble for a few months now. When running on one of our genomes, we're getting the following error (previous few lines included):

Collect info from ctg005350
Collect info from ctg005510
Collect info from ctg005670
Collect info from ctg005850
Collect info from ctg006080
Collect info from ctg006310
Collect info from ctg006510
Traceback (most recent call last):
  File "/work/soghigian_lab/apps/conda/envs/ins/Inspector/inspector.py", line 149, in <module>
    debreak_cluster.genotype(cov,denovo_args.outpath)
  File "/work/soghigian_lab/apps/conda/envs/ins/Inspector/debreak_merge_clustering.py", line 314, in genotype
    leftcov=samfile.count(chrom,max(start-100,0),start,read_callback='all')
  File "pysam/calignmentfile.pyx", line 1081, in pysam.calignmentfile.AlignmentFile.count (pysam/calignmentfile.c:12699)
TypeError: count() got an unexpected keyword argument 'read_callback'

Any idea how to resolve this?

Structural Error increased after the inspector polishing

Hi inspector Team,
I used the inspector to perform assembly correction of two haplotype-resolved HiFi assemblies of a human individual. I can see the assembly QV is better, but one of assemblies shows more structural errors after correction. In addition, I also found the local assembly would fail in inspector correction. Is it possible for me to address these issues? Thanks!

commands:
inspector.py -c asm1.fa -r ccsreads.1.fastq ccsreads.2.fastq -o asm1_out/ --datatype hifi
inspector.py -c asm2.fa -r ccsreads.1.fastq ccsreads.2.fastq -o asm2_out/ --datatype hifi

inspector-correct.py -i asm1_out/ --datatype pacbio-hifi -o asm1_corrected/
inspector-correct.py -i asm2_out/ --datatype pacbio-hifi -o asm2_corrected/

asm2 before correction:
Structural error 49
Expansion 29
Collapse 20
Haplotype switch 0
Inversion 0

Small-scale assembly error /per Mbp 14.790265745042309
Total small-scale assembly error 44683
Base substitution 26180
Small-scale expansion 10097
Small-scale collapse 8406

QV 46.40314095738277

asm2 after correction:
Structural error 58
Expansion 32
Collapse 26
Haplotype switch 0
Inversion 0

Small-scale assembly error /per Mbp 1.3369275599703823
Total small-scale assembly error 4039
Base substitution 3227
Small-scale expansion 349
Small-scale collapse 463

QV 50.03862621625856

what happened to p-value?

Hi Maggie,
This is a very interesting tool. I am thoroughly testing it in comparison to other polishing approaches I have been using, so I should have so good feedback soon. I am particularly interested to see how it handles small N-gaps introduced by BioNano hybrid scaffolding that can easily be spanned by HiFi reads.

What happened to the p-value parameter? I see it in the documentation, but not in v1.0.2. This could be very helpful to increase the quality of polishing.

Also, v1.0.2 still shows v1.0.1 as the version.

Thanks,
Kevin

Inspector-correct not writing properly corrected structural errors into contig_corrected.fa

I am finding a new type of error during the correction.

It seems like the jobs get sent properly to flye, and start getting assembled, but then immediately after looks for the result, and claims that they don't exist. In the meantime, if I do top, I see the Flye jobs running, but the inspector claims that has already finished, so those results, I guess, will be saved in draft_assembly.fasta but not on the final contig_corrected.fa.

When checking with inspector the inspector corrected assembly, we see the small errors very nicely corrected, but not the structural errors. So it might be because they get never written into that final corrected fasta.

This is how the error looks in one of these cases:

[2021-11-23 21:32:05] INFO: Extending reads
[2021-11-23 22:19:49] INFO: Overlap-based coverage: 10
[2021-11-23 22:19:49] INFO: Median overlap divergence: 0.109189
0% 90% 100%
[2021-11-23 22:35:36] INFO: Assembled 2 disjointigs
[2021-11-23 22:35:36] INFO: Generating sequence
0% 10% 20% 30% 50% 60% 70% 80% 100%
[2021-11-23 22:35:36] ERROR: Caught unhandled exception: Can't open /space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/00-assembly/draft_assembly.fasta
[2021-11-23 22:35:36] ERROR: flye-modules(+0x3ab73) [0x562558de4b73]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0ee0e07f6f]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0ee0e07fb1]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f0ee0e0819a]
[2021-11-23 22:35:36] ERROR: flye-modules(+0xc5e4) [0x562558db65e4]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x373e5) [0x562558de13e5]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x179d2) [0x562558dc19d2]
[2021-11-23 22:35:36] ERROR: /lib64/libc.so.6(__libc_start_main+0xed) [0x7f0ee03b934d]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x17b79) [0x562558dc1b79]
[2021-11-23 22:35:36] ERROR: Command '['flye-modules', 'assemble', '--reads', 'Inspector_MS_nextdenovo/assemble_workspace/read_ass_ctg003150__46462__46463__133__col.fa', '--out-asm', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/00-assembly/draft_assembly.fasta', '--config', '/space/Software/final_env/lib/python2.7/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg', '--log', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/flye.log', '--threads', '4', '--min-ovlp', '5000']' returned non-zero exit status -6
[2021-11-23 22:35:36] ERROR: Pipeline aborted

Inspector seems doesn't fix the small-scale errors

Hi~
I used the command "inspector.py -c ctg.fa -r ccs1.fq ccs2.fq -o inspector_out/ --datatype hifi" to evaluate the ctg-level draft assembly. But I found the output "valid_contig.fa" was exactly the same as the input "ctg.fa". I careflly compared the size of contigs and the base sequence in "small_scale_error.bed", and I'm sure the "valid_contig.fa" makes no difference. The "summary_statistics" did output errors, but I wonder why they were not fixed in "valid_contig.fa".

minimap2 -I option

Hi,

Because of my genome size I need to specific the minimap2 -I parameter to a larger number. Is it possible to simply provide Inspector with the bam file rather than with a fastq file? It also would make things way easier when working on a cluster where I can distribute the mapping jobs.

Thanks!

Question: how many times should I run inspector?

I am currently assembling a teleost genome using HiFi reads and I decided to use Inspector to correct my primary assembly. I observed marked improvements in the assembly after just a single round, but the corrected assembly was not free from errors (see table below). I was wondering if there is any benefit to running inspector in multiple rounds similar to how older long read genomes needed several rounds of polishing? Is there a risk of over-polishing or introducing errors?

Statics of contigs:	initial	1 round
Number of contigs	625	625
Number of contigs > 10000 bp	625	625
Number of contigs >1000000 bp	242	242
Total length	2272598809	2272481657
Total length of contigs > 10000 bp	2272598809	2272481657
Total length of contigs >1000000bp	2170694264	2170577325
Longest contig	68564173	68545954
Second longest contig length	68389185	68371778
N50	24159452	24149395
N50 of contigs >1Mbp	24159452	24149395


Read to Contig alignment:
Mapping rate /%	99.99	99.99
Split-read rate /%	9.62	9.62
Depth	45.576	45.5783
Mapping rate in large contigs /%	95.97	95.97
Split-read rate in large contigs /%	9.65	9.66
Depth in large conigs	45.7891	45.7926


Structural error	501	209
Expansion	363	139
Collapse	79	31
Haplotype switch	48	27
Inversion	11	12


Small-scale assembly error /per Mbp	32.7136491	1.937969438
Total small-scale assembly error	74345	4404
Base substitution	55461	2875
Small-scale expansion	11403	691
Small-scale collapse	7481	838

QV	35.4455019	38.49163495

there no mixed in -r parameter

in your inspector.py,there are not mixed type,why?
parser.add_argument('-d','--datatype',type=str,default='clr',help='Input read type. (clr, hifi, nanopore) [clr]')

Is there a way to change minimap2 parameters to save memory?

Hi Maggi,

thank you for the fantastic program! I was wondering if there was a way to change some of the minimap2 parameters within the Inspector.py code? I have some files with >200,000 reads to align to thousands of contigs and I would like to be able to change the -K parameter to 50M to reduce memory consumption. Currently, all my files <180,000 reads complete but over 180K they fail to have reads mapped to contigs because the job gets killed. Any advice?

Jamie

phased haplotigs

I'd like to use inspector on phased haplotigs (from hifiasm; i.e., hap1.fa, hap2.fa) from a highly heterozygous plant species (the phasing looks good). I'm assuming I would run inspector against each of the hap.fa assemblies. Are there any caveats that I should understand in doing this - i.e., since only one haplotig assembly is given, but all of the hifi reads are given (from both haplotypes), do I need to worry about inspector erroneously identifying heterozygousity as an error? Hopefully this makes sense. Maybe what I should do is combine both haplotypes into a single assembly and run it as a single assembly..??

Any help would be greatly appreciated,

Jeff Maughan

Can you please migrate to python3?

python2 is no longer supported. Could you please migrate to python3? Thank you.

Consistency in formatting of summary_statistics

Hello,

I'm trying out Inspector, and noticed that there are some inconsistencies with the formatting of the file summary_statistics. For example, most header/value pairs are tab-separated, but not in all cases:

For example in this script writing to the summary_statistics file:
https://github.com/Maggi-Chen/Inspector/blob/master/denovo_static.py
lines 230-233 write a tab, but lines 451-452 do not.

Having consistent formatting of header\tvalue would be helpful for parsing the output file on the command-line.

Thanks for developing this tool!
Lauren

An error occurred：ZeroDivisionError: float division by zero

Dear teacher, thank you for your work.
I'm using Inspector for evaluation and correction. An error occurred：ZeroDivisionError: float division by zero。
The log is as follows.
Inspector.log
Inspector starting... 23/06/2022 11:01:15
Start Assembly evaluation with contigs: ['../../../../10.resoult/genome_assemblyed/pbipa.fasta']
TIME: Before read mapping 1.2830026149749756
TIME: Read Alignment: 178.04207849502563
nohup.txt

I run normally on other assemblies. Can you give me a solution?

In addition, I have two questions.
First, how big QV value of genome evaluation belongs to a better assembly.
Second, I want to know the difference between Inspector and nextpolish in correction, and whether I need to further use correction software on this basis.

bgzip format for contig fasta

Hi, all

Does inspectory support bgzip format for contig fasta? When I running inspector on bgziped fasta files, it failed with no error messages were returned.

Best regards,
Xinchang

Using nanopore ultra-long reads to correct structural error

Hi dear developer,

I tried to use nanopore ultra-long reads to correct structural error in my hifi assembly. The inspector.py module did detect 443 structural error in the structural_error.bed, however the inspector-cor directly said

Inspector Assembly Fail chr01__2182326__2182423__97__exp
...

as flye seemed not working. I wonder if it caused by the length of the reads were usually longer than 50kb and sometimes 100kb.

I have tested to assemble a read_ass_chr01__100958203__100958204__153__col.fa by flye with default parameter which get a 23kb contig while the longest reads in the fasta file was 108640bp. And the program took nearly an hour to run for 15 reads in the fasta with a mean length of 90461 bp.

Could you please give me some suggestions to correct the structural error in my hifi assembly? Thanks a lot!

samtools index failed to create index

Hi,

I wanted to use Inspector to inspect an assembly that's been scaffolded to chromosomes, but our chromosomes are greater than 512 Mb (i.e., the limit for samtools index to create a bai). Is it possible for the pipeline to use csi instead? Or is there a workaround, like using inspector --skip_read_alignment?

Error running Inspector

Hi,

We are trying to run Inspector, and we installed using the following commands:

virtualenv --python=/usr/bin/python2.7 inspector_env
pip install pysam
pip install statsmodels --no-use-pep517

And we get this error involving numpy:

(inspector_env) ubuntu@ip-172-31-23-69:~/software/Inspector$ ./inspector.py -c /home/ubuntu/data/cuttlefish/assembly/run1/Assembly.fasta -r /home/ubuntu/data/cuttlefish/reads/cuttlefish_Guppy_5.0.7_sup.fastq /home/ubuntu/data/cuttlefish/reads/09_28_21_R941_CF5mg_Guppy_5.0.11_prom_sup.fastq -o inspector_out/ --datatype nanopore -t 40
Traceback (most recent call last):
 File "./inspector.py", line 70, in <module>
  import denovo_baseerror
 File "/home/ubuntu/software/Inspector/denovo_baseerror.py", line 3, in <module>
  import statsmodels.stats.proportion
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/stats/__init__.py", line 1, in <module>
  from statsmodels.tools._testing import PytestTester
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/__init__.py", line 1, in <module>
  from .tools import add_constant, categorical
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/tools.py", line 8, in <module>
  from statsmodels.compat.python import lzip, lmap
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/__init__.py", line 1, in <module>
  from statsmodels.tools._testing import PytestTester
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/_testing.py", line 11, in <module>
  from statsmodels.compat.pandas import assert_equal
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/pandas.py", line 4, in <module>
  import numpy as np
 File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/numpy.py", line 46, in <module>
  NP_LT_114 = LooseVersion(np.__version__) < LooseVersion('1.14')
AttributeError: 'module' object has no attribute '__version__'

Can you help with this?

Thanks

QV score is lower after polishing

Hi,
I have an assembly built with Flye, and then I polish it with pepper. Using inspector to evaluate the QV score, I have a lower QV after polishing. Have you had this issue before and do you know why? Thanks for your help.
Hien

Run Inspector with Illumina data

Hi!

I want to compare the results of one run of Inspector and one run of merqury using Illumina data, which datatype should I add if I use Illumina data?

Thanks a lot!
Marc

minimap2: command not found

Hi Maggi,
I was running Inspector on the assembled genome on a cluster node with 128 Gb of ram and 48 threads, I installed all the dependencies and run Inspector. However after running for a while it crushed a couple of times and gave me the same error.

"sh: minimap2: command not found
samtools sort: failed to read header from "-"
mv: cannot stat ‘/STORAGE/DATA/xxx/xxxx/flye_assembly/read_to_contig_1.bam’: No such file or directory
[E::hts_open_format] Failed to open file "//STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam" : No such file or directory
samtools index: failed to open "/STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam": No such file or directory
cat: /STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/debreak_workspace/read_to_contig_*debreak.temp: No such file or directory
Traceback (most recent call last):
File "/home/apps/user_apps/xxx/xxx_apps/Inspector/inspector.py", line 131, in
cov=denovo_static.mapping_info_ctg(denovo_args.outpath,chromosomes_large,chromosomes_small,totalcontiglen,totalcontiglen_large)
File "/home/apps/user_apps/piwczyn/mergi_apps/Inspector/denovo_static.py", line 123, in mapping_info_ctg
unmapped=int(pysam.AlignmentFile(outpath+'read_to_contig.bam','rb').unmapped)
File "pysam/calignmentfile.pyx", line 318, in pysam.calignmentfile.AlignmentFile.cinit (pysam/calignmentfile.c:4730)
File "pysam/calignmentfile.pyx", line 534, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7261)
IOError: file //STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam not found "

How should it get fixed?

Thanks ,