maggi-chen / inspector Goto Github PK
View Code? Open in Web Editor NEWA tool for evaluating long-read de novo assembly results
License: MIT License
A tool for evaluating long-read de novo assembly results
License: MIT License
Hello, using inspector-correct.py with --datatype nano-corr leads to an error that data type is invalid, due to a space in front of nano-corr in this line:
if inscor_args.datatype not in ['pacbio-raw','pacbio-hifi', 'pacbio-corr', 'nano-raw',' nano-corr']:
Hi, I am finding some troubles with inspector-correct.
After the last update of scripts, 4 days ago, now it is successfully producing the contig_corrected.fasta
file. But when looking at the log of the correction process, every time that a structural error needs to be re-assembled with flye, an error like this occurs:
Base error correction for ctg001250 finished. Time cost: 0.00589680671692
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
--nano-corr | --subassemblies) file1 [file_2 ...]
--genome-size size --out-dir dir_path [--threads int]
[--iterations int] [--min-overlap int] [--resume]
[--debug] [--version] [--help]
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
--nano-corr | --subassemblies) file1 [file_2 ...]
--genome-size size --out-dir dir_path [--threads int]
[--iterations int] [--min-overlap int] [--resume]
[--debug] [--version] [--help]
flye: error: argument -g/--genome-size is required
flye: error: argument -g/--genome-size is required
usage: flye (--pacbio-raw | --pacbio-corr | --nano-raw |
--nano-corr | --subassemblies) file1 [file_2 ...]
--genome-size size --out-dir dir_path [--threads int]
[--iterations int] [--min-overlap int] [--resume]
[--debug] [--version] [--help]
flye: error: argument -g/--genome-size is required
FLYETIME for ctg001250__925207__925554__347__exp 0.044429063797
FLYETIME for ctg001250__931956__933136__1180__exp 0.0446717739105
FLYETIME for ctg001250__813565__813971__406__exp 0.0445201396942
Inspector Assembly Fail ctg001250__925207__925554__347__exp
Inspector Assembly Fail ctg001250__931956__933136__1180__exp
Inspector Assembly Fail ctg001250__813565__813971__406__exp
It is hard to know if the end result will have that contig corrected, or if it failed in doing that.
Is there any way to avoid that "genome size required" error that flye is producing?
hi , Thank you for developing such a good software。 Your article write that we have developed Inspector to comprehensively evaluate assembly quality and identify assembly errors in haploid and diploid genomes. Can I use Inspector to find structural assembly errors in homological polyploidy genome ?
I'd like to inspector to perform assembly correction of two haplotype-resolved HiFi assemblies of a human individual; So i fristly run the test dataset to validate successful installation of inspector. In the log file, i found a warning :
cat: test_out/ae_merge_workspace/inv_merged_*: No such file or directory;
However in the directory of test_out/ae_merge_workspace/, there was a file named inv_merged_ctg1. so do I need to worry about this warning? and only 2 structure error and 298 small-scale assembly error were present in the summary_statistics file. Is the difference of result caused by different versions of dependencies for Inspector?
Any help would be greatly appreciated,
Jingwei Yue!
####the log file:####
import pandas.util.testing as tm
[M::mm_idx_gen::0.0481.05] collected minimizers
[M::mm_idx_gen::0.0793.31] sorted minimizers
[M::main::0.0793.31] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.0913.00] mid_occ = 70
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.1002.83] distinct minimizers: 242594 (96.40% are singletons); average occurrences: 1.138; average spacing: 5.328; total length: 1470648
[M::worker_pipeline::14.6467.00] mapped 6532 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -a -Q -N 1 -I 10G -t 8 test_out/valid_contig.fa testdata/read_test.fastq.gz
[M::main] Real time: 14.660 sec; CPU: 102.577 sec; Peak RSS: 1.226 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Collect info from ctg2
cat: test_out/ae_merge_workspace/inv_merged_*: No such file or directory
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
end n100
####files in the directory of test_out/ae_merge_workspace/
deletion-merged
del_merged_ctg1
insertion-merged
ins_merged_ctg1
inversion-merged
#######Dependencies for Inspector:
python 3.7
minimap2 2.2
samtools 1.6
Hi Maggi,
I have been using Inspector to check the assembled genome through Flye. However, continuously it gives me the same error even if I changed some default parameters. Here is the error, How it should be fixed? thanks
/home/xxxxx/Inspector/inspector.py -c genome.fasta -r genome.fastq.gz -d nanopore -t 20
.
.
.
.
.
.
Collect info from scaffold_96561_np12_RagTag_polished
Collect info from scaffold_97192_np12_RagTag_polished
Collect info from scaffold_97806_np12_RagTag_polished
Collect info from scaffold_97933_np12_RagTag_polished
Collect info from scaffold_98921_np12_RagTag_polished
sh: 1: cat: Argument list too long
sh: 1: cat: Argument list too long
sh: 1: cat: Argument list too long
Traceback (most recent call last):
File "/home/xxxxxxx/Inspector/inspector.py", line 131, in
cov=denovo_static.mapping_info_ctg(denovo_args.outpath,chromosomes_large,chromosomes_small,totalcontiglen,totalcontiglen_large)
File "/home/xxxxxx/Inspector/denovo_static.py", line 138, in mapping_info_ctg
splrate=round(10000*float(splitread)/mapped)/100.0
ZeroDivisionError: float division by zero
Best,
Mergi
Hi,
We've been using inspector without trouble for a few months now. When running on one of our genomes, we're getting the following error (previous few lines included):
Collect info from ctg005350
Collect info from ctg005510
Collect info from ctg005670
Collect info from ctg005850
Collect info from ctg006080
Collect info from ctg006310
Collect info from ctg006510
Traceback (most recent call last):
File "/work/soghigian_lab/apps/conda/envs/ins/Inspector/inspector.py", line 149, in <module>
debreak_cluster.genotype(cov,denovo_args.outpath)
File "/work/soghigian_lab/apps/conda/envs/ins/Inspector/debreak_merge_clustering.py", line 314, in genotype
leftcov=samfile.count(chrom,max(start-100,0),start,read_callback='all')
File "pysam/calignmentfile.pyx", line 1081, in pysam.calignmentfile.AlignmentFile.count (pysam/calignmentfile.c:12699)
TypeError: count() got an unexpected keyword argument 'read_callback'
Any idea how to resolve this?
Hi inspector Team,
I used the inspector to perform assembly correction of two haplotype-resolved HiFi assemblies of a human individual. I can see the assembly QV is better, but one of assemblies shows more structural errors after correction. In addition, I also found the local assembly would fail in inspector correction. Is it possible for me to address these issues? Thanks!
commands:
inspector.py -c asm1.fa -r ccsreads.1.fastq ccsreads.2.fastq -o asm1_out/ --datatype hifi
inspector.py -c asm2.fa -r ccsreads.1.fastq ccsreads.2.fastq -o asm2_out/ --datatype hifi
inspector-correct.py -i asm1_out/ --datatype pacbio-hifi -o asm1_corrected/
inspector-correct.py -i asm2_out/ --datatype pacbio-hifi -o asm2_corrected/
asm2 before correction:
Structural error 49
Expansion 29
Collapse 20
Haplotype switch 0
Inversion 0
Small-scale assembly error /per Mbp 14.790265745042309
Total small-scale assembly error 44683
Base substitution 26180
Small-scale expansion 10097
Small-scale collapse 8406
QV 46.40314095738277
asm2 after correction:
Structural error 58
Expansion 32
Collapse 26
Haplotype switch 0
Inversion 0
Small-scale assembly error /per Mbp 1.3369275599703823
Total small-scale assembly error 4039
Base substitution 3227
Small-scale expansion 349
Small-scale collapse 463
QV 50.03862621625856
Hi Maggie,
This is a very interesting tool. I am thoroughly testing it in comparison to other polishing approaches I have been using, so I should have so good feedback soon. I am particularly interested to see how it handles small N-gaps introduced by BioNano hybrid scaffolding that can easily be spanned by HiFi reads.
What happened to the p-value parameter? I see it in the documentation, but not in v1.0.2. This could be very helpful to increase the quality of polishing.
Also, v1.0.2 still shows v1.0.1 as the version.
Thanks,
Kevin
I am finding a new type of error during the correction.
It seems like the jobs get sent properly to flye, and start getting assembled, but then immediately after looks for the result, and claims that they don't exist. In the meantime, if I do top
, I see the Flye jobs running, but the inspector claims that has already finished, so those results, I guess, will be saved in draft_assembly.fasta
but not on the final contig_corrected.fa
.
When checking with inspector the inspector corrected assembly, we see the small errors very nicely corrected, but not the structural errors. So it might be because they get never written into that final corrected fasta.
This is how the error looks in one of these cases:
[2021-11-23 21:32:05] INFO: Extending reads
[2021-11-23 22:19:49] INFO: Overlap-based coverage: 10
[2021-11-23 22:19:49] INFO: Median overlap divergence: 0.109189
0% 90% 100%
[2021-11-23 22:35:36] INFO: Assembled 2 disjointigs
[2021-11-23 22:35:36] INFO: Generating sequence
0% 10% 20% 30% 50% 60% 70% 80% 100%
[2021-11-23 22:35:36] ERROR: Caught unhandled exception: Can't open /space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/00-assembly/draft_assembly.fasta
[2021-11-23 22:35:36] ERROR: flye-modules(+0x3ab73) [0x562558de4b73]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0ee0e07f6f]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0ee0e07fb1]
[2021-11-23 22:35:36] ERROR: /space/Software/final_env/bin/../lib/libstdc++.so.6(__cxa_rethrow+0) [0x7f0ee0e0819a]
[2021-11-23 22:35:36] ERROR: flye-modules(+0xc5e4) [0x562558db65e4]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x373e5) [0x562558de13e5]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x179d2) [0x562558dc19d2]
[2021-11-23 22:35:36] ERROR: /lib64/libc.so.6(__libc_start_main+0xed) [0x7f0ee03b934d]
[2021-11-23 22:35:36] ERROR: flye-modules(+0x17b79) [0x562558dc1b79]
[2021-11-23 22:35:36] ERROR: Command '['flye-modules', 'assemble', '--reads', 'Inspector_MS_nextdenovo/assemble_workspace/read_ass_ctg003150__46462__46463__133__col.fa', '--out-asm', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/00-assembly/draft_assembly.fasta', '--config', '/space/Software/final_env/lib/python2.7/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg', '--log', '/space/no_backup/merce/filter/final_assemblies/Inspector_MS_nextdenovo/assemble_workspace/flye_out_ctg003150__46462__46463__133__col/flye.log', '--threads', '4', '--min-ovlp', '5000']' returned non-zero exit status -6
[2021-11-23 22:35:36] ERROR: Pipeline aborted
Hi~
I used the command "inspector.py -c ctg.fa -r ccs1.fq ccs2.fq -o inspector_out/ --datatype hifi" to evaluate the ctg-level draft assembly. But I found the output "valid_contig.fa" was exactly the same as the input "ctg.fa". I careflly compared the size of contigs and the base sequence in "small_scale_error.bed", and I'm sure the "valid_contig.fa" makes no difference. The "summary_statistics" did output errors, but I wonder why they were not fixed in "valid_contig.fa".
Hi,
Because of my genome size I need to specific the minimap2 -I parameter to a larger number. Is it possible to simply provide Inspector with the bam file rather than with a fastq file? It also would make things way easier when working on a cluster where I can distribute the mapping jobs.
Thanks!
I am currently assembling a teleost genome using HiFi reads and I decided to use Inspector to correct my primary assembly. I observed marked improvements in the assembly after just a single round, but the corrected assembly was not free from errors (see table below). I was wondering if there is any benefit to running inspector in multiple rounds similar to how older long read genomes needed several rounds of polishing? Is there a risk of over-polishing or introducing errors?
Statics of contigs: | initial | 1 round |
---|---|---|
Number of contigs | 625 | 625 |
Number of contigs > 10000 bp | 625 | 625 |
Number of contigs >1000000 bp | 242 | 242 |
Total length | 2272598809 | 2272481657 |
Total length of contigs > 10000 bp | 2272598809 | 2272481657 |
Total length of contigs >1000000bp | 2170694264 | 2170577325 |
Longest contig | 68564173 | 68545954 |
Second longest contig length | 68389185 | 68371778 |
N50 | 24159452 | 24149395 |
N50 of contigs >1Mbp | 24159452 | 24149395 |
Read to Contig alignment: | ||
Mapping rate /% | 99.99 | 99.99 |
Split-read rate /% | 9.62 | 9.62 |
Depth | 45.576 | 45.5783 |
Mapping rate in large contigs /% | 95.97 | 95.97 |
Split-read rate in large contigs /% | 9.65 | 9.66 |
Depth in large conigs | 45.7891 | 45.7926 |
Structural error | 501 | 209 |
Expansion | 363 | 139 |
Collapse | 79 | 31 |
Haplotype switch | 48 | 27 |
Inversion | 11 | 12 |
Small-scale assembly error /per Mbp | 32.7136491 | 1.937969438 |
Total small-scale assembly error | 74345 | 4404 |
Base substitution | 55461 | 2875 |
Small-scale expansion | 11403 | 691 |
Small-scale collapse | 7481 | 838 |
QV | 35.4455019 | 38.49163495 |
in your inspector.py,there are not mixed type,why?
parser.add_argument('-d','--datatype',type=str,default='clr',help='Input read type. (clr, hifi, nanopore) [clr]')
Hi Maggi,
thank you for the fantastic program! I was wondering if there was a way to change some of the minimap2 parameters within the Inspector.py code? I have some files with >200,000 reads to align to thousands of contigs and I would like to be able to change the -K parameter to 50M to reduce memory consumption. Currently, all my files <180,000 reads complete but over 180K they fail to have reads mapped to contigs because the job gets killed. Any advice?
Jamie
I'd like to use inspector on phased haplotigs (from hifiasm; i.e., hap1.fa, hap2.fa) from a highly heterozygous plant species (the phasing looks good). I'm assuming I would run inspector against each of the hap.fa assemblies. Are there any caveats that I should understand in doing this - i.e., since only one haplotig assembly is given, but all of the hifi reads are given (from both haplotypes), do I need to worry about inspector erroneously identifying heterozygousity as an error? Hopefully this makes sense. Maybe what I should do is combine both haplotypes into a single assembly and run it as a single assembly..??
Any help would be greatly appreciated,
Jeff Maughan
python2 is no longer supported. Could you please migrate to python3? Thank you.
Hello,
I'm trying out Inspector, and noticed that there are some inconsistencies with the formatting of the file summary_statistics
. For example, most header/value pairs are tab-separated, but not in all cases:
For example in this script writing to the summary_statistics
file:
https://github.com/Maggi-Chen/Inspector/blob/master/denovo_static.py
lines 230-233 write a tab, but lines 451-452 do not.
Having consistent formatting of header\tvalue
would be helpful for parsing the output file on the command-line.
Thanks for developing this tool!
Lauren
Dear teacher, thank you for your work.
I'm using Inspector for evaluation and correction. An error occurred:ZeroDivisionError: float division by zero。
The log is as follows.
Inspector.log
Inspector starting... 23/06/2022 11:01:15
Start Assembly evaluation with contigs: ['../../../../10.resoult/genome_assemblyed/pbipa.fasta']
TIME: Before read mapping 1.2830026149749756
TIME: Read Alignment: 178.04207849502563
nohup.txt
I run normally on other assemblies. Can you give me a solution?
In addition, I have two questions.
First, how big QV value of genome evaluation belongs to a better assembly.
Second, I want to know the difference between Inspector and nextpolish in correction, and whether I need to further use correction software on this basis.
Hi, all
Does inspectory support bgzip format for contig fasta? When I running inspector on bgziped fasta files, it failed with no error messages were returned.
Best regards,
Xinchang
Hi dear developer,
I tried to use nanopore ultra-long reads to correct structural error in my hifi assembly. The inspector.py module did detect 443 structural error in the structural_error.bed, however the inspector-cor directly said
Inspector Assembly Fail chr01__2182326__2182423__97__exp
...
as flye seemed not working. I wonder if it caused by the length of the reads were usually longer than 50kb and sometimes 100kb.
I have tested to assemble a read_ass_chr01__100958203__100958204__153__col.fa by flye with default parameter which get a 23kb contig while the longest reads in the fasta file was 108640bp. And the program took nearly an hour to run for 15 reads in the fasta with a mean length of 90461 bp.
Could you please give me some suggestions to correct the structural error in my hifi assembly? Thanks a lot!
Hi,
I wanted to use Inspector to inspect an assembly that's been scaffolded to chromosomes, but our chromosomes are greater than 512 Mb (i.e., the limit for samtools index to create a bai). Is it possible for the pipeline to use csi instead? Or is there a workaround, like using inspector --skip_read_alignment?
Hi,
We are trying to run Inspector, and we installed using the following commands:
virtualenv --python=/usr/bin/python2.7 inspector_env
pip install pysam
pip install statsmodels --no-use-pep517
And we get this error involving numpy:
(inspector_env) ubuntu@ip-172-31-23-69:~/software/Inspector$ ./inspector.py -c /home/ubuntu/data/cuttlefish/assembly/run1/Assembly.fasta -r /home/ubuntu/data/cuttlefish/reads/cuttlefish_Guppy_5.0.7_sup.fastq /home/ubuntu/data/cuttlefish/reads/09_28_21_R941_CF5mg_Guppy_5.0.11_prom_sup.fastq -o inspector_out/ --datatype nanopore -t 40
Traceback (most recent call last):
File "./inspector.py", line 70, in <module>
import denovo_baseerror
File "/home/ubuntu/software/Inspector/denovo_baseerror.py", line 3, in <module>
import statsmodels.stats.proportion
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/stats/__init__.py", line 1, in <module>
from statsmodels.tools._testing import PytestTester
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/__init__.py", line 1, in <module>
from .tools import add_constant, categorical
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/tools.py", line 8, in <module>
from statsmodels.compat.python import lzip, lmap
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/__init__.py", line 1, in <module>
from statsmodels.tools._testing import PytestTester
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/tools/_testing.py", line 11, in <module>
from statsmodels.compat.pandas import assert_equal
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/pandas.py", line 4, in <module>
import numpy as np
File "/home/ubuntu/software/Inspector/inspector_env/lib/python2.7/site-packages/statsmodels/compat/numpy.py", line 46, in <module>
NP_LT_114 = LooseVersion(np.__version__) < LooseVersion('1.14')
AttributeError: 'module' object has no attribute '__version__'
Can you help with this?
Thanks
Hi,
I have an assembly built with Flye, and then I polish it with pepper. Using inspector to evaluate the QV score, I have a lower QV after polishing. Have you had this issue before and do you know why? Thanks for your help.
Hien
Hi!
I want to compare the results of one run of Inspector and one run of merqury using Illumina data, which datatype should I add if I use Illumina data?
Thanks a lot!
Marc
Hi Maggi,
I was running Inspector on the assembled genome on a cluster node with 128 Gb of ram and 48 threads, I installed all the dependencies and run Inspector. However after running for a while it crushed a couple of times and gave me the same error.
"sh: minimap2: command not found
samtools sort: failed to read header from "-"
mv: cannot stat ‘/STORAGE/DATA/xxx/xxxx/flye_assembly/read_to_contig_1.bam’: No such file or directory
[E::hts_open_format] Failed to open file "//STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam" : No such file or directory
samtools index: failed to open "/STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam": No such file or directory
cat: /STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/debreak_workspace/read_to_contig_*debreak.temp: No such file or directory
Traceback (most recent call last):
File "/home/apps/user_apps/xxx/xxx_apps/Inspector/inspector.py", line 131, in
cov=denovo_static.mapping_info_ctg(denovo_args.outpath,chromosomes_large,chromosomes_small,totalcontiglen,totalcontiglen_large)
File "/home/apps/user_apps/piwczyn/mergi_apps/Inspector/denovo_static.py", line 123, in mapping_info_ctg
unmapped=int(pysam.AlignmentFile(outpath+'read_to_contig.bam','rb').unmapped)
File "pysam/calignmentfile.pyx", line 318, in pysam.calignmentfile.AlignmentFile.cinit (pysam/calignmentfile.c:4730)
File "pysam/calignmentfile.pyx", line 534, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:7261)
IOError: file //STORAGE/DATA/xxx/xxxx/Ins_flye_assembly/read_to_contig.bam
not found "
How should it get fixed?
Thanks ,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.