Comments (30)
from quast.
Can you confirm that 10 or more Ns in the assembly is considered a scaffold gap, for the purpose of classifying scaffold gap size mis.
?
from quast.
Well, technically yes, 10 or more Ns may be considered a scaffold gap. This is not the only condition of course. Lets assume there are two aligned fragments in a scaffold, A1 and A2. To consider a scaffold gap misassembly between them, we check:
- there is a misassembly between A1 and A2 (using regular misassembly definition), the size of inconsistency between reference and scaffold positions of A1 and A2 is less than
scaffolds_gap_threshold
(10 kbp, by default) - both A1 and A2 align on the same strand of the same chromosome (i.e., the misassembly is NOT an inversion or a translocation)
- there is a sequence of nucleotides between A1 and A2 in the scaffold, the sequence should have length at least 10 (here it is!) and at least 95% of the nucleotides should be Ns.
from quast.
and at least 95% of the nucleotides should be Ns
Ahah! That explains the behaviour that I'm seeing. I have a scaffold gap of 100 Ns (arbitrary number of Ns, because the size of the gap is not estimated). There's a single SNV 10 bp from the end of A1, which prevents that 10 bp from aligning to the reference. So the unaligned sequence is 100 bp composed of 10 nucleotides and 100 Ns. This is being categorized as a misassembly
rather than a scaffold gap size mis.
because 100/110 = 91% < 95%.
With the default ---min-identity=95
up to 20 nucleotides at the end of an alignment can be unaligned, due to a single SNV. If 20 nucleotides are unaligned from both sides, that's 40 unaligned nucleotides. To exceed the 95% N threshold, the gap would have to be larger than 800 Ns. Shorter gaps will be considered a misassembly
.
from quast.
Well, we can add some absolute thresholds in addition to 95% to exclude penalising of such small scaffold gaps. Do you think this is reasonable?
Actually, we expected larger stretches of Ns from scaffolders when they are not sure about the gap size (like 1000 or 2000 Ns).
from quast.
It looks to me like this code:
should match this code:
and check that the number of unaligned nucleotides is larger than
qconfig.unaligned_part_size
.from quast.
In my opinion, I think the ratio check can be dropped. If there's at least 10 Ns, and the number of unaligned non-N nucleotides, is less than a fixed threshold, it should be considered a scaffold gap.
from quast.
Here's a case where QUAST is incorrectly identifying a scaffold gap as a relocation misassembly, but it's not due to failing the Ns ration test.
A gist of the sequence (42 kbp):
https://gist.github.com/sjackman/80d7021b76a2d82b2f1e1e50c1649282
The minimap2 alignment:
10686884 42259 14361 42254 - CM000669.2 159345973 4313215 4341083 23453 27909 60 tp:A:P cm:i:2193 s1:i:23442 s2:i:542
10686884 42259 16 4311 - CM000669.2 159345973 4341196 4345492 3787 4309 60 tp:A:P cm:i:367 s1:i:3781 s2:i:77
The QUAST report:
10686884_32279_0_470812__..._10458231
Extensive misassembly (relocation, inconsistency = -9937) between 4319 1 and 42259 14331
4341189 4345508 4319 1 CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly 10686884_32279_0_470812__..._10458231 99.12 True
relocation, inconsistency = -9937
4313211 4341114 42259 14331 CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly 10686884_32279_0_470812__..._10458231 99.56 True
CONTIG 10686884_32279_0_470812__..._10458231 42259 misassembled
The unaligned sequence of the contig between 10686884:4320-14330 is composed of 11 nucleotides and 10,000 Ns. So it passes the 95% Ns ration test. Any idea why it's not being classified as a scaffold gap?
from quast.
I believe this line of code has a bug with Python 2:
It's doing an integer division so the result is an integer, either 0 or 1. This test only passes if the sequence in the gap is composed entirely 100% of Ns. This bug wouldn't affect Python 3, where this division would yield a float.
May I suggest removing the ratio test, and replacing it with:
return gap_in_contig.count('N') >= qconfig.Ns_break_threshold
from quast.
I've opened PR #47 to address this issue.
from quast.
Thanks for debugging this issue! Yes, it was a division bug, worked correctly in Python 3 only.
The proper fix is to add
from __future__ import division
in the beginning of the module but since we decide to drop the ratio test, it is not needed here (but probably needed in other places, e.g. in your example from quast/quast_libs/ca_utils/analyze_contigs.py
).
As for the suggested fix -- I do not fully agree with it. E.g., if you have a 10 000 bp long gap (inconsistency with reference) in the contig and there is just 10 Ns there (even not necessary consecutive, just a single N per 1 kbp) this will be treated as "scaffold gap size mis" according to your fix. And I am not sure that this is desired behaviour :)
I suggest to change to something like this:
gap_len = len(gap_in_contig)
if gap_len < qconfig.Ns_break_threshold:
return False`
side_len = min(qconfig.MAX_INDEL_LENGTH, (gap_len - qconfig.Ns_break_threshold) // 2)
return gap_in_contig[X:-X] == len() * 'N')
That is, we ask for at least 10 Ns ("Ns_break_threshold") in the gap and also allow up to 50 ("MAX_INDEL_LENGTH") non-Ns at both sides of the gap. At the same time, all nucleotides inside the gap should be strictly Ns, which is a typical pattern for scaffold gaps.
Do you agree?
from quast.
If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.
MAX_INDEL_LENGTH (50 bp) is an appropriate threshold in the middle of an alignment, but it's too small at the ends of alignments. For example, requiring 95% alignment identity, a 10 bp indel and a single SNV (normal individual variation) at the end of contig can cause 50 bp to go unaligned, and so incorrectly be classified as a misassembly rather than a scaffold gap.
I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.
from quast.
If there is an unaligned segment in the middle of a contig, I believe it's detected by this function process_unaligned_part
, which increments possible_misassemblies
.
quast/quast_libs/ca_utils/analyze_contigs.py
Lines 26 to 27 in 6712d35
Where does that unaligned part get counted in the metrics? My guess is either
# unaligned mis. contigs
or unaligned contigs (partial)
. Which one would it be? Does it also get counted as # misassembled contigs
and/or # misassemblies
?from quast.
even not necessary consecutive, just a single N per 1 kbp
Good catch. I've fixed up my PR #47:
def is_gap_filled_ns(contig_seq, align1, align2):
gap_in_contig = contig_seq[align1.end(): align2.start() - 1]
return 'N' * qconfig.Ns_break_threshold in gap_in_contig
from quast.
PAF uses 0-based coordinates. Does QUAST use 1-based coordinates internally?
https://github.com/lh3/miniasm/blob/master/PAF.md
from quast.
Will answer to all questions here:
If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.
I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.
This totally make sense but note that this code is also used in MetaQUAST which deals with some complex cases like absence of some reference genomes, presence of very close reference genomes, and so on. I am afraid mostly about this use case. However, probably your suggestion make sense in the metagenomic case, too. I applied your PR locally and started the test on a medium-size metagenomic dataset and will compare results after/before.
Good catch. I've fixed up my PR #47:
Now I like it more :) Lets wait until my metagenomic test is finished and we probably merge this PR.
Does QUAST use 1-based coordinates internally?
Yes, we use 1-based coords since QUAST was originally based on Nucmer/Mummer which reports 1-based alignments. After recent switch to minimap2 we decided not to change internal representations (it is very painful!)
Where does that unaligned part get counted in the metrics? My guess is either # unaligned mis. contigs or unaligned contigs (partial). Which one would it be? Does it also get counted as # misassembled contigs and/or # misassemblies?
Most of the answers are in our manual (just search for the corresponding metric names). Briefly speaking: possible misassembly
is a separate event which is reported in MetaQUAST mode only. However, it is always happen in partially unaligned contigs (but the opposite is not always true -- not all partially unaligned contigs have possible misassemblies). As for the # misassemblies
-- no, they include only relocations/translocations/inversions and do NOT include local misassemblies, possible misassemblies, unaligned mis. contigs, scaffold gap size misassemblies, structural variations, etc.
from quast.
My metagenomic test of the PR passed correctly, so I merged it. Thanks, @sjackman !
By the way, are you still interested in the topic of this issue (reporting number of relocations which could be possibly counted as scaffold gap errors)?
from quast.
Yes, I'd be interested to see # misassembles
and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not. It's helpful to know whether the error was introduced by the unitigger or the scaffolder. I found that the size of each gap size estimation error is in contigs_report_*.stdout
, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout
when the inconsistency is <7k or >10k.
from quast.
from quast.
@sidorov-si ABySS-Samtobreak reports scaffolding errors. It takes a SAM file of aligned scaftigs and produces a SAM file of non-colinear aligned scaftigs. It may be useful to you. It's built by default when you compile ABySS when you have ghc
in your PATH
.
sudo apt-get install ghc
# or
brew install ghc
https://github.com/bcgsc/abyss/tree/master/Misc
Cut your scaffolds into scaftigs using seqtk cutN -n1
and then align them with either bwa mem -x intractg
or minimap2 -x asm10
.
I'm looking forward to similar functionality from QUAST! =)
from quast.
IGV can also view PAF files, though it doesn't currently visualize the alignment in the CIGAR strings, only the alignment blocks.
from quast.
Ok, this feature looks useful for you and probably for the whole community, too :)
We will add it soon!
Few notes:
Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.
Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size
), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?
I found that the size of each gap size estimation error is in contigs_report_*.stdout, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout when the inconsistency is <7k or >10k.
Are you talking about messages like "inconsistency = ..."? Actually we report them already for <7k and >10k. For local misassemblies (<7k by default in --large
mode and <1k by default, otherwise) they look like
Gap between these two alignments (local misassembly). Inconsistency = -190
For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments
(note the number is huge).
In case of scaffold gap misassemblies we report "gap length difference".
We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.
from quast.
Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.
Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?
Say in the assembly there's a scaffold composed of two contigs A and B with a gap of 10 or more Ns between them. If A and B map for then 10,000 bp apart, it's a scaffolding relocation misassembly. If they map to different chromosomes, it's a scaffolding translocation misassembly, and if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.
from quast.
This message is reported when there's a scaffold gap inconsistency in the range of [7k, 10k].
Incorrectly estimated size of scaffold gap between these two alignments: gap length difference = 8509
For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments
This message doesn't specify whether there's a scaffold gap of more than 10 Ns on the query between the two alignment blocks. It could be either a contig or a scaffolding error.
We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.
Whether the alignment break involves a scaffold gap of Ns or not can be reported.
from quast.
Ok, got it.
Note that the currently implemented scaffold gap size misassemblies
are excluded from # misassemblies
, since we consider them as not severe errors (like local misassemblies
).
Suggested novel metrics (scaffold relocations, translocations, inversions
) will BE counted in # misassemblies
. In fact it is an extended classification of them in addition to regular relocations, translocation, inversions. I hope this handful of novel metrics will not be confusing in the main report :)
... if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.
Do you mind if we call your scaffold orientation misassembly
as scaffold inversion
for consistency with our regular inversions
? And if we also do not report your scaffold inversion misassembly
, since all our current misassemblies are "breakpoint-like", i.e. defined as an even between two adjacent blocks.
from quast.
Yep, that all makes sense me to me.
I hope this handful of novel metrics will not be confusing in the main report :)
I don't believe this breakdown needs to be in the main report, since relocations, translocations, inversions
aren't already reported in the main report. I'd suggest they go in misassemblies_report.txt
. Perhaps something like…
# misassemblies 1200
sequence 600
# relocations 300
# translocations 200
# inversions 100
scaffold 600
# relocations 300
# translocations 200
# inversions 100
from quast.
Note that relocations, translocations, inversions
aren't in report.txt
but they are present in report.html
(after pressing "extended report" button). However, your suggested report structure looks pretty nice to me, I think we can report them in a similar fashion in HTML report, too.
from quast.
Great! contig
may be a better name than sequence
for the subsection.
from quast.
We are finalizing v.5.0 release and this feature was added there in the following format.
- The detailed log file (
contigs_report_*.stdout
) now includes additional labels "scaffold gap is present" for misassemblies containing the gap but not marked as "scaffold gap mis size" misassembly (e.g. because of too large inconsistency).
Example:
Real Alignment 1: 3533883 3536114 | 2199 1 | 2232 2199 | 95.18 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
Extensive misassembly (relocation, scaffold gap is present, inconsistency = 10184) between these two alignments
Real Alignment 2: 3523090 3523261 | 2808 2637 | 172 172 | 100.0 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
- The detailed misassemblies report (
.../contigs_reports/misassemblies_report.txt
) now counts contig and scaffold misassemblies separately.
Example (for metaquast run, so interspecies translocations are present):
Assembly meta_contigs_1 meta_contigs_2
# misassemblies 3 4
# contig misassemblies 2 4
# c. relocations 1 1
# c. translocations 0 0
# c. inversions 0 0
# c. interspecies translocations 1 3
# scaffold misassemblies 1 0
# s. relocations 0 0
# s. translocations 0 0
# s. inversions 0 0
# s. interspecies translocations 1 0
# misassembled contigs 2 2
....
- The rest reports remain unchanged. E.g. HTML report does not distinguish between contig and scaffold misassemblies. Probably we will add this separation there too but let's test this feature with the text report only.
- Not directly related to this issue but you may be also interested in this: we also added "scaffold gap size local misassemblies" which are a combination of "scaffold gap size misassemblies" and "local misassemblies". Previously, if there was a scaffold gap size misassembly but inconsistency is small (less than extensive misassembly threshold, e.g. 1 kbp by default) we reported a local misassembly. Now we report "scaffold gap loc. mis." in such cases and these misassemblies are excluded from the number of local misassemblies.
from quast.
Nice! Thanks! I'm excited for the upcoming 5.0 release of QUAST.
from quast.
Related Issues (20)
- Error while creating Krona plots HOT 1
- ERROR! File not found (contigs) running on biocontainers/quast:5.2.0 HOT 4
- MetaQUAST 5.2.0 reference genome "not in list" error when "Summarizing results..."
- metaquast coloring error for > 14 samples
- OSError: [Errno 22]
- Error creating Krona plots in metaquast.py HOT 8
- Version of installed minimap2 differs from its version in the QUAST package (2.24) HOT 1
- Augustus busco not running with a conda install
- ERROR! Skipping S7KLEB_S14_L001_R2_001.fastq.00.0_0.cor.fastq.gz because it contains non-ACGTN characters. HOT 1
- [Errno 30] Read-only file system: '/usr/local/lib/python3.10/dist-packages/quast_libs/bwa/make.log' HOT 2
- Issues with the installation
- Issues about 'quast-download-busco' HOT 2
- minimap2 cannot works correctly in the "Running Contig Analyzer" step
- Error occured while running Busco HOT 2
- No space left on device HOT 5
- Bam files in output HOT 1
- What size of genome fraction can be considered low?
- Issues running quast.
- invalid literal for int() with base 10: 'START_A'
- mapped reads count in reads_report
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from quast.