Giter Club home page Giter Club logo

Comments (30)

sidorov-si avatar sidorov-si commented on July 23, 2024 1

from quast.

sjackman avatar sjackman commented on July 23, 2024

Can you confirm that 10 or more Ns in the assembly is considered a scaffold gap, for the purpose of classifying scaffold gap size mis.?

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Well, technically yes, 10 or more Ns may be considered a scaffold gap. This is not the only condition of course. Lets assume there are two aligned fragments in a scaffold, A1 and A2. To consider a scaffold gap misassembly between them, we check:

  • there is a misassembly between A1 and A2 (using regular misassembly definition), the size of inconsistency between reference and scaffold positions of A1 and A2 is less than scaffolds_gap_threshold (10 kbp, by default)
  • both A1 and A2 align on the same strand of the same chromosome (i.e., the misassembly is NOT an inversion or a translocation)
  • there is a sequence of nucleotides between A1 and A2 in the scaffold, the sequence should have length at least 10 (here it is!) and at least 95% of the nucleotides should be Ns.

from quast.

sjackman avatar sjackman commented on July 23, 2024

and at least 95% of the nucleotides should be Ns

Ahah! That explains the behaviour that I'm seeing. I have a scaffold gap of 100 Ns (arbitrary number of Ns, because the size of the gap is not estimated). There's a single SNV 10 bp from the end of A1, which prevents that 10 bp from aligning to the reference. So the unaligned sequence is 100 bp composed of 10 nucleotides and 100 Ns. This is being categorized as a misassembly rather than a scaffold gap size mis. because 100/110 = 91% < 95%.

With the default ---min-identity=95 up to 20 nucleotides at the end of an alignment can be unaligned, due to a single SNV. If 20 nucleotides are unaligned from both sides, that's 40 unaligned nucleotides. To exceed the 95% N threshold, the gap would have to be larger than 800 Ns. Shorter gaps will be considered a misassembly.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Well, we can add some absolute thresholds in addition to 95% to exclude penalising of such small scaffold gaps. Do you think this is reasonable?
Actually, we expected larger stretches of Ns from scaffolders when they are not sure about the gap size (like 1000 or 2000 Ns).

from quast.

sjackman avatar sjackman commented on July 23, 2024

It looks to me like this code:

return gap_in_contig.count('N') / len(gap_in_contig) >= qconfig.gap_filled_ns_threshold

should match this code:
if count_ns / float(unaligned_len) < qconfig.gap_filled_ns_threshold and unaligned_len - count_ns >= qconfig.unaligned_part_size:

and check that the number of unaligned nucleotides is larger than qconfig.unaligned_part_size.

from quast.

sjackman avatar sjackman commented on July 23, 2024

In my opinion, I think the ratio check can be dropped. If there's at least 10 Ns, and the number of unaligned non-N nucleotides, is less than a fixed threshold, it should be considered a scaffold gap.

from quast.

sjackman avatar sjackman commented on July 23, 2024

Here's a case where QUAST is incorrectly identifying a scaffold gap as a relocation misassembly, but it's not due to failing the Ns ration test.

A gist of the sequence (42 kbp):
https://gist.github.com/sjackman/80d7021b76a2d82b2f1e1e50c1649282

The minimap2 alignment:

10686884	42259	14361	42254	-	CM000669.2	159345973	4313215	4341083	23453	27909	60	tp:A:P	cm:i:2193	s1:i:23442	s2:i:542
10686884	42259	16	4311	-	CM000669.2	159345973	4341196	4345492	3787	4309	60	tp:A:P	cm:i:367	s1:i:3781	s2:i:77

The QUAST report:

10686884_32279_0_470812__..._10458231
Extensive misassembly (relocation, inconsistency = -9937) between 4319 1 and 42259 14331
4341189	4345508	4319	1	CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly	10686884_32279_0_470812__..._10458231	99.12		True
relocation, inconsistency = -9937
4313211	4341114	42259	14331	CM000669.2_Homo_sapiens_chromosome_7__GRCh38_reference_primary_assembly	10686884_32279_0_470812__..._10458231	99.56		True
CONTIG	10686884_32279_0_470812__..._10458231	42259	misassembled

The unaligned sequence of the contig between 10686884:4320-14330 is composed of 11 nucleotides and 10,000 Ns. So it passes the 95% Ns ration test. Any idea why it's not being classified as a scaffold gap?

from quast.

sjackman avatar sjackman commented on July 23, 2024

I believe this line of code has a bug with Python 2:

return gap_in_contig.count('N') / len(gap_in_contig) >= qconfig.gap_filled_ns_threshold

It's doing an integer division so the result is an integer, either 0 or 1. This test only passes if the sequence in the gap is composed entirely 100% of Ns. This bug wouldn't affect Python 3, where this division would yield a float.

May I suggest removing the ratio test, and replacing it with:

return gap_in_contig.count('N') >= qconfig.Ns_break_threshold

from quast.

sjackman avatar sjackman commented on July 23, 2024

I've opened PR #47 to address this issue.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Thanks for debugging this issue! Yes, it was a division bug, worked correctly in Python 3 only.
The proper fix is to add
from __future__ import division
in the beginning of the module but since we decide to drop the ratio test, it is not needed here (but probably needed in other places, e.g. in your example from quast/quast_libs/ca_utils/analyze_contigs.py).

As for the suggested fix -- I do not fully agree with it. E.g., if you have a 10 000 bp long gap (inconsistency with reference) in the contig and there is just 10 Ns there (even not necessary consecutive, just a single N per 1 kbp) this will be treated as "scaffold gap size mis" according to your fix. And I am not sure that this is desired behaviour :)

I suggest to change to something like this:

gap_len = len(gap_in_contig)
if gap_len < qconfig.Ns_break_threshold:
    return False`
side_len = min(qconfig.MAX_INDEL_LENGTH, (gap_len - qconfig.Ns_break_threshold) // 2)
return gap_in_contig[X:-X] == len() * 'N')

That is, we ask for at least 10 Ns ("Ns_break_threshold") in the gap and also allow up to 50 ("MAX_INDEL_LENGTH") non-Ns at both sides of the gap. At the same time, all nucleotides inside the gap should be strictly Ns, which is a typical pattern for scaffold gaps.
Do you agree?

from quast.

sjackman avatar sjackman commented on July 23, 2024

If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.

MAX_INDEL_LENGTH (50 bp) is an appropriate threshold in the middle of an alignment, but it's too small at the ends of alignments. For example, requiring 95% alignment identity, a 10 bp indel and a single SNV (normal individual variation) at the end of contig can cause 50 bp to go unaligned, and so incorrectly be classified as a misassembly rather than a scaffold gap.

I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.

from quast.

sjackman avatar sjackman commented on July 23, 2024

If there is an unaligned segment in the middle of a contig, I believe it's detected by this function process_unaligned_part, which increments possible_misassemblies.

if count_ns / float(unaligned_len) < qconfig.gap_filled_ns_threshold and unaligned_len - count_ns >= qconfig.unaligned_part_size:
possible_misassemblies = 1

Where does that unaligned part get counted in the metrics? My guess is either # unaligned mis. contigs or unaligned contigs (partial). Which one would it be? Does it also get counted as # misassembled contigs and/or # misassemblies?

from quast.

sjackman avatar sjackman commented on July 23, 2024

even not necessary consecutive, just a single N per 1 kbp

Good catch. I've fixed up my PR #47:

def is_gap_filled_ns(contig_seq, align1, align2):
    gap_in_contig = contig_seq[align1.end(): align2.start() - 1]
    return 'N' * qconfig.Ns_break_threshold in gap_in_contig

from quast.

sjackman avatar sjackman commented on July 23, 2024

gap_in_contig = contig_seq[align1.end(): align2.start() - 1]

PAF uses 0-based coordinates. Does QUAST use 1-based coordinates internally?
https://github.com/lh3/miniasm/blob/master/PAF.md

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Will answer to all questions here:

If there were a 10 kbp gap, I'd expect the 10 kbp of sequence to align somewhere else in the reference, and so it would be classified as a misassembly (either a translocation or relocation). If there's sequence that does not align to the reference at all, it's more likely a reference issue then an assembly issue, either sequence found in this individual and missing from the reference, or a >5% diverged mobile genetic element.

I feel that an alignment gap should be classified as a scaffold gap as long as the gap contains more than 10 consecutive Ns, and the alignment blocks to the left and right of the alignment gap are collinear.

This totally make sense but note that this code is also used in MetaQUAST which deals with some complex cases like absence of some reference genomes, presence of very close reference genomes, and so on. I am afraid mostly about this use case. However, probably your suggestion make sense in the metagenomic case, too. I applied your PR locally and started the test on a medium-size metagenomic dataset and will compare results after/before.

Good catch. I've fixed up my PR #47:

Now I like it more :) Lets wait until my metagenomic test is finished and we probably merge this PR.

Does QUAST use 1-based coordinates internally?

Yes, we use 1-based coords since QUAST was originally based on Nucmer/Mummer which reports 1-based alignments. After recent switch to minimap2 we decided not to change internal representations (it is very painful!)

Where does that unaligned part get counted in the metrics? My guess is either # unaligned mis. contigs or unaligned contigs (partial). Which one would it be? Does it also get counted as # misassembled contigs and/or # misassemblies?

Most of the answers are in our manual (just search for the corresponding metric names). Briefly speaking: possible misassembly is a separate event which is reported in MetaQUAST mode only. However, it is always happen in partially unaligned contigs (but the opposite is not always true -- not all partially unaligned contigs have possible misassemblies). As for the # misassemblies -- no, they include only relocations/translocations/inversions and do NOT include local misassemblies, possible misassemblies, unaligned mis. contigs, scaffold gap size misassemblies, structural variations, etc.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

My metagenomic test of the PR passed correctly, so I merged it. Thanks, @sjackman !

By the way, are you still interested in the topic of this issue (reporting number of relocations which could be possibly counted as scaffold gap errors)?

from quast.

sjackman avatar sjackman commented on July 23, 2024

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not. It's helpful to know whether the error was introduced by the unitigger or the scaffolder. I found that the size of each gap size estimation error is in contigs_report_*.stdout, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout when the inconsistency is <7k or >10k.

from quast.

sidorov-si avatar sidorov-si commented on July 23, 2024

from quast.

sjackman avatar sjackman commented on July 23, 2024

@sidorov-si ABySS-Samtobreak reports scaffolding errors. It takes a SAM file of aligned scaftigs and produces a SAM file of non-colinear aligned scaftigs. It may be useful to you. It's built by default when you compile ABySS when you have ghc in your PATH.

sudo apt-get install ghc
# or
brew install ghc

https://github.com/bcgsc/abyss/tree/master/Misc
Cut your scaffolds into scaftigs using seqtk cutN -n1 and then align them with either bwa mem -x intractg or minimap2 -x asm10.

I'm looking forward to similar functionality from QUAST! =)

from quast.

sjackman avatar sjackman commented on July 23, 2024

IGV can also view PAF files, though it doesn't currently visualize the alignment in the CIGAR strings, only the alignment blocks.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Ok, this feature looks useful for you and probably for the whole community, too :)
We will add it soon!

Few notes:

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.

Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?

I found that the size of each gap size estimation error is in contigs_report_*.stdout, when the gap size error is between [extensive misassembly size, scaffold gap size] ([7k, 10k] by default). It'd be helpful to print a similar message in .stdout when the inconsistency is <7k or >10k.

Are you talking about messages like "inconsistency = ..."? Actually we report them already for <7k and >10k. For local misassemblies (<7k by default in --large mode and <1k by default, otherwise) they look like
Gap between these two alignments (local misassembly). Inconsistency = -190
For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments
(note the number is huge).
In case of scaffold gap misassemblies we report "gap length difference".
We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.

from quast.

sjackman avatar sjackman commented on July 23, 2024

Yes, I'd be interested to see # misassembles and possibly its three subcategories broken down by whether they're involved in a scaffold gap or not.

Hmm, only relocations could be potentially a scaffold gap (in case of increased --scaffold-gap-max-size), while inversions and translocations could not. At least using our definition of scaffold gaps. Am I wrong?

Say in the assembly there's a scaffold composed of two contigs A and B with a gap of 10 or more Ns between them. If A and B map for then 10,000 bp apart, it's a scaffolding relocation misassembly. If they map to different chromosomes, it's a scaffolding translocation misassembly, and if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.

from quast.

sjackman avatar sjackman commented on July 23, 2024

ca_output.stdout_f.write('\t\t\t Incorrectly estimated size of scaffold gap between these two alignments: ')

This message is reported when there's a scaffold gap inconsistency in the range of [7k, 10k].

Incorrectly estimated size of scaffold gap between these two alignments: gap length difference = 8509

For relocations they look like:
Extensive misassembly (relocation, inconsistency = 1886198) between these two alignments

This message doesn't specify whether there's a scaffold gap of more than 10 Ns on the query between the two alignment blocks. It could be either a contig or a scaffolding error.

We do NOT report inconsistency in case of inversions and translocations because it cannot be properly defined in this case.

Whether the alignment break involves a scaffold gap of Ns or not can be reported.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Ok, got it.
Note that the currently implemented scaffold gap size misassemblies are excluded from # misassemblies, since we consider them as not severe errors (like local misassemblies).
Suggested novel metrics (scaffold relocations, translocations, inversions) will BE counted in # misassemblies. In fact it is an extended classification of them in addition to regular relocations, translocation, inversions. I hope this handful of novel metrics will not be confusing in the main report :)

... if they map to the same chromosome with different orientations, it's also a scaffold orientation misassembly. You could perhaps define a scaffold inversion misassembly as three colinear contigs A B' C, where is B is inverted with respect to A and C.

Do you mind if we call your scaffold orientation misassembly as scaffold inversion for consistency with our regular inversions? And if we also do not report your scaffold inversion misassembly, since all our current misassemblies are "breakpoint-like", i.e. defined as an even between two adjacent blocks.

from quast.

sjackman avatar sjackman commented on July 23, 2024

Yep, that all makes sense me to me.

I hope this handful of novel metrics will not be confusing in the main report :)

I don't believe this breakdown needs to be in the main report, since relocations, translocations, inversions aren't already reported in the main report. I'd suggest they go in misassemblies_report.txt. Perhaps something like…

# misassemblies              1200
  sequence                   600
    # relocations            300
    # translocations         200
    # inversions             100
  scaffold                   600
    # relocations            300
    # translocations         200
    # inversions             100

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

Note that relocations, translocations, inversions aren't in report.txt but they are present in report.html (after pressing "extended report" button). However, your suggested report structure looks pretty nice to me, I think we can report them in a similar fashion in HTML report, too.

from quast.

sjackman avatar sjackman commented on July 23, 2024

Great! contig may be a better name than sequence for the subsection.

from quast.

alexeigurevich avatar alexeigurevich commented on July 23, 2024

We are finalizing v.5.0 release and this feature was added there in the following format.

  1. The detailed log file (contigs_report_*.stdout) now includes additional labels "scaffold gap is present" for misassemblies containing the gap but not marked as "scaffold gap mis size" misassembly (e.g. because of too large inconsistency).
    Example:
Real Alignment 1: 3533883 3536114 | 2199 1 | 2232 2199 | 95.18 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
       Extensive misassembly (relocation, scaffold gap is present, inconsistency = 10184) between these two alignments
Real Alignment 2: 3523090 3523261 | 2808 2637 | 172 172 | 100.0 | gi_49175990_ref_NC_000913.2__Escherichia_coli_str._K_12_substr._MG1655__complete_genome Contig_390
  1. The detailed misassemblies report (.../contigs_reports/misassemblies_report.txt) now counts contig and scaffold misassemblies separately.
    Example (for metaquast run, so interspecies translocations are present):
Assembly                              meta_contigs_1  meta_contigs_2
# misassemblies                       3               4             
  # contig misassemblies              2               4             
    # c. relocations                  1               1             
    # c. translocations               0               0             
    # c. inversions                   0               0             
    # c. interspecies translocations  1               3             
  # scaffold misassemblies            1               0             
    # s. relocations                  0               0             
    # s. translocations               0               0             
    # s. inversions                   0               0             
    # s. interspecies translocations  1               0             
# misassembled contigs                2               2 
....
  1. The rest reports remain unchanged. E.g. HTML report does not distinguish between contig and scaffold misassemblies. Probably we will add this separation there too but let's test this feature with the text report only.
  2. Not directly related to this issue but you may be also interested in this: we also added "scaffold gap size local misassemblies" which are a combination of "scaffold gap size misassemblies" and "local misassemblies". Previously, if there was a scaffold gap size misassembly but inconsistency is small (less than extensive misassembly threshold, e.g. 1 kbp by default) we reported a local misassembly. Now we report "scaffold gap loc. mis." in such cases and these misassemblies are excluded from the number of local misassemblies.

from quast.

sjackman avatar sjackman commented on July 23, 2024

Nice! Thanks! I'm excited for the upcoming 5.0 release of QUAST.

from quast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.