Giter Club home page Giter Club logo

Comments (6)

jermp avatar jermp commented on August 17, 2024

I'm CCing also @jnalanko who might be interested. Have you also notice this?

I can confirm that different runs of themisto may report different number of color sets, e.g., 4,236,355 vs. 4,236,354, for the same issue.

from ggcat.

jnalanko avatar jnalanko commented on August 17, 2024

Hi, @jermp. No, I have not noticed this. Thankfully, the correctness of Themisto should not be affected either, as long as the unitigs cover all k-mers at least once, and the colors of the unitigs not wrong.

from ggcat.

jermp avatar jermp commented on August 17, 2024

Yes exactly, both Themisto and Fulgor are correct anyway since the num. of kmers always seem to be correct and consistent. Just letting you know.

from ggcat.

Guilucand avatar Guilucand commented on August 17, 2024

Hi @jermp, the difference in color subsets is due to a (partially wanted, for performance) race condition when updating the hash table of the colors.

For efficiency and technical reasons, a new color is first written to disk and then added to an in-memory hashmap that is used to deduplicate equal colors, without holding a global lock.
So when two kmers of the same colors (never seen before) are processed at the same time, they can get different color indexes (that are in practice referring to the same set of colors) and they can write them before noticing that the other kmer had the same color.
To avoid this problem ggcat should get a global lock each time it writes a new color to disk and hold it until the in-memory hashmap is updated (with the index returned by the disk write), slowing down parallel insertions.

In practice, this should not lead to incorrect results, unless you check for color equality only by comparing the color subset index without reading the colormap, also the increase in space for storing (few) duplicate colors is very small, and the computed compacted graph has always the same set of maximal unitigs.

The difference in unitigs counts you're seeing is also due to the problem above since when two adjacent kmers in the same maximal unitig have a different color index (even if it refers to the same color set) they are split into multiple unitigs.

Also, this problem tends to be more visible with very small datasets, where there is a high chance that two kmers processed by different threads at the exact same time share the same color.

If you need the color sets to be always distinct I can try to see if there is a way to ensure that, maybe putting this requirement under an optional feature flag if it hurts performance.

from ggcat.

jermp avatar jermp commented on August 17, 2024

Hi @Guilucand and thank you for confirming.
I do not think it is a hard requirement. Indeed correctness is not impacted if we have some duplicated color.
Any thoughts here @rob-p?

from ggcat.

rob-p avatar rob-p commented on August 17, 2024

I agree it's not a hard requirement, but it would be very nice to have consistency between runs and to know the true number of colors. Differences are small in our example, but maybe could grow larger for very similar pangenomes and tons of threads.

How is color equality checked, currently? As map contention is very likely to be low anyway, perhaps you could try a sharded map like DashMap to avoid almost all lock contention?

Also, thanks for the quick response!

from ggcat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.