Giter Club home page Giter Club logo

Comments (7)

jkbonfield avatar jkbonfield commented on June 26, 2024

I'm not wanting hts-specs to become the arbiter of such things when there are entire (much better staffed and supported) groups that are handling this already. We also run the risk of inventing a code ourselves and then ChEBI inventing a different code, or reusing our code for another base type. Long term this wouldn't aid anyone. Rather we should just track and mirror the official ChEBI nomenclature instead. If there are short codes documented there that aren't in our spec, then I think it's fine to add them in. If there is something missing that needs adding, it should be raised with ChEBI itself to go via their channels first. You make a compelling case for 4mC so I'd hope they will consider it.

There was discussion at some point about creating local codes, where we could put a code in a header with ChEBI ID and then refer to that code within the data, but it adds complexity, potentially huge when merging files, extending headers is hard to do given the state of a lot of software, and ultimately it saves very little. Rather it may be best to simply use the header comment fields to do the reverse - document the ChEBI codes so people looking at the data can see what it is without having to hunt down the definitions.

from hts-specs.

marcus1487 avatar marcus1487 commented on June 26, 2024

For the arbiters of the single letter code I completely understand the reasoning to avoid this, but in the absence of a pointer to the arbiter the table in the SAM tags spec sort of becomes the de facto arbiter. In fact looking at the 5mC ChEBI page I don't see a specific mention as m for the single letter code. The only place I know of that specifies the single letter codes would by @michaelmhoffman 's DNA mods database. I'm not sure if Michael would like to claim the role of arbiter for DNA single letter codes or if there is another source for these codes that we could use as the arbiter of single letter codes.

The other issue here is not around whether one can determine the modified base of interest, but with how much ease one can identify the modified base. When using ChEBI codes, most users will interact with these codes in a genome browser and see integer labels for the various modifications of interest. They would then have to refer back to the SAM header or look up the ChEBI code to figure out which modified base this is. So there is certainly some added value to the modified base single letter code for being used for the most common bases. I'm happy to make the case for 4mC where it would carry the most weight, but this does not seem to be ChEBI to me.

The annotation of the modified base codes used, even for single letter codes, makes sense in the SAM header comment lines. We are aiming to include this in output formats at nanopore.

from hts-specs.

marcus1487 avatar marcus1487 commented on June 26, 2024

@jkbonfield Or others, do you have any thoughts on this topic? It seems that ChEBI is not quite up to the task for this specification. Can you suggest where we might submit a request to have these single letter codes updated?

from hts-specs.

jkbonfield avatar jkbonfield commented on June 26, 2024

Sorry for the slow reply. No, unfortunately I don't know who is the best here. Our original table was taken from the Viner et.al. paper (https://www.biorxiv.org/content/10.1101/043794v1), which was basically written by experts in the field. None of the SAM maintainers are qualified to be deciding on this sort of thing, and even if we were, we'd just run the risk of forking things and causing multiple nomenclatures to appear.

I, apparently wrongly, assumed that the other short codes would make their way into ChEBI as that's also referred to in the paper, but it may not have happened. I still think ChEBI feels like the natural place to go to, but all I can advise is speaking directly with them or Coby Viner to ask how to get new codes accepted. We're happy to follow the community consensus here, but it does need consensus first.

from hts-specs.

jkbonfield avatar jkbonfield commented on June 26, 2024

I think some of this could also be addressed by the genome browsers. For example when they see a base mod 21839, they could turn it into a tooltip to https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:21839. We could obviously add some comments to the SAM headers, but structure comments feel error prone and it doesn't really solve anything as a genome browser won't be looking there without having an update, in which case pointing to ChEBI instead feels like the more natural fix.

I've also pinged @michaelmhoffman regarding whether there is a way to add new codes to the DNA mods database.

from hts-specs.

jkbonfield avatar jkbonfield commented on June 26, 2024

GA4GH File Formats isn't willing to be the maintainer for such things and the view from upstream is that it's too premature to add new short codes for these, so for now all I can recommend is adding @CO tags to annotate the SAM file for humans.

from hts-specs.

jkbonfield avatar jkbonfield commented on June 26, 2024

Closing this as "not planned", for now at least. We don't have the appropriate skills in GA4GH for maintaining such a database, so we'll just follow the community / upstream portals. Ie ChEBI or DNA Mods DB. If new things appear in there, please do raise a ticket for us to add them to our specifications.

from hts-specs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.