Giter Club home page Giter Club logo

2020-aa-abund's People

Contributors

taylorreiter avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

pythseq

2020-aa-abund's Issues

Initial musings on salmon component of aa quantification

Hi @taylorreiter,

Thanks for the nice write-up explaining your pipeline! As you point out, this is an off-label use of salmon, and so I can't say we've tested this functionality before. However, I don't see a priori an obvious reason why the basic model salmon is using for probabilistic allocation of ambiguous fragments should be problematic in this context. That's not to say that, perhaps, not model could not be tweaked to work better in this context, but at least there is no glaring reason I see why it would be "wrong".

In my mind, the biggest question mark is how the SAM/BAM files are being processed. Specifically, since salmon is designed to work with genomic / transcriptomic data, it assumes a DNA/RNA alphabet. In alignment-based mode, it therefore encodes the input sequences not using an ASCII (or unicode) encoding, but using a bit-packed encoding to save space (and assuming the DNA/RNA alphabet). Thus, in the case of your input data, it will probably be viewing most of the sequences as consisting of unknown characters, and replacing those, internally, with pseudo-random sets of {A,C,G,T}. If I look through the logs you uploaded, I see some evidence of this:

[2020-01-17 23:02:50.265] [jointLog] [info] replaced 111,201 non-ACGT nucleotides with random nucleotides

Now, here is where things get interesting. Normally, this would be a problem, because in alignment-based mode, salmon examines the alignments, walking the nucleotides of the "read" and of the "reference" and processing the CIGAR string from the corresponding alignment record. It does this to simultaneously score each alignment (to compute an accurate conditional probability of this reference location producing the observed string), and also to "learn" from the data the parameters of the alignment model. Specifically, salmon uses a spatially-varying Markov model to learn the probability of observing a particular tuple of (ref_base, read_base, cigar_op) given the preceding state of the alignment. All of this is done by default in salmon, and is disabled if need by passing the flag --noErrorModel.

However, in your case, the SAM file does not seem to contain a CIGAR string. So, there is nothing to process. Thus, the model remains uninformative, and the conditional probabilities that would normally be informed by the alignment score remain flat. This is the equivalent to basically saying that if a read r maps to two different contigs c1 and c2, then the conditional probability of generating the alignments i.e. P( aln(r, c_1) | c_1 ) = P( aln(r, c_2) | c_2 ) = 1 / |{c1,c2}| = 1/2. Normally, if one is dealing with transcriptome alignments, and the aligner reports both highest-scoring and sub-optimal alignments, then computing meaningful alignment probabilities is crucial. However, if your upstream alignment tool is only reporting equally-best mappings for the peptides, then this simplification of the probabilities is totally reasonable (and what you would get if you went through the motions of scoring the probabilities anyway). So, I think this is probably working out OK :). One thing I might try just to see if it has any effect; how does the output change, if at all, if you pass --noErrorModel to salmon.

In closing, this is a really cool off-label use of salmon, and I don't see any a priori reason that it doesn't make sense from a basic model standpoint. I'd be happy to remain in the loop and help out however I can. I'd also be interested on feedback of whether or not a "protein-level" salmon option / mode might be something of broad(-ish) interest to the community, and if so, what additional features and options we could add to support this.

Cheers!
Rob

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.