The 2020-aa-abund from taylorreiter

Initial musings on salmon component of aa quantification

Thanks for the nice write-up explaining your pipeline! As you point out, this is an off-label use of salmon, and so I can't say we've tested this functionality before. However, I don't see a priori an obvious reason why the basic model salmon is using for probabilistic allocation of ambiguous fragments should be problematic in this context. That's not to say that, perhaps, not model could not be tweaked to work better in this context, but at least there is no glaring reason I see why it would be "wrong".

In my mind, the biggest question mark is how the SAM/BAM files are being processed. Specifically, since salmon is designed to work with genomic / transcriptomic data, it assumes a DNA/RNA alphabet. In alignment-based mode, it therefore encodes the input sequences not using an ASCII (or unicode) encoding, but using a bit-packed encoding to save space (and assuming the DNA/RNA alphabet). Thus, in the case of your input data, it will probably be viewing most of the sequences as consisting of unknown characters, and replacing those, internally, with pseudo-random sets of {A,C,G,T}. If I look through the logs you uploaded, I see some evidence of this:

[2020-01-17 23:02:50.265] [jointLog] [info] replaced 111,201 non-ACGT nucleotides with random nucleotides

Now, here is where things get interesting. Normally, this would be a problem, because in alignment-based mode, salmon examines the alignments, walking the nucleotides of the "read" and of the "reference" and processing the CIGAR string from the corresponding alignment record. It does this to simultaneously score each alignment (to compute an accurate conditional probability of this reference location producing the observed string), and also to "learn" from the data the parameters of the alignment model. Specifically, salmon uses a spatially-varying Markov model to learn the probability of observing a particular tuple of (ref_base, read_base, cigar_op) given the preceding state of the alignment. All of this is done by default in salmon, and is disabled if need by passing the flag --noErrorModel.

However, in your case, the SAM file does not seem to contain a CIGAR string. So, there is nothing to process. Thus, the model remains uninformative, and the conditional probabilities that would normally be informed by the alignment score remain flat. This is the equivalent to basically saying that if a read r maps to two different contigs c1 and c2, then the conditional probability of generating the alignments i.e. P( aln(r, c_1) | c_1 ) = P( aln(r, c_2) | c_2 ) = 1 / |{c1,c2}| = 1/2. Normally, if one is dealing with transcriptome alignments, and the aligner reports both highest-scoring and sub-optimal alignments, then computing meaningful alignment probabilities is crucial. However, if your upstream alignment tool is only reporting equally-best mappings for the peptides, then this simplification of the probabilities is totally reasonable (and what you would get if you went through the motions of scoring the probabilities anyway). So, I think this is probably working out OK :). One thing I might try just to see if it has any effect; how does the output change, if at all, if you pass --noErrorModel to salmon.

In closing, this is a really cool off-label use of salmon, and I don't see any a priori reason that it doesn't make sense from a basic model standpoint. I'd be happy to remain in the loop and help out however I can. I'd also be interested on feedback of whether or not a "protein-level" salmon option / mode might be something of broad(-ish) interest to the community, and if so, what additional features and options we could add to support this.

Cheers!
Rob

taylorreiter / 2020-aa-abund Goto Github PK

2020-aa-abund's People

Contributors

Stargazers

Watchers

Forkers

2020-aa-abund's Issues

Initial musings on salmon component of aa quantification

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent