After discussion with <a class="user-mention notranslate" data-hovercard-type="user" d

Sounds good, and just to clarify/<a href="https://github.com/matsengrp/cft/issues/295#

Noted, thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hove

Reconsider subsampling/pruning procedure about cft HOT 19 CLOSED

eharkins commented on August 15, 2024

Reconsider subsampling/pruning procedure

from cft.

Comments (19)

psathyrella commented on August 15, 2024 1

Sounds good, and just to clarify/remind, the idea I think is that most clusters over 10k sequences are spurious light chain super-clusters, so "pruning"/subsampling could presumably take the form of very simple-mindedly removing distantly related branches.

from cft.

matsen commented on August 15, 2024 1

Would the idea be that we still want to consider indels in our final step of pruning?

Yep!

would we want to try to respect multiplicity at all

Yes, I think that's a good idea. I don't know how this currently works, and/or how it would work in a tree-based setting. Chat Monday?

from cft.

matsen commented on August 15, 2024 1

You know better than me! This was just my reaction.

from cft.

psathyrella commented on August 15, 2024

That all sounds sensible, and I don't think i really have anything useful to contribute to the fix that should happen right now, but a potential long term fix might incorporate the fact that almost all clusters larger than 10k are light chain superclusters incorporating many actual families. So assuming that the trees for these superfamilies usually consist of many widely separated sublineages it might make sense to downsample by only running on the sublineage that includes the seed.

from cft.

eharkins commented on August 15, 2024

Noted, thanks @psathyrella! Is there information about which sequences belong to these sublineages recorded somewhere in partis output to make this possible post-partis or would we need to re-partition such a "supercluster"?

from cft.

psathyrella commented on August 15, 2024

Well sublineages would be a tree property, so in principle would come from cft running trees. But you can always have partis make these cluster path/FastTree mashup trees I keep mentioning.

from cft.

eharkins commented on August 15, 2024

This makes sense in the absence of the need to align (whether it's because we use indel-reversed sequences, or there are no indels, or things have already been aligned in partis) since I was under the impression that we need same length sequences to make a FastTree. Then we could do what you are talking about by removing distal lineages from the seed. In fact, by pruning to the seed lineage in https://github.com/matsengrp/cft/blob/master/bin/prune.py, do we not already do this?

Also I'm not sure what

these cluster path/FastTree mashup trees I keep mentioning

refers to.

from cft.

psathyrella commented on August 15, 2024

I talked about them in the last bcr meeting. They're the approximate trees partis makes if you run --get-tree-metrics, constructed starting from the tree implicit in hierarchical agglomerative cluster, refined with FastTree

from cft.

eharkins commented on August 15, 2024

Ah I see, makes sense! Sorry for not remembering.

Sounds like it could be useful in subsampling large clusters, though does the issue I'm describing make sense re: aligning?

It sounded like the original motivation for --max-sequences was that large clusters over a certain threshold were overwhelming both memory use and compute time for:

alignment
un-pruned cluster tree building
un-pruned cluster tree pruning

However, --max-sequences implementation isn't as smart as we want it to be for subsampling large clusters (in cases of many seqs with equal multiplicity). So we'd really like to subsample in some smart way like a pruning strategy.

This requires a tree, which we can build any way we want so long as we don't need to align in CFT. But currently in CFT, in cases where we need to align, we need to subsample pre-alignment which means pre-tree-building, thus my (edited) question:

do we need to crash with a helpful message if someone runs CFT with > 10k sequences in a cluster that needs alignment? muscle will crash anyway in this case so the only alternative seems like setting --max-sequences and warning that we have automatically done so.

which maybe explains your statement:

I don't think i really have anything useful to contribute to the fix that should happen right now

from cft.

psathyrella commented on August 15, 2024

I'm not sure that it relates to whether cft aligns or not -- when partis runs FastTree it's on indel-reversed sequences.

Well if all you need is a tree to inform sub sampling strategies, it seems highly useful to have an approximate tree from partis clustering. You actually don't need to run partis, and even running FastTree to "refine" the hierarchical agglomeration tree is probably a waste of time for you, since it's only refining large multifurcations, which'll only occur among very similar sequences. So you just call this method of the clusterpath class. When --get-tree-metrics is set in a partis run, it calls that with get_fasttrees=True, so you'd probably just leave it False. In order to get the cluster path trees, you also need to tell the clustering run to save the entire cluster path, which I think it doesn't do by default (since it's large), but I may have turned it on, I dunno.

from cft.

eharkins commented on August 15, 2024

Great! This seems like it would work as a method of informing sub sampling strategies when we have these > 10k sequence clusters with many sequences having the same multiplicity.

I will implement this and talk about with @lauradoepker what the ideal behavoir should be in terms of --max-sequences applying this downsampling by default vs optionally, as well as think about how this fits in with downsampling by multiplicity.

from cft.

eharkins commented on August 15, 2024

So long as we are still aligning sequences in CFT in some cases, we still need to think about how we should subsample in order to be able to align (since we are limited by memory to aligning some max # of seqs). Options I can think of for if we have too many sequences to align:

subsample somehow before aligning sequences
crash

Assuming we don't want to crash in this case, this is important to know:

Right now, the current subsampling setup in CFT is this:

read in partis output which has some original number of sequences in the clonal family
if there are more than 10k sequences, take the top 10k ranked by multiplicity, with no respect to ties
align if we have given --preserve-indels, otherwise use indel-reversed
build a fasttree on aligned sequences from step 3
subsample (prune) fasttree to max 100 seqs and then proceed building ML trees etc

So we already

subsample somehow before aligning sequences

in 2. but we are not satisfied with how that's done, since it treats all sequences with equal multiplicity the same, when in fact some might be more worth keeping when considering the tree.

Assuming we want to rework how we do step 2, one idea is:
Take a tree (partis or fasttree) created from indel-reversed sequences (so we don't need to align) and subsample from that tree, then after subsampling we can align the non-indel-reversed sequences corresponding to the subsampled set, and proceed to step 4.

However, it might make sense to try to consolidate the subsampling in 2 with the subsampling (pruning) in 5. This would look something like:

read in partis output which has some original number of sequences in the clonal family
build a fasttree (or use partis tree) indel-reversed sequences (since we haven't aligned yet)
subsample (prune) tree from step 2 to max 100 seqs
align (if we have given --preserve-indels and there are indels) the non-indel-reversed sequences corresponding to the pruned set of sequences
build ML trees using aligned 100 seqs from 4.

The only downside is that the tree we subsample from does not pay respect to indels.

Thoughts on this or alternatives @matsen @lauradoepker @mmshipley @psathyrella ?

from cft.

matsen commented on August 15, 2024

I like this idea as long as it's not too cumbersome to code. The indels, unfortunately, don't contribute much to phylogenetic analysis. Certainly a no-indel tree is sufficient for an initial prune. But, couldn't we prune down to a moderate set, say 500 sequences, align, and then another round of pruning?

…

On Thu, Oct 17, 2019 at 5:31 PM Elias Harkins ***@***.***> wrote: So long as we are still aligning sequences in CFT in some cases, we still need to think about how we should subsample in order to be able to align (since we are limited by memory to aligning some max # of seqs). Options I can think of for if we have too many sequences to align: - subsample somehow before aligning sequences - crash Assuming we don't want to crash in this case, this is important to know: Right now, the current subsampling setup in CFT is this: 1. read in partis output which has some original number of sequences in the clonal family 2. if there are more than 10k sequences, take the top 10k ranked by multiplicity, with no respect to ties 3. align if we have given --preserve-indels, otherwise use indel-reversed 4. build a fasttree on aligned sequences from step 3 5. subsample (prune) fasttree to max 100 seqs and then proceed building ML trees etc So we already subsample somehow before aligning sequences in 2. but we are not satisfied with how that's done, since it treats all sequences with equal multiplicity the same, when in fact some might be more worth keeping when considering the tree. Assuming we want to rework how we do step 2, one idea is: Take a tree (partis or fasttree) created from indel-reversed sequences (so we don't need to align) and subsample from that tree, then after subsampling we can align the non-indel-reversed sequences corresponding to the subsampled set, and proceed to step 4. However, it might make sense to try to consolidate the subsampling in 2 with the subsampling (pruning) in 5. This would look something like: 1. read in partis output which has some original number of sequences in the clonal family 2. build a fasttree (or use partis tree) indel-reversed sequences (since we haven't aligned yet) 3. subsample (prune) tree from step 2 to max 100 seqs 4. align (if we have given --preserve-indels and there are indels) the non-indel-reversed sequences corresponding to the pruned set of sequences 5. build ML trees using aligned 100 seqs from 4. The only downside is that the tree we subsample from does not pay respect to indels. Thoughts on this or alternatives @matsen <https://github.com/matsen> @lauradoepker <https://github.com/lauradoepker> @mmshipley <https://github.com/mmshipley> @psathyrella <https://github.com/psathyrella> ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#295?email_source=notifications&email_token=AAA3QRD7XQI6V2INAW5TXIDQPD7WZA5CNFSM4I7QC4H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBSADYI#issuecomment-543424993>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA3QRH2XDPKENMGPUFBAQ3QPD7WZANCNFSM4I7QC4HQ> .

-- Frederick "Erick" Matsen, Associate Member Fred Hutchinson Cancer Research Center http://matsen.fredhutch.org/

from cft.

eharkins commented on August 15, 2024

@matsen yes we could do that. Would the idea be that we still want to consider indels in our final step of pruning?

Separately, would we want to try to respect multiplicity at all as --max-sequences has done in bin/process_partis.py or do we want to just focus on tree-based subsampling? I created this issue to figure out how to rework the --max-sequences option, since I think we either need to re-implement how it subsamples there (since it ends up being arbitrary in many cases), or just remove that option and require that subsampling be done separately.

from cft.

eharkins commented on August 15, 2024

@matsen and I discussed this today and we both feel that if we are considering re-working the subsampling/pruning methods used in CFT in any way, we should spend the time designing something that will be applicable to both current and future datasets.

I'm recording here some of what we talked about, to be followed up with @mmshipley @lauradoepker @psathyrella during our meeting next Wednesday.

subsampling/pruning methods

Currently, we have a subsampling/pruning procedure outlined in my above comment. If we are considering changing this at all, we could use (but are not limited to) some combination of the following methods:

taking the top N sequences according to multiplicity. These are less likely to have been observed by sequencing error. Doesn't cover case of many singletons
blast (in seeded case) for top N matches against the seed sequence. Doesn't cover unseeded case
phylogenetic pruning. Doesn't scale to large clusters
various forms of clustering (and various methods of sampling from clusters -- e.g. just take the seed cluster or take some subsample from each cluster):
- UMAP
- vsearch
- etc

data considerations

seeded clusters vs unseeded clusters; our methods should apply to both (how much do we care about the unseeded case?)
what is the scale of future clusters given UMIs?
how frequently (now and in the future) do we observe clusters large enough that our current methods don't scale well (e.g. cluster over 10k sequences full of singletons)?

from cft.

eharkins commented on August 15, 2024

As @matsen suggested today, for now we will just stop execution of CFT if we encounter a cluster above our threshold for size. We don't predict encountering such clusters in future datasets using UMIs, and if we do we would probably like to decide how to proceed on a case-specific basis.

I will just add an exception based on cluster size, and remove --max-sequences from our call to bin/process_partis.py.

Should we still consider implementing some solution for --max-sequences in the case of many 'singlets' (sequences with multiplicity = 1) since it will arbitrarily downsample in that case? Even though that option will no longer be used in CFT for now, we might consider resolving this case if we are going to leave the option in the script.

from cft.

matsen commented on August 15, 2024

Why wouldn't we leave in --max-sequences and have it set the count at which the exception is thrown?

Not sure why the "singlet" case is different here.

from cft.

eharkins commented on August 15, 2024

That works too, I was assuming that the exception was something specific to the pipeline and considering that --max-sequences might still get used to downsample by multiplicity for uses of bin/process_partis.py beyond getting called from our default pipeline.

from cft.

eharkins commented on August 15, 2024

In the above PR, went with

add an exception based on cluster size, and remove --max-sequences from our call to bin/process_partis.py.

Adding a warning to --max-sequences given that

the exception was something specific to the pipeline and considering that --max-sequences might still get used to downsample by multiplicity for uses of bin/process_partis.py beyond getting called from our default pipeline.

from cft.

Reconsider subsampling/pruning procedure about cft HOT 19 CLOSED

Comments (19)

subsampling/pruning methods

data considerations

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent