Comments (19)
Sounds good, and just to clarify/remind, the idea I think is that most clusters over 10k sequences are spurious light chain super-clusters, so "pruning"/subsampling could presumably take the form of very simple-mindedly removing distantly related branches.
from cft.
Would the idea be that we still want to consider indels in our final step of pruning?
Yep!
would we want to try to respect multiplicity at all
Yes, I think that's a good idea. I don't know how this currently works, and/or how it would work in a tree-based setting. Chat Monday?
from cft.
You know better than me! This was just my reaction.
from cft.
That all sounds sensible, and I don't think i really have anything useful to contribute to the fix that should happen right now, but a potential long term fix might incorporate the fact that almost all clusters larger than 10k are light chain superclusters incorporating many actual families. So assuming that the trees for these superfamilies usually consist of many widely separated sublineages it might make sense to downsample by only running on the sublineage that includes the seed.
from cft.
Noted, thanks @psathyrella! Is there information about which sequences belong to these sublineages recorded somewhere in partis output to make this possible post-partis or would we need to re-partition such a "supercluster"?
from cft.
Well sublineages would be a tree property, so in principle would come from cft running trees. But you can always have partis make these cluster path/FastTree mashup trees I keep mentioning.
from cft.
This makes sense in the absence of the need to align (whether it's because we use indel-reversed sequences, or there are no indels, or things have already been aligned in partis) since I was under the impression that we need same length sequences to make a FastTree. Then we could do what you are talking about by removing distal lineages from the seed. In fact, by pruning to the seed lineage in https://github.com/matsengrp/cft/blob/master/bin/prune.py, do we not already do this?
Also I'm not sure what
these cluster path/FastTree mashup trees I keep mentioning
refers to.
from cft.
I talked about them in the last bcr meeting. They're the approximate trees partis makes if you run --get-tree-metrics, constructed starting from the tree implicit in hierarchical agglomerative cluster, refined with FastTree
from cft.
Ah I see, makes sense! Sorry for not remembering.
Sounds like it could be useful in subsampling large clusters, though does the issue I'm describing make sense re: aligning?
It sounded like the original motivation for --max-sequences
was that large clusters over a certain threshold were overwhelming both memory use and compute time for:
- alignment
- un-pruned cluster tree building
- un-pruned cluster tree pruning
However, --max-sequences
implementation isn't as smart as we want it to be for subsampling large clusters (in cases of many seqs with equal multiplicity). So we'd really like to subsample in some smart way like a pruning strategy.
This requires a tree, which we can build any way we want so long as we don't need to align in CFT. But currently in CFT, in cases where we need to align, we need to subsample pre-alignment which means pre-tree-building, thus my (edited) question:
do we need to crash with a helpful message if someone runs CFT with > 10k sequences in a cluster that needs alignment? muscle will crash anyway in this case so the only alternative seems like setting --max-sequences and warning that we have automatically done so.
which maybe explains your statement:
I don't think i really have anything useful to contribute to the fix that should happen right now
from cft.
I'm not sure that it relates to whether cft aligns or not -- when partis runs FastTree it's on indel-reversed sequences.
Well if all you need is a tree to inform sub sampling strategies, it seems highly useful to have an approximate tree from partis clustering. You actually don't need to run partis, and even running FastTree to "refine" the hierarchical agglomeration tree is probably a waste of time for you, since it's only refining large multifurcations, which'll only occur among very similar sequences. So you just call this method of the clusterpath class. When --get-tree-metrics is set in a partis run, it calls that with get_fasttrees=True, so you'd probably just leave it False. In order to get the cluster path trees, you also need to tell the clustering run to save the entire cluster path, which I think it doesn't do by default (since it's large), but I may have turned it on, I dunno.
from cft.
Great! This seems like it would work as a method of informing sub sampling strategies when we have these > 10k sequence clusters with many sequences having the same multiplicity.
I will implement this and talk about with @lauradoepker what the ideal behavoir should be in terms of --max-sequences
applying this downsampling by default vs optionally, as well as think about how this fits in with downsampling by multiplicity.
from cft.
So long as we are still aligning sequences in CFT in some cases, we still need to think about how we should subsample in order to be able to align (since we are limited by memory to aligning some max # of seqs). Options I can think of for if we have too many sequences to align:
- subsample somehow before aligning sequences
- crash
Assuming we don't want to crash in this case, this is important to know:
Right now, the current subsampling setup in CFT is this:
- read in partis output which has some original number of sequences in the clonal family
- if there are more than 10k sequences, take the top 10k ranked by multiplicity, with no respect to ties
- align if we have given --preserve-indels, otherwise use indel-reversed
- build a fasttree on aligned sequences from step 3
- subsample (prune) fasttree to max 100 seqs and then proceed building ML trees etc
So we already
subsample somehow before aligning sequences
in 2. but we are not satisfied with how that's done, since it treats all sequences with equal multiplicity the same, when in fact some might be more worth keeping when considering the tree.
Assuming we want to rework how we do step 2, one idea is:
Take a tree (partis or fasttree) created from indel-reversed sequences (so we don't need to align) and subsample from that tree, then after subsampling we can align the non-indel-reversed sequences corresponding to the subsampled set, and proceed to step 4.
However, it might make sense to try to consolidate the subsampling in 2 with the subsampling (pruning) in 5. This would look something like:
- read in partis output which has some original number of sequences in the clonal family
- build a fasttree (or use partis tree) indel-reversed sequences (since we haven't aligned yet)
- subsample (prune) tree from step 2 to max 100 seqs
- align (if we have given --preserve-indels and there are indels) the non-indel-reversed sequences corresponding to the pruned set of sequences
- build ML trees using aligned 100 seqs from 4.
The only downside is that the tree we subsample from does not pay respect to indels.
Thoughts on this or alternatives @matsen @lauradoepker @mmshipley @psathyrella ?
from cft.
from cft.
@matsen yes we could do that. Would the idea be that we still want to consider indels in our final step of pruning?
Separately, would we want to try to respect multiplicity at all as --max-sequences has done in bin/process_partis.py or do we want to just focus on tree-based subsampling? I created this issue to figure out how to rework the --max-sequences option, since I think we either need to re-implement how it subsamples there (since it ends up being arbitrary in many cases), or just remove that option and require that subsampling be done separately.
from cft.
@matsen and I discussed this today and we both feel that if we are considering re-working the subsampling/pruning methods used in CFT in any way, we should spend the time designing something that will be applicable to both current and future datasets.
I'm recording here some of what we talked about, to be followed up with @mmshipley @lauradoepker @psathyrella during our meeting next Wednesday.
subsampling/pruning methods
Currently, we have a subsampling/pruning procedure outlined in my above comment. If we are considering changing this at all, we could use (but are not limited to) some combination of the following methods:
- taking the top N sequences according to multiplicity. These are less likely to have been observed by sequencing error. Doesn't cover case of many singletons
- blast (in seeded case) for top N matches against the seed sequence. Doesn't cover unseeded case
- phylogenetic pruning. Doesn't scale to large clusters
- various forms of clustering (and various methods of sampling from clusters -- e.g. just take the seed cluster or take some subsample from each cluster):
- UMAP
- vsearch
- etc
data considerations
- seeded clusters vs unseeded clusters; our methods should apply to both (how much do we care about the unseeded case?)
- what is the scale of future clusters given UMIs?
- how frequently (now and in the future) do we observe clusters large enough that our current methods don't scale well (e.g. cluster over 10k sequences full of singletons)?
from cft.
As @matsen suggested today, for now we will just stop execution of CFT if we encounter a cluster above our threshold for size. We don't predict encountering such clusters in future datasets using UMIs, and if we do we would probably like to decide how to proceed on a case-specific basis.
I will just add an exception based on cluster size, and remove --max-sequences
from our call to bin/process_partis.py
.
Should we still consider implementing some solution for --max-sequences
in the case of many 'singlets' (sequences with multiplicity = 1) since it will arbitrarily downsample in that case? Even though that option will no longer be used in CFT for now, we might consider resolving this case if we are going to leave the option in the script.
from cft.
Why wouldn't we leave in --max-sequences
and have it set the count at which the exception is thrown?
Not sure why the "singlet" case is different here.
from cft.
That works too, I was assuming that the exception was something specific to the pipeline and considering that --max-sequences
might still get used to downsample by multiplicity for uses of bin/process_partis.py
beyond getting called from our default pipeline.
from cft.
In the above PR, went with
add an exception based on cluster size, and remove --max-sequences from our call to bin/process_partis.py.
Adding a warning to --max-sequences
given that
the exception was something specific to the pipeline and considering that --max-sequences might still get used to downsample by multiplicity for uses of bin/process_partis.py beyond getting called from our default pipeline.
from cft.
Related Issues (20)
- Investigate shm indel in QA255.157-Vk HOT 10
- improve view-output output HOT 1
- Add process_partis.py option for a specific indel HOT 6
- Make it very obvious when sequences have an indel HOT 7
- Olmsted frozen deployments HOT 4
- Resolve insertion in QA255.016-VL HOT 2
- Add indel tree analysis to pipeline? HOT 9
- Archiving data for manuscripts HOT 6
- Track build options alongside input data HOT 1
- Check whether naive was sampled HOT 1
- Output full partis cluster fasta HOT 2
- Flow chart defining correspondence between Overbaugh and Matsen teams on B-cell research HOT 11
- update testing framework HOT 1
- Remove commented out / unused code and run code formatter
- Preserve ambiguous nucleotides HOT 1
- Nextflow
- Validate parsing of partis multiplicity HOT 15
- Use new partis fcn add_seqs_to_line() HOT 6
- Introduction on the program HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cft.