matsengrp / cft Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 4.57 MB

Clonal family tree

Python 100.00%

cft's People

Contributors

Stargazers

Watchers

Forkers

psathyrella eharkins standardgalactic

cft's Issues

sequence alignment for post partis fastas

One issue that's probably causing artifacts in tree reconstruction is that sequences are padded with Ns, and the amount of N padding differs across clusters. Maybe one way to deal with this is to use something like mafft to combine clusters (each of which is individually aligned), then trim off the maximal end gap/N substring length from all sequences.

Non-unique name '108017-1' in the alignment

One of the clusters from QA255.067-Vh/Hs-LN2-5RACE-IgG-new-cluster-annotations.csv has non-unique sequence names. This means FastTree does not produce a tree and processing this cluster fails.

$ bin/process_partis.py --annotation /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-Vh/Hs-LN2-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QA255.067-Vh/Hs-LN2-5RACE-IgG-new.csv --cluster_base cluster --output_dir out.test  --separate --chain "h"                                                                       
writing out.test/cluster0.fa
writing out.test/cluster1.fa
writing out.test/cluster2.fa
writing out.test/cluster3.fa

$ FastTree -nt -quiet  out.test/cluster0.fa 
Non-unique name '108017-1' in the alignment

ninja edit, also happens on cluster1.fa (but not cluster 2 or 3)

$ FastTree -nt -quiet  out.test/cluster1.fa 
Non-unique name '115707-1' in the alignment

How to run the workflow?

My current thinking is Toil, which apparently can use slurm as an execution engine: http://toil.readthedocs.io/en/latest/batchSystem.html?highlight=slurm

clustering summary information

reminder from meeting with erick, who presumably will want to rewrite this issue.

display ascii-art tree

This issue is a spinoff of #11, specifically from this comment by Erick discussing ascii representations of tree topologies.

This will be a placeholder to anchor branches and PRs if it works out.

Docker image

The conda documented method of exporting and recreating environments has bugs that make it useless for us.

    $ conda env create -f environment.yml
    $ conda env create -f environment.yml -n foo
    Using Anaconda Cloud api site https://api.anaconda.org
    Fetching package metadata .............
    Solving package specifications: .
    Error: Packages missing in current linux-64 channels:

The conda developers have known about these problems since July but it persists. Although they have accepted and merged a solution, it does not actually fix the problem.

Until this gets fixed, follow the documentation in README.md for creating an environment using conda.

Visualizations

I wanted to start a thread on ideas for nice visualizations.

Perhaps a tree + alignment would be nice here, like so:

This shows the alignment restricted to the parsimony-informative sites. We could do that, or just show the CDRs or something. Or we could have the alignment scrollable showing a fixed tree.

Clearly we'd want to cut down the tree to the region around the seed sequence.

SConstruct bug fixes

OBO error on line 117
Need -nt option on FastTree to specific nucleotide (it's throwing a warning now because AA is default)

naive sequence id not present in dnaml newick tree

The newick tree created by bin/dnaml2tree.py does not have any node whose name corresponds to the naive sequence and that atches the name of the naive sequence in the FASTA file.
I'm guessing that the root node is getting a number like the other internal nodes and we need to do something special to give it a real name.

$ grep naive output/QA255.006-Vh/Hs-LN2-5RACE-IgG-new/cluster2/dnaml.fa
>naive2
$ grep naive output/QA255.006-Vh/Hs-LN2-5RACE-IgG-new/cluster2/dnaml.newick 
$

Designing communication between components

So far we have been passing a FASTA around, which is fine, though it seems like we want to be passing around some richer information. Last meeting we discussed using a JSON file for this communication. We could also consider YAML.

Let's plan what this will hold here. Zooming out, we have

an investigation, which may contain samples from multiple time points #10
samples contain annotated sequences, which contain VDJ calls etc, and indels #8
samples contain clusters, and we may have multiple clusterings per sample #9
investigations may contain seed sequences, which point to clusters in each sample

I'm imagining this like so, where the investigation is the big box, the timepoints are arranged vertically, and the stars are the seed sequences:

sort tree so seed sequence appears at the top

breaking out from #11 (comment)

I do also think that some sorting, such that the seed sequence ends up on top, would be nice. I did this with nw_order, but I'll bet that there's something better in one of those python packages in which one can define a custom comparator.

The sorting will be nice because then the internal sequences of interest will be as close as possible to the top.

phylip parsing error

The phylip parsing script is failing on many inputs, e.g.

xvfb-run -a bin/dnaml2tree.py --dnaml output/Hs-LN1-5RACE-IgG/QB850.001-Vh/cluster2/outfile --outdir output/Hs-LN1-5RACE-IgG/QB850.001-Vh/cluster2 --basename dnaml
Traceback (most recent call last):
  File "bin/dnaml2tree.py", line 229, in <module>                                                                                              
    main()
  File "bin/dnaml2tree.py", line 197, in main
    tree = build_tree(sequences, parents)
  File "bin/dnaml2tree.py", line 112, in build_tree
    raise RuntimeError("The tree is not properly rooted; expected a single root but there are {}.".format(orphan_nodes))
RuntimeError: The tree is not properly rooted; expected a single root but there are 4.

Alnvu for sequence alignment display?

Just an idea, while I'm thinking about it... This would make it easier to see just the changes in the alignment. It would still have the global consensus either at the top or bottom. I find this really helpful for looking at ASCII alignments.

Include patient id, seed id, etc. in metadata

It would be nice to get the patient id, seed id, timepoint, and gene inserted into the metadata earlier in the pipeline.

Ideally this information and more would be supplied by the researcher. In the meantime we can extract it from the paths and filenames that conform to the expected pattern.
There is code in the cftweb app that extracts this information from the path and filename (marked by "TEMPORARY HACK!"). This hack should move up to process_partis.py where the extracted info can be included in the metadata json file.

process_partis.py should do something reasonable if the path and filename don't conform to the expected format.

Richer node annotation

It would be nice to annotate nodes on the tree to flag non productive sequences. If our ASR is resulting in non-productive sequences (in-frame stop codons), that's a problem.

Also, although our current data is from RNA, we might want to detect out-of-frame and stop condon sequences in the input data upstream of any tree building.

dnaml2tree.py is producing non-binary trees

I think the rerooting function in bin/dnaml2tree.py is producing non-binary trees. The node marked 'X' in this diagram has three branches. This can make it difficult to further manipulate these trees with tools that only work on binary trees (e.g. much of dendropy).

Is there a way we can root these trees on the naive sequence and still have binary trees?

# output/QA255.006-Vh/Hs-LN2-5RACE-IgG-new/cluster2/dnaml.newick
                                                 /------- seed QA255
                    /----------------------------+                  
+-------------------X                            \------- 265064-1  
                    |                                               
                    \------------------------------------ 115652-1

Need pipeline from partis output to visualization

Construct a pipeline and smooth over any interface mismatches between the tools.

As a first milestone use SCons.
This is a placeholder for an early pipeline; nothin here rules out future directions.

Features:

Ideally this would go from partis output to web interface
Tie together existing tools from David and Will.
Run in parallel (on slurm for now)

More data!

Right now there is a lot of data that has been processed by partis in /fh/fast/matsen_e/processed-data/partis -- perhaps @psathyrella can tell us if any of these annotation-only or something that wouldn't make them appropriate for building trees on clonal families.

I propose that our processed data go in /fh/fast/matsen_e/processed-data/cft using the same directory names.

I also propose adding data from http://www.nature.com/articles/ncomms11112, which according to Jason VdH is available pre-processed (yay!) in ImmPort under the identifier SDY675. We'll then have to ask Duncan to run partis on that.

Add Fasta download for ancestral reconstruction between naive and seed

As requested by @lauranoges et al.

Sequence alignment colors corresponding to the Callangram?

The colors in the alignment seemed a little flat. Do we want to use colors that correspond to this?

green: b2d28e
blue: 8fbbd9
orange: ecaf80
purple: aba7d4

lighter versions:

lt green: e8f2da
lt blue: dae9f2
lt orange: f2e3da
lt purple: dadaf2

Download fasta from table views

Should be able to download a fasta file of all cluster sequences in both by individual and by cluster table views. Perhaps could add a link to the cluster ascii art view as well, though it may feel a little lost without other links at the moment...

Selecting alternative clusters

Occasionally the cluster given in the annotations file will not be the one we want.

We can easily get the sequences from the annotations file, but we do not have the naive sequence directly at-hand. We can get this from ./bin/partis --naive-vsearch --other-args, and the .log file has the function call used to generate the data, it will just take an intermediate rerunning of partis.

Consistent style between contributors

There are a number of us working on this project. So I think it's good to try to pick a style writing code.

When writing C++ I've enjoyed using clang-format. This way nobody has to think about code formatting-- you just run it through the formatter on a regular basis and poof! Done. There is the equivalent for Python: https://github.com/google/yapf

Discuss-- or perhaps a topic for the next stand-up?

Use/learn from SONAR

Schramm, C. A., Sheng, Z., Zhang, Z., Mascola, J. R., Kwong, P. D., & Shapiro, L. (2016). SONAR: A High-Throughput Pipeline for Inferring Antibody Ontogenies from Longitudinal Sequencing of B Cell Transcripts. Frontiers in Immunology, 7. https://doi.org/10.3389/fimmu.2016.00372

General ui cleanup

remove template's dead-end navigation icons
hide toggle for build information
ability to switch back and forth between "by cluster" and "by individual" views
fix timepoint links on "by individual" view

Don't show D genes for light chain sequences

Light chains don't have D genes, and partis copes by putting in IGKDx-x*x as a placeholder.

Ideally we wouldn't show that at all for light chain sequences. Perhaps the locus will show up in the metadata somewhere?

By Patient navigation busted

When we click through by patient, we get here:

which is already looking fishy (see timepoint).

Clicking on this links to http://stoat:5000/Hs-LN1-5RACE-Ig/Vk/cluster0/dnaml.seedLineage.fa//clusters.html , which is not found.

process_partis.py needs to be more robust

There are two common situations in which process_partis.py falls over with a stack trace and without producing any guidance on what the real problem is or how to correct it.

Both situations can be triggered by running process_partis.py on the reference output files under test/reference-results/ in the partis repo.

The first is when a partition file does not contain a column labelled seed_unique_id.
The second is when the partition file contains seed_unique_id, but the format of the column is not what was expected.
The script should test the availability and format of the seed_unique_id before accessing the data, and either adapt to the missing data or output a useful error message.

$ process_partis.py --annotation partis/test/reference-results/partition-new-simu-cluster-annotations.csv --partition partis/test/reference-results/partition-new-simu.csv
Traceback (most recent call last):
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 237, in <module>
    main()
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 218, in main
    args.chain)
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 100, in process_data
    seed_ids = pd.read_csv(part_file).loc[0]['seed_unique_id'].split(':')
  File "/home/cwarth/.conda/envs/py27/lib/python2.7/site-packages/pandas/core/series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "/home/cwarth/.conda/envs/py27/lib/python2.7/site-packages/pandas/indexes/base.py", line 2183, in get_value
    raise e1
KeyError: 'seed_unique_id'

$ process_partis.py --annotation partis/test/reference-results/seed-partition-new-simu-cluster-annotations.csv --partition partis/test/reference-results/seed-partition-new-simu.csv 
Traceback (most recent call last):
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 237, in <module>
    main()
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 218, in main
    args.chain)
  File "/home/cwarth/src/matsen/cft/bin/process_partis.py", line 100, in process_data
    seed_ids = pd.read_csv(part_file).loc[0]['seed_unique_id'].split(':')
AttributeError: 'numpy.int64' object has no attribute 'split'

Annotate sequence alignment to indicate rearrangement parameters

It would be useful to show V-N-D-N-J as separate font colors, and SHM edits too. Here is some code that does something close to this, and a screen shot.

CDR3 bounds seem wrong sometimes

In this alignment, the CDR3 is in bold, and V/N1/D/N2/J genes are green/blue/orange/blue/purple, respectively.

It looks like the CDR3 is ending in the N2 region—albeit still on a F codon—, rather than on the F codon near the start of the J gene. @psathyrella, is this expected from partis/python/utils.py?

Move bin/demo.sh process_partis.py loop into SConstruct

May as well move to nestly while we're at it...

working on web interface

This issue is a placeholder for a variety of interface tweaks that I want to implement.
This will include work on the top level table and on the individual cluster pages.
I would like to try out the msa package for displaying the alignment for a cluster.

How will we represent/use the uncertainty in clusterings reported by partis?

Note that the clusterings will overlap a lot-- the only modification between clusterings is a single merge.

So we could spit out one tree-building-bundle for each of the unique clusters that's in one of the clusterings.

Relevant for #1 .

process_partis.py crashes while processing QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv

Attempting to process partis output file in /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/

process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-Vh/Hs-LN1-5RACE-IgG-new --separate --chain h

Some additional debugging statements added.

+ process_partis.py --annotations /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new-cluster-annotations.csv --partition /fh/fast/matsen_e/dshaw/cft/data/seeds/QB850.001-Vh/Hs-LN1-5RACE-IgG-new.csv --cluster_base cluster --output_dir postpartis_out/QB850.001-Vh/Hs-LN1-5RACE-IgG-new --separate --chain h
glfo = <type 'dict'>
region = v
line[v_gene] = IGHV1-18*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = d
line[d_gene] = IGHD1-7*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = j
line[j_gene] = IGHJ6*03
glfo['seqs'].keys() = ['j', 'd', 'v']
region = v
line[v_gene] = IGHV1-18*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = d
line[d_gene] = IGHD5-12*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = j
line[j_gene] = IGHJ4*01
glfo['seqs'].keys() = ['j', 'd', 'v']
region = v
line[v_gene] = IGHV1-18*1m
glfo['seqs'].keys() = ['j', 'd', 'v']
Traceback (most recent call last):
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 293, in <module>
    main()
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 274, in main
    args.chain)
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 145, in process_data
    calculate_bounds(cluster.to_dict(), glfo)
  File "/shared/silo_researcher/Matsen_F/MatsenGrp/working/cwarth/cft/bin/process_partis.py", line 101, in calculate_bounds
    utils.add_implicit_info(glfo, line)
  File "/home/cwarth/src/matsen/partis/python/utils.py", line 831, in add_implicit_info
    uneroded_gl_seq = glfo['seqs'][region][line[region + '_gene']]
KeyError: 'IGHV1-18*1m'

How to deploy cft web server?

Placeholder for information on how deploy cft web server.

contact scicomp for information on standard practice

Remove long root on ete SVG

This appears to just be an artifact of the rerooting, and not relevant.

Partis annotation information in metadata for use in alignment coloring

To color alignments, we need the annotation information about V/N(/D/N/)J boundaries for a given cluster—as well as the SHM mutation positions for each sequence in the cluster—to live in metadata.json. Currently the color annotation done in branch 33-alncolor uses hardcoded values as mockup. When the annotation data are available, that code can be easily updated to access it.

What format should we use for representing mutations on trees?

@dawahs Perhaps you could paste in a tiny example of a BEAST stoch mapping file?

Reconstructing indels

Sequences in the annotations .csv have indels removed, but if a sequence is queried we may want the sequence with the indels reconstructed.

Do we want this as a single .fa file included in output directory?

Integration test data

Using some simulated data from Partis, let's set up continuous integration testing with travis CI or werker.

ASCII art tree and ete SVG trees should have the same leaf order

If possible...

sample metadata; Keemei?

It's clear now that the current hack of extracting sample metadata from sequence files is not a great solution.

I'm going to suggest https://github.com/biocore/Keemei as a way to have validate-able tabular metadata. @lauranoges, would you be willing to keep your metadata in Google Sheets? It would look like this (more or less, but obviously with different data in it):

and would keep everyone loving each other instead of griping about how the metadata isn't formatted correctly.

minor cftweb cleanups

There are a handful of places in the cft code that reference the igdbweb project from which cftweb was copied.

In particular if the app crashes without debugging it will send email to [email protected] which goes to me. I don't know how I aliased [email protected] to my login.

Folks should know about the --debug flag to cftweb - it might be useful.

Also the install requirements are not current because cftweb does not require frozen-flask package.

There are other places in the cftweb code that igdbweb is mentioned; those should probably be changed to cftweb.

Better documentation/error messaging when parsing input fails

With the whole parsing of the log file fiasco yesterday, I think I'll try to comb through process_partis.py and add in some checks and clear messaging/documentation in case we hit a snag like this again.

parse partis cluster files and output per-clonal-family files

Input: FASTA file, partis clustering file
Output: per-clonal-family FASTA file, and an inferred naive sequence

See https://matsengrp.slack.com/archives/cft/p1476141257000013 for notes from Duncan about parsing his files.

how to index samples?

@lauranoges : @cswarth is in the process of setting up a web front-end to browse results.

My guess is that you aren't going to want to be presented with a list of all clusters from all data sets. Thus, how would you like them broken up? By individual then by timepoint?

Cutting down a tree from seeds to root

We may want to cut down the set of sequences from an entire clonal family to a smaller set that seem related to the evolution of the given seed sequence. I'm thinking about functionality to do so.

Because there's a unique root to seed path, our tree is going to look like this given a single seed sequence:

|\
|\^
|\^
|\^
^ ^

The only question then is how much to take from each subtree ^. We could say that we want the k closest leaves from each subtree, or we could say that we want all sequences within some distance d from the attachment point to the main trunk. Or some combination of those.

Multiple seeds shouldn't be much harder-- we just have a branched "trunk".

For output, it would be great to get the names of the sequences and the tree induced by the selected subset of sequences. An induced tree is just the tree where we trim out any branches that aren't leading to a leaf in a selected set.

Seems like some useful subroutines could be:

given a rooted tree, get a vector of distances to leaves
given a root and a subset of leaves, get the subtree induced by those leaves

Time points

This is an important point: many of these data sets have multiple time points, and we should do something smart with that information.

The first step would be to

take the union of all of the clusters across various timepoints that have a given sequence, but preserving and transmitting the timepoint information for downstream steps
build trees on the unioned thing
color nodes by timepoint in the clustered tree

Add toggle for svg tree from ete3

@lauranoges et al. really liked these, so we'd like a button/toggle on the tree/alignment view that would let us show that instead of the asciiart.

This should share as much of the template as possible with the asciiart viz, and should be routed to by a url query param with something like ?viz_mode=ete3_svg.

Ancestral sequence reconstruction

David and I just created trees based on Duncan's analysis of the sequences I gained recently. It would be really great if there was a way to reconstruct the sequences at internal nodes of the trees. Ie. ancestral sequences that we didn't directly observe. Any thoughts? 🤔