Henk has warned that it can be hard to separate outbreak samples from other environmen

Add script to pull out clade-specific indels and supernodes about outbryk HOT 17 OPEN

iqbal-lab commented on August 27, 2024

Add script to pull out clade-specific indels and supernodes

from outbryk.

Comments (17)

iqbal-lab commented on August 27, 2024

How to do the second bit (b) ?

make a filelist, listing all the cleaned sample graph binaries in a cluster, FLIST

cat FLIST
sample1.ctx
sample2.ctx
etc

Then list this in a file
ls FLIST > COLLIST

cortex_var_31_c1 --colour_list COLLIST --mem_height 21 --mem_width 100 --sample_id cluster_name --output_supernodes contigs.fa --max_var_len 100000

This gives us a fasta file contigs.fa
Go through this and discard any contig shorter than 500bp, and calculate the longest read - say this length is X.

Then make a filelist for every sample
ls sample1.ctx > sample1.ctx.list
...
etc
and list these in a file COL2
ls sample*ctx.list > COL2

Then, say there are 21 samples in our clyster
cortex_var_31_c21 --colour_list COL2 --mem_height 21 --mem_width 100 --pan_genome_matrix contigs.fa --max_read_length X

This will produce a matrix - for each contig (row), it will say what % of the kmers in that contig are in each sample (column)

Then we just look for contigs present only in our cluster (set of columns)

from outbryk.

iqbal-lab commented on August 27, 2024

Rachel -can I give this to you to try out on Henk's listeria dataset please? Could even test on Jen;s paper I suppose, but I think we have plenty to be going with

from outbryk.

rmcolq commented on August 27, 2024

OK. :)

from outbryk.

rmcolq commented on August 27, 2024

A) Are you suggesting that after making a tree using phyml based on snps in samples, we take these clusters and try to further divide them by indels? Or that we build a tree to separate clusters based just on indels?
Or were you thinking of building a tree using both snps and indels (currently using snp-sites to get phylip file to input to phyml. Snp-sites does snps only.)

Or should I just choose one of these that works?

from outbryk.

iqbal-lab commented on August 27, 2024

So this all hinges on how you interpret a tree. Usually, people want to interpret a tree in an evolutionary context, and they want branch lengths to be proportional to time. On that basis, they use just SNPs, making the approximation that there is a fixed mutation rate for SNPs. Indels occur by different mechanisms to SNPs and the rate will be different, so they don't include them. On that basis, i don't really want to make a global tree based on both SNPs and indels.
However when looking at clusters, these should differ by very few SNPs (~4), and honestly I was not completely clear in my head how I would compare them. I was initially thinking of seeing whether samples that seem identical (or nearly) can be distinguished by indels or phages. Making a tree of things works for vertically inherited indels, but not for horizontally transferring phages.

from outbryk.

iqbal-lab commented on August 27, 2024

We could easily experiment by taking the Minnesota clusters and Listeria testset clusters, and looking at (high confidence) indels in those clusters, and phages. Does this all make sense?

from outbryk.

iqbal-lab commented on August 27, 2024

But I would re-call variants within a cluster with a local reference...

from outbryk.

rmcolq commented on August 27, 2024

I'm still confused. Here is what I think you are suggesting:
Once we have a phyml constructed tree (from SNPs), we 'extract' clusters. Within clusters we want more resolution. This can be done with 1) indels 2) phages. So for each cluster:

pick a local reference (my feeling with MASH is that it isn't going to be ideal for this - would there be a problem with just picking one of the samples randomly/the first?)
re-call variants within that cluster with that reference (is this the combine_vcfs.pl script in cortex, or an actual run_calls.pl again?)
using the new vcf(?) subset just indel type variation and try to spot sub-clusters? (indels)
separately do what you suggest above with finding large contigs and making matrix with cortex. Try to spot contigs that appear in some groups in the cluster and not others. (phages?)
combine this information?

from outbryk.

iqbal-lab commented on August 27, 2024

OK, you are close. Here is my corrected version

Once we have a phyml constructed tree (from SNPs), we 'extract' clusters. Within clusters we want more resolution. This can be done with 1) indels 2) phages. So for each cluster: (all right so far)

pick a local reference (my feeling with MASH is that it isn't going to be ideal for this - would there be a problem with just picking one of the samples randomly/the first?)

A reference has to be a fully assembled reference genome, so we can't just use one of them - we'd need to assemble it fully, which I want to avoid. At worst we can use the same one we used before, but a closer one might be an improvement. However, we could choose our favourite reference (note we only have a big set of references for Salmonella at the moment) by looking at core SNPs (I know I've not explained how to do this for now)

re-call variants within that cluster with that reference (is this the combine_vcfs.pl script in cortex, or an actual run_calls.pl again?)

This really is call again from scratch using run_calls

using the new vcf subset just indel type variation and try to spot sub-clusters? (indels)

yes, well I guess first I would look at SNPs called by this method and build a tree just of these samples, and then look at indels on top

separately do what you suggest above with finding large contigs and making matrix with cortex.

yes

Try to spot contigs that appear in some groups in the cluster and not others.
yes

(phages?)
I would take Henk's list of phages, as a fasta, and maybe remove reduncancy using CD-HIT, and then use pan_genome_matrix to see which samples do/not have them

combine this information?

Well, I don't think we can automate that in advance
Z

from outbryk.

rmcolq commented on August 27, 2024

OK, thanks!

from outbryk.

iqbal-lab commented on August 27, 2024

Hold on - why won't MASH work? What have you found so far?

from outbryk.

rmcolq commented on August 27, 2024

Not much…I still have to try it Henks way with sketching the references first and ‘grooming’(?), but if you brute force it like I did yesterday morning:

for ref in /data3/projects/outbreak_challenge/salm/ref_info/Salm_refs/*.fasta; do

    ref_id=${ref%.fasta*}  # get the part before the .fasta

    ref_id=${ref_id##*/}

    echo $ref_id

    mash dist $ref ../pseudo_samples/sample_* &> mash_summary_$ref_id.txt

Done

It takes forever!

It will be quicker done properly…

I had got the impression you could only compare sample(s) against one reference at a time, but rereading Henks guidelines, they suggest otherwise. My misunderstanding.

from outbryk.

iqbal-lab commented on August 27, 2024

I believe henk's suggestion, that MASH will work best on trimmed reads, so I think it will be worthwhile uysing Trimmomatic to clean the reads first. Pretty sure if you do it right it is extremely fast

from outbryk.

hcdenbakker commented on August 27, 2024

Lecture starts in 5 min; but yes, groomed/trimmed reads is essential. Otherwise it performs pretty badly. Takes < 30 s for one search on my Mac.

Sent from my iPhone

On Sep 10, 2015, at 9:16 AM, Zamin Iqbal [email protected] wrote:

I believe henk's suggestion, that MASH will work best on trimmed reads, so I think it will be worthwhile uysing Trimmomatic to clean the reads first. Pretty sure if you do it right it is extremely fast

—
Reply to this email directly or view it on GitHub.

from outbryk.

rmcolq commented on August 27, 2024

Do we have alternative references for Listeria? Or just the one?
Do we only need to run run_calls again if we are using a new reference?
If not rebuilding the graphs, is there still a way to do the combining and genotyping and merging into one vcf?

from outbryk.

iqbal-lab commented on August 27, 2024

Yes here
https://github.com/iqbal-lab/outbryk/tree/master/species/Listeria/refdata
Yes
The joint workflow will do this

from outbryk.

hcdenbakker commented on August 27, 2024

Rachel,

I put 56 potential other reference sequences in the species/Listeria/refdata folder on the outbryk repository. You can check with Mash which one is the closest match. I found it was L2624 with a smaller number of references.

Sent from my iPhone

On Sep 11, 2015, at 8:14 AM, rmnorris [email protected] wrote:

Do we have alternative references for Listeria? Or just the one?
Do we only need to run run_calls again if we are using a new reference?
If not rebuilding the graphs, is there still a way to do the combining and genotyping and merging into one vcf?
—
Reply to this email directly or view it on GitHub.

from outbryk.

Add script to pull out clade-specific indels and supernodes about outbryk HOT 17 OPEN

Comments (17)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent