probic / msweep Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 1.0 32.4 MB

mSWEEP High-resolution sweep metagenomics using fast probabilistic inference

License: MIT License

CMake 14.63% C++ 85.37%

plate-sweep metagenomics taxonomic-profiling high-throughput-sequencing c-plus-plus

msweep's People

Contributors

Stargazers

Watchers

Forkers

piketulus

msweep's Issues

Group indicators are not validated with Themisto input

mSWEEP should stop the execution if the number of group indicators (supplied with the -i flag) does not match the input. When running with input data from Themisto, the program will happily keep executing even if the numbers do not match.

It might be possible to validate the indicators by supplying the Themisto index as an extra argument. This would have the added benefit that the group indicators no longer need to be supplied separately when using Themisto, because the index directory will contain them.

This change will have to wait until the format for Themisto indexes is finalized.

Create a conda recipe for easy installation

(Bio)conda has become somewhat of a standard way to easily install bioinformatics tools and pipelines. mSWEEP should be installable via conda to make the tool available for less tech savvy users.

question about parameters for poppunk

Hello,

Many thanks for your tool. I was wondering if you have any recommendations on the poppunk clustering for input into mSWEEP. I have a modest number of S. epi genomes. I am using dbscan for the model fit and the model refine option (since S. epi is recombinogenic). Are there any parameters that can affect mSWEEP downstream and would you have any recommendations for poppunk clustering for S. epi?

Many thanks

Add option to read in a precomputed likelihood matrix

mSWEEP has the capability to write out the internal likelihood matrix as of #11. Since reading the alignments in and filling the likelihood matrix do not parallelize well (the process is bounded by disk access), precomputing the likelihood matrix can save HPC resources in cases where the alignment files are very large. Therefore, an option should be added to read in a likelihood matrix that has been saved with the --write-likelihood toggle.

Testing mSweep on C.jejuni data, questions for constructing a reference set

Hi,
thanks for providing the tool and extensive documentation.
I am currently trying to apply the mGEMS pipeline, starting with mSweep, on a dataset from campylobacter samples. Incidentally these are also plate sweeps for which I also have isolate sequences.

Initial tests on a sample with a known mixture using the database you provided in the paper gave some weird results, I suspected because the reference genomes are too distant.
So I decided to build my own reference genome set and grouped available isolate genomes from pubmlst by ST and selected up to 5 (when more were available) genomes with different cgc_100_group (so more than 100 different alleles in cgMLST) for each ST, in the end 1715 genomes with distinct 1388 STs. [[ while I wrote this up I noticed I made a mistake in the selection, and actually this should lead to 3026 genomes with 2332 unique STs]].
I then ran the themisto pseudoalignment and mSweep (v1.4) again on the sample with the known mixture and I still get incorrect abundances / group assignments (example below).

I had hoped that maybe you could point me in the right direction on how to use the tool and to improve my understanding on how to build a more suitable reference dataset. Mainly I am trying to understand how finely grained the clustering should be.

Thanks in advance for any help and sorry if this post has gotten a bit too long :)

questions

Q1: I guess the main problem is finding the right level of dis/similarity between the reference genomes and the assigned groups.
In the end the grouping should serve to capture distinct strains and guide read-binning in mGems, so for example there should not be 2 samples with the same CC/ST in a single sample? Should reference genomes within the same group be as different as possible (within the group to capture as many possible strains belonging to that group) or should they be more similar to each other?

Q2: In your testing, what is the typical behavior of mSweep when no highly similar reference genome to a strain in the mixed sample is present in the reference set? From what I see, it seems that then abundances are "split" onto the closest genomes in the reference set.

Q3: How large can the reference fasta be in order to still be able to reasonably work with themisto (on a server with 250G memory)?

some examples from the results:

In the sample I have 3 strains, ~60% ST 21, ~26% ST 3218, ~14% ST 5845

mixed sample STs

ST       aspA    glnA    gltA    glyA    pgm     tkt     uncA
ST21	2	1	5	3	2	5	19
ST 3218	64      70      22      98      123     86      16
ST5845	18      70      164     97      115     86      47

abundances with CC database (paper)

                           CC                  V2
Campylobacter_jejuni_CC21      0.911332000000
Campylobacter_jejuni_CC45      0.045833100000
Campylobacter_jejuni_CC661      0.026714400000
Campylobacter_jejuni_CC353      0.013283300000
Campylobacter_jejuni_CC607      0.002822730000

CC21 is correctly matched to the most abundant strain
the two other strains present in the sample do not have a CC assigned.

Top5 groups with custom database:

ST         V2         V3         V4         V5
7440 0.43524000 0.43386900 0.43344000 0.43632600
7144 0.21560000 0.21720200 0.21750400 0.21470600
4875 0.19998300 0.19962500 0.19964200 0.19971100
6428 0.05697990 0.05731250 0.05721290 0.05699080
9389 0.03805540 0.03783970 0.03823250 0.03815700

ST 7440 (except: pgm_725), 7144 (uncA_418) have a similar MLST profile to ST21 (most abundant strain), but ST 21 (5 reference genomes) is predicted at very low abundance.
ST 4875 (glyA_103), ST6428 (aspA_344) have a similar profile to the 2nd strain, the correct ST3218 (1 reference genome) is predicted at abundance 0.008.
ST 9389 (gltA_77, pgm_205, tkt_743) has a similar profile to the 3rd strain, but no references with correct ST5845 were included.

Using the bootstrapping option (--iters) adds an extra line to the output abundances file

If mSWEEP is run with the --iters option enabled, the output *_abundances.txt file will have an extra empty line at the end when compared to the *_abundances.txt file that would be produced without the --iters option. This extra line confuses some postprocessing tools like mGEMS, and the file format should be fixed so that there is no extra line at the end of the *_abundances.txt file.

I will add a fix as soon as possible.

Feature request: report the fraction of unassigned reads

Hi, thanks for making this great tool!

It would be nice to know the fraction of the sample that could not be assigned to any target sequence. This would give an idea for how much of the sample is noise or outside of the pseudoalignment reference database. I know I can already get an idea of this number by parsing the pseudoalignment file myself, but it would be convenient if mSWEEP reported this in the comment section of the output file.

segmentation fault / persistent memory issues during build log likelihood array

Hi Tommi, trying to run mSWEEPs and I keep getting a segmentation fault.

mSWEEP-v1.6.1 abundance estimation
Parsing arguments
Reading the input files
reading group indicators
read 5580 group indicators
reading pseudoalignments
read 2612421 unique alignments
Building log-likelihood array
/cm/local/apps/slurm/var/spool/job1140150/slurm_script: line 12: 215562 Segmentation fault mSWEEP -t 62 --themisto-1 230219_fullFq_fullRefset/klebs/ali_1.aln.gz --themisto-2 230219_fullFq_fullRefset/klebs/ali_2.aln.gz -o 230219_fullFq_fullRefset/klebs/msweep -i 230201_run/klebs/ref_clu.txt --write-probs --gzip-probs

have tried to reduce threads and increase memory, but still no luck. any suggestions?

Optimizing mSWEEP runs on large datasets

Hello mSWEEP developers,

I'm having a hard time running mSWEEP on our samples consisting of an average of 30 million paired reads per sample. The themisto index consists of 2137 reference genomes and therefore our alignments are quite large and I'm wondering how best to optimize the mSWEEP run.

First I tried using default settings, using a full node and 48 threads, but the process has not finished in almost three days. In the meantime, I tried using the --min-hits, --max-iters, and --tol to varying success and I'm hoping to get your opinion on what combination of flags to use.

Here's what I tried:

The --max-iters flag didn't seem to make much of a difference, even down to 100.
The --min-hits flag worked well to get results within a few hours, but I used an extreme value of 1,000,000. So that might have been too stringent.
Lastly, picking the extreme value of --tol 0.1 also got me results within a few hours and showed similar results as the --min-hits flag, however, I have no idea how to choose the best value.

Do you have recommendations on which combination of flags could help me improve run time without greatly influencing the results?

Thank you in advance for your time,
Enrique

Automatically determine `--tol` to avoid numerical stability derived convergence issues on large input

The estimation algorithm from rcgpar will sometimes get stuck oscillating around the solution on very large input because of numerical stability issues in checking the convergence of the gradient descent.

While this can be fixed by the --tol and the parameter by increasing its value, it would be better to adapt the parameter value based on the size of the input, which determines the number of floating point operations that are performed.

Installation problems

Hi,

mSWEEP looks exciting, but I am having trouble installing it. I have pasted part of the stdout below:

ubuntu@harry:/programs/mSWEEP$ mkdir build
ubuntu@harry:/programs/mSWEEP$ cd build/
ubuntu@harry:/programs/mSWEEP/build$ cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/programs/mSWEEP/build
ubuntu@harry:/programs/mSWEEP/build$ make
Scanning dependencies of target mSWEEP
[ 10%] Building CXX object CMakeFiles/mSWEEP.dir/src/read_bitfield.cpp.o
[ 20%] Building CXX object CMakeFiles/mSWEEP.dir/src/likelihood.cpp.o
/home/ubuntu/programs/mSWEEP/src/likelihood.cpp: In function ‘double lbeta(double, double)’:
/home/ubuntu/programs/mSWEEP/src/likelihood.cpp:7:10: error: ‘lgamma’ is not a member of ‘std’
return(std::lgamma(x) + std::lgamma(y) - std::lgamma(x + y));
^

There are more similar errors after this which I haven't shown here. It looks like there is a problem with the lgamma function, is this due to a library I am missing or something?

Thanks,

Harry

Kallisto reference index for multi-contig assemblies

Hello again,

After successful installation I want to build the reference index with kallisto. I suppose the right command is kallisto index -i example_kmi (The command in the pre-processing says kallisto pseudo whereas the toy example says kallisto index.)
Now, my reference sequence assemblies each consist of multiple contigs, but the readme asks to have one FASTA with all the reference sequences and to provide a clustering file with as many cluster assignments. This seems to be a bit impractical and please correct me if I'm wrong, but I assume I would have to merge all my assemblies into a very big FASTA file and provide a cluster assignment with as many rows as contigs in all assemblies combined?

Thanks, Aaron

Reading plaintext alignments consumes a lot of memory (workaround inside)

Reading a plaintext pseudoalignment from Themisto consumes a lot more memory than is necessary because plaintext input disables the internal encoding of the pseudoalignments as a sparse vector.

Workaround: use alignment-writer to compact the alignment file and then read in the compact alignment file instead of the plaintext one.

make fails

Hi there,

When I try building mSWEEP I get the following error. Is there any way to fix that?

[ 11%] Building CXX object CMakeFiles/mSWEEP.dir/src/Sample.cpp.o
cc1plus: error: unrecognised command line option ‘-std=c++11’
make[2]: *** [CMakeFiles/mSWEEP.dir/src/Sample.cpp.o] Error 1
make[1]: *** [CMakeFiles/mSWEEP.dir/all] Error 2
make: *** [all] Error 2

Thanks, Aaron

Add option to supply multiple clusterings for the same run

(Re)running mSWEEP with several clusterings of the reference sequences (eg. hierarchically from genus -> species -> sequence type -> lineage) is sometimes useful but currently requires rerunning the entire estimation.

Since loading in the pseudoalignments from themisto can take quite a while especially for large sequencing runs, it would be useful to have an option for estimating the abundances several times with different clusterings.