caporaso-lab / genome-sampler Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 10.0 3.16 MB

Home Page: https://caporasolab.us/genome-sampler/

License: BSD 3-Clause "New" or "Revised" License

Python 99.29% Makefile 0.13% TeX 0.49% HTML 0.09%

hacktoberfest

genome-sampler's People

Contributors

Stargazers

Watchers

Forkers

ebolyen andrewsanchez gregcaporaso nbokulich oddant1 wasade keegan-evans pmcharrison reveurmichael lizgehret

genome-sampler's Issues

handle ? and U characters in GISAID format files

I came across both in GISAID downloaded sequence data today, and they caused imports to fail. I think this should be handled by first replacing any U or u characters in sequences with a T, and then dropping any sequences that still contain characters that are outside of the IUPAC DNA alphabet (the ? probably implies N, but I'm not comfortable with generally, silently making that assumption).

illustrate alignment, masking, and tree building in tutorial

Include any discussion of limitations.

add action to support vcf-based masking

As discussed on Virological. This will enable us to construct an automated pipeline from context/focal sequences through a phylogenetic tree with complete data provenance.

dereplicate context sequences in sample_diversity

This should greatly speed up this step and doesn't impact the results in any meaningful way (different sequence ids might show up in the final set, but that doesn't matter).

add pangolin action for lineage assignment

This will likely take FeatureData[Sequence] as input and generate FeatureData[Taxonomy] as output, mirroring taxonomy assignment in QIIME 2 microbiome workflows.

Add MetadataColumn[Time] type to QIIME 2

We are getting by with Categorical in the plugin, but this feature would be pretty handy overall.

Stochastic test failure in TestSubsampleNeighbors.test_sample_cluster_missing_locales

=================================== FAILURES ===================================
__________ TestSubsampleNeighbors.test_sample_cluster_missing_locales __________

self = <genome_sampler.tests.test_sample_neighbors.TestSubsampleNeighbors testMethod=test_sample_cluster_missing_locales>

    def test_sample_cluster_missing_locales(self):
        columns = ['context_id', 'n_mismatches', 'locale']
        cluster = pd.DataFrame([['c4', 5, 'abc'],
                                ['c2', 0, float('nan')],
                                ['c99', 1, float('nan')],
                                ['c42', 2, 'abc']],
                               columns=columns)
    
        count_obs_c4 = 0
        count_obs_c2 = 0
        count_obs_c99 = 0
        count_obs_c42 = 0
    
        for _ in range(self._N_TEST_ITERATIONS):
            obs = _sample_cluster(cluster, 3, np.random.RandomState())
            self.assertEqual(len(obs), 3)
            if 'c4' in obs:
                count_obs_c4 += 1
            if 'c2' in obs:
                count_obs_c2 += 1
            if 'c99' in obs:
                count_obs_c99 += 1
            if 'c42' in obs:
                count_obs_c42 += 1
    
        # c4 and c42 all have locale "abc" and c99 and c2 have unknown locale,
        # so we expect to see c99 amd c2 more frequently
        self.assertTrue(count_obs_c99 > count_obs_c4)
        self.assertTrue(count_obs_c99 > count_obs_c42)
>       self.assertTrue(count_obs_c2 > count_obs_c4)
E       AssertionError: False is not true

generalize label_seqs input type

This should be FeatureData[Sequence|AlignedSequence], pending a framework bug fix. @ebolyen or @Oddant1, it doesn't look like there is an issue yet for that framework bug. Can you add a link to it here when there is one?

add documentation section on running in parallel

I've received questions from multiple users on how to run in parallel, so we should add a specific section to the docs on this. Here is some text copied from my replies that we can use in this section:

genome-sampler can be run in parallel to speed it up. This is done in different ways depending on whether you're running the steps individually or through Snakemake.

If you're using Snakemake, you need to edit Snakefile and set the N_THREADS value to the number of threads you'd like genome-sampler to use.

If you're running the steps individually you can pass the --p-n-threads option to several of the commands. For example, sample-diversity is the slowest step in the workflow. You can provide the --p-n-threads parameter to run it in parallel:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads n

When running this command, you should set n to be the number of available processors or cores on a single node of your system. For example, I work on a cluster that has nodes with 28 cores, so when I submit this job I would run:

qiime genome-sampler sample-diversity \
 --i-context-seqs filtered-context-seqs.qza \
 --p-percent-id 0.9995 \
 --o-selection diversity-selection.qza \
 --p-n-threads 28

This uses all of the resources on a single node of the cluster for me. In the future we'll be adding support for splitting workflows like this across multiple cluster nodes, but we do not have this support at this time.

Add MRCA to neighbor selection

This used to be done by finding the closest non-identical sequence by hamming distance.

filter context sequences with ids that are not in metadata

I'm running an analysis with yesterday's GISAID downloads and there is at least one sequence id that is present in the sequence data but not metadata. We should add a note to the documentation on how to handle this and maybe add it to the Snakemake file as well. Here's how I'm handing this now:

qiime feature-table filter-seqs --i-data context-seqs.qza --m-metadata-file context-metadata.tsv --o-filtered-data context-seqs-w-metadata.qza

add IDSelection.combine method

This could pull code from the combine_selections action so it could be reused elsewhere.

add stand-alone locale sampler

It may be useful to have one that is not connected to near-neighbor sampling.

factor number of degenerate characters into selection approach in sample-neighbors

Sequences with fewer N characters should be considered better candidates for inclusion.

Allow sample-random to take a proportion to be randomly sampled

Currently sample-random can only take the number of items you want to randomly sample. Evan and I ran into an instance while working on the benchmark where it would have been useful to be able to instead specify a proportion of the total number of samples that you want.

summarize-selections improvements

should include the number of unique ids across the provided selections to let the viewer know how many total context sequences they'll have in their downstream analysis
visual summaries of the samples (e.g., a map, a tree)

reduce duplication of code in `GISAIDDNAFASTAFormat` validator

We could port this format to q2-types as a less strict DNAFastaFormat, and DNAFastaFormat could become our strict DNA Fasta Format. These could share a common base class, or we could have a function that creates these formats given ValidationSet and FASTADNAValidator.

add equality operator for IDSelection

This will be useful in tests. Update unit tests when this has been added.

update mask format and parser

Should use pyvcf pending W-L/ProblematicSites_SARS-CoV2#4

Add tests of this transformer when this is addressed - these were inadvertently left out of #56.

running the tutorial yields a KeyError.

[Wed Aug 26 08:16:24 2020]
rule import_context_seqs:
    input: context-seqs.fasta
    output: context-seqs.qza
    jobid: 4

Traceback (most recent call last):
  File "/home/ctbrown/miniconda3/envs/genome-sampler/lib/python3.6/site-packages/qiime2/sdk/util.py", line 90, in parse_format
    format_record = pm.formats[format_str]
KeyError: 'GISAIDDNAFASTAFormat'

any thoughts?

add smaller data set for testing

The current Snakemake workflow takes a long time to run. We'll need a smaller data set (shorter sequences, I think) for efficient testing.

have filter_seqs return a log

It'd be useful to know which and how many sequences were filtered for being too long, too short, or too ambiguous.

move overridable configuration parameters from Snakefile to YAML file

This was suggested by one of our reviewers at F1000.

update install instructions in docs

Possibly to use packages.qiime2.org.

add action for importing sequences from NCBI

CI: rename actions/jobs/etc

Rename repo and plugin/package to `genome-sampler`

Bootstrap jupyterbook builds

See https://github.com/executablebooks/quantecon-mini-example/tree/master/.github/workflows
for reference.

Phase 1: build https://github.com/caporaso-lab/q2-covid-19/blob/master/docs/methods.md "as-is" (updating as we replace scripts with QIIME 2 actions).
Phase 2: integrate https://github.com/qiime2/sphinx-ext-qiime2 (once ready), so that we can get q2cli-executed outputs (just like on docs.qiime2.org).
Phase 3: replace q2cli commands with usage examples (will require new directives in sphinx-ext-qiime2).

Error in rule view_context_seqs

I am trying to run genome sampler over a 350k genomes dataset.
However, I got the following error:

[Mon Feb 15 09:51:35 2021]
Error in rule view_context_seqs:
jobid: 1
output: results_15fev21/context-seqs.qzv
shell:

qiime feature-table tabulate-seqs       --i-data results_15fev21/context-seqs.qza       --o-visualization results_15fev21/context-seqs.qzv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

One interesting thing is that the computer froze for some seconds, the RAM memory was full. Do you believe it is a RAM issue?
My computer has 64Gb of ram.

allow user to define target size for subsampled collection of sequences

This came up on the QIIME 2 Forum here. This require some rethinking of how our subsamples are defined since (as far as I know) this isn't something we could achieve with vsearch (e.g., for sample-diversity), but this has come up for me as well, so I think it's something we'll run into again.

tutorial modifications

update name of doc file from methods.md to tutorial.md
break commands over multiple lines, like in the Q2 docs
add tutorial data to Snakemake folder, update instructions for accessing that data
[ ] Note however that usually you would perform some manual filtering and trimming between these two steps, so these two commands likely won’t get you a publication quality phylogeny. : note that we are working on adding more support of these steps this is now part of #67
Add suggestions on modifications that need to be made to Snakemake file to apply to non-tutorial data (e.g., update file names and N_THREADS)

Add GitHub issue templates

It would be useful to collect some basic information when new issues come in, we could look at the templates defined here as a basis:

https://github.com/qiime2/template-repo/tree/master/github_templates/templates/ISSUE_TEMPLATE

illustrate sampling of focal sequences in tutorial

This came up in peer review.

vsearch actions need different percent_id parameters in Snakemake

sample-diversity should be 0.9990 by default and sample-neighbors should be 0.9999 by default

select sequences with fewest degenerate characters in sample-diversity

We'll need to see if vsearch has options for this. If not, we should be able to achieve this by sorting sequences based on their fraction of degenerate characters prior to clustering, so that the sequences with the smallest fraction of degenerate characters are most likely to become cluster centroids.