caporaso-lab / genome-sampler Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://caporasolab.us/genome-sampler/
License: BSD 3-Clause "New" or "Revised" License
Home Page: https://caporasolab.us/genome-sampler/
License: BSD 3-Clause "New" or "Revised" License
I came across both in GISAID downloaded sequence data today, and they caused imports to fail. I think this should be handled by first replacing any U or u characters in sequences with a T, and then dropping any sequences that still contain characters that are outside of the IUPAC DNA alphabet (the ?
probably implies N
, but I'm not comfortable with generally, silently making that assumption).
Include any discussion of limitations.
As discussed on Virological. This will enable us to construct an automated pipeline from context/focal sequences through a phylogenetic tree with complete data provenance.
This should greatly speed up this step and doesn't impact the results in any meaningful way (different sequence ids might show up in the final set, but that doesn't matter).
This will likely take FeatureData[Sequence]
as input and generate FeatureData[Taxonomy]
as output, mirroring taxonomy assignment in QIIME 2 microbiome workflows.
We are getting by with Categorical in the plugin, but this feature would be pretty handy overall.
=================================== FAILURES ===================================
__________ TestSubsampleNeighbors.test_sample_cluster_missing_locales __________
self = <genome_sampler.tests.test_sample_neighbors.TestSubsampleNeighbors testMethod=test_sample_cluster_missing_locales>
def test_sample_cluster_missing_locales(self):
columns = ['context_id', 'n_mismatches', 'locale']
cluster = pd.DataFrame([['c4', 5, 'abc'],
['c2', 0, float('nan')],
['c99', 1, float('nan')],
['c42', 2, 'abc']],
columns=columns)
count_obs_c4 = 0
count_obs_c2 = 0
count_obs_c99 = 0
count_obs_c42 = 0
for _ in range(self._N_TEST_ITERATIONS):
obs = _sample_cluster(cluster, 3, np.random.RandomState())
self.assertEqual(len(obs), 3)
if 'c4' in obs:
count_obs_c4 += 1
if 'c2' in obs:
count_obs_c2 += 1
if 'c99' in obs:
count_obs_c99 += 1
if 'c42' in obs:
count_obs_c42 += 1
# c4 and c42 all have locale "abc" and c99 and c2 have unknown locale,
# so we expect to see c99 amd c2 more frequently
self.assertTrue(count_obs_c99 > count_obs_c4)
self.assertTrue(count_obs_c99 > count_obs_c42)
> self.assertTrue(count_obs_c2 > count_obs_c4)
E AssertionError: False is not true
I've received questions from multiple users on how to run in parallel, so we should add a specific section to the docs on this. Here is some text copied from my replies that we can use in this section:
genome-sampler can be run in parallel to speed it up. This is done in different ways depending on whether you're running the steps individually or through Snakemake.
If you're using Snakemake, you need to edit Snakefile
and set the N_THREADS
value to the number of threads you'd like genome-sampler to use.
If you're running the steps individually you can pass the --p-n-threads
option to several of the commands. For example, sample-diversity
is the slowest step in the workflow. You can provide the --p-n-threads
parameter to run it in parallel:
qiime genome-sampler sample-diversity \
--i-context-seqs filtered-context-seqs.qza \
--p-percent-id 0.9995 \
--o-selection diversity-selection.qza \
--p-n-threads n
When running this command, you should set n to be the number of available processors or cores on a single node of your system. For example, I work on a cluster that has nodes with 28 cores, so when I submit this job I would run:
qiime genome-sampler sample-diversity \
--i-context-seqs filtered-context-seqs.qza \
--p-percent-id 0.9995 \
--o-selection diversity-selection.qza \
--p-n-threads 28
This uses all of the resources on a single node of the cluster for me. In the future we'll be adding support for splitting workflows like this across multiple cluster nodes, but we do not have this support at this time.
This used to be done by finding the closest non-identical sequence by hamming distance.
I'm running an analysis with yesterday's GISAID downloads and there is at least one sequence id that is present in the sequence data but not metadata. We should add a note to the documentation on how to handle this and maybe add it to the Snakemake
file as well. Here's how I'm handing this now:
qiime feature-table filter-seqs --i-data context-seqs.qza --m-metadata-file context-metadata.tsv --o-filtered-data context-seqs-w-metadata.qza
This could pull code from the combine_selections action so it could be reused elsewhere.
It may be useful to have one that is not connected to near-neighbor sampling.
Sequences with fewer N characters should be considered better candidates for inclusion.
Currently sample-random
can only take the number of items you want to randomly sample. Evan and I ran into an instance while working on the benchmark where it would have been useful to be able to instead specify a proportion of the total number of samples that you want.
We could port this format to q2-types as a less strict DNAFastaFormat, and DNAFastaFormat
could become our strict DNA Fasta Format. These could share a common base class, or we could have a function that creates these formats given ValidationSet
and FASTADNAValidator
.
This will be useful in tests. Update unit tests when this has been added.
Should use pyvcf pending W-L/ProblematicSites_SARS-CoV2#4
Add tests of this transformer when this is addressed - these were inadvertently left out of #56.
[Wed Aug 26 08:16:24 2020]
rule import_context_seqs:
input: context-seqs.fasta
output: context-seqs.qza
jobid: 4
Traceback (most recent call last):
File "/home/ctbrown/miniconda3/envs/genome-sampler/lib/python3.6/site-packages/qiime2/sdk/util.py", line 90, in parse_format
format_record = pm.formats[format_str]
KeyError: 'GISAIDDNAFASTAFormat'
any thoughts?
The current Snakemake workflow takes a long time to run. We'll need a smaller data set (shorter sequences, I think) for efficient testing.
It'd be useful to know which and how many sequences were filtered for being too long, too short, or too ambiguous.
This was suggested by one of our reviewers at F1000.
Possibly to use packages.qiime2.org.
See https://github.com/executablebooks/quantecon-mini-example/tree/master/.github/workflows
for reference.
Phase 1: build https://github.com/caporaso-lab/q2-covid-19/blob/master/docs/methods.md "as-is" (updating as we replace scripts with QIIME 2 actions).
Phase 2: integrate https://github.com/qiime2/sphinx-ext-qiime2 (once ready), so that we can get q2cli-executed outputs (just like on docs.qiime2.org).
Phase 3: replace q2cli commands with usage examples (will require new directives in sphinx-ext-qiime2
).
I am trying to run genome sampler over a 350k genomes dataset.
However, I got the following error:
[Mon Feb 15 09:51:35 2021]
Error in rule view_context_seqs:
jobid: 1
output: results_15fev21/context-seqs.qzv
shell:
qiime feature-table tabulate-seqs --i-data results_15fev21/context-seqs.qza --o-visualization results_15fev21/context-seqs.qzv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
One interesting thing is that the computer froze for some seconds, the RAM memory was full. Do you believe it is a RAM issue?
My computer has 64Gb of ram.
This came up on the QIIME 2 Forum here. This require some rethinking of how our subsamples are defined since (as far as I know) this isn't something we could achieve with vsearch (e.g., for sample-diversity
), but this has come up for me as well, so I think it's something we'll run into again.
methods.md
to tutorial.md
Snakemake
folder, update instructions for accessing that dataN_THREADS
)It would be useful to collect some basic information when new issues come in, we could look at the templates defined here as a basis:
https://github.com/qiime2/template-repo/tree/master/github_templates/templates/ISSUE_TEMPLATE
This came up in peer review.
sample-diversity should be 0.9990 by default and sample-neighbors should be 0.9999 by default
We'll need to see if vsearch has options for this. If not, we should be able to achieve this by sorting sequences based on their fraction of degenerate characters prior to clustering, so that the sequences with the smallest fraction of degenerate characters are most likely to become cluster centroids.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.