Comments (4)
We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?
In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run
8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.
Note the * to stop accidentally downloading all the datasets.
I did not understand how *
should stop downloading all datasets. Usually in paths *
denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *
? (if so, it's a bug!) Can you clarify?
There are multiple ha datasets which are downloaded - this is okay in my opinion but not ideal. In such cases maybe download the flu_h3n2_ha_broad?
How do we pick flu_h3n2_ha_broad
among others? And what if it's not a flu/*
, but sc2/*
? Nextclade features should make sense for all viruses (even the ones that haven't been added yet).
While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:
-
If you don't use the dataset files outside of
nextclade run
, you can avoid separate dataset download entirely. Thenextclade run
command accepts--dataset-name
argument, which makes it to download the dataset in-memory (without writing to disk) and run with it immediately. This way you don't needdataset get
calls at all:nextclade run --dataset-name="nextstrain/flu/h3n2/pa" --output-dir="results/" my.fasta.gz
-
If you insist on having dataset files on disk (perhaps you use it in your processing after nextclade), you can use a loop to avoid repetition. Here is a few examples in bash (but you can also setup Snakemake or other workflow frameworks to run multiple things in some sort of a loop, according to a set of parameters):
$ for v in pa mp; do nextclade dataset get --name="nextstrain/flu/h3n2/$v" --output-dir="outputdir/$v"; done
-
Instead of hardcoding dataset names, you can use
dataset list
with--search
argument to find datasets using sub-string match. This is probably the closest thing to the wildcard*
syntax you've requested:$ nextclade dataset list --only-names --search=flu/h3 nextstrain/flu/h3n2/ha/CY163680 nextstrain/flu/h3n2/ha/EPI1857216 nextstrain/flu/h3n2/na/EPI1857215 nextstrain/flu/h3n2/pb1 nextstrain/flu/h3n2/np nextstrain/flu/h3n2/ns nextstrain/flu/h3n2/mp nextstrain/flu/h3n2/pa nextstrain/flu/h3n2/pb2
Then you can feed this list into a loop instead of a hardcoded list:
for v in $( nextclade dataset list --only-names --search=flu/h3 ); do nextclade dataset get --name="$v" --output-dir="outputdir/$v"; done
Nothing stops you from plugging your entire processing into this loop - this way you always know what is being downloaded and where:
for v in $( nextclade dataset list --only-names --search=flu/h3 ); do nextclade dataset get --name="$v" --output-dir="outputdir/$v"; nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz"; my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv"; done
You can use GNU Parallel to run different datasets concurrently (workflows also often have a way to do this automatically):
function run_one() { v=$1 nextclade dataset get --name="$v" --output-dir="outputdir/$v"; nextclade run --input-dataset="outputdir/$v" --output=dir="results/" "my_$v.fasta.gz"; my_script.py --virus="$v" --nextclade-tsv="results/$v/nextclade.tsv"; } export -f run_one parallel --jobs=4 run_one :::$( nextclade dataset list --only-names --search=flu/h3 )
These approaches allow you to avoid code duplication, and you always know what's being downloaded and where - exact names and paths. Loops of course complicate things quite a bit, that's a downside. And, for sanity checks, you likely want to test that you've got all the datasets you want, to avoid omissions.
If you need more control, you can filter datasets additionally by piping the list into the grep
or into a script:
$ nextclade dataset list --only-names --search=flu/ | grep -E '(mp|pa)' | sort
nextstrain/flu/h1n1pdm/mp
nextstrain/flu/h1n1pdm/pa
nextstrain/flu/h3n2/mp
nextstrain/flu/h3n2/pa
$ nextclade dataset list --only-names --search=flu/h3 | my_filter.py
and then feed the resulting list into the loop.
For even more control, you can also add --json
flag to the dataset list
command. This will print your search results in JSON format, which you can later feed to jq
or to a script. This way you can also implement your own search/filtering - by dumping JSON of all datasets, choosing a subset, and then downloading only it. Contrived example:
$ nextclade dataset list --json --search=flu/ | jq -r '.[] | select(.attributes.segment == "pa" and .attributes["reference name"] == "A/NewYork/392/2004") | .path'
nextstrain/flu/h3n2/pa
from nextclade.
@ammaraziz, what might be useful for you is nextclade sort
. We are currently refining some of the matching parameters, but what you can do for example is
nextclade sort all_my_rsv_sequences.fasta --output-dir split_by_dataset --output-results-tsv table_with_matches.tsv
this will split your input sequences into files corresponding to datasets (and their prefixes).
from nextclade.
We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments. And perhaps even to combine results for different segments and process them further. But haven't figured anything quite yet. Is this your use-case?
Yes it's very close to the use case!
In your proposal, how would you access the downloaded datasets afterwards? There is no way of knowing what's being downloaded and to where and how many. And you'd still need to copy-paste nextclade run 8 times. Also, datasets come and go, today it might be 8, tomorrow there's 32.
I had not considered this, this puts a big red ! on my request. But as you hinted above, the ability to make multiple runs with a single command falls within this idea.
I did not understand how * should stop downloading all datasets. Usually in paths * denotes so-called wildcard, that is any path under that particular path. Does nextclade currently downloads all datasets if you omit the *? (if so, it's a bug!) Can you clarify?
To stop accidentally entering in: nextstrain/flu/
which will download all datasets for all species of full (but as you said this doesn't generalise to other species supported by nextclade).
While wildcard downloads are not supported yet, in the meantime, there is a couple of other approaches you might consider:
....
Thank you for the code and the explanation, this does achieve my task (or what instigated this feature request). You've perfectly represented what I was trying to do in the code, that is not hard code the flu datasets.
Going back to this:
We have been thinking how to simplify making multiple runs with a single command, for example for different flu segments.
This feature request would be part of this bigger picture of running multiple runs with a single command. Therefore, closing this issue as in hindsight it should have been a discussion.
Thanks again!
from nextclade.
Hi Richard,
That's actually the usecase which triggered this request.
Thanks again :)
from nextclade.
Related Issues (20)
- Community build cache validity bug HOT 2
- Developer guide uses deprecated CLI option
- docs: document nextalign-like use-case HOT 1
- ENH(nextclade cli): nextclade dataset list: indicate whether clades can be assigned HOT 7
- nextclade run --output-columns-selection throws error for seqName and includes index even though I don't want index HOT 11
- Nextclade Web: Confusing unwanted dataset switching HOT 3
- Nextclade Web: consider rethinking dataset badges HOT 1
- Nextclade Web: don't store unnecessary dataset info in local storage
- [minor] Auspice dataset functionality: URL redirects don't update displayed metadata HOT 3
- Max marker setting even counts markers that are off
- Frameshift and insertion markers cannot be disabled/configured in contrast to all other markers
- Unfolding <details> in changelog in website jumbles things up
- Rename `master` to `main`
- How does one update Nextclade CLI? I cannot find any instructions on the Nextclade CLI page, only descriptions of various updates? HOT 1
- Bioconda workflow failed with push error due to insufficient permissions HOT 7
- SVG download for the Results table
- Default threads for webapp are set too high HOT 2
- Add coverage per CDS to output HOT 8
- List of mutational changes per clade HOT 2
- Can influenza H5 datasets be available for nextclade CLI HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nextclade.