nextstrain / nextclade_data Goto Github PK

Datasets for https://github.com/nextstrain/nextclade

Home Page: https://clades.nextstrain.org

Shell 7.15% JavaScript 5.45% Python 87.40%

nextclade_data's Introduction

This repository is archived and contains the content used to build the documentation and splash page found in nextstrain.org. This content can now be found here.

License and copyright

Source code to Nextstrain is made available under the terms of the GNU Affero General Public License (AGPL). Nextstrain is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

nextclade_data's People

Contributors

Stargazers

Watchers

Forkers

laura-bankers taylorpaisie lsy7 sakrnference sandes89 ibdc-inda vovamatv pcjentsch lunajq laurau123 samuelbunga sidneymbell zhaojiangmiao neherlab joon-klaps ivan-tests-stuff mazeller skandel2

nextclade_data's Issues

No datasets found having attributes name: flu_h1n1pdm_na

Current Behavior

Dataset not got

nextclade dataset get --name flu_h1n1pdm_na --output-dir nextclade_datasets/flu_h1n1pdm_na

Expected behavior

Expected it to get the dataset

How to reproduce

Run the above command

Possible solution

(optional)

Your environment: if running Nextstrain locally

nextclade 2.11.0

Additional context

This dataset is listed when using nextclade dataset list

Difference between sars-cov-2-no-recomb and sars-cov-2

Hello,

I noticed that there is a difference between sars-cov-2-no-recomb and sars-cov-2 for example genome sequence ON609102.1 (https://www.ncbi.nlm.nih.gov/nuccore/ON609102.1?report=fasta). With dataset sars-cov-2 it classifies as 22C and with dataset sars-cov-2-no-recomb is classifies as 21L. I can provide more sequences if needed.

This may not necessarily be a bug. Maybe it is because sars-cov-2-no-recomb is outdated. Are there plans to update this dataset?

rsv_b tree.json: G_clade always "unassigned"

Hi & congrats on the release of nextclade v3!

I'm using nextclade v3.0.0 and getting the datasets for RSV-A and RSV-B using these commands:

    nextclade dataset get --name rsv_a --output-zip rsv_a.zip
    nextclade dataset get --name rsv_b --output-zip rsv_b.zip

and then nextclade run -D rsv_a.zip ... and ... rsv_b.zip ....

The G_clade output TSV column for rsv_a has expected values like "GA2", "GA.3.0.0" etc., but the G_clade output TSV column for rsv_b is always "unassigned". The tree.json from rsv_a.zip has expected G_clade values annotated on nodes, but the tree.json from rsv_b.zip has "node_attrs": {... "G_clade": {"value": "unassigned"} for all internal nodes as far as I can tell.

Hopefully there's just a little typo fix that can bring back the GB's! 🙂 Thanks.

Add BA.2 and BA.3 sequences to sample sequences (if any are open)

Error downloading sars-cov-2 dataset after release 2024-02-16--04-00-32Z

Hi Nextclade team! I'm encountering some issues after the release of tag 2024-02-16--04-00-32Z. I see errors with both the v3.0.0 and v3.2.0 nextclade CLI versions.

The explicit tag 2024-02-16--04-00-32Z downloads successfully, as does omitting the tag. But the output version is "unreleased":

nextclade dataset get --name sars-cov-2 --tag "2024-02-16--04-00-32Z" --output-dir "sars-cov-2_2024-02-16--04-00-32Z"
nextclade dataset get --name sars-cov-2 --output-dir "sars-cov-2_no-tag"

"version": {
	"tag": "unreleased"
},

Any other tag raises an error.

nextclade dataset get --name sars-cov-2 --tag "2024-01-16--20-31-02Z" --output-dir "sars-cov-2_2024-01-16--20-31-02Z"
nextclade dataset get --name sars-cov-2 --tag "latest" --output-dir "sars-cov-2_latest"

Error:
   0: Dataset not found: 'sars-cov-2'.

      Did you mean:
      - nextstrain/sars-cov-2/XBB
      - nextstrain/sars-cov-2/BA.2
      - nextstrain/sars-cov-2/BA.2.86
      - nextstrain/sars-cov-2/wuhan-hu-1/orfs
      - nextstrain/sars-cov-2/wuhan-hu-1/proteins
      - community/isuvdl/mazeller/prrsv2/orf5/yimim2023
      - nextstrain/mpox/all-clades
      - nextstrain/rsv/a/EPI_ISL_412866
      - nextstrain/rsv/b/EPI_ISL_1653999
      ?

      Type `nextclade dataset list` to show available datasets.

Location:
   packages/nextclade-cli/src/cli/nextclade_dataset_get.rs:79

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Next update for sars-cov-2 dataset?

Hi,

as the last update is 25 days old and pango-designation updated to v1.17, maybe an updated dataset for nextclade would be nice to have?

CC @corneliusroemer

Best wishes and thanks
David

20C is not stable, jumps around

Current Behavior

As identified in #48, 20C can jump around between builds due to tree builder instability with reversions. Maybe I can add some constraints to constraint.nwk to prevent this in the future.

Leading space in attributes column of latest SARS-CoV-2 gene map prevents parsing by gffutils

Current Behavior

The attributes column of the latest SARS-CoV-2 gene map has an unexpected leading space before the gene name definitions. This space causes the default parsing of the key/value pairs in the column to fail with default parsing by gffutils. For example, the first line of the body gets parsed as:

.	.	gene	26245	26472	.	+	.	 gene_name%3DE

Expected behavior

The first line of the body should be:

.	.	gene	26245	26472	.	+	.	gene_name=E

How to reproduce

Steps to reproduce the current behavior:

$ curl -O https://raw.githubusercontent.com/nextstrain/nextclade_data/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-07-12T12%3A00%3A00Z/files/genemap.gff
$ cat -vet genemap.gff
# Gene map (genome annotation) of SARS-CoV-2 in GFF format.$
# For gene map purpses we only need some of the columns. We substitute unused values with "." as per GFF spec.$
# See GFF format reference at https://www.ensembl.org/info/website/upload/gff.html$
# seqname^Isource^Ifeature^Istart^Iend^Iscore^Istrand^Iframe^Iattribute$
.^I.^Igene^I26245^I26472^I.^I+^I.^I gene_name=E$
.^I.^Igene^I26523^I27191^I.^I+^I.^I gene_name=M$
.^I.^Igene^I28274^I29533^I.^I+^I.^I gene_name=N$
.^I.^Igene^I266^I13468^I.^I+^I.^I gene_name=ORF1a$
.^I.^Igene^I13468^I21555^I.^I+^I.^I gene_name=ORF1b$
.^I.^Igene^I25393^I26220^I.^I+^I.^I gene_name=ORF3a$
.^I.^Igene^I27202^I27387^I.^I+^I.^I gene_name=ORF6$
.^I.^Igene^I27394^I27759^I.^I+^I.^I gene_name=ORF7a$
.^I.^Igene^I27756^I27887^I.^I+^I.^I gene_name=ORF7b$
.^I.^Igene^I27894^I28259^I.^I+^I.^I gene_name=ORF8$
.^I.^Igene^I28284^I28577^I.^I+^I.^I gene_name=ORF9b$
.^I.^Igene^I21563^I25384^I.^I+^I.^I gene_name=S$

See spaces between the final tab (^I) and the gene_name.

In contrast, here is the corresponding output for the flu H3N2 HA gene map:

$ curl -O https://raw.githubusercontent.com/nextstrain/nextclade_data/master/data/datasets/flu_h3n2_ha/references/CY163680/versions/2022-06-08T12%3A00%3A00Z/files/genemap.gff
$ ##gff-version 3$
##sequence-region CY163680.1 1 1737$
CY163680.1^Ifeature^Igene^I18^I65^I.^I+^I.^Igene_name=SigPep$
CY163680.1^Ifeature^Igene^I66^I1052^I.^I+^I.^Igene_name=HA1$
CY163680.1^Ifeature^Igene^I1053^I1715^I.^I+^I.^Igene_name=HA2$

Possible solution

Create a new version of the SARS-CoV-2 data without these leading spaces.

(While we're at it, we could also specify the sequence id in the first column and GFF version in the first line, to match other GFF files in this repo, but that could be a separate PR.)

Dataset question in a particular case

I apologize for not actually fitting into the bug report template.

Topic

Using the SARS-CoV-2 dataset I see that an old 20A sequence became 20C (since dataset : 2022-11-15T12:00:00Z) and the reason why is not clear to me.

Description

Assuming the key ressource here is the "tree.js" file, I compared (non-exhaustively) files from the current release and from the previous one and tried to retro define how my sequence was classified by nextclade (based on mutations observed).

It is not clear to me as I'm not used to this topic :
From the dataset 2022-10-27T12:00:00Z(previous) the node ("name":"NODE_0000061" ... "partiallyAliased":"B.1.160" ... "clade_legacy":"20A") seems to correspond to the node ("name":"NODE_0000097" ... "partiallyAliased":"B.1.160" ... "clade_legacy":"20C") in the new released dataset (2022-11-15T12:00:00Z).

I expect that my sequence is estimated close to these two nodes (using one or other dataset).

I'm aware that the dataset is evolutive but I don't understand why B.1.160 is now related to 20C instead of 20A previously (and for a long time).

In the hope that my investigation is relevant and not misleading...

Best regards,

PS : Github can't process the mentioned sequence but I can provide it if needed.

Version missing from pathogen.json for datasets released on or after Jan 29

I was working with the RSV datasets, and noticed that the "version" key is missing entirely from pathogen.json on new datasets.

I originally saw this with the 1/29 versions of both RSV datasets, but while looking into this I found it's also present on nextstrain/flu/yam/ha/JN993010, but not on anything older.

I don't see anything in the changelog (either here or on the CLI) about the "version" key being removed entirely from pathogen.json, and this being missing seems to make it so you can't identify the version of a downloaded dataset (only specify a particular version to download), which the documentation still says you can do.

I did see the commit about removing the version from the files in data/ to be auto-generated in data_output/ while I was trying to figure this out, but the version seems to be missing from the actual downloaded datasets from nextclade dataset get (as well as from data_output/ on the repo).

Dated nodes in SARS-COV-2 tree

Hi 👋 thanks for an amazing resource!

I would find it very useful if the node times used in the main UI could be included somewhere in the archival datasets (e.g. here)

I'm using these archival datasets over the download options provided in the main UI for two reasons: (1) it's nice to have a stable URL to use in my scripts rather than manually downloading the data from the web browser; and (2) the UI doesn't seem to provide a JSON option any more which I find far easier to use than the newick based options (although I recognise I'm in a minority there).

Would it be possible to add the dates to the archived tree please, or could you advise on any other alternatives?

Rename header for MPXV dataset

Context

I like using nextalign locally, but when I use nextalign with the MPXV dataset, and I keep the reference, the reference is listed as 'ref_in_coord' instead of a useful name.

Description

Please change the header for the reference fasta for MPXV dataset (https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/MPXV/references/ancestral/versions/2022-08-09T12:00:00Z/files/reference.fasta)

Examples

The header of https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/MPXV/references/ancestral/versions/2022-08-09T12:00:00Z/files/reference.fasta is '>ref_in_coord Reference sequence in coord.fasta coordinates'

Possible solution

Instead of '>ref_in_coord Reference sequence in coord.fasta coordinates', perhaps... '>AncestralMPX_ref' or an accession number.

New flu_h3n2_ha nextclade data module does not have EPI accessions

The newest released flu_h3n2_ha nextclade dataset does not have EPI accessions listed within the tree.json file. This information was really helpful for downstream purposes. Could this information be added back? It is currently present in every other flu build.

Next update for sars-cov-2 dataset?

Hi,

as the last update is 33 days old and since then pango-designation released v1.19, maybe an updated dataset for nextclade would be nice to have? 🙃

CC @corneliusroemer

Best wishes and thanks
David

Splitting up clade/WHO_name for SARS-CoV-2

Context

Currently we use a long composite clade name for SARS-CoV-2 datasets that combines Nextstrain clade, WHO name, and sometimes legacy extra names like 20H (Beta, V2).

It would be useful to break these up more cleanly - especially the historic names are only kept for backwards compatibility, not because they are meaningful anymore.

In the future we would like to cleanly use the following:
Nextstrain_clade -> 22D
WHO name -> Omicron

Pango lineage is already annotated as a Nextclade_pango column, so no change is necessary.

For a transition period, we would like to keep the old column while already having the new columns.

The migration should work as follows:

Step 1

Add extra columns Nextstrain_clade, WHO_name
Add a column Nextstrain_legacy which will maintain the old naming scheme for backwards compatibility
clade_membership will stay as is in step 1.

These attributes should be output into the tsv/csv, the web version should not display these columns. This may require a slight code change/extension to Nextclade by @ivan-aksamentov:

Currently, extra columns are specified in the tree.json with a dict: {name,displayName,description}, this should be extended by a showWeb or similarly named boolean attribute. If set to false, it is not shown - it defaults to true to maintain backwards compatibility.

Data users can start using the new names Nextstrain_clade and WHO_name from now on, and those who want to keep using the historic names should start switching to Nextstrain_legacy so their software keeps working once we implement step 2.

Step 2 (due 2023-02-01)

clade_membership will stop using 20H (Beta, V2) and start using Nextstrain_clade. Web view will switch on WHO_name and will keep Nextstrain_clade and Nextstrain_legacy switched off.

Step 3 (if ever in far future)

Nextstrain_legacy is deprecated and removed.

Discussion

As soon as Step 1 is implemented, we can start using the new metadata fields in ncov-ingest and in builds as colouring etc.

As soon as step 1 is implemented, users can start migrating away from clade in the metadata and start using Nextstrain_legacy for backwards compatibility if they want to use the old names.

For data users, it is step 1 that's critical.

Step 2 is to complete the migration on the frontend. It should happen maybe after 1-2 months of step 1 being implemented, to give enough warning to make the few code changes necessary.

Implementation

For step 1 to go ahead we need the code change in Nextclade as described above (@ivan-aksamentov) and dataset changes (@corneliusroemer).

This should be possible within 2-4 weeks.

Once step 1 is complete, we can communicate the changes and provide a migration guide.

Step 2 just involves @corneliusroemer making another dataset change.

Incorrect 'workflow' link to RSV dataset on nextclade.org

Hi guys,

The RSV workflow link on nextclade.org does not exit. It should be:

https://github.com/nextstrain/nextclade_data/tree/master/data/nextstrain/rsv/a/EPI_ISL_412866

Also in the above link the readme link is similarly incorrect!

This is for both RSVA and RSVB.

ENH: Add unaliased pango column to 21L focus build

Context

Many aliases are popping up - confusing even me. Would be great to have also an unaliased column. This would allow for depth first sorting too - aliasing screws that up at the moment.

I'd keep BA unexpanded, so BA.2.75 -> BA.2.75, but BL.1 -> BA.2.75.1.1

Description

Should be possible to be added pretty easily by a small Python script in the workflow.

Can use pango_aliasor package.

Would make nextstrain/nextclade#984 less pressing

Latest flu_h1n1_na module is missing NA subclade "B"

Hi There,

The nextclade tree and training set for H1N1pdm09 NA set appears to be missing subclade B, which circulated from ~2018-2020. The current tree available on the online version goes from A.1.1 directly into B.1.

Overall, this is a relatively minor issue but I suspect the exclusion of B in the training set is a bug. Thanks.

Some influenza datasets are missing glyco data (virus_properties.json)

Context

The new influenza datasets include virus_properties.json which defines potential glycosylation sites.

Description

The virus_properties.json is only in the "new" datasets (those datasets with a more recent reference). The virus_properties.json should be very similar or identical between references.

For example, for h1n1_ha (CY121680) dataset has no glyc information:
https://github.com/nextstrain/nextclade_data/blob/5d7f1e06fd1c95acd52753f07a9ce4b2872e1e56/data/datasets/flu_h1n1pdm_ha/references/CY121680/versions/2023-01-27T12:00:00Z/files/virus_properties.json

But h1n1_ha (MW626062) has:
https://github.com/nextstrain/nextclade_data/blob/5d7f1e06fd1c95acd52753f07a9ce4b2872e1e56/data/datasets/flu_h1n1pdm_ha/references/MW626062/versions/2023-01-27T12:00:00Z/files/virus_properties.json

Would it be possible to include glycosylation for the 'older' references?

In addition, the vic_ha KX058884 (B/Brisbane/60/2008) is missing data in the virus_properties.json (the vic_na dataset has info on glyco sites)

`./scripts/rebuild` fails when run with new dataset as "FileNotFoundError: [Errno 2] .../nextclade_data/data_temp/nextstrainebolazaire__unreleased.zip'"

Trying to build data_output locally using ./scripts/rebuild.

I pip installed repro_zipfile

But there seems to be an issue somewhere when the output zipfile doesn't exist yet:

❯ ./scripts/rebuild --input-dir ./data --output-dir ./data_output --allow-dirty && serve -l 3000 --cors ./data_output
INFO: :Adding '.dataset_order' entries to 'collection.json' for the following datasets: 'nextstrain/ebola/zaire'. Please reorder them manually as needed. This order is used when displaying datasets of the collection in the user interface.
Traceback (most recent call last):
  File "/Users/corneliusromer/code/nextclade_data/./scripts/rebuild", line 513, in <module>
    main()
  File "/Users/corneliusromer/code/nextclade_data/./scripts/rebuild", line 248, in main
    collection, release_infos_for_dataset, refs = process_one_collection(
                                                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/nextclade_data/./scripts/rebuild", line 339, in process_one_collection
    release_infos = prepare_dataset_release_infos(args, datasets, datasets_from_index_json, collection_dir, tag,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/nextclade_data/./scripts/rebuild", line 404, in prepare_dataset_release_infos
    create_dataset_package(args, dataset_new, path, tag, dataset_dir)
  File "/Users/corneliusromer/code/nextclade_data/./scripts/rebuild", line 496, in create_dataset_package
    make_zip(out_dir, zip_filename)
  File "/Users/corneliusromer/code/nextclade_data/scripts/lib/fs.py", line 80, in make_zip
    with ReproducibleZipFile(output_zip, "w") as z:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/zipfile.py", line 1286, in __init__
    self.fp = io.open(file, filemode)
              ^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/corneliusromer/code/nextclade_data/data_temp/nextstrain__ebola__zaire__unreleased.zip'

SARS-CoV-2 Variant designated a while ago is missing ( BA.5.12 )

I believe the SARS-CoV-2 variant BA.5.12 has been missed within the last few updates. It looks like every single known variant since the latest version was designated; except for BA.5.12

Additional context

BA.5.12 was designated 2023-02-23 10:47:30+0000

Omicron Phylogenetic Tree with NextClade Variant Availability indicators

Red is unavailable
Black is available

It makes sense that XBB.2.9 is not available via NextClade as it was designated on the same day as the latest dataset version, but BA.5.12's unavailability seems like a bug.

Note: If BA.5.12 is available via NextClade (and there is a bug in my code), please share with me how I can identify the correct lineages. I use the latest dataset version's tree.json file

Request for NA reference datasets for flu subtypes

Context

In addition to clade typing using the HA segment, public health labs are searching for mechanisms to screen for potential variants in the NA segment associated with tamiflu resistance. With an NA reference dataset, these types of screening mechanisms could be put in play.

Description

Availability of a flu_{subtype}_na reference datasets that can be downloaded using nextclade dataset get.

Examples

To analyze an H3N2 sample against an NA reference dataset, for example, one could run
nextclade dataset get --name="flu_h3n2_na" --reference="CY163680" --tag="2022-11-23T12:00:00Z -o h3h2_dataset and proceed with nextclade run --input-datset=h3h2_dataset. Then users could parse the resulting aa_subs for mutations associated with tamiflu resistance.

ENH: Add known stops/frameshifts to monkeypox dataset

Example: lineage B.1.8 has a stop codon and is still viable

Nextclade CLI - thousands of unclassified sequences from GISAID - mostly early April 2023 submissions

Current Behavior

After I reprocess my collection of GISAID data I routinely check for samples that don't get a Nextclade classification. There are usually just a handful on scattered dates, perhaps a few hundred per month. But with the latest Nextclade release there are suddenly several thousand. They are mostly for submission dates in early April - 1st to 10th - over 6,000 in that period (out of the total of ~120K samples).

They are from many different countries (30+). China and South Korea have the most, so they look somewhat over-represented.

Here are some sample Accession IDs:
EPI_ISL_17371203, EPI_ISL_17373731, EPI_ISL_17383740

Expected behavior

Most of these samples would be assigned a Lineage by the Nexclade CLI.

How to reproduce

Steps to reproduce the current behavior:

Run the nextclade cli tool with the latest release to re-classify sequences from GISAID
Examine records that don't get a lineage classification

ENH: Add BA.2/4/5 recombinants to BA.2 dataset

Adjust gene maps in flu datasets to v2

Flu datasets have their gene map attribute keys and values separated with spaces. But in GFF3 spec the separator is = instead. Nextclade v2 will not find any genes silently, unless this is fixed. Nextclade v1 supports both, so this change is not breaking.

SC2 and Monkeypox datasets seems to be using = already, but double checking all the dataset won't hurt.

ENH: Add key RBD mutations as labelled mutations

Next update for sars-cov-2 dataset?

Hi,

as the last update is 34 days old and since then pango-designation released v1.23 two weeks ago (JN.1 👀), an updated dataset for nextclade would be nice to have? 🙃

CC @corneliusroemer

Best wishes and thanks
David

ENH: Add labeled mutations to monkeypox datasets

Automate creation of GitHub releases

In order to avoid people (me) forgetting to do so manually.

Context: https://discussion.nextstrain.org/t/no-release-for-sars-cov-2-may-update-of-nextclade-data/1396

ENH: Add parent lineages as partiallyAliased column for recombinants

Would be great to add recombinant parent info to the partiallyAliased column in SC2 datasets

ENH: G_clade GB1 for RSV-B?

Thank you for adding datasets for RSV-A and RSV-B! I am updating my scripts to use nextclade instead of scraping clade assignments from nextstrain.org. There is a GB1 branch on nextstrain.org: https://nextstrain.org/rsv/b/genome?label=clade:GB1 However, nextclade did not assign G_clade GB1 to any of the expected sequences (e.g. KU316116, MG642053) and it looks like the rsv_b tree.json has no samples with G_clade = GB1.

Any chance of adding GB1 samples and labels to rsv_b in a future release?

Next update for sars-cov-2 (and nextstrain/sars-cov-2/wuhan-hu-1/proteins) dataset?

Hi,

as the last update is ~54 days old (2024-02-16) and since then pango-designation released v1.26 (with all these "FLirT" and other sublineages of JN.1 👀), updated datasets for sars-cov-2* in nextclade would be nice to have? 🙃

CC @corneliusroemer

Best wishes and thanks
David

nextstrain / nextclade_data Goto Github PK

nextclade_data's Introduction

License and copyright

nextclade_data's People

Contributors

Stargazers

Watchers

Forkers

nextclade_data's Issues

Current Behavior

Expected behavior

How to reproduce

Possible solution

Your environment: if running Nextstrain locally

Additional context

Current Behavior

Current Behavior

Expected behavior

How to reproduce

Possible solution

Topic

Description

Context

Description

Examples

Possible solution

Context

Step 1

Step 2 (due 2023-02-01)

Step 3 (if ever in far future)

Discussion

Implementation

Context

Description

Context

Description

Additional context

Context

Description

Examples

Current Behavior

Expected behavior

How to reproduce

Recommend Projects

Recommend Topics

Recommend Org