nextstrain / dengue Goto Github PK

View Code? Open in Web Editor NEW

8.0 14.0 10.0 14.14 MB

Nextstrain build for dengue virus

Home Page: https://nextstrain.org/dengue

Python 82.70% Shell 17.30%

nextstrain pathogen

dengue's Introduction

Nextstrain repository for dengue virus

This repository contains two workflows for the analysis of dengue virus data:

ingest/ - Download data from GenBank, clean and curate it and upload it to S3
phylogenetic/ - Make phylogenetic trees for nextstrain.org
nextclade/ - Make Nextclade datasets for nextstrain/nextclade_data

Each folder contains a README.md with more information.

Documentation

Contributor documentation

dengue's People

Contributors

Stargazers

Watchers

Forkers

global-localhost global19 global19-atlassian-net prosaddas rhysinward jubair231dd chantisakee zhengzha2000 theasifreza tamjasoon

dengue's Issues

Bug: Update dropped strains file to list accession instead of strain

Current Behavior

Currently, strains listed in phylogenetic/config/dropped_strains.txt are not being dropped since 8ab810f

Expected behavior

Strains listed in dropped_strains.txt are not in the final phylogenetic tree.

How to reproduce

Possible solution

Perhaps cherry pick a commit like:

67016d1

Your environment: if browsing Nextstrain online

Operating system:
Browser:

Your environment: if running Nextstrain locally

Operating system:
Browser:
Version (e.g. auspice 2.7.0):

Additional context

Add any other context about the problem here.

Harmonize with pathogen repo guide

Context

Part of updating the pathogen repos to match a golden path:

https://github.com/nextstrain/pathogen-repo-guide

To Dos

Rename ingest/workflow/snakemake_rules to ingest/rules
Rename ingest/rules/*.smk to match pathogen-repo-guide/rules/*
Move ingest/source-data/* files to ingest/config
Modernize ncbi-field-map in config
Add a CHANGELOG.md file
Rename "config" to "defaults"
Move nextstrain automation rules and configs to ingest/build-configs

Use a "scripts" directory instead of a "bin" directory

Context

Result of discussion and vote within the team in https://github.com/nextstrain/private/issues/117

Rename `config` to `defaults`

Context

At some point, get through the backlog of changes to align with the latest pathogen-repo-guide.

Rename any */config to */defaults

Modernize `ncbi-field-map` in config

Establish some deduplication guidelines within the phylogenetic workflow

Context

Flagged by #28 (comment) as well as prior historical discussions.

Design and implement some deduplication paths in the phylogentic workflow.

Description

Examples

Possible solution

Preferably, leverage the existing tools in the nextstrain dockerfile, with seqkit being a probable choice.

Fine tune the "Dengue virus DENVx genotypes" dataset

Dengue virus DENVx genotypes

An update regarding the classification of dengue virus (DENV) genotypes. As suggested by @rneher in a Slack channel, adding an outgroup seems to have mostly resolved the issues with cross-serotype false positives for DENV1, DENV3, and DENV4 genotype classifications.

DENV1, DENV3, and DENV4 Genotype Classifications

The following images show the improved genotype classifications for DENV1, DENV3, and DENV4 after incorporating the outgroup:

The green color indicates within-serotype classifications, which is a positive outcome.

DENV2 Genotype Classification

However, the DENV2 genotype classification still requires further improvement:

The reconstructed "all" root was too similar to a DENV2 genotype, so it has been swapped with DENV4.

Next Steps for DENV2

The current plan is to add the reconstructed roots of DENV1, DENV3, and DENV4 as outgroups to DENV2 to further enhance the genotype classification accuracy across all serotypes. Although please feel free to submit other suggestions.

Add workflow for producing the Nextclade dengue dataset

Context

Add a workflow for producing the Nextclade dataset for dengue serotypes and subtypes in a nextclade folder, following the pathogen-repo-guide. This will ease dataset creation, testing, and debugging.

Description

TBD

Examples

Possible solution

TBD

Rename "subtype" to "genotype"

Context

In response to comment:

I think we should generally be consistent with the nomenclature. I see for metadata you have nextclade_subtype with entries like DENV1/II. This is canonically "DENV genotype". I suggest aiming for two columns in the metadata. One for serotype with DENV1, DENV2, etc... and one for denv_genotype with DENV1/II, etc.... This is similar to how things work for SARS-CoV-2 with a clade column as well as a lineage column. Also mpox uses clade and lineage as well as separate columns.

Description

Currently we have

ncbi_serotype because we are relying on "NCBI" annotation as the source of serotype assignment. No change to the column name here
nextclade_subtype because we are using "nextclade" for genotype assignment. Rename this to "nextclade_genotype"

Of course feel free to comment on this GitHub Issue with other suggestions.
Optionally, we could reorder the metadata columns such that ncbi_serotype and nextclade_genotype are next to each other to make this distinction more obvious to people manually looking at the metadata file.

Update the links of FASTA and metadata in the README

@j23414 Can you please update the link of the FASTA files and metadata in the README here with the links you provided in issue comment here. I usually go to the phylogenetic/README.md file to download the most updated metadata and sequences.

Set root reference in phylogenetic builds

Context

To potentially be compatible with nextstrain/nextclade#1455

Possible solution

cherry-pick e890181 into its own PR, applied to the phylogenetic workflow

The `phylogenetic` github action cannot build from staged `ingest` data during dev-branch testing

Current Behavior

The phylogenetic GitHub action (see this run) ignored the provided sequence and metadata URL configurations:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/sequences_all.fasta.zst
    METADATA_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/metadata_all.tsv.zst

These input fields are specified in the .github/workflows file:

dengue/.github/workflows/phylogenetic.yaml

Lines 33 to 42 in e901a30

 sequences_url: 

 description: | 

  URL for a sequences.fasta.zst file. 

  If not provided, will use default sequences_url from phylogenetic/config/config_dengue.yaml 

  required: false 

 type: string 

 metadata_url: 

 description: | 

  URL for a metadata.tsv.zst file. 

  If not provided, will use default metadata_url from phylogenetic/config/config_dengue.yaml

However, they are not being used in the phylogenetic rule:

dengue/phylogenetic/rules/prepare_sequences.smk

Lines 24 to 25 in e901a30

 sequences_url = "https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst", 

 metadata_url = "https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst"

Expected Behavior

The phylogenetic GitHub action should accept sequences and metadata from specified URLs, especially when testing different features on dev branches. These URL datasets are often generated by the ingest GitHub action and should be a configurable-optional-input dataset during feature testing.

Possible Solution(s)

Consider implementing changes similar to the Zika repository, but with the addition of allowing for serotype expansion ( all, denv1, denv2, denv3, denv4).

Hopefully, then we could provide a config similar to:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/sequences_{serotype}.fasta.zst
    METADATA_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/metadata_{serotype}.tsv.zst

Which could be expanded across all serotypes. Otherwise, solving this issue might involve defining multiple sets of SEQUENCES_DENVX_URL and METADATA_DENVX_URL fields, which would be tedious during testing dev-branches. Alternatively, consider splitting phylogenetic into separate workflows (or workflow calls from a main workflow) for each serotype (phylogenetic_denv1 to phylogenetic_denv4). Open to discussion or suggestions.

ENH: Generalize taxon id to serotype map definitions to a configuration file

Context

As a potential enhancement, it may be beneficial to allow users to configure the serotype (and taxon ID) list. This suggestion is inspired by the discussions in the following comments:

This would be particularly useful if we intend to permit users to modify the list of serotypes for curation, especially if taxon IDs become more detailed (e.g., the taxonomy subtree for Dengue).

Possible solution

Open to more suggestions or feedback here, but some solutions include:

Store the list and map in a dedicated config/taxid_to_serotype_map.tsv file.
Store the list and map directly in the config/build.config, following a similar approach to the NCBI field_map configuration.

ingest: use new augur curate commands

Follow up to nextstrain/ingest#43, nextstrain/ingest#44, and blocked on new Augur release.

TODOs

update curate rule to use new augur curate commands
update ingest/vendored to remove cruft

Add E gene builds

Context

By user request:

Is there any chance we could get a E gene build of nextstrain dengue? Much more sequences of E than full genome, especially in some parts of the world

Description

Examples

Possible steps to a solution

Pull out E gene sequence from the dengue reference.gb file to be used as the reference for the E gene builds.
a. Or follow rsv rules
Add a filter_length_per_group function for “all_E”, “denv1_E”, “denv2_E”, etc similar to filter_sequences_per_group.
Add E to the dropdown under “Dataset” by appending _E and _genome (e.g. dengue_denv1_genome.json and dengue_denv1_E.json and updating the nextstrain.org manifest file.

Dependencies

Word of caution for genome MT597439

Though the header of the genome MT597439 says "Dengue virus type 2 isolate 43257 polyprotein (POLY) gene, partial cds; and sfRNA2 lncRNA gene, partial sequence", the serotype section of this genome tag it as

FEATURES Location/Qualifiers
source 1..10252
/organism="dengue virus type 2"
/mol_type="genomic RNA"
/serotype="4"
/isolate="43257"
/isolation_source="serum"
/host="Homo sapiens"
/db_xref="taxon:11060"
/country="South Korea"
/collection_date="2010"
/note="genotype: 2"

These people messed up while submission of this genome. In their Article here, they correctly assign it as DENV4/II (See Fig 1b, sample 43257 highlighted in yellow). I will request NCBI to correct this entry. But wanted to highlight it for the record.

Add manual serotype annotations along with justifications to "annotations.tsv"

Context

In response to comment: #28 (comment)

Description

We are relying on ncbi_tax_id to split dengue records into "DENV1" - "DENV4" but some records are missing this information.

Examples

Possible solution

Incorporate any manual annotation into the "annotations.tsv" file in the form of:

DI401607	ncbi_serotype	denv1 # Based on DEFINITION line in GenBank

Restructure S3 URLs

Noticed in reviewing #71 that the current dengue files are uploaded as:

files/workflows/dengue/sequences_<type>.fasta.zst
files/workflows/dengue/metadata_<type>.tsv.zst

I think it would be more inline with the standard data files to update this to

files/workflows/dengue/<type>/sequences.fasta.zst
files/workflows/dengue/<type>/metadata.tsv.zst

Switch DENV2 genotypes to numeric to be consistent with DENV1, 3, and 4

Context

In response to comment: #28 (comment)

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II
DENV2/S --> DENV2/VI
DENV4/S --> DENV4/IV

One modification is to keep the S groups, since S=Sylvatic.

Description

Transition to using numeric lineage labels, and less geography-tied naming conventions.

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II

Before implementing this, check if this is standard in the literature or will cause any confusion.

Examples

Possible solution

(Optional)

Improve serotype assignment in Dengue virus DENVx genotypes datasets

Context

Flagged by @rneher slack message, the Dengue virus DENVx genotypes dataset could be further improved in its clade assignments. For example for DENV1:

DENV2 samples that align are correctly placed onto the outgroup node and marked as unassigned. (good!)
However, DENV1 samples that don't belong to an annotated genotype are also marked as unassigned, which is arguably incorrect. (This could be improved!) An example shown below:

Description

These samples should be assigned to the DENV1 serotype without a specific genotype, rather than being marked as unassigned. To illustrate this group of samples visually, we aim to reduce the samples in the magenta region of the table:

Possible solution

To ensure accurate serotype assignment while allowing for true-negative genotype assignments. I'm currrently planning the following steps:

In the dengue/all tree, identify the amino acid mutations from the dengue/all reconstructed root to the reconstructed root of each serotype.
In each dengue/denv* tree, locate the amino acid mutations from the serotype reconstructed root to the outgroup dengue/all reconstructed root, and correct the coordinates accordingly.
Add the corrected coordinates of the amino acid mutations to each of the clades_genotype_denv*.tsv files, using the serotype name (e.g., DENV1) as the identifier.

After implementing these changes:

All DENV1 samples should be assigned to the DENV1 serotype, even if they don't belong to a specific genotype.
Samples from other serotypes (e.g., DENV2) should still be correctly marked as unassigned.

Of course, open to other suggestions or guidance here.

Removal of genome containing plasmid sequence

Dear @j23414

I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include

AY243466
AY243467
AY243468
AY243469
AY376438
AY648301
AY656167
AY656168
AY656169
AY656170
AY744148

Shouldn't these be either removed or the plasmid sequence be chopped off?
I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)

phylogenetic: Stochastic augur refine error

Noting the stochastic error in the all serotype build that I've seen in the automated runs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/treetime/treetime.py", line 57, in run
    return self._run(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/treetime/treetime.py", line 333, in _run
    self.calc_rate_susceptibility(params=tt_kwargs)
  File "/usr/local/lib/python3.10/site-packages/treetime/clock_tree.py", line 866, in calc_rate_susceptibility
    raise ValueError("ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments.")
ValueError: ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments.

ERROR: ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments. 
 
ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues 

ERROR from TreeTime: An error occurred in TreeTime (see above). This may be due to an issue with TreeTime or Augur.
Please report you are calling TreeTime via Augur.

ENH: nextclade extensions to display multiple nomenclatures

Context

To be written

Description

TBD

Examples

Possible solution

https://github.com/nextstrain/nextclade/blob/master/docs/user/input-files/04-reference-tree.md#extensions

Split by dengue serotype (denv1-denv4)

Description

Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.

Context

Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.

For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).

Possible solution(s)

Rely on NCBI taxon id annotations for serotype segregation

Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.

Create a Nextclade dataset for finer subtype classification

Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.

An ensemble method

Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.

Tasks to solve this issue

#20
#16
- which possibly requires #25

	sequences_url:
	description: \|
	URL for a sequences.fasta.zst file.
	If not provided, will use default sequences_url from phylogenetic/config/config_dengue.yaml
	required: false
	type: string
	metadata_url:
	description: \|
	URL for a metadata.tsv.zst file.
	If not provided, will use default metadata_url from phylogenetic/config/config_dengue.yaml

	sequences_url = "https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst",
	metadata_url = "https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst"

nextstrain / dengue Goto Github PK

dengue's Introduction

Nextstrain repository for dengue virus

Documentation

dengue's People

Contributors

Stargazers

Watchers

Forkers

dengue's Issues

Current Behavior

Expected behavior

How to reproduce

Possible solution

Your environment: if browsing Nextstrain online

Your environment: if running Nextstrain locally

Additional context

Context

To Dos

Context

Context

Context

Description

Examples

Possible solution

Dengue virus DENVx genotypes

DENV1, DENV3, and DENV4 Genotype Classifications

DENV2 Genotype Classification

Next Steps for DENV2

Context

Description

Examples

Possible solution

Context

Description

Context

Possible solution

Current Behavior

Expected Behavior

Possible Solution(s)

Context

Possible solution

TODOs

Context

Description

Examples

Possible steps to a solution

Dependencies

Context

Description

Examples

Possible solution

Context

Description

Examples

Possible solution

Context

Description

Possible solution

Context

Description

Examples

Possible solution

Description

Context

Possible solution(s)

Tasks to solve this issue

Recommend Projects

Recommend Topics

Recommend Org