Giter Club home page Giter Club logo

dengue's Introduction

Nextstrain repository for dengue virus

This repository contains two workflows for the analysis of dengue virus data:

  • ingest/ - Download data from GenBank, clean and curate it and upload it to S3
  • phylogenetic/ - Make phylogenetic trees for nextstrain.org
  • nextclade/ - Make Nextclade datasets for nextstrain/nextclade_data

Each folder contains a README.md with more information.

Documentation

dengue's People

Contributors

genehack avatar huddlej avatar ivan-aksamentov avatar j23414 avatar joverlee521 avatar trvrb avatar tsibley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dengue's Issues

Bug: Update dropped strains file to list accession instead of strain

Current Behavior

Currently, strains listed in phylogenetic/config/dropped_strains.txt are not being dropped since 8ab810f

Expected behavior

Strains listed in dropped_strains.txt are not in the final phylogenetic tree.

How to reproduce

Possible solution

Perhaps cherry pick a commit like:

Your environment: if browsing Nextstrain online

  • Operating system:
  • Browser:

Your environment: if running Nextstrain locally

  • Operating system:
  • Browser:
  • Version (e.g. auspice 2.7.0):

Additional context

Add any other context about the problem here.

Rename `config` to `defaults`

Context

At some point, get through the backlog of changes to align with the latest pathogen-repo-guide.

  • Rename any */config to */defaults

Fine tune the "Dengue virus DENVx genotypes" dataset

Dengue virus DENVx genotypes

An update regarding the classification of dengue virus (DENV) genotypes. As suggested by @rneher in a Slack channel, adding an outgroup seems to have mostly resolved the issues with cross-serotype false positives for DENV1, DENV3, and DENV4 genotype classifications.

DENV1, DENV3, and DENV4 Genotype Classifications

The following images show the improved genotype classifications for DENV1, DENV3, and DENV4 after incorporating the outgroup:

Screenshot 2024-06-12 at 11 56 02 AM Screenshot 2024-06-12 at 11 57 17 AM Screenshot 2024-06-12 at 11 58 01 AM

The green color indicates within-serotype classifications, which is a positive outcome.

DENV2 Genotype Classification

However, the DENV2 genotype classification still requires further improvement:

Screenshot 2024-06-12 at 11 56 37 AM

The reconstructed "all" root was too similar to a DENV2 genotype, so it has been swapped with DENV4.

Next Steps for DENV2

The current plan is to add the reconstructed roots of DENV1, DENV3, and DENV4 as outgroups to DENV2 to further enhance the genotype classification accuracy across all serotypes. Although please feel free to submit other suggestions.

Rename "subtype" to "genotype"

Context

In response to comment:

I think we should generally be consistent with the nomenclature. I see for metadata you have nextclade_subtype with entries like DENV1/II. This is canonically "DENV genotype". I suggest aiming for two columns in the metadata. One for serotype with DENV1, DENV2, etc... and one for denv_genotype with DENV1/II, etc.... This is similar to how things work for SARS-CoV-2 with a clade column as well as a lineage column. Also mpox uses clade and lineage as well as separate columns.

Description

Currently we have

  • ncbi_serotype because we are relying on "NCBI" annotation as the source of serotype assignment. No change to the column name here
  • nextclade_subtype because we are using "nextclade" for genotype assignment. Rename this to "nextclade_genotype"

Of course feel free to comment on this GitHub Issue with other suggestions.
Optionally, we could reorder the metadata columns such that ncbi_serotype and nextclade_genotype are next to each other to make this distinction more obvious to people manually looking at the metadata file.

The `phylogenetic` github action cannot build from staged `ingest` data during dev-branch testing

Current Behavior

The phylogenetic GitHub action (see this run) ignored the provided sequence and metadata URL configurations:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/sequences_all.fasta.zst
    METADATA_URL: s3://nextstrain-data/files/workflows/dengue/trials/serogeno20240518/metadata_all.tsv.zst

These input fields are specified in the .github/workflows file:

sequences_url:
description: |
URL for a sequences.fasta.zst file.
If not provided, will use default sequences_url from phylogenetic/config/config_dengue.yaml
required: false
type: string
metadata_url:
description: |
URL for a metadata.tsv.zst file.
If not provided, will use default metadata_url from phylogenetic/config/config_dengue.yaml

However, they are not being used in the phylogenetic rule:

sequences_url = "https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst",
metadata_url = "https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst"

Expected Behavior

The phylogenetic GitHub action should accept sequences and metadata from specified URLs, especially when testing different features on dev branches. These URL datasets are often generated by the ingest GitHub action and should be a configurable-optional-input dataset during feature testing.

Possible Solution(s)

Consider implementing changes similar to the Zika repository, but with the addition of allowing for serotype expansion ( all, denv1, denv2, denv3, denv4).

Hopefully, then we could provide a config similar to:

env:
    TRIAL_NAME: serogeno20240518
    SEQUENCES_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/sequences_{serotype}.fasta.zst
    METADATA_URL: https://data.nextstrain.org/files/workflows/dengue/trials/serogeno20240518/metadata_{serotype}.tsv.zst

Which could be expanded across all serotypes. Otherwise, solving this issue might involve defining multiple sets of SEQUENCES_DENVX_URL and METADATA_DENVX_URL fields, which would be tedious during testing dev-branches. Alternatively, consider splitting phylogenetic into separate workflows (or workflow calls from a main workflow) for each serotype (phylogenetic_denv1 to phylogenetic_denv4). Open to discussion or suggestions.

ENH: Generalize taxon id to serotype map definitions to a configuration file

Context

As a potential enhancement, it may be beneficial to allow users to configure the serotype (and taxon ID) list. This suggestion is inspired by the discussions in the following comments:

This would be particularly useful if we intend to permit users to modify the list of serotypes for curation, especially if taxon IDs become more detailed (e.g., the taxonomy subtree for Dengue).

Possible solution

Open to more suggestions or feedback here, but some solutions include:

  1. Store the list and map in a dedicated config/taxid_to_serotype_map.tsv file.
  2. Store the list and map directly in the config/build.config, following a similar approach to the NCBI field_map configuration.

Add E gene builds

Context

By user request:

Is there any chance we could get a E gene build of nextstrain dengue? Much more sequences of E than full genome, especially in some parts of the world

Description

Examples

Possible steps to a solution

  1. Pull out E gene sequence from the dengue reference.gb file to be used as the reference for the E gene builds.
    a. Or follow rsv rules
  2. Add a filter_length_per_group function for “all_E”, “denv1_E”, “denv2_E”, etc similar to filter_sequences_per_group.
  3. Add E to the dropdown under “Dataset” by appending _E and _genome (e.g. dengue_denv1_genome.json and dengue_denv1_E.json and updating the nextstrain.org manifest file.

Dependencies

Word of caution for genome MT597439

Though the header of the genome MT597439 says "Dengue virus type 2 isolate 43257 polyprotein (POLY) gene, partial cds; and sfRNA2 lncRNA gene, partial sequence", the serotype section of this genome tag it as

FEATURES Location/Qualifiers
source 1..10252
/organism="dengue virus type 2"
/mol_type="genomic RNA"
/serotype="4"
/isolate="43257"
/isolation_source="serum"
/host="Homo sapiens"
/db_xref="taxon:11060"
/country="South Korea"
/collection_date="2010"
/note="genotype: 2"

These people messed up while submission of this genome. In their Article here, they correctly assign it as DENV4/II (See Fig 1b, sample 43257 highlighted in yellow). I will request NCBI to correct this entry. But wanted to highlight it for the record.

Restructure S3 URLs

Noticed in reviewing #71 that the current dengue files are uploaded as:

  • files/workflows/dengue/sequences_<type>.fasta.zst
  • files/workflows/dengue/metadata_<type>.tsv.zst

I think it would be more inline with the standard data files to update this to

  • files/workflows/dengue/<type>/sequences.fasta.zst
  • files/workflows/dengue/<type>/metadata.tsv.zst

Switch DENV2 genotypes to numeric to be consistent with DENV1, 3, and 4

Context

In response to comment: #28 (comment)

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II
DENV2/S --> DENV2/VI
DENV4/S --> DENV4/IV

One modification is to keep the S groups, since S=Sylvatic.

Description

Transition to using numeric lineage labels, and less geography-tied naming conventions.

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II

Before implementing this, check if this is standard in the literature or will cause any confusion.

Examples

Possible solution

(Optional)

Improve serotype assignment in Dengue virus DENVx genotypes datasets

Context

Flagged by @rneher slack message, the Dengue virus DENVx genotypes dataset could be further improved in its clade assignments. For example for DENV1:

  1. DENV2 samples that align are correctly placed onto the outgroup node and marked as unassigned. (good!)
  2. However, DENV1 samples that don't belong to an annotated genotype are also marked as unassigned, which is arguably incorrect. (This could be improved!) An example shown below:

image

Description

These samples should be assigned to the DENV1 serotype without a specific genotype, rather than being marked as unassigned. To illustrate this group of samples visually, we aim to reduce the samples in the magenta region of the table:

Screenshot 2024-06-25 at 9 50 29 AM

Possible solution

To ensure accurate serotype assignment while allowing for true-negative genotype assignments. I'm currrently planning the following steps:

  1. In the dengue/all tree, identify the amino acid mutations from the dengue/all reconstructed root to the reconstructed root of each serotype.
  2. In each dengue/denv* tree, locate the amino acid mutations from the serotype reconstructed root to the outgroup dengue/all reconstructed root, and correct the coordinates accordingly.
  3. Add the corrected coordinates of the amino acid mutations to each of the clades_genotype_denv*.tsv files, using the serotype name (e.g., DENV1) as the identifier.

After implementing these changes:

  • All DENV1 samples should be assigned to the DENV1 serotype, even if they don't belong to a specific genotype.
  • Samples from other serotypes (e.g., DENV2) should still be correctly marked as unassigned.

Of course, open to other suggestions or guidance here.

Removal of genome containing plasmid sequence

Dear @j23414

I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include

AY243466
AY243467
AY243468
AY243469
AY376438
AY648301
AY656167
AY656168
AY656169
AY656170
AY744148

Shouldn't these be either removed or the plasmid sequence be chopped off?
I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)

newplot (3)

phylogenetic: Stochastic augur refine error

Noting the stochastic error in the all serotype build that I've seen in the automated runs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/treetime/treetime.py", line 57, in run
    return self._run(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/treetime/treetime.py", line 333, in _run
    self.calc_rate_susceptibility(params=tt_kwargs)
  File "/usr/local/lib/python3.10/site-packages/treetime/clock_tree.py", line 866, in calc_rate_susceptibility
    raise ValueError("ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments.")
ValueError: ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments.

ERROR: ClockTree.calc_rate_susceptibility: rate estimate is negative. In this case the heuristic treetime uses to account for uncertainty in the rate estimate does not work. Please specify the clock-rate and its standard deviation explicitly via CLI parameters or arguments. 
 
ERROR in TreeTime.run: An error occurred which was not properly handled in TreeTime. If this error persists, please let us know by filing a new issue including the original command and the error above at: https://github.com/neherlab/treetime/issues 

ERROR from TreeTime: An error occurred in TreeTime (see above). This may be due to an issue with TreeTime or Augur.
Please report you are calling TreeTime via Augur.

Split by dengue serotype (denv1-denv4)

Description

Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.

Context

Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.

For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).

Possible solution(s)

Rely on NCBI taxon id annotations for serotype segregation

Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.

Create a Nextclade dataset for finer subtype classification

Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.

An ensemble method

Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.

Tasks to solve this issue

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.