Giter Club home page Giter Club logo

avian-flu's Issues

ingest: Upload NCBI/Andersen lab outputs to S3

Follow up to #40

I'm planning to pull out the S3 related rules from upload_from_fauna and make them data source agnostic so that they can be used for fauna, ncbi, and andersen-lab ingest workflows.

I'm planning on uploading to the usual public S3 bucket since the NCBI data is public.
Both the NCBI data and Andersen lab data are currently specific to H5N1, so using the prefix
s3://nextstrain-data/files/workflows/avian-flu/h5n1/

data source type S3 URL
NCBI metadata s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/metadata.tsv.zst
sequences s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ha/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/mp/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/na/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/np/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ns/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pa/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb1/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb2/sequences.fasta.zst
Andersen Lab metadata s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/metadata.tsv.zst
sequences s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ha/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/mp/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/na/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/np/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ns/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pa/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb1/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb2/sequences.fasta.zst
merged metadata s3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst
sequences s3://nextstrain-data/files/workflows/avian-flu/h5n1/ha/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/mp/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/na/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/np/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ns/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pa/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb1/sequences.fasta.zst
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb2/sequences.fasta.zst

Note: The "merged" files will be blocked by #42, but I at least wanted to get the planned S3 URLs out here in case anyone disagrees with the paths.

Ingest Andersen Lab metadata and sequences

Jotting down the outline of ingesting the metadata and sequences from https://github.com/andersen-lab/avian-influenza

  1. Fetch tarball of https://github.com/andersen-lab/avian-influenza
  2. Concatenate FASTA files of same segment
  3. Cut FASTA headers down to just the SRA run accession
  4. Join segment FASTAs with metadata using augur curate
  5. Curate metadata - matching Nextstrain columns -> Andersen Lab metadata
Nextstrain Andersen Lab
strain A/<Host>/USA/<isolate>/<year> (<year> parsed from <Date>)
virus avian_flu
isolate_id <Run> (SRA accession)
date <Date> (any missing date or ? marked date 2024-XX-XX)
region North America
country USA
division <US State>
location <US State>
host <Host> (binned into Avian, Cattle, or Nonhuman Mammal)
domestic_status ?
subtype h5n1
originating_lab <Center Name>
submitting_lab <Center Name>
authors ?
PMID ?
gisaid_clade ?
h5_clade ?
  1. Output metadata.tsv and sequences.fasta per segment
  2. Run metadata files through add_segment_counts.py to match fauna downloaded data
  3. Upload output files to S3
  4. Concatenate fauna and Andersen lab files
  5. Dedup based on strain name
  6. Upload merged data to S3

ERROR: Augur version incompatibility detected

Hi, I'm trying to run the quickstart-build for avian-flu, but I got this error. I'm using nextstrain.cli v.6.2.1 and augur v.19.2.0. Some idea to fix it?

·································
ERROR: Augur version incompatibility detected: the JSON results/branch-lengths_h5n1_ha.json was generated by {'program': 'augur', 'version': '19.2.0'}, which is incompatible with the current augur version (21.1.0). We suggest you rerun the pipeline using the current version of augur.
[Tue Apr 18 21:08:44 2023]

Error in rule export:
jobid: 1
input: results/tree_h5n1_ha.nwk, results/metadata_h5n1_ha.tsv, results/branch-lengths_h5n1_ha.json, results/traits_h5n1_ha.json, results/nt-muts_h5n1_ha.json, results/aa-muts_h5n1_ha.json, results/cleavage-site_h5n1_ha.json, results/cleavage-site-sequences_h5n1_ha.json, config/auspice_config_h5n1.json
output: auspice/flu_avian_h5n1_ha.json
shell:

augur export v2 --tree results/tree_h5n1_ha.nwk --metadata results/metadata_h5n1_ha.tsv --node-data results/branch-lengths_h5n1_ha.json results/traits_h5n1_ha.json results/nt-muts_h5n1_ha.json results/aa-muts_h5n1_ha.json results/cleavage-site_h5n1_ha.json results/cleavage-site-sequences_h5n1_ha.json --auspice-config config/auspice_config_h5n1.json --include-root-sequence --output auspice/flu_avian_h5n1_ha.json

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-18T210843.145552.snakemake.log
···································

--fill-gaps eventually adds a furin cleavage site to everyone

Hi, in augur align you are calling --fill-gaps which is just doing

 _seq = _seq.replace('-', 'N')

so that in the nextstrain tree everyone gets a FCS.

I don't have access to filtered_H5Nx_HA.fasta and metadata-with-clade_H5Nx_HA.tsv so I can't make a proposal of modification of the Snakemake with a python script in order to leave the FCS indels but if I did I would be happy to make some tries.

The only experiment I did was to download a few thousands of H5Nx sequences from genbank, to label with LPAI those with a deletion at the FCS, then to remove from the alignment the FCS region, to run mega, and I got that the LPAI sequences where quite clustered together.

Ingest NCBI GenBank data for H5N1 outbreak

Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.

Original plan that was discussed:

  1. Fetch GenBank accessions from NCBI Virus Download (example URL) -- this provides additional filter options for collection date and genotype which are not available through the datasets command and shows 2,561 records on NCBI Virus.
  2. Use datasets to download dataset for accessions
    datasets download virus genome accession --inputfile accessions.txt
    
  3. Use Bio.Entrez.efetch to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.
  4. Merge the dataset metadata and the entrez metadata into a single NDJSON that can then go through the usual curation pipeline.

Detours

  • I waffled a little on whether we needed (3), because I realized that the GenBank Title is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such as

    PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) clone new segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
    

    However, I'm not sure that all record titles will follow this format. GenBank docs for the Definition/Title does not make me confident about it either

    Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding).

    I think I'll stick with the original plan to use Entrez, but it was a nice thought.

  • (1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
    I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running

    datasets download virus genome taxon 11320
    

    I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:

    datasets download virus genome taxon 11320 --released-after "04/01/2024" --geo-location "North America"
    

    This returns 32,262 records. Grep for H5N in the genomic.fna brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.

Annotate H5 clades through a node data JSON file instead of modifying metadata

Context

In conversation about #22, @trvrb noted:

I don't think this is part of the scope of this PR, but it would seem cleaner to me for this clade-labeling/add-clades.py script to instead just create a node data JSON with h5_label_clade rather than messing with the metadata file.

@lmoncla and I just had some confusion from different rules (refine, traits) asking for the metadata TSV vs the metadata-with-clade TSV. Though the function metadata_by_wildcards mostly solves this issue.

@jameshadfield noted that:

The only reason I can see to not do this is if we use this data in the filtering step. But we don't.

Description

We should modify scripts/add-clades.py to create a node data JSON file as output and update the workflow to make the resulting output an input to the export rule instead of a step that modifies the metadata.

ingest: Join NCBI and Andersen lab data

Follow up to #28 + #40

The Andersen lab will continue to update consensus sequences from SRA runs. These eventually get uploaded to GenBank by the original submitters and become available through the NCBI data. However, there is a delay in the data through NCBI, so we can merge the Andersen lab data with the NCBI data to get the latest available sequences.

Both sets of data have the SRA accession, so we can use that to dedup the data.

Automate ingest via GH Actions

Follow up to #40. Blocked by #41

We can remove a lot manual work by adding GH Action workflows to run workflows.

The GH Action for fauna ingest will most likely need to be manually triggered since we still need to upload data to fauna manually. However, the GH Action for NCBI + Andersen lab ingest can be scheduled to run automatically.

Both GH Action workflows should use the shared pathogen-repo-build workflow. This will depend on nextstrain/infra#11 for using OIDC tokens in the GH Action workflow.


I'm not as familiar with the phylogenetic workflows in this repo, so it's unclear to me how much we can automate them via GH Actions. Is there a lot of manual clean up that needs to be done for builds or is it as straightforward as running nextstrain build and then nextstrain deploy?

Corrupted sequences in Fauna

Fauna output has (what I presume is) corrupted data for certain sequences such as:

>A/egret/Korea/22WC603/2023|avian_flu|EPIEPI2738274|2023-03-06|japan_korea|south_korea|south_korea|south_korea|avian|?|h5n1|national_institute_of_wildlife_disease_control_and_prevention(niwdc)|national_institute_of_wildlife_disease_control_and_prevention|?|?|2.3.4.4b|?
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtacc...

(Notice the start of the nuc sequence includes parts of the strain name)

This data raises no errors during the parse, filter, align steps but IQ-TREE will crash with a warning:

ERROR: Sequence A_DELIM-BZEZHVQHONIWPWOQAAAV_egret_DELIM-BZEZHVQHONIWPWOQAAAV_Korea_DELIM-BZEZHVQHONIWPWOQAAAV_22WC603_DELIM-BZEZHVQHONIWPWOQAAAV_2023 has invalid character E at site 17
...

I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.

I suggest we add it to the exclude list, unless others have more knowledge here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.