nextstrain / avian-flu Goto Github PK

View Code? Open in Web Editor NEW

13.0 17.0 6.0 2.39 MB

Nextstrain build for avian influenza viruses

Home Page: http://nextstrain.org/avian-flu

Python 85.71% Shell 12.30% Perl 2.00%

nextstrain pathogen

avian-flu's Issues

ingest: Upload NCBI/Andersen lab outputs to S3

Follow up to #40

I'm planning to pull out the S3 related rules from upload_from_fauna and make them data source agnostic so that they can be used for fauna, ncbi, and andersen-lab ingest workflows.

I'm planning on uploading to the usual public S3 bucket since the NCBI data is public.
Both the NCBI data and Andersen lab data are currently specific to H5N1, so using the prefix
s3://nextstrain-data/files/workflows/avian-flu/h5n1/

data source	type	S3 URL
NCBI	metadata	s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/metadata.tsv.zst
	sequences	s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ha/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/mp/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/na/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/np/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ns/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pa/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb1/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb2/sequences.fasta.zst
Andersen Lab	metadata	s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/metadata.tsv.zst
	sequences	s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ha/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/mp/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/na/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/np/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ns/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pa/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb1/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb2/sequences.fasta.zst
merged	metadata	s3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst
	sequences	s3://nextstrain-data/files/workflows/avian-flu/h5n1/ha/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/mp/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/na/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/np/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/ns/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/pa/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb1/sequences.fasta.zst
		s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb2/sequences.fasta.zst

Note: The "merged" files will be blocked by #42, but I at least wanted to get the planned S3 URLs out here in case anyone disagrees with the paths.

Ingest Andersen Lab metadata and sequences

Jotting down the outline of ingesting the metadata and sequences from https://github.com/andersen-lab/avian-influenza

Fetch tarball of https://github.com/andersen-lab/avian-influenza
Concatenate FASTA files of same segment
Cut FASTA headers down to just the SRA run accession
Join segment FASTAs with metadata using augur curate
Curate metadata - matching Nextstrain columns -> Andersen Lab metadata

Nextstrain	Andersen Lab
strain	A/<Host>/USA/<isolate>/<year> (<year> parsed from <Date>)
virus	avian_flu
isolate_id	<Run> (SRA accession)
date	<Date> (any missing date or ? marked date 2024-XX-XX)
region	North America
country	USA
division	<US State>
location	<US State>
host	<Host> (binned into Avian, Cattle, or Nonhuman Mammal)
domestic_status	?
subtype	h5n1
originating_lab	<Center Name>
submitting_lab	<Center Name>
authors	?
PMID	?
gisaid_clade	?
h5_clade	?

Output metadata.tsv and sequences.fasta per segment
Run metadata files through add_segment_counts.py to match fauna downloaded data
Upload output files to S3
Concatenate fauna and Andersen lab files
Dedup based on strain name
Upload merged data to S3

ERROR: Augur version incompatibility detected

Hi, I'm trying to run the quickstart-build for avian-flu, but I got this error. I'm using nextstrain.cli v.6.2.1 and augur v.19.2.0. Some idea to fix it?

·································
ERROR: Augur version incompatibility detected: the JSON results/branch-lengths_h5n1_ha.json was generated by {'program': 'augur', 'version': '19.2.0'}, which is incompatible with the current augur version (21.1.0). We suggest you rerun the pipeline using the current version of augur.
[Tue Apr 18 21:08:44 2023]

Error in rule export:
jobid: 1
input: results/tree_h5n1_ha.nwk, results/metadata_h5n1_ha.tsv, results/branch-lengths_h5n1_ha.json, results/traits_h5n1_ha.json, results/nt-muts_h5n1_ha.json, results/aa-muts_h5n1_ha.json, results/cleavage-site_h5n1_ha.json, results/cleavage-site-sequences_h5n1_ha.json, config/auspice_config_h5n1.json
output: auspice/flu_avian_h5n1_ha.json
shell:

augur export v2 --tree results/tree_h5n1_ha.nwk --metadata results/metadata_h5n1_ha.tsv --node-data results/branch-lengths_h5n1_ha.json results/traits_h5n1_ha.json results/nt-muts_h5n1_ha.json results/aa-muts_h5n1_ha.json results/cleavage-site_h5n1_ha.json results/cleavage-site-sequences_h5n1_ha.json --auspice-config config/auspice_config_h5n1.json --include-root-sequence --output auspice/flu_avian_h5n1_ha.json

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-18T210843.145552.snakemake.log
···································

--fill-gaps eventually adds a furin cleavage site to everyone

Hi, in augur align you are calling --fill-gaps which is just doing

 _seq = _seq.replace('-', 'N')

so that in the nextstrain tree everyone gets a FCS.

I don't have access to filtered_H5Nx_HA.fasta and metadata-with-clade_H5Nx_HA.tsv so I can't make a proposal of modification of the Snakemake with a python script in order to leave the FCS indels but if I did I would be happy to make some tries.

The only experiment I did was to download a few thousands of H5Nx sequences from genbank, to label with LPAI those with a deletion at the FCS, then to remove from the alignment the FCS region, to run mega, and I got that the LPAI sequences where quite clustered together.

Adopt Nextstrain's Snakemake style guide

While working on #17, I noticed this repo's Snakefile hasn't yet adopted recommended practices in our Snakemake style guide. Probably not a priority right now, but would be good to adopt at some point.

Ingest NCBI GenBank data for H5N1 outbreak

Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.

Original plan that was discussed:

Fetch GenBank accessions from NCBI Virus Download (example URL) -- this provides additional filter options for collection date and genotype which are not available through the datasets command and shows 2,561 records on NCBI Virus.

Use datasets to download dataset for accessions

datasets download virus genome accession --inputfile accessions.txt

Use Bio.Entrez.efetch to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.
Merge the dataset metadata and the entrez metadata into a single NDJSON that can then go through the usual curation pipeline.

Detours

I waffled a little on whether we needed (3), because I realized that the GenBank Title is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such as
```
PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) clone new segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
```
However, I'm not sure that all record titles will follow this format. GenBank docs for the Definition/Title does not make me confident about it either

Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding).

I think I'll stick with the original plan to use Entrez, but it was a nice thought.
(1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running
```
datasets download virus genome taxon 11320
```
I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:
```
datasets download virus genome taxon 11320 --released-after "04/01/2024" --geo-location "North America"
```
This returns 32,262 records. Grep for H5N in the genomic.fna brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.

Annotate H5 clades through a node data JSON file instead of modifying metadata

Context

In conversation about #22, @trvrb noted:

I don't think this is part of the scope of this PR, but it would seem cleaner to me for this clade-labeling/add-clades.py script to instead just create a node data JSON with h5_label_clade rather than messing with the metadata file.

@lmoncla and I just had some confusion from different rules (refine, traits) asking for the metadata TSV vs the metadata-with-clade TSV. Though the function metadata_by_wildcards mostly solves this issue.

@jameshadfield noted that:

The only reason I can see to not do this is if we use this data in the filtering step. But we don't.

Description

We should modify scripts/add-clades.py to create a node data JSON file as output and update the workflow to make the resulting output an input to the export rule instead of a step that modifies the metadata.

who defined the clade name for h3N2?

They are many clades for H3N2, who can tell me who defined them?

ingest: Join NCBI and Andersen lab data

Follow up to #28 + #40

The Andersen lab will continue to update consensus sequences from SRA runs. These eventually get uploaded to GenBank by the original submitters and become available through the NCBI data. However, there is a delay in the data through NCBI, so we can merge the Andersen lab data with the NCBI data to get the latest available sequences.

Both sets of data have the SRA accession, so we can use that to dedup the data.

Automate ingest via GH Actions

Follow up to #40. Blocked by #41

We can remove a lot manual work by adding GH Action workflows to run workflows.

The GH Action for fauna ingest will most likely need to be manually triggered since we still need to upload data to fauna manually. However, the GH Action for NCBI + Andersen lab ingest can be scheduled to run automatically.

Both GH Action workflows should use the shared pathogen-repo-build workflow. This will depend on nextstrain/infra#11 for using OIDC tokens in the GH Action workflow.

I'm not as familiar with the phylogenetic workflows in this repo, so it's unclear to me how much we can automate them via GH Actions. Is there a lot of manual clean up that needs to be done for builds or is it as straightforward as running nextstrain build and then nextstrain deploy?

Corrupted sequences in Fauna

Fauna output has (what I presume is) corrupted data for certain sequences such as:

>A/egret/Korea/22WC603/2023|avian_flu|EPIEPI2738274|2023-03-06|japan_korea|south_korea|south_korea|south_korea|avian|?|h5n1|national_institute_of_wildlife_disease_control_and_prevention(niwdc)|national_institute_of_wildlife_disease_control_and_prevention|?|?|2.3.4.4b|?
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtacc...

(Notice the start of the nuc sequence includes parts of the strain name)

This data raises no errors during the parse, filter, align steps but IQ-TREE will crash with a warning:

ERROR: Sequence A_DELIM-BZEZHVQHONIWPWOQAAAV_egret_DELIM-BZEZHVQHONIWPWOQAAAV_Korea_DELIM-BZEZHVQHONIWPWOQAAAV_22WC603_DELIM-BZEZHVQHONIWPWOQAAAV_2023 has invalid character E at site 17
...

I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.

I suggest we add it to the exclude list, unless others have more knowledge here?

ingest: Run Nextclade as part of ingest

Follow up to #40

With the recent addition of the community H5 Nextclade datasets in nextstrain/nextclade_data#196, it should now be possible to run Nextclade as part of ingest to assign clades to the H5 sequences.

Maybe this can replace the current manual clade labeling process with clade-labeling scripts?

nextstrain / avian-flu Goto Github PK

avian-flu's Issues

ingest: Upload NCBI/Andersen lab outputs to S3

Ingest Andersen Lab metadata and sequences

ERROR: Augur version incompatibility detected

--fill-gaps eventually adds a furin cleavage site to everyone

Adopt Nextstrain's Snakemake style guide

Ingest NCBI GenBank data for H5N1 outbreak

Original plan that was discussed:

Detours

Annotate H5 clades through a node data JSON file instead of modifying metadata

Context

Description

who defined the clade name for h3N2?

ingest: Join NCBI and Andersen lab data

Automate ingest via GH Actions

Corrupted sequences in Fauna

ingest: Run Nextclade as part of ingest

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent