nextstrain / avian-flu Goto Github PK
View Code? Open in Web Editor NEWNextstrain build for avian influenza viruses
Home Page: http://nextstrain.org/avian-flu
Nextstrain build for avian influenza viruses
Home Page: http://nextstrain.org/avian-flu
Follow up to #40
I'm planning to pull out the S3 related rules from upload_from_fauna and make them data source agnostic so that they can be used for fauna, ncbi, and andersen-lab ingest workflows.
I'm planning on uploading to the usual public S3 bucket since the NCBI data is public.
Both the NCBI data and Andersen lab data are currently specific to H5N1, so using the prefix
s3://nextstrain-data/files/workflows/avian-flu/h5n1/
data source | type | S3 URL |
---|---|---|
NCBI | metadata | s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/metadata.tsv.zst |
sequences | s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ha/sequences.fasta.zst | |
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/mp/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/na/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/np/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/ns/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pa/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb1/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi/pb2/sequences.fasta.zst | ||
Andersen Lab | metadata | s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/metadata.tsv.zst |
sequences | s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ha/sequences.fasta.zst | |
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/mp/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/na/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/np/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/ns/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pa/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb1/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab/pb2/sequences.fasta.zst | ||
merged | metadata | s3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst |
sequences | s3://nextstrain-data/files/workflows/avian-flu/h5n1/ha/sequences.fasta.zst | |
s3://nextstrain-data/files/workflows/avian-flu/h5n1/mp/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/na/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/np/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ns/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pa/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb1/sequences.fasta.zst | ||
s3://nextstrain-data/files/workflows/avian-flu/h5n1/pb2/sequences.fasta.zst |
Note: The "merged" files will be blocked by #42, but I at least wanted to get the planned S3 URLs out here in case anyone disagrees with the paths.
Jotting down the outline of ingesting the metadata and sequences from https://github.com/andersen-lab/avian-influenza
augur curate
Nextstrain | Andersen Lab |
---|---|
strain | A/<Host>/USA/<isolate>/<year> (<year> parsed from <Date>) |
virus | avian_flu |
isolate_id | <Run> (SRA accession) |
date | <Date> (any missing date or ? marked date 2024-XX-XX) |
region | North America |
country | USA |
division | <US State> |
location | <US State> |
host | <Host> (binned into Avian, Cattle, or Nonhuman Mammal) |
domestic_status | ? |
subtype | h5n1 |
originating_lab | <Center Name> |
submitting_lab | <Center Name> |
authors | ? |
PMID | ? |
gisaid_clade | ? |
h5_clade | ? |
strain
nameHi, I'm trying to run the quickstart-build for avian-flu, but I got this error. I'm using nextstrain.cli v.6.2.1 and augur v.19.2.0. Some idea to fix it?
·································
ERROR: Augur version incompatibility detected: the JSON results/branch-lengths_h5n1_ha.json was generated by {'program': 'augur', 'version': '19.2.0'}, which is incompatible with the current augur version (21.1.0). We suggest you rerun the pipeline using the current version of augur.
[Tue Apr 18 21:08:44 2023]
Error in rule export:
jobid: 1
input: results/tree_h5n1_ha.nwk, results/metadata_h5n1_ha.tsv, results/branch-lengths_h5n1_ha.json, results/traits_h5n1_ha.json, results/nt-muts_h5n1_ha.json, results/aa-muts_h5n1_ha.json, results/cleavage-site_h5n1_ha.json, results/cleavage-site-sequences_h5n1_ha.json, config/auspice_config_h5n1.json
output: auspice/flu_avian_h5n1_ha.json
shell:
augur export v2 --tree results/tree_h5n1_ha.nwk --metadata results/metadata_h5n1_ha.tsv --node-data results/branch-lengths_h5n1_ha.json results/traits_h5n1_ha.json results/nt-muts_h5n1_ha.json results/aa-muts_h5n1_ha.json results/cleavage-site_h5n1_ha.json results/cleavage-site-sequences_h5n1_ha.json --auspice-config config/auspice_config_h5n1.json --include-root-sequence --output auspice/flu_avian_h5n1_ha.json
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-18T210843.145552.snakemake.log
···································
Hi, in augur align you are calling --fill-gaps which is just doing
_seq = _seq.replace('-', 'N')
so that in the nextstrain tree everyone gets a FCS.
I don't have access to filtered_H5Nx_HA.fasta and metadata-with-clade_H5Nx_HA.tsv so I can't make a proposal of modification of the Snakemake with a python script in order to leave the FCS indels but if I did I would be happy to make some tries.
The only experiment I did was to download a few thousands of H5Nx sequences from genbank, to label with LPAI those with a deletion at the FCS, then to remove from the alignment the FCS region, to run mega, and I got that the LPAI sequences where quite clustered together.
While working on #17, I noticed this repo's Snakefile hasn't yet adopted recommended practices in our Snakemake style guide. Probably not a priority right now, but would be good to adopt at some point.
Jotting down concrete steps for ingesting NCBI GenBank data for H5N1 outbreak based on internal team discussion and GDoc notes.
datasets
command and shows 2,561 records on NCBI Virus.datasets
to download dataset for accessions
datasets download virus genome accession --inputfile accessions.txt
Bio.Entrez.efetch
to fetch the GenBank records to parse out additional fields that are not included in the dataset metadata: strain, serotype, segment.I waffled a little on whether we needed (3), because I realized that the GenBank Title
is included in the FASTA headers of the sequences downloaded! We could potentially parse out strain, serotype, and segment from titles such as
PP766980.1 Influenza A virus (A/Canada goose/North Carolina/W24-90A/2024(H5N1)) clone new segment 7 matrix protein 2 (M2) and matrix protein 1 (M1) genes, complete cds
However, I'm not sure that all record titles will follow this format. GenBank docs for the Definition
/Title
does not make me confident about it either
Brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function (if the sequence is non-coding).
I think I'll stick with the original plan to use Entrez
, but it was a nice thought.
(1) is currently not possible because the download function from NCBI Virus is broken due to a bug for H5N1 (Slack thread).
I thought we could just download all of the influenza genomes then filter locally, but I'm hitting the invalid zip archive error when running
datasets download virus genome taxon 11320
I'm not hitting the error with some minimum filtering on release date and geo-location, so we can go with this:
datasets download virus genome taxon 11320 --released-after "04/01/2024" --geo-location "North America"
This returns 32,262 records. Grep for H5N
in the genomic.fna
brought this down to 3,275 which is closer to the number on NCBI Virus, though I cannot easily check to see they are the same sequences.
In conversation about #22, @trvrb noted:
I don't think this is part of the scope of this PR, but it would seem cleaner to me for this clade-labeling/add-clades.py script to instead just create a node data JSON with h5_label_clade rather than messing with the metadata file.
@lmoncla and I just had some confusion from different rules (refine, traits) asking for the metadata TSV vs the metadata-with-clade TSV. Though the function metadata_by_wildcards mostly solves this issue.
@jameshadfield noted that:
The only reason I can see to not do this is if we use this data in the filtering step. But we don't.
We should modify scripts/add-clades.py
to create a node data JSON file as output and update the workflow to make the resulting output an input to the export
rule instead of a step that modifies the metadata.
They are many clades for H3N2, who can tell me who defined them?
The Andersen lab will continue to update consensus sequences from SRA runs. These eventually get uploaded to GenBank by the original submitters and become available through the NCBI data. However, there is a delay in the data through NCBI, so we can merge the Andersen lab data with the NCBI data to get the latest available sequences.
Both sets of data have the SRA accession, so we can use that to dedup the data.
Follow up to #40. Blocked by #41
We can remove a lot manual work by adding GH Action workflows to run workflows.
The GH Action for fauna ingest will most likely need to be manually triggered since we still need to upload data to fauna manually. However, the GH Action for NCBI + Andersen lab ingest can be scheduled to run automatically.
Both GH Action workflows should use the shared pathogen-repo-build workflow. This will depend on nextstrain/infra#11 for using OIDC tokens in the GH Action workflow.
I'm not as familiar with the phylogenetic workflows in this repo, so it's unclear to me how much we can automate them via GH Actions. Is there a lot of manual clean up that needs to be done for builds or is it as straightforward as running nextstrain build
and then nextstrain deploy
?
Fauna output has (what I presume is) corrupted data for certain sequences such as:
>A/egret/Korea/22WC603/2023|avian_flu|EPIEPI2738274|2023-03-06|japan_korea|south_korea|south_korea|south_korea|avian|?|h5n1|national_institute_of_wildlife_disease_control_and_prevention(niwdc)|national_institute_of_wildlife_disease_control_and_prevention|?|?|2.3.4.4b|?
aegretkoreawcpbatggatgtcaatccgactttacttttcttaaaagtgccagcgcaaaatgccataagtacc...
(Notice the start of the nuc sequence includes parts of the strain name)
This data raises no errors during the parse
, filter
, align
steps but IQ-TREE will crash with a warning:
ERROR: Sequence A_DELIM-BZEZHVQHONIWPWOQAAAV_egret_DELIM-BZEZHVQHONIWPWOQAAAV_Korea_DELIM-BZEZHVQHONIWPWOQAAAV_22WC603_DELIM-BZEZHVQHONIWPWOQAAAV_2023 has invalid character E at site 17
...
I observed this while running builds on AWS to test #11 and I can reproduce locally as well (it's somewhat stochastic as it depends on if the sequence makes it though filtering. I think this may be the only such strain, but it is corrupted for pb1 and pb2.
I suggest we add it to the exclude list, unless others have more knowledge here?
Follow up to #40
With the recent addition of the community H5 Nextclade datasets in nextstrain/nextclade_data#196, it should now be possible to run Nextclade as part of ingest to assign clades to the H5 sequences.
Maybe this can replace the current manual clade labeling process with clade-labeling
scripts?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.