cdcgov / seqsender Goto Github PK

Automated Pipeline to Generate FTP Files and Manage Submission of Sequence Data to Public Repositories

Home Page: https://cdcgov.github.io/seqsender/

License: Apache License 2.0

Python 11.60% Dockerfile 0.01% CSS 0.01% HTML 88.39%

genbank gisaid ncbi-biosamples ncbi-genbank ncbi-sra ncbi-submission bioinformatics-pipeline biosample gisaid-format gisaid-upload

seqsender's Introduction

Public Database Submission Pipeline

Beta Version: 1.2.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!

General Disclaimer: This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Documentation

Overview

seqsender is a Python program that is developed to automate the process of generating necessary submission files and batch uploading them to NCBI archives (such as BioSample, SRA, and Genbank) and GISAID databases (e.g. EpiFlu, EpiCoV, EpiPox, EpiArbo). Presently, the pipeline is capable of uploading Influenza A Virus (FLU), SARS-COV-2 (COV), Monkeypox (POX), Arbovirus (ARBO), and a wide variety of other organisms. If you’d like to have seqsender support your virus create a issue.

Contacts

Role	Contact
Creator	Dakota Howard, Reina Chau
Maintainer	Dakota Howard
Back-Up	Reina Chau, Brian Lee

Code Attributions

Dakota Howard and Reina Chau for majority of the code base with input and testing from colleagues.

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC’s privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC’s Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

seqsender's People

Contributors

Stargazers

Watchers

Forkers

yil479 osnofianresearch phl-2 nbx0 mamtagiri leebrian ammaraziz concentricbyginkgo erikwolfsohn dthoward96

seqsender's Issues

Use Flu Submission Org

As part of the enhancement to support flu submissions, please use a different group to identify submissions made by flu programs (as opposed to sc2 groups using CDC_OAMD).

Step by step docs with screenshots

Complete step-by-step document on the website with screenshots of how to prepare files and run SS. The users will mainly be non-CLI, so the instructions need to be overly verbose. Write them for FLU use specifically for now.

Document for setup (define the lab’s configs)
Documents for operating
i. Putting metadata into template excel
ii. SS commands and verifying successful submission.

Default to GISAID sub first and attach EPI_SEQUENCE_ID to GenBank

Setup the Default behavior of submitting to all repos to be GISAID -> NCBI in order to first capture the EPI_SEQUENCE_ID assigned by GISAID and then adding to GenBank Structure Comment field like:

https://www.ncbi.nlm.nih.gov/nuccore/OP845736.1/

COMMENT     ##FluData-START##
            EPI_ISOLATE_ID   :: EPI_ISL_9631596
            NAME             :: A/Wisconsin/01/2022
            TYPE             :: H3
            Segment_name     :: HA
            HOST_GENDER      :: F
            PASSAGE          :: Original
            LOCATION         :: United States / Wisconsin
            COLLECT_DATE     :: 11-Jan-2022
            SPECIMEN_ID      :: 22VR005083 ORIGINAL
            SENDER_LAB       :: Wisconsin State Laboratory of Hygiene
            SEQLAB_SAMPLE_ID :: 3030725183
            EPI_SEQUENCE_ID  :: EPI1981213
            ##FluData-END##
        
FEATURES             Location/Qualifiers
     source          1..1737
                     /organism="Influenza A virus"
                     /mol_type="viral cRNA"
                     /strain="A/Wisconsin/01/2022"
                     /serotype="H3N2"
                     /host="Homo sapiens"
                     /db_xref="taxon:11320"
                     /segment="4"
                     /country="USA: Wisconsin"
                     /collection_date="11-Jan-2022"
                     /note="passage details:Original"

can't check status

I'm having another issue, which I suspect may be the result of my limited experience with docker...

When I try and check the status of a submission, I get this

`docker exec -it seqsender bash seqsender-kickoff check_submission_status --submission_dir ./ --submission_name sub_1 --organism FLU

Error: Submission name: sub_1 for FLU production-data is not found in the submission log file.

Error: Either a submission has not been made or an entry has been moved.
`

but the submission log file does exist right where I'm pointing, and contains that submission name

cat submission_log.csv Submission_Name,Organism,Database,Submission_Position,Submission_Type,Submission_Date,Submission_Status,Submission_Directory,Config_File,Table2asn,GFF_File,Update_Date sub_1,FLU,BIOSAMPLE,1,Production,2024-04-02,pending;submitted,/data,/data/sub_1/config.yaml,False,,2024-04-02 sub_1,FLU,SRA,1,Production,2024-04-02,pending;submitted,/data,/data/sub_1/config.yaml,False,,2024-04-02 sub_1,FLU,GENBANK,1,Production,2024-04-02,---;---,/data,/data/sub_1/config.yaml,False,,2024-04-02

FTP error: [Errno 2] No such file or directory: '/test_input/test_fastq_R1.fastq'

I am getting this error when I try to run the test submit command:
python seqsender.py submit --unique_name test_submission --config test_config.yaml --metadata /root/miniconda3/seqsender/test_input/test_metadata.tsv --fasta /root/miniconda3/seqsender/test_input/test_fasta.fasta --test

The output first says:
Processing test_submission.
Processing Files.
Creating GISAID files.
Creating Genbank files.
Creating BioSample files.
Creating SRA files.
test_submission complete.

Submission report exists pulling down.
Submitting to SRA/BioSample.

Followed by the FTP error described above. Does anything appear to be missing? Thank you

Can someone documenthow to obtain NCBI username and password

Is your feature request related to a problem? Please describe.
The seqsender submission configuration has fields for NCBI username and password. But NCBI accounts are created and logged into vai thied party systems (google etc). How do we obtain an NCBI username/password pair for submission to e.g. Biosample and SRA.

Describe the solution you'd like
Please provide documentation, or a pointer to documentation, detailing how to obtain an NCBI username/passowrd pair that would enable us to make submissions to Biosample, SRA, Genbank

Describe alternatives you've considered
NA

Additional context
We would need the credentials to work for submissions to Biosample and SRA.

xml submissions to NCBI do not require 'org_id'

Is your feature request related to a problem? Please describe.
xml submissions to NCBI do not require the 'org_id' field/attribute defined in *config.yaml. The SRA team confirmed that this number is for internal use and submitters only need to include their center/group name in the xml.

Describe the solution you'd like
Remove org_id from the config and associated parsing. The recommendation from SRA team is to simplify the organization block as:

Describe alternatives you've considered
Using the dummy value 12345 from the template appears to cause no adverse events.

Additional context
Add any other context or screenshots about the feature request here.

Create Automated Test/Validation Scripts

Creating automated testing for script updates with drastically increase testing time. Testing is tedious when it depends on databases processing files to determine if the changes work correctly.

Environment testing:

Automatic docker deployment to GHCR
Master branch now automatically builds and deploys the latest docker image to github container repository.
Automatic docker testing on pull request
Automatic python/mamba versioning testing

Mypy Testing:

All Functions mypy testing
Automated github-action mypy testing

Pydantic Testing:

All functions pydantic testing
Automated github-action pydantic testing

Update readme with doc style info

To help users and potential collaborators, please update the readme to follow the flu doc style and explicitly call out how you like folks to submit issues, test issues, commit (eg, dev branches vs main vs releases).

After your changes a user or collaborator will be able to understand how you work on and make changes to seqsender and how to expect to watch for in progress work, completed work and new releases.

Shiny app template errors

Loving the new Shiny app for configuration! I caught a couple of issues while I was playing around with it - I'll update this if I find more:

config.yaml

excludes Submission_Position: if GenBank isn't selected, causing the workflow to fail
invalid values for Submission_Position: - uses None, First, Second, workflow expects 1, 2 or empty
(sometimes?) doesn't enable Link_Sample_Between_NCBI_Databases: even though True is selected
- might be worth switching this on by default when SRA and BioSample are selected together - SRA submission will fail otherwise unless the user registered BioSamples previously and manually assigns them in the metadata

SARS-CoV-2.cl.1.0,COV metadata.csv

bs-description is not included - this isn't required for submission, but the workflow fails without it

automatic biosample package validation

Biosample packages can be incorporated into seqsender using the biosample attribute xml. It lists off the requirements for every biosample package and can be automated to regularly collect the most up to date xml to also store locally on github. This will allow users who want to use seqsender to instantly use their desired package without having to adjust the main_config file to support their organisms.

GitHub action to weekly scrape the biosample attribute xml to keep the repo up to date with latest attribute.
Seqsender function to pull down latest biosample attribute from web.
Seqsender biosample update to incorporate biosample package xml in addition to required fields in main config.

User defined date specificity

Is your feature request related to a problem? Please describe.
Hard-coded date formatting at YYYY-MM-DD creates challenges for generalizing to other microbial pathogens, the majority of which must be submitted to BioSample with only YYYY or YYYY-MM to ensure privacy.

Describe the solution you'd like
Consider letting users define their own date specificity, perhaps in the *_config.yaml. That would preserve the current default requirements for SC2 and Flu. A more advanced option would be to allow setting a minimum (or maximum) specificity rather than a fixed requirement for flexibility during submission (e.g. [1] YYYY or YYYY-MM, but not YYYY-MM-DD vs [2] YYYY-MM or YYYY-MM-DD, but not YYYY).

Describe alternatives you've considered
Maybe this also gets covered in your solution to #43 but BioSample itself does not impose strict requirements for date specificity and it's generally up to submitters to determine what is appropriate.

Additional context
Add any other context or screenshots about the feature request here.

Clarify GISAID CLI usage

Clarify in documentation the need for users to go to gisaid.org and download their CLI-API, where to install it for SeqSender to use it and requesting their token.
Refactor seqsender internal code to import their CLI python pkg and use their commands rather than the raw API. This will assure that changes made by GISAID get inherited to SeqSender more smoothly.

cc: @leebrian @rchau88 @kristinelacek

Errors in script for production submission

Hello, I am trying to do our first production submission and this is the output I am getting.

Traceback (most recent call last):
File "seqsender.py", line 616, in
main()
File "seqsender.py", line 591, in main
submission_preparation.process_submission(args.unique_name, args.fasta, args.metadata, os.path.join(os.path.dirname(os.path.abspath(file)), "config_files", args.config))
File "/root/miniconda3/seqsender/submission_preparation.py", line 493, in process_submission
main_df = merge(fasta_file, metadata_file)
File "/root/miniconda3/seqsender/submission_preparation.py", line 228, in merge
main_df = fasta.merge(metadata, left_on = "fasta_name_orig", right_on = config_dict["general"]["fasta_sample_name_col"], how = "left")
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/frame.py", line 7963, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 87, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 652, in init
) = self._get_merge_keys()
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1005, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/generic.py", line 1563, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'specimen_collector_sample_id'

There is no submission directory at /home/user1/FLU_reporting

This may be an extremely simple question, I'm probably overlooking something to do with docker, but I'm trying a test submission via Docker, the command (and response) being

`docker exec -it seqsender bash seqsender-kickoff submit
--organism FLU
-bsn
--submission_dir /home/user1/FLU_reporting/
--submission_name test_sub
--config_file config.yaml
--metadata_file metadata.csv
--fasta_file RI-M04353-2024013.fasta
--test

There is no submission directory at /home/user1/FLU_reporting
`

the path is correct, my submission folder "test_sub" is indeed at /home/user1/FLU_reporting/

I'm at a loss

Issue templates to update

Additional templates

New contributor
Suggest new virus to support

Update existing templates

Bug Report
Feature Request
Maintenance

Info to add to existing templates

Virus information
Instrument information
Database information

Pandera metadata validation

User metadata can be validated using pandera validation. This will allow for metadata field requirements based on a schema file. This will allow seqsender to automatically detect issues with user metadata. Pandera is a better alternative than hardcoding metadata field validation into seqsender because a schema can be created for each virus with multiple valid options for each field. This can then be easily expanded to include restrictions for other viruses or to roll back restrictions.

Pandera metadata schema files:

Functions to add to next version

check-submissions: Allow option to update a single submission instead of updating all submissions in log.
other organism: Allow any organism to be used with the flag "other" .
Other is currently added. It doesn't allow for GISAID submissions since it cannot be determined easily which epiCLI to use. The other option is a default generic submission template which will allow for any organism to be submitted to NCBI.
gisaid: Create gisaid submission as a toggle option to be used with any organism. This will allow automated upload for NCBI but manual submission for gisaid when a CLI option doesn't exist.
In order to support turning off GISAID submissions for other organisms, all epiCLI's have support added for them. This is to allow for any epiCLI to be connected to seqsender and used without issue. New epiCLI's can be easily added by its information to the internal metadata config file.
- EpiArbo (Arbovirus)
- EpiPox (Monkeypox)
Table2asn submission validation.
Table2asn submissions are made via email, this prevents seqsender from being able to validate a submission is correct before submitting it. Using the Table2asn validation file seqsender can now parse this file and detect issues which will then prevent submission and notify the user of what issues to correct.
User config file validation.
Config files are used to store user info and determine how seqsender processes their submissions. Current checks only validated that it loaded correctly as a yaml file. Now config files are checked against schema files which can determine if a user incorrectly filled out their submission file. If the user did incorrectly fill out their submission file it will now report an error message directing the user to the incorrect field and notifying them of what to change it to.

Adding these two features to this update as they are needed to resolve issues with incorporating Enteric BioSample attributes

NCBI & GISAID account creation docs

Will be very helpful to have step-by-step instructions with screenshots in the documentation for creating an account, highlighting which fields will then be needed later in seqsender.

How to correctly add Source Modifiers for genbank submisson?

Hi,

Thanks for creating and maintaining this very useful program.

I am trying to include patient metadata for the NCBI submission part of the process. Similar to gisaid which allows gender, passage etc, NCBI allows source modifiers such as sex.

I am not 100% sure how to include such information for the NCBI part of the metadata in the config file.

genbank_src_metadata:
  column_names:
    isolate: genbank_name
    host: host
    country: location
    isolation-source: isolation_source

Lets say I wanted to include the ncbi source modifier Sex (assuming I have a column in my metadata called gender), would I add the following:

genbank_src_metadata:
  column_names:
  ....
  Sex: gender

Is that correct?

A related question, for the structured data section eg:

COMMENT     ##Assembly-Data-START##
            Assembly Method       :: CLC Genomics
            Sequencing Technology :: PacBio Sequel II
            ##Assembly-Data-END##

How can I add more information than?

My main aim is to match all the metadata that is required in gisaid to ncbi submission.

Thanks,

Ammar

Add Influenza submission