Giter Club home page Giter Club logo

seqsender's Introduction

Public Database Submission Pipeline

Beta Version: 1.1.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!

General Disclaimer: This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Overview

seqsender is a Python program that is developed to automate the process of generating necessary submission files and batch uploading them to NCBI archives (such as BioSample, SRA, and Genbank) and GISAID databases (e.g. EpiFlu and EpiCoV). Presently, the pipeline is capable of uploading Influenza A Virus (FLU) and SARS-COV-2 (COV) data. However, the dynamic nature of this pipeline can allow for additional uploads of other organisms in future updates or requests.

Contacts

Role Contact
Creator Dakota Howard, Reina Chau
Maintainer Dakota Howard
Back-Up Reina Chau, Brian Lee

Prerequisites

  • NCBI Submissions

seqsender utilizes an UI-Less Data Submission Protocol to bulk upload submission files (e.g., submission.xml, submission.zip, etc.) to NCBI archives. The submission files are uploaded to the NCBI server via FTP on the command line. Before attempting to submit a submission using seqsender, submitter will need to

  1. Have a NCBI account. To sign up, visit NCBI website.

  2. Required for CDC users and highly recommended for others is creating a center account for your institution/lab NCBI Center Account Instructions. Center accounts allow you to perform submissions UI-less submissions as your institution/lab.

  3. Required for CDC users and also recommended is creating a submission group in NCBI Submission Portal. A group should include all individuals who need access to UI-less submissions through the web interface with your center account. Each member of the group must also have an individual NCBI account. NCBI website.

  4. Refer to this page for information regarding requirements for GenBank submissions via FTP only. This page applies only for COVID and Influenza NCBI GenBank FTP Submissions For further questions contact [email protected] to discuss requirements for submissions.

  5. Coordinate a NCBI namespace name (spuid_namespace) that will be used with Submitter Provided Unique Identifiers (spuid) in the submission. The liaison of spuid_namespace and spuid is used to report back assigned accessions as well as for cross-linking objects within submission. The values of spuid_namespace are up to the submitter to decide but they must be unique and well-coordinated prior to make a submission. For more information about these two fields, see BioSample / SRA / GENBANK metadata requirements.

  • GISAID Submissions

seqsender makes use of GISAID’s Command Line Interface tools to bulk uploading meta- and sequence-data to GISAID databases. Presently, the pipeline only allows upload to EpiFlu (Influenza A Virus) and EpiCoV (SARS-COV-2) databases. Before uploading, submitter needs to

  1. Have a GISAID account. To sign up, visit GISAID Platform.

  2. Request a client-ID for EpiFlu or EpiCoV database in order to use its CLI tool. The CLI utilizes the client-ID along with the username and password to authenticate the database prior to make a submission. To obtain a client-ID, please email [email protected] to request. Important note: If submitter would like to upload a “test” submission first to familiarize themselves with the submission process prior to make a real submission, one should additionally request a test client-id to perform such submissions.

  3. Download the EpiFlu or EpiCoV CLI from the GISAID platform and stored them in the destination of choice prior to perform a batch upload.

Here is a quick look of where to store the downloaded GISAID CLI package.

Requirement Files

Before submitters can perform a batch submission using seqsender, they must make sure the requirement files (such as config.yaml, metadata.csv, sequence.fasta, raw reads, etc.) are already prepared and stored in a submission directory of choice.

  1. To prep for FLU submissions, select one of the databases below to get started:

BioSample
SRA
Genbank
GISAID

  1. To prep for COV submissions, select one of the databases below to get started:

BioSample
SRA
Genbank
GISAID

Quick Start

Code Attributions

Dakota Howard and Reina Chau for majority of the code base with input and testing from colleagues.

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC’s privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC’s Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

seqsender's People

Contributors

dthoward96 avatar leebrian avatar nbx0 avatar rchau88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

seqsender's Issues

Pandera metadata validation

User metadata can be validated using pandera validation. This will allow for metadata field requirements based on a schema file. This will allow seqsender to automatically detect issues with user metadata. Pandera is a better alternative than hardcoding metadata field validation into seqsender because a schema can be created for each virus with multiple valid options for each field. This can then be easily expanded to include restrictions for other viruses or to roll back restrictions.

Pandera metadata schema files:

  • FLU
  • COV
  • POX
  • ARBO
  • OTHER
  • No Validation

FTP error: [Errno 2] No such file or directory: '/test_input/test_fastq_R1.fastq'

I am getting this error when I try to run the test submit command:
python seqsender.py submit --unique_name test_submission --config test_config.yaml --metadata /root/miniconda3/seqsender/test_input/test_metadata.tsv --fasta /root/miniconda3/seqsender/test_input/test_fasta.fasta --test

The output first says:
Processing test_submission.
Processing Files.
Creating GISAID files.
Creating Genbank files.
Creating BioSample files.
Creating SRA files.
test_submission complete.

Submission report exists pulling down.
Submitting to SRA/BioSample.

Followed by the FTP error described above. Does anything appear to be missing? Thank you

Clarify GISAID CLI usage

  • Clarify in documentation the need for users to go to gisaid.org and download their CLI-API, where to install it for SeqSender to use it and requesting their token.

  • Refactor seqsender internal code to import their CLI python pkg and use their commands rather than the raw API. This will assure that changes made by GISAID get inherited to SeqSender more smoothly.

cc: @leebrian @rchau88 @kristinelacek

Default to GISAID sub first and attach EPI_SEQUENCE_ID to GenBank

Setup the Default behavior of submitting to all repos to be GISAID -> NCBI in order to first capture the EPI_SEQUENCE_ID assigned by GISAID and then adding to GenBank Structure Comment field like:

https://www.ncbi.nlm.nih.gov/nuccore/OP845736.1/

COMMENT     ##FluData-START##
            EPI_ISOLATE_ID   :: EPI_ISL_9631596
            NAME             :: A/Wisconsin/01/2022
            TYPE             :: H3
            Segment_name     :: HA
            HOST_GENDER      :: F
            PASSAGE          :: Original
            LOCATION         :: United States / Wisconsin
            COLLECT_DATE     :: 11-Jan-2022
            SPECIMEN_ID      :: 22VR005083 ORIGINAL
            SENDER_LAB       :: Wisconsin State Laboratory of Hygiene
            SEQLAB_SAMPLE_ID :: 3030725183
            EPI_SEQUENCE_ID  :: EPI1981213
            ##FluData-END##
        
FEATURES             Location/Qualifiers
     source          1..1737
                     /organism="Influenza A virus"
                     /mol_type="viral cRNA"
                     /strain="A/Wisconsin/01/2022"
                     /serotype="H3N2"
                     /host="Homo sapiens"
                     /db_xref="taxon:11320"
                     /segment="4"
                     /country="USA: Wisconsin"
                     /collection_date="11-Jan-2022"
                     /note="passage details:Original"

Functions to add to next version

  • check-submissions: Allow option to update a single submission instead of updating all submissions in log.
  • other organism: Allow any organism to be used with the flag "other" .
    Other is currently added. It doesn't allow for GISAID submissions since it cannot be determined easily which epiCLI to use. The other option is a default generic submission template which will allow for any organism to be submitted to NCBI.
  • gisaid: Create gisaid submission as a toggle option to be used with any organism. This will allow automated upload for NCBI but manual submission for gisaid when a CLI option doesn't exist.
    In order to support turning off GISAID submissions for other organisms, all epiCLI's have support added for them. This is to allow for any epiCLI to be connected to seqsender and used without issue. New epiCLI's can be easily added by its information to the internal metadata config file.
    • EpiArbo (Arbovirus)
    • EpiPox (Monkeypox)
  • Table2asn submission validation.
    Table2asn submissions are made via email, this prevents seqsender from being able to validate a submission is correct before submitting it. Using the Table2asn validation file seqsender can now parse this file and detect issues which will then prevent submission and notify the user of what issues to correct.
  • User config file validation.
    Config files are used to store user info and determine how seqsender processes their submissions. Current checks only validated that it loaded correctly as a yaml file. Now config files are checked against schema files which can determine if a user incorrectly filled out their submission file. If the user did incorrectly fill out their submission file it will now report an error message directing the user to the incorrect field and notifying them of what to change it to.

How to correctly add Source Modifiers for genbank submisson?

Hi,

Thanks for creating and maintaining this very useful program.

I am trying to include patient metadata for the NCBI submission part of the process. Similar to gisaid which allows gender, passage etc, NCBI allows source modifiers such as sex.

I am not 100% sure how to include such information for the NCBI part of the metadata in the config file.

genbank_src_metadata:
  column_names:
    isolate: genbank_name
    host: host
    country: location
    isolation-source: isolation_source

Lets say I wanted to include the ncbi source modifier Sex (assuming I have a column in my metadata called gender), would I add the following:

genbank_src_metadata:
  column_names:
  ....
  Sex: gender

Is that correct?

A related question, for the structured data section eg:

COMMENT     ##Assembly-Data-START##
            Assembly Method       :: CLC Genomics
            Sequencing Technology :: PacBio Sequel II
            ##Assembly-Data-END##

How can I add more information than?

My main aim is to match all the metadata that is required in gisaid to ncbi submission.

Thanks,

Ammar

Update readme with doc style info

To help users and potential collaborators, please update the readme to follow the flu doc style and explicitly call out how you like folks to submit issues, test issues, commit (eg, dev branches vs main vs releases).

After your changes a user or collaborator will be able to understand how you work on and make changes to seqsender and how to expect to watch for in progress work, completed work and new releases.

NCBI & GISAID account creation docs

Will be very helpful to have step-by-step instructions with screenshots in the documentation for creating an account, highlighting which fields will then be needed later in seqsender.

Create Automated Test/Validation Scripts

Creating automated testing for script updates with drastically increase testing time. Testing is tedious when it depends on databases processing files to determine if the changes work correctly.

Environment testing:

  • Automatic docker deployment to GHCR
    Master branch now automatically builds and deploys the latest docker image to github container repository.
  • Automatic docker testing on pull request
  • Automatic python/mamba versioning testing

Functions to be tested:

  • create_ncbi_submission
  • create_gisaid_submission
  • create_submission_xml
  • create_submission_status_csv
  • create_authorset
  • create_fasta
  • create_genbank_files
  • create_genbank_zip
  • save_xml
  • create_submission_log
  • create_genbank_table2asn
  • read_gisaid_log
  • get_required_colnames
  • get_metadata
  • check_credentials
  • check_raw_read_files
  • process_fasta_samples
  • update_submission_status
  • update_genbank_files
  • get_token
  • get_ncbi_process_report
  • process_biosample_sra_report
  • check_submission_description
  • process_genbank_report
  • get_execution_time
  • get_config
  • start
  • args_parser
  • main
  • create_zip_template
  • authenticate
  • download_table2asn
  • submit_ncbi
  • submit_gisaid
  • sendmail

Issue templates to update

Additional templates

  • New contributor
  • Suggest new virus to support

Update existing templates

  • Bug Report
  • Feature Request
  • Maintenance

Info to add to existing templates

  • Virus information
  • Instrument information
  • Database information

Errors in script for production submission

Hello, I am trying to do our first production submission and this is the output I am getting.

Traceback (most recent call last):
File "seqsender.py", line 616, in
main()
File "seqsender.py", line 591, in main
submission_preparation.process_submission(args.unique_name, args.fasta, args.metadata, os.path.join(os.path.dirname(os.path.abspath(file)), "config_files", args.config))
File "/root/miniconda3/seqsender/submission_preparation.py", line 493, in process_submission
main_df = merge(fasta_file, metadata_file)
File "/root/miniconda3/seqsender/submission_preparation.py", line 228, in merge
main_df = fasta.merge(metadata, left_on = "fasta_name_orig", right_on = config_dict["general"]["fasta_sample_name_col"], how = "left")
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/frame.py", line 7963, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 87, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 652, in init
) = self._get_merge_keys()
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1005, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/generic.py", line 1563, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'specimen_collector_sample_id'

Use Flu Submission Org

As part of the enhancement to support flu submissions, please use a different group to identify submissions made by flu programs (as opposed to sc2 groups using CDC_OAMD).

Can someone documenthow to obtain NCBI username and password

Is your feature request related to a problem? Please describe.
The seqsender submission configuration has fields for NCBI username and password. But NCBI accounts are created and logged into vai thied party systems (google etc). How do we obtain an NCBI username/password pair for submission to e.g. Biosample and SRA.

Describe the solution you'd like
Please provide documentation, or a pointer to documentation, detailing how to obtain an NCBI username/passowrd pair that would enable us to make submissions to Biosample, SRA, Genbank

Describe alternatives you've considered
NA

Additional context
We would need the credentials to work for submissions to Biosample and SRA.

Step by step docs with screenshots

Complete step-by-step document on the website with screenshots of how to prepare files and run SS. The users will mainly be non-CLI, so the instructions need to be overly verbose. Write them for FLU use specifically for now.

  • Document for setup (define the lab’s configs)
  • Documents for operating
    i. Putting metadata into template excel
    ii. SS commands and verifying successful submission.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.