usegalaxy-eu / ena-upload-cli Goto Github PK

ENA upload tool - script your Open Data upload to the European Nucleotide Archive

License: MIT License

Python 100.00%

ena-upload-cli's Introduction

ENA upload tool

This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this template repo. The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a Galaxy tool and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like usegalaxy.eu or usegalaxy.be.

Overview

The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:

STUDY
SAMPLE
EXPERIMENT
RUN

You can set the tool to perform the following actions:

add: add an object to the archive
modify: modify an object in the archive
cancel: cancel a private object and its dependent objects
release: release a private object immediately to the public

After a successful submission, new tsv tables will be generated with the ENA accession numbers filled in along with a submission receipt.

Tool dependencies

python 3.7+ including following packages:
- Genshi
- lxml
- pandas
- requests
- pyyaml
- openpyxl
- jsonschema

Installation

pip install ena-upload-cli

Usage

Minimal:  ena-upoad-cli --action {add,modify,cancel,release} --center CENTER_NAME  --secret SECRET

All supported arguments:

  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --action {add,modify,cancel,release}
                         add: add an object to the archive
                         modify: modify an object in the archive
                         cancel: cancel a private object and its dependent objects
                         release: release a private object immediately to public
  --study STUDY         table of STUDY object
  --sample SAMPLE       table of SAMPLE object
  --experiment EXPERIMENT
                        table of EXPERIMENT object
  --run RUN             table of RUN object
  --data [FILE ...]     data for submission
  --center CENTER_NAME  specific to your Webin account
  --checklist CHECKLIST
                        specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011
  --xlsx XLSX           filled in excel template with metadata
  --isa_json ISA_JSON   BETA: ISA json describing describing the ENA objects
  --isa_assay_stream ISA_ASSAY_STREAM
                        BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams
  --auto_action         BETA: detect automatically which action (add or modify) to apply when the action column is not given
  --tool TOOL_NAME      specify the name of the tool this submission is done with. Default: ena-upload-cli
  --tool_version TOOL_VERSION
                        specify the version of the tool this submission is done with
  --no_data_upload      indicate if no upload should be performed and you like to submit a RUN object (e.g. if uploaded was done separately).
  --draft               indicate if no submission should be performed
  --secret SECRET       .secret.yml file containing the password and Webin ID of your ENA account
  -d, --dev             flag to use the dev/sandbox endpoint of ENA

Mandatory arguments: --action, --center and --secret.

ENA Webin

A Webin can be made here if you don't have one already. The Webin ID makes use of the full username looking like: Webin-XXXXX. Visit Webin online to check on your submissions or dev Webin to check on test submissions.

The .secret.yml file

To avoid exposing your credentials through the terminal history, it is recommended to make use of a .secret.yml file, containing your password and username keywords. An example is given in the root of this directory.

ENA sample checklists

You can specify ENA sample checklist using the --checklist parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our template repo.

Fixed sample columns

The command line tool will automatically fetch the correct scientific name based on the taxon ID or fetch the taxon ID based on the scientific name. Both can be given and no overwrite will be done.

Mandatory: alias, title, sample_description, collection date, geographic location (country and/or sea) and either scientific_name or taxon_id (preferred)
Optional: common_name, sample_description

alias	title	taxon_id	scientific_name	common_name	sample_description	collection date	geographic location (country and/or sea)
sample_alias_4	sample_title_2	2697049	Severe acute respiratory syndrome coronavirus 2	covid-19	sample_description_1	2020-10-11	Argentina
sample_alias_5	sample_title_3	2697049	Severe acute respiratory syndrome coronavirus 2	covid-19	sample_description_2	2008-01-24	Belgium

Viral submissions

If you want to submit viral samples you can use the ENA virus pathogen checklist by adding ERC000033 to the checklist parameter. Check out our viral example command as demonstration. Please use the ENA virus pathogen checklist in our template repo to know what is allowed/possible in the Controlled vocabularyfields.

ENA study, experiment and run tables

Please check out the template of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.

Dev instance

By default the submission will be done using following url to ENA: https://www.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA

Use the --dev flag if you want to do a test submission using the tool by the sandbox dev instance of ENA: https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA. A TEST submission will be discarded within 24 hours.

Submitting a selection of rows to ENA

There are two ways of submitting only a selection of objects to ENA. This is handy for reoccurring submissions, especially when they belong to the same study.

Manual: you can add an optional status column to every table/sheet that contains the action you want to apply during this submission. If you chose to add only the first 2 samples to ENA, you specify --action add as parameter in the command and you add the add value to the status column of the rows you want to submit as demonstrated below. Same holds for the action modify, release and cancel.
Automatic (BETA): using the --auto_action it is possible to auto detect wether an object (using the alias) is already present on ENA and will fill in the specified action (--action parameter) accordingly. In practice, this means that if a user chooses to add objects and we already find this object already exists using its alias, this objects will not be added. On the other hand, if the command is used to modify objects, we want to apply this solely on objects that already exist on ENA. The detection only works with ENA objects that are published and findable on the website trough the search function (both the dev and live website). If the tool does not correctly detect the presence of your ENA object, we suggest to use the more robust manual approach as described above.

Example with modify as seen in the example sample modify table

alias	status	title	taxon_id	sample_description
sample_alias_4	modify	sample_title_1	2697049	sample_description_1
sample_alias_5		sample_title_2	2697049	sample_description_2

IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the --action parameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.

Using Excel templates

We also support the use of specific excel templates, designed for each sample checklist. Use the --xlsx command to add the path to an excel template file filled in from this template repo.

The data files

Supported data

Most files uploaded to the ENA FTP server need to be compressed.

More information on how ENA wants to receive the files can be found here.

Note for data upload: Uploaded files are persistently stored on the ENA server after the upload for some time. Thus, if multiple test submission are performed, it is possible to skip the data upload with --no_data_upload in subsequent submissions. This also allows uploading (large) datasets separately e.g. with aspera. For the --no_data_upload argument, data file(s) still need to be provided with --data if a RUN object is submitted without its MD5 sums in the file_checksum column.

Releasing and canceling a submission

If you want to release or cancel data, you can do so by using cancel or release in the --action parameter in the command line. Tables that have to be released or cancelled need an accession column with corresponding accession ids. This means that you first have to use add to submit your data, and use afterwords the updated table with accession ids, if you did not yet submit your data.

By default the updated tables after submission will have the action added in their status column. Don't forget to change the values to release or cancel if you want to use one of these actions (or delete the status column if your action applies for the whole table).

NOTE: Releasing a study will make all child elements like runs and experiments public.

Tool overview

inputs:

metadata tables/excelsheet/isa_json
- examples in example_table and on this template repo for excel sheets
- (optional) define actions in status column e.g. add, modify, cancel, release (when not given the whole table is submitted)
- to perform bulk submission of all objects, the aliases ids in different ENA objects should be in the association where alias ids in experiment object link all objects together
experimental data
- examples in example_data

outputs:

a receipt.xml file in the working directory with the receipt from the ENA submission
metadata tables with updated info in the same directory of inputs:
- updated status: added, modified, canceled, released
- accession ids
- submission date
- file checksums in runs table if not given
- taxon id or scientific name in sample table if not given

Test the tool

Add metadata and sequence data

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --secret .secret.yml

Add metadata only

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs_md5sums.tsv --dev --secret .secret.yml

Add studies

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --dev --secret .secret.yml

Modify sample metadata

ena-upload-cli --action modify --center 'your_center_name' --sample example_tables/ENA_template_samples_modify.tsv --dev --secret .secret.yml

Viral data

ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples_vir.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml

Using an Excel template

ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsx

Using an ISA JSON

ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"

Release submission

ena-upload-cli --action release --center'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml

Note for Windows users: Windows, by default, does not support wildcard expansion in command-line arguments. Because of this the --data example_data/*gz argument should be substituted with one containing a list of the data files. For this example, use:
--data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz

ena-upload-cli's People

Contributors

Stargazers

Watchers

Forkers

bedroesb roncoronimiguel hivlab mobilegenome nbisweden rabuono vdkkia ieguinoa erikakvalem eisenra wna-se cat-bro elixir-belgium scilifelabdatacentre

ena-upload-cli's Issues

[Discussion] Adding support for Nextcloud file transfer to ENA ftp server ?

A feature brought up in issue #48 which is worth thinking about!

The license_file parameter is deprecated, use license_files instead.

Hi,

I think there might be a problem with how version 0.6.3 of the upload client is built, since I can't install it using pip. I have no problem installing version 0.6.2. Here is the installation log:

ena.log

I would take the chance to ask about uploading an analysis. I have an count table with samples as columns and genome accessions as rows, and apart from that, the unprocessed reads from deep shotgun sequencing. I know how to proceed with the reads, but how should I do with the count table? Is it possible to add this table to the study?

Also, does the sample table allows for custom fields? Besides the fields from the ENA checklist.

If these last two enquiries do not belong here, I am happy to move them elsewhere.

Thank you.

Support FTPS connection to protect account credidentials

The FTP protocol does not support secure connections between the client and the server and account credentials are sent as plain text. This might not be an issue depending on how the traffic is routed but it not good practice for general use cases.

Supporting all optional fields in the run/experiment and study xml

Adding support for all official ENA checklists

Checklists from https://www.ebi.ac.uk/ena/browser/checklists

Adding flag to not do a submission but just to generate the XMLs

This would be for testing purposes

Catching errors better + more clear error messages

The tool would benefit from different verbose modes and real logging.

Adding controlled vocabulary to the sample checklists

The controlled vocabulary for the sample checklists are now checked on ENAs side, this could also be done on our side

Using the labels from ENA in the tsv table headers instead of the underscore version

This way no mapping table is needed

Possibility to upload non-mandatory (virus) metadata

Thanks for the nice helper. However, I noticed that currently ena-upload-cli supports only mandatory ERC000033 fields in their XML templates. I came up for myself with a little hack with updated XML forms https://github.com/avilab/ena-upload-cli/tree/location to get some additional metadata uploaded, e.g. age, geographic location locality+lon/lat. I appreciate that under current implementation there is no simple fix to include any of optional checklist fields, given that ENA database may not accept empty fields(not sure).

One way to fix this possible issue would be, very briefly, to check imported tables against schemas and serialise dictionaries to XML. So that any combination of non-mandatory fields can be included.

I happy to be corrected and directed to the right path if there is a way to include optional/recommended (virus) metadata fields using this app as it is.

Possible values for file_format are not clear

For the field file_format in the ENA run table.
Possible values for the field would be the ones listed in ENA_template_FILE.xml.
However, if any of the following values are used, an error message is generated.

454_native
454_native_qual
454_native_seq
fasta
helicos_native
illumina_native
illumina_native_int
illumina_native_prb
illumina_native_qseq
illumina_native_scarf
illumina_native_seq
solid_native
solid_native_csfasta
solid_native_qual
sra
tab

Click to show error message

<MESSAGES>
          <ERROR>In run, alias:"run_1", accession:"", In filename:"1.bam", filetype:"454_native". Invalid file type "454_native".</ERROR>
          <ERROR>In run, alias:"run_10", accession:"", In filename:"10.bam", filetype:"Illumina_native". Invalid file type "Illumina_native".</ERROR>
          <ERROR>In run, alias:"run_11", accession:"", In filename:"11.bam", filetype:"Illumina_native_int". Invalid file type "Illumina_native_int".</ERROR>
          <ERROR>In run, alias:"run_12", accession:"", In filename:"12.bam", filetype:"Illumina_native_prb". Invalid file type "Illumina_native_prb".</ERROR>
          <ERROR>In run, alias:"run_13", accession:"", In filename:"13.bam", filetype:"Illumina_native_qseq". Invalid file type "Illumina_native_qseq".</ERROR>
          <ERROR>In run, alias:"run_14", accession:"", In filename:"14.bam", filetype:"Illumina_native_scarf". Invalid file type "Illumina_native_scarf".</ERROR>
          <ERROR>In run, alias:"run_15", accession:"", In filename:"15.bam", filetype:"Illumina_native_seq". Invalid file type "Illumina_native_seq".</ERROR>
          <ERROR>In run, alias:"run_16", accession:"", In filename:"16.tar", filetype:"OxfordNanopore_native". Invalid file suffix for file "16.tar". File compression is required for file type "OxfordNanopore_native". Supported compression formats are: BZIP2, GZIP with file suffixes: .bz2, .gz.</ERROR>
          <ERROR>In run, alias:"run_19", accession:"", In filename:"19.bam", filetype:"SOLiD_native". Invalid file type "SOLiD_native".</ERROR>
          <ERROR>In run, alias:"run_2", accession:"", In filename:"2.bam", filetype:"454_native_qual". Invalid file type "454_native_qual".</ERROR>
          <ERROR>In run, alias:"run_20", accession:"", In filename:"20.bam", filetype:"SOLiD_native_csfasta". Invalid file type "SOLiD_native_csfasta".</ERROR>
          <ERROR>In run, alias:"run_21", accession:"", In filename:"21.bam", filetype:"SOLiD_native_qual". Invalid file type "SOLiD_native_qual".</ERROR>
          <ERROR>In run, alias:"run_22", accession:"", In filename:"22.fastq", filetype:"sra". Invalid file type "sra".</ERROR>
          <ERROR>In run, alias:"run_24", accession:"", In filename:"24.fastq", filetype:"tab". Invalid file type "tab".</ERROR>
          <ERROR>In run, alias:"run_3", accession:"", In filename:"3.bam", filetype:"454_native_seq". Invalid file type "454_native_seq".</ERROR>
          <ERROR>In run, alias:"run_7", accession:"", In filename:"7.fastq", filetype:"fasta". Invalid file type "fasta".</ERROR>
          <ERROR>In run, alias:"run_9", accession:"", In filename:"9.bam", filetype:"Helicos_native". Invalid file type "Helicos_native".</ERROR>
          <ERROR>In run, alias:"run_1", accession:"". Invalid group of files: 1 "454_native" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_10", accession:"". Invalid group of files: 1 "Illumina_native" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_11", accession:"". Invalid group of files: 1 "Illumina_native_int" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_12", accession:"". Invalid group of files: 1 "Illumina_native_prb" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_13", accession:"". Invalid group of files: 1 "Illumina_native_qseq" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_14", accession:"". Invalid group of files: 1 "Illumina_native_scarf" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_15", accession:"". Invalid group of files: 1 "Illumina_native_seq" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_19", accession:"". Invalid group of files: 1 "SOLiD_native" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_2", accession:"". Invalid group of files: 1 "454_native_qual" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_20", accession:"". Invalid group of files: 1 "SOLiD_native_csfasta" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_21", accession:"". Invalid group of files: 1 "SOLiD_native_qual" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_22", accession:"". Invalid group of files: 1 "sra" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_24", accession:"". Invalid group of files: 1 "tab" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_3", accession:"". Invalid group of files: 1 "454_native_seq" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_7", accession:"". Invalid group of files: 1 "fasta" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <ERROR>In run, alias:"run_9", accession:"". Invalid group of files: 1 "Helicos_native" file. Supported file grouping(s) are: [ at least 1 "CompleteGenomics_native" files],[ at least 1 "fastq" files],[1 "OxfordNanopore_native" file],[ at least 1 "PacBio_HDF5" files],[1 "bam" file],[1 "cram" file],[1 "sff" file],[1 "srf" file].</ERROR>
          <INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
</MESSAGES>

ENA might be using another list of file formats to validate. Their documentation at Accepted Read Data Formats either does not list the values leading to the error, or flags them as deprecated.

The read operation timed out

Hi there,
I am unable get this the following command to run on my ubuntu VM. The tool was installed using the pip command (pip install ena-upload-cli). My ubuntu VM already has the ftp port 21 open by default. Any thoughts?

ena-upload-cli --action add --center 'BioCommons Australia' --study ENA_template_studies.tsv --sample ENA_template_samples.tsv --experiment ENA_template_experiments.tsv --run ENA_template_runs.tsv --data *gz -d --secret .secret.yml
Check if all required columns are present in the study table.
Check if all required columns are present in the sample table.
Check if all required columns are present in the experiment table.
Check if all required columns are present in the run table.
No valid checksums found, generate now... done.

Connecting to ftp.webin2.ebi.ac.uk....
uploading /home/ubuntu/ena/ENA_TEST1.R1.fastq.gz
ERROR: The read operation timed out
ERROR: If your connection times out at this stage, it propably is because of a firewall that is in place. FTP is used in passive mode and connection will be opened to one of the ports: 40000 and 50000.
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/ena-upload-cli", line 11, in
load_entry_point('ena-upload-cli==0.6.1', 'console_scripts', 'ena-upload-cli')()
File "/home/ubuntu/.local/lib/python3.8/site-packages/ena_upload/ena_upload.py", line 925, in main
submit_data(file_paths, password, webin_id)
File "/home/ubuntu/.local/lib/python3.8/site-packages/ena_upload/ena_upload.py", line 424, in submit_data
print(ftps.storbinary(f'STOR {filename}', open(path, 'rb')))
File "/usr/lib/python3.8/ftplib.py", line 504, in storbinary
conn.unwrap()
File "/usr/lib/python3.8/ssl.py", line 1285, in unwrap
s = self._sslobj.shutdown()
socket.timeout: The read operation timed out

Many thanks,

If table is given without action values, an error is thrown

Add support for tables that to not contain add or modify in the status column

Add GitHub Action to check if new xsd templates are added to ENA

Include script used to generate tabular metadata templates

Add support for the submission of analysis objects

ENA supports the submission of other analysis spreadsheets/XMLs.

Following the analysis xsd formatting

Missing / Wrong sequencer identifiers in templates

Hi,

when I processed some of our data, the data validation failed with an error message relating to missing "Illumina" elements in the XML. I could solve this by adding the following lines to

ENA_template_PLATFORM.XML (l.28):
<INSTRUMENT_MODEL py:when="row.instrument_model.lower().strip() == 'illumina novaseq 6000'">Illumina NovaSeq 6000</INSTRUMENT_MODEL>

SRA.common.xsd (l.911):
<xs:enumeration value="Illumina NovaSeq 6000"/>

In course of doing this, I also noticed that the NextSeq and HiSeq X platforms are listed without the Illumina prefix, e.g.

<INSTRUMENT_MODEL py:when="row.instrument_model.lower().strip() == 'illumina hiseq 4000'">Illumina HiSeq 4000</INSTRUMENT_MODEL>
<INSTRUMENT_MODEL py:when="row.instrument_model.lower().strip() == 'nextseq 550'">NextSeq 550</INSTRUMENT_MODEL>

I'm happy to provide a PR for this, however, I wonder if this is the clean way to do or if the files were originally fetched from ENA and should be rather fixed there.

Best

Fritjof

FTPS instead of FTP

Since v3.0.1 ENA Webin makes use of FTPS

nan is filled in when column is mandatory and cell is empty

This tricks the validation although nothing is present

Most minimal example not working

ena_upload --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --secret .secret.yml

This throws the error:

Traceback (most recent call last):
  File "c:\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\rabuo\ena-upload-cli\Scripts\ena-upload-cli.exe\__main__.py", line 7, in <module>
  File "c:\users\rabuo\ena-upload-cli\lib\site-packages\ena_upload\ena_upload.py", line 771, in main
    schema_xmls = run_construct(
  File "c:\users\rabuo\ena-upload-cli\lib\site-packages\ena_upload\ena_upload.py", line 243, in run_construct
    schema_xmls[schema] = construct_xml(schema, stream, xsds[schema])
  File "c:\users\rabuo\ena-upload-cli\lib\site-packages\ena_upload\ena_upload.py", line 182, in construct_xml
    validate_xml(xsd, xml_string)
  File "c:\users\rabuo\ena-upload-cli\lib\site-packages\ena_upload\ena_upload.py", line 133, in validate_xml
    return xmlschema.assertValid(doc)
  File "src\lxml\etree.pyx", line 3623, in lxml.etree._Validator.assertValid
lxml.etree.DocumentInvalid: Element 'SAMPLE_ATTRIBUTES': Missing child element(s). Expected is ( SAMPLE_ATTRIBUTE )., line 10

This is because the default checklist ERC000011 has only optional fields, and if no value is given for one of them, the template will create a SAMPLE_ATTRIBUTES object without SAMPLE_ATTRIBUTE children (since they are all optional)

Solution: extra if statement for the <SAMPLE_ATTRIBUTES> object in the template that checks if the row contains an optional field

Traceback (most recent call last):
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/eval.py", line 301, in lookup_attr
    val = getattr(obj, key)
          ^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Series' object has no attribute 'iteritems'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3653, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'iteritems'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/eval.py", line 307, in lookup_attr
    val = obj[key]
          ~~~^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/pandas/core/series.py", line 1007, in __getitem__
    return self._get_value(key)
           ^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/pandas/core/series.py", line 1116, in _get_value
    loc = self.index.get_loc(label)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3655, in get_loc
    raise KeyError(key) from err
KeyError: 'iteritems'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/bin/ena-upload-cli", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/ena_upload/ena_upload.py", line 953, in main
    schema_xmls = run_construct(
                  ^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/ena_upload/ena_upload.py", line 298, in run_construct
    schema_xmls[schema] = construct_xml(schema, stream, xsds[schema])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/ena_upload/ena_upload.py", line 235, in construct_xml
    xml_string = stream.render(method='xml', encoding='utf-8')
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/core.py", line 184, in render
    return encode(generator, method=method, encoding=encoding, out=out)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/output.py", line 59, in encode
    return _encode(''.join(list(iterator)))
                           ^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/output.py", line 243, in __call__
    for kind, data, pos in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/output.py", line 674, in __call__
    for kind, data, pos in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/output.py", line 779, in __call__
    for kind, data, pos in chain(stream, [(None, None, None)]):
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/output.py", line 598, in __call__
    for ev in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/core.py", line 292, in _ensure
    for event in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/base.py", line 641, in _include
    for event in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/markup.py", line 326, in _match
    for event in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/base.py", line 581, in _flatten
    for kind, data, pos in stream:
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/directives.py", line 369, in __call__
    iterable = _eval_expr(self.expr, ctxt, vars)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/base.py", line 290, in _eval_expr
    retval = expr.evaluate(ctxt)
             ^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/eval.py", line 160, in evaluate
    return eval(self.code, _globals, {'__data__': data})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/ena_upload/templates/ENA_template_runs.xml", line 14, in <Expression 'iter(run_groups.iteritems())'>
    <py:for each="alias, experiment_alias in run_groups.iteritems()">
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/eval.py", line 309, in lookup_attr
    val = cls.undefined(key, owner=obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/users/bi/fduarte/miniconda3/envs/ena-iupload/lib/python3.11/site-packages/genshi/template/eval.py", line 397, in undefined
    raise UndefinedError(key, owner=owner)
genshi.template.eval.UndefinedError: alias
run_alias_1a    [experiment_alias_7a]
run_alias_3c    [experiment_alias_9c]
Name: experiment_alias, dtype: object has no member named "iteritems"

I downgraded to pandas 1.5.3 and now it seems to work fine. I used

ena-upload-cli --action add --center 'CRG' --study ena_templates/example_tables/ENA_template_studies.tsv --sample ena_templates/example_tables/ENA_template_samples.tsv --experiment ena_templates/example_tables/ENA_template_experiments.tsv --run ena_templates/example_tables/ENA_template_runs.tsv --data ena_templates/example_data/*gz --dev --secret ena_templates/.secret.yml --draft --no_data_upload

for running the tool.
Thanks,

Changing release date through API

Now it is only possible to change the release data manually though the website.

--no_upload parameter not working

when running the example command following error is thrown:

ena-upload-cli --action add --center 'VIB-UGENT' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --secret .secret.yml --no_upload
No files will be uploaded, remove `--no_upload' argument to perform upload.
No valid checksums found, generate now... Traceback (most recent call last):
  File "/home/bedro/.local/bin/ena-upload-cli", line 8, in <module>
    sys.exit(main())
  File "/home/bedro/.local/lib/python3.8/site-packages/ena_upload/ena_upload.py", line 733, in main
    md5 = df['file_name'].apply(lambda x: file_md5[x]).values
  File "/home/bedro/.local/lib/python3.8/site-packages/pandas/core/series.py", line 4138, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
  File "/home/bedro/.local/lib/python3.8/site-packages/ena_upload/ena_upload.py", line 733, in <lambda>
    md5 = df['file_name'].apply(lambda x: file_md5[x]).values
KeyError: 'ENA_TEST2.R1.fastq.gz'

I don’t think that the success attribute of the XML receipt can be used as an indicator of a successful submission. You would still need to parse the content to look for errors and successfully allocated accession numbers in the body of the receipt, see for example from the corresponding implementation in the webin-cli by ENA

Empty rows give error

If rows are given without values, errors like

genshi.template.eval.UndefinedError: nan has no member named "lower"

Are thrown

Windows compatibility issues

Because of a timestamp in the filename, files with a colon : character are not accepted by windows, preventing from installing and using the script.