guigolab / geneidx Goto Github PK

Nextflow pipeline for genome annotation of protein-coding genes

License: GNU General Public License v3.0

Shell 0.31% Nextflow 37.14% Perl 14.53% Dockerfile 0.59% Awk 1.78% Jupyter Notebook 15.58% Python 7.43% HTML 1.19% Groovy 21.46%

gene-prediction genome-annotation nextflow-pipeline

geneidx's Introduction

GeneidX

GeneidX provides a fast annotation of the protein-coding genes in an eukaryotic genome taking as input the genome assembly and the taxonomic ID of the species to annotate.

In the description here, you can find our preliminary results, a schema of our method and a description of the minimal requirements and commands required for running it.

Stay tuned for an article with detailed descriptions and feel free to contact us if you are trying it and find any problem.

Preliminary results

The results of an initial benchmarking using vertebrates genomes annotated in Ensembl show that our method is as accurate as the top ab initio gene predictors, but it is between 10 and 100 times faster.

Before running GeneidX

DEPENDENCIES

Make sure ( Docker or Singularity ) and Nextflow are installed in your computer

MANDATORY PARAMETERS

Required parameters to run GeneidX

tsv

the absolute path of the tsv file input file

The tsv file can contain multiple assemblies of different species, thus allowing the pipeline to annotate in parallel all the species contained in the tsv file. See data/assemblies.tsv for an example.

example

ID	PATH	TAXID
GCA_951394045.1	https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_946965045.2?download=true&gzip=true	2809013
humanAssembly	/absolutePath/humanAssembly.fa.gz	9606

IMPORTANT

the provided paths must point to gzipped files (.gz extension is expected from the pipeline's processes)
the path can also be a URL

column_id_value

the column name of the row id, in the example above is ID it must be unique

column_path_value

the column name of the path pointing to the fasta file, in the example above is PATH

column_taxid_value

the column name of the taxid value, in the example above is TAXID

OPTIONAL PARAMETERS

use_masking (default = true)

Whether to mask the genomes or not -> boolean

store_param_files (default=false)

Whether to store the generated geneid param file in the target species folder -> boolean

assemblies_dir (default="")

Path to the directory where the downloaded assemblies will be stored. In case of many assemblies -> string

uniref_identity (default= 0.9)

Refers to the level of homology of the proteins taken from UniRef. Can take values 0.5, 0.9 or 1.0

Proteins are automatically retrieved from UniRef. See here for more information. The set of proteins in non-redundant considering a more strict or less strict redundancy threshold. (UniRef50, less strict threshold (less proteins, different proteins in this set are really different), > UniRef100 the most strict threshold (more proteins, "more redundant" ))

proteins_lower_lim (default= 90000)

Ensure that the set of proteins downloaded from UniRef has at least this many proteins.

proteins_upper_lim (default= 130000)

Ensure that the set of proteins downloaded from UniRef has at most this many proteins.

The downloader module will keep iterating and adjusting the query parameters until the number of proteins is within this range.

general_gene_params (default= "$projectDir/data/general_gene_model.param")

Template for the basic parameters of the gene structures predicted by geneid.

match_score_min (default= 300)

This is the minimal score for a DIAMOND alignment to be used in the process of automatically training the coding potential estimation matrices.

match_ORF_min (default= 100)

This is the minimal length of a ORF derived from DIAMOND protein alignments to be used in the process of automatically training the coding potential estimation matrices.

intron_margin (default= 40)

To make sure that what we use as introns is truly non-coding sequence, this is the margin from the end of each predicted exon to the positions where we start assuming that corresponds to an intron.

min_intron_size (default= 20)

max_intron_size (default= 10000)

These set the lower and upper boundaries of the size of introns that are being used for building the coding potential estimation matrices.

source_uniprot (default= 1)

This is a "boolean" variable indicating whether the proteins of input come from uniprot/uniref or not. This information is used for producing a first "functional annotation" of the output GFF3 since it computes the intersection with the file of protein to genome matches in such a way that for each predicted gene we provide a list of which UniRef proteins are highly homologous to it.

Running GeneidX example:

nextflow run guigolab/geneidx -profile <docker/singularity> \
                                        --tsv <PATH_TO_TSV> \
                                        --column_taxid_value <TAXID_COLUMN_NAME> \
                                        --column_path_value <PATH_COLUMN_NAME> \
                                        --column_id_value <ID_COLUMN_NAME> \
                                        --outdir <OUTPUT_directory>

or alternatively, clone the repository and then run it (highly recommended)

git clone https://github.com/guigolab/geneidx.git
cd geneidx
git checkout dev/v2-fixes
nextflow run main.nf -profile <docker/singularity> \
                              --tsv <PATH_TO_TSV> \
                              --column_taxid_value <TAXID_COLUMN_NAME> \
                              --column_path_value <PATH_COLUMN_NAME> \
                              --column_id_value <ID_COLUMN_NAME> \
                              --outdir <OUTPUT_directory>

Revise the DETAILS section below for the minor specifications of each parameter.

Schema

Which steps are taking place as part of this pipeline? This is a graphical summary, and the specific steps are outlined below.

Get the set of proteins to be used for the protein-to-genome alignments.
Get the closest Geneid parameter file to use, as source of the parameters is not indicated by the user.
Create the protein database for DIAMOND to use it as a source.
Align the provided genome against the database created using DIAMOND BLASTx flavour.
Run the auto-training process:

Use matches to estimate the coding sections, look for open reading frames between stop codons.
Use matches from the same protein to predict the potential introns.
From the sequences of both previous steps compute the initial and transition probability matrices required for the computation of the coding potential of the genome that will be annotated.

Update the parameter file with the parameters indicated in the params.config file and also the matrices automatically generated from the protein-DNA matches.
Run Geneid with the new parameter file and the protein-DNA matches from the previous steps as additional evidence. This is done in parallel for each independent sequence inside the genome FASTA file, and it consists of the following steps:

Pre-process the matches to join those that overlap and re-score them appropriately.
Run Geneid using the evidence and obtain the GFF3 file.
Remove the files from the internal steps.

Concatenate all the outputs into a single GFF3 file that is sorted by coordinates.
Add information from proteins matching the predicted genes.

DETAILS:

The name of the sequences in the FASTA file cannot contain unusual characters.
The input genome file must be a gzip-compressed FASTA file. (.fa.gz)
Auto-train the parameter file always.
It is recommended to clone the repository and then run the pipeline from there.
The output of the predictions is stored in the path indicated when running the pipeline, in the following structure {provided outdir}/species/{taxid of the species}.
If you are running the pipeline multiple times, it is recommended that you define a directory for downloading the docker/singularity images to avoid having to download them multiple times. See singularity.cacheDir variable in nextflow.config.
If you have used Geneid in the past and have manually trained a parameter file, we are open to receive them and share them in our repositories giving credit to the users who generated them: to view the complete list of the available parameter files see

Follow us on Twitter (@GuigoLab) for updates in the article describing our method, and contact us for any questions or suggestions.

geneidx's People

Contributors

Stargazers

Watchers

Forkers

robertftenbosch apollo994

geneidx's Issues

EmptyDataError and empty annotation file

Hi Ferriol,

I hope you're doing well. I've been trying to use the geneidx Nextflow pipeline that you developed, but I've run into a couple of issues that I wanted to bring to your attention.

I encountered an error when trying to predict genes in the genome assembly GCA_000002985.3 (C. elegans). The pipeline failed with the following error message: pandas.errors.EmptyDataError: No columns to parse from file at the "pyComputeIntrons" subworkflow.
I also noticed that when I tried to annotate only chromosome V, the pipeline completed successfully, but the annotation file was empty, even though we know there are thousands of genes in chromosome V.

I would really appreciate it if you could take a look into these issues and let me know if there's anything I can do to work around them or if there's a fix that you can implement.
Let me know if you need more information or data to help reproduce the problem.

Thanks in advance,
Pablo

Run plain geneid

Add nextflow parameter to run plain geneid after geneid parameter file selection.

geneidx tries to write in location where it should only read & taxon id not found

Hi,

I tried to run geneidx. With singularity, I failed to pull the image - but possibly that's an issue on my own server.

Error executing process > 'param_value_selection_workflow:getParamName (7240)'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ferriolcalvet-geneidx.img.pulling.1679713549962 docker://ferriolcalvet/geneidx > /dev/null
  status : 2
  message:
    Usage: singularity [options]
    
    singularity: error: no such option: --name

Here's my output of singularity --version (installed via apt-get on Ubuntu ):

Singularity 1.00 (commit: 2ebc2f3f2059b96885416167363bde2e27ece106)
Running under Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
pygame 2.1.2 (SDL 2.0.20, Python 3.10.6)
Hello from the pygame community. https://www.pygame.org/contribute.html

With docker, I execute with root permissions. The image can be pulled. The problem is that apparently the container tries to write in places where the input data sits.

Here is my command:

sudo nextflow run main.nf -profile docker --genome /nas-hs/projs/data/Drosophila_melanogaster/data/genome.fasta.masked.gz --taxid 7240 --outdir .

Here is my error message:

Error executing process > 'UncompressFASTA (genome.fasta.masked.gz)'

Caused by:
  Process `UncompressFASTA (genome.fasta.masked.gz)` terminated with an error exit status (125)

Command executed:

  if [ ! -s  genome.fasta.masked ]; then
      echo "unzipping genome genome.fasta.masked.gz"
      gunzip -c genome.fasta.masked.gz > genome.fasta.masked;
  fi

Command exit status:
  125

Command output:
  (empty)

Command error:
  docker: Error response from daemon: error while creating mount source path '/nas-hs/projs/data/Drosophila_melanogaster/data': mkdir /nas-hs: read-only file system.
  time="2023-03-25T04:07:50+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/66/4a183e6f9fa512936ae52466a8b48b

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It is rather clear how to fix (don't pipe the unpacked genome to any other place but the output directory).

Next, I copied the gzipped genome into a folder where root has writing permissions and tried again. I am using reference species Drosophila simulans with taxon id 7240. That is indeed the taxon id in NCBI Taxonomy

Here is my new call:

sudo nextflow run main.nf -profile docker --genome genome.fasta.masked.gz --taxid 7240 --outdir .

It fails to find the taxon for some obscure reason. There are most definitely D. simulans proteins at NCBI for this taxon, I have the protein set on my harddrive, too. I just don't know how to start the pipeline with a local protein set. Or maybe it finds the proteins, but s.th. goes wrong looking for geneid parameters for this taxon?

Error messages:

N E X T F L O W  ~  version 22.10.7
Launching `main.nf` [clever_bardeen] DSL2 - revision: bb4f07340a

GeneidX
=============================================
output			: /home/katharina/git/geneidx/output
genome			: genome.fasta.masked.gz
taxon			: 7240

WARN: A process with name 'getFASTA2' is defined more than once in module script: /home/katharina/git/geneidx/subworkflows/CDS_estimates.nf -- Make sure to not define the same function as process
[-        ] process > UncompressFASTA -
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[-        ] process > prot_down_wo... -
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
executor >  local (4)
[fe/ef4540] process > UncompressFA... [  0%] 0 of 1
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [  0%] 0 of 1
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [  0%] 0 of 1
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [  0%] 0 of 1
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor >  local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[-        ] process > fix_chr_names   -
[-        ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[-        ] process > prot_down_wo... -
[-        ] process > build_protei... -
[-        ] process > build_protei... -
[-        ] process > alignGenome_... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[-        ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_select... -
[56/3eda8f] process > param_value_... [100%] 1 of 1, failed: 1 ✘
[-        ] process > param_value_... -
[-        ] process > creatingPara... -
[-        ] process > geneid_WORKF... -
[-        ] process > geneid_WORKF... -
[-        ] process > prep_concat     -
[-        ] process > concatenate_... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff3addInfo:... -
[-        ] process > gff34portal     -
Execution cancelled -- Finishing pending tasks before exit
Oops ...

Error executing process > 'param_selection_workflow:getParamName (7240)'

Caused by:
  Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)

Command executed:

  #!/usr/bin/env python3
  # coding: utf-8
  
  import os, sys
  import pandas as pd
  import requests
  from lxml import etree
  
  
  # Define an alternative in case everything fails
  selected_param = "Homo_sapiens.9606.param"
  
  
  # Define functions
  def choose_node(main_node, sp_name):
      for i in range(len(main_node)):
          if main_node[i].attrib["scientificName"] == sp_name:
              #print(main_node[i].attrib["rank"],
              #    main_node[i].attrib["taxId"],
              #    main_node[i].attrib["scientificName"])
              return main_node[i]
      return None
  
  
  # given a node labelled with the species, and with
  # lineage inside it returns the full path of the lineage
  def sp_to_lineage_clean ( sp_sel ):
      lineage = []
  
      if sp_sel is not None:
          lineage.append(sp_sel.attrib["taxId"])
  
      for taxon in sp_sel:
          #print(taxon.tag, taxon.attrib)
          if taxon.tag == 'lineage':
              lin_pos = 0
              for node in taxon:
                  if "rank" in node.attrib.keys():
                      lineage.append( node.attrib["taxId"] )
                  else:
                      lineage.append( node.attrib["taxId"] )
                  lin_pos += 1
      return(lineage)
  
  
  
  def get_organism(taxon_id):
      response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
      if response.status_code == 200:
          root = etree.fromstring(response.content)
          species = root[0].attrib
          lineage = []
          for taxon in root[0]:
              if taxon.tag == 'lineage':
                  for node in taxon:
                      lineage.append(node.attrib["taxId"])
      return lineage
  
  
  if 0:
      ###
      # We want to update the lists as new parameters may have been added
      ###
  
      # List files in directory
      list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
      list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
  
  
      # Put the list into a dataframe
      data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
  
  
      # Generate the dataframe with the filename and lineage information
      list_repeats_taxids = []
      for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
          species = species_no_space.replace("_", " ")
          response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
          xml = response.content
          if xml is None or len(xml) == 0:
              continue
  
          root = etree.fromstring(xml)
      #     print(species)
          sp_sel = choose_node(root, species)
          if sp_sel is None:
              continue
      #     print(sp_sel.attrib.items())getParamName
          lineage_sp = sp_to_lineage_clean(sp_sel)
  
          param_species = f"{species_no_space}.{taxid}.param"
          list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
          # print((ens_sp, species, link_species, lineage_sp))
  
  
      # Put the information into a dataframe
      data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
  
      data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
      # print("New parameters saved")
  
  
  
  else:
      ###
      # We want to load the previously generated dataframe
      ###
      data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = "	")
  
      def split_n_convert(x):
          return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
      data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
  
  # Following either one or the other strategy we now have N parameters to choose.
  # print(data.shape[0], "parameters available to choose")
  
  
  
  ###
  # Separate the lineages into a single taxid per row
  ###
  exploded_df = data.explode("taxidlist")
  exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
  exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
  
  ###
  # Get the species of interest lineage
  ###
  query = pd.DataFrame(get_organism(int(7240)))
  query.columns = ["taxid"]
  query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
  # print(query)
  
  ###
  # Intersect the species lineage with the dataframe of taxids for parameters
  ###
  intersected_params = query.merge(exploded_df, on = "taxid")
  # print(intersected_params.shape)
  
  ###
  # If there is an intersection, select the parameter whose taxid appears
  #   less times, less frequency implies more closeness
  ###
  if intersected_params.shape[0] > 0:
      #print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
  
      taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
      #print(taxid_closest_param)
  
      selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
      print("/home/katharina/git/geneidx/data/Parameter_files.taxid/", selected_param, sep = "/", end = '')

Command exit status:
  127

Command output:
  (empty)

Command error:
  docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/etc/shadow" to rootfs at "/etc/shadow": mount /etc/shadow:/etc/shadow (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown.
  time="2023-03-25T04:11:47+01:00" level=error msg="error waiting for container: context canceled"

Work dir:
  /home/katharina/git/geneidx/work/88/8e185cac6455345234538354fbf905

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Best wishes,

Katharina

How to run geneidx in local workstation via docker?

Hi,
I have this humble question, newbie to this domain of container.
I have all pre-requested already installed

Docker
Singularity
Nextflow
and the input data

The first thing I did I googled for the docker container :
docker pull ferriolcalvet/geneidx
It seems to be installed with success.
this is log from terminal
latest: Pulling from ferriolcalvet/geneidx 23858da423a6: Pulling fs layer 326f452ade5c: Pulling fs layer a42821cd14fb: Pulling fs layer 8471b75885ef: Pull complete 8ffa7aaef404: Pull complete 15132af73342: Pull complete f4aeaae2f8da: Pull complete 5d8aa591a389: Pull complete 4645a1f8dbad: Pull complete b8861be24c63: Pull complete 5067931dafb0: Pull complete 2af56fb3cf52: Pull complete 253e785a8a50: Pull complete f42d46aac9d9: Pull complete f9f8117fb535: Pull complete e15f4b46da28: Pull complete 99bd3a5629a5: Pull complete 27043e5ccd2d: Pull complete 1292524c3142: Pull complete f7b36a643fbe: Pull complete 860293db19cb: Pull complete f3dafd5c9e29: Pull complete b1325f591394: Pull complete 5065735ed845: Pull complete ab5a5ae1904e: Pull complete ad9050f7ac01: Pull complete Digest: sha256:de5f906aa7ac2fccc68235d3329eb89a06b811bc6cc7515035f1bc770127a1b5 Status: Downloaded newer image for ferriolcalvet/geneidx:latest docker.io/ferriolcalvet/geneidx:latest
2nd
I use nexflow cmd
nextflow run guigolab/geneidx -with-docker --genome reference.fa.gz --taxid 562
and this the msg I got
N E X T F L O W ~ version 22.10.0 Project guigolab/geneidxis currently stickied on revision: main -- you need to explicitly specify a revision with the option-r in order to use it

So how can I run geneidx?

Thank you in advance
Ben

Failing on sequence without introns?

Hello,
I wanted to try your pipeline on a small Wheat sequence, but I faced several issues, starting from pyComputeIntrons process. Because intron_l is empty (I posit it's because I do not have introns in this little sequence?), the pandas indexing returns a Keyerror on dd_intron = dd_intron[[0,4,1,3]].
I fixed this in a dirty way by specifying columns=range(5) in the previous pd.DataFrame call.

Then, findORF fails, because <name>_gffread.fa is empty (so not a fasta file):

RuntimeError: <name>_gffread.fa is not fasta or fastq sequence file

I am not sure what I could do to avoid this issue.

For reference, the sequence I used is available on https://framagit.org/geniomhe/genomics/structural-genomics-project-r4-2023/-/blob/main/data/region4.fasta, and I used the following command:

nextflow run geneidx/main.nf -profile docker \
    --tsv ./data/geneidx.tsv \
    --column_taxid_value TAXID \
    --column_path_value PATH \
    --column_id_value ID \
    --outdir results/geneidx

on the following TSV:

ID	PATH	TAXID
region4	data/region4.fasta.gz	4565

"command not found"

Hi, I'm using WSL2 Ubuntu for Windows 10.

Ubuntu 22.04.3 LTS
Windows 10 Version 10.0.19045.3803

I get the error below with all the methods I've tried. Downloading the geneidx release from Github with or without using a terminal and then running it, or using the first method given in "Running GeneidX example" in readme.

$ nextflow run ./main.nf -profile docker \
                                        --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv \
                                        --column_taxid_value GCA_905340225.1 \
                                        --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true&gzip=true \
                                        --column_id_value 938171 \
                                        --outdir /mnt/d/biotools-linux/geneidx/output
[1] 543
--column_id_value: command not found
username@MSI:/mnt/d/biotools-linux/geneidx$ N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [curious_ride] DSL2 - revision: 2c57827f27

GENOMEMASKING+GENEID+BLASTx - NextflowPipeline
=============================================
output path: /mnt/d/biotools-linux/geneidx/output
tsv input path: /mnt/d/biotools-linux/geneidx/data/assemblies.tsv
id column: Assembly Accession
taxid column: GCA_905340225.1
path column: https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true
mask genomes? false

After this the terminal is stuck (even after 3-4 hours) and there doesn't seem to be any activity until I type a random prompt, like s.

s
s: command not found

[1]+  Stopped                 nextflow run ./main.nf -profile docker --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv --column_taxid_value GCA_905340225.1 --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true

I've tried excluding column_id_value command and this time it said command not found for outdir. I've excluded both and got a similar message again.

$ nextflow run ./main.nf -profile docker \
                                        --tsv /mnt/d/biotools-linux/geneidx/data/assemblies.tsv \
                                        --column_taxid_value GCA_905340225.1 \
                                        --column_path_value https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true&gzip=true
[1] 874
username@MSI:/mnt/d/biotools-linux/geneidx$ N E X T F L O W  ~  version 23.10.0
Launching `./main.nf` [prickly_leavitt] DSL2 - revision: 2c57827f27

GENOMEMASKING+GENEID+BLASTx - NextflowPipeline
=============================================
output path: /mnt/d/biotools-linux/geneidx/output
tsv input path: /mnt/d/biotools-linux/geneidx/data/assemblies.tsv
id column: Assembly Accession
taxid column: GCA_905340225.1
path column: https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_905340225.1?download=true
mask genomes? false

Then I have to do the same random typing to stop the process again, even though there is no activity, and I get the same "Stopped" message.

Edit: Confirmed it happens in v2.0.0 too.

Low BUSCO score on predicted geneset

Hi Ferriol!

I hope you're doing well. I recently tried out your gene prediction tool and wanted to give you some feedback. I found that the results were not as good as those obtained with more popular tools. In fact, the completeness of the gene predictions was only around 34%, compared to ~90% using other tools.
I obtained the genemark and augustus scores from the genesets generated by the braker pipeline. For the protein mode I used Swissprot as reference. I ran geneidx with default parameters. It was very fast indeed. Do you know what could improve the results?

All the best,
Pablo

Protein set and parameter file the output folder

The current output folder structure makes complicated to understand what proteins and what parameter file were used when multiple species are run from the same input csv. This is especially true when multiple species use the same protein evidence/parameter file. Here I suggest to add a sym link to the protein database as well as the parameter filem in the species specific folder.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.