I tried to run geneidx. With singularity, I failed to pull the image - but possibly that's an issue on my own server.
Singularity 1.00 (commit: 2ebc2f3f2059b96885416167363bde2e27ece106)
Running under Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0]
pygame 2.1.2 (SDL 2.0.20, Python 3.10.6)
Hello from the pygame community. https://www.pygame.org/contribute.html
With docker, I execute with root permissions. The image can be pulled. The problem is that apparently the container tries to write in places where the input data sits.
It is rather clear how to fix (don't pipe the unpacked genome to any other place but the output directory).
Next, I copied the gzipped genome into a folder where root has writing permissions and tried again. I am using reference species Drosophila simulans with taxon id 7240. That is indeed the taxon id in NCBI Taxonomy
It fails to find the taxon for some obscure reason. There are most definitely D. simulans proteins at NCBI for this taxon, I have the protein set on my harddrive, too. I just don't know how to start the pipeline with a local protein set. Or maybe it finds the proteins, but s.th. goes wrong looking for geneid parameters for this taxon?
N E X T F L O W ~ version 22.10.7
Launching `main.nf` [clever_bardeen] DSL2 - revision: bb4f07340a
GeneidX
=============================================
output : /home/katharina/git/geneidx/output
genome : genome.fasta.masked.gz
taxon : 7240
WARN: A process with name 'getFASTA2' is defined more than once in module script: /home/katharina/git/geneidx/subworkflows/CDS_estimates.nf -- Make sure to not define the same function as process
[- ] process > UncompressFASTA -
[- ] process > fix_chr_names -
[- ] process > compress_n_i... -
[- ] process > prot_down_wo... -
[- ] process > prot_down_wo... -
[- ] process > build_protei... -
[- ] process > build_protei... -
[- ] process > alignGenome_... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
executor > local (4)
[fe/ef4540] process > UncompressFA... [ 0%] 0 of 1
[- ] process > fix_chr_names -
[- ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [ 0%] 0 of 1
[- ] process > prot_down_wo... -
[- ] process > build_protei... -
[- ] process > build_protei... -
[- ] process > alignGenome_... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
executor > local (4)
[fe/ef4540] process > UncompressFA... [ 0%] 0 of 1
[- ] process > fix_chr_names -
[- ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [ 0%] 0 of 1
[- ] process > prot_down_wo... -
[- ] process > build_protei... -
[- ] process > build_protei... -
[- ] process > alignGenome_... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[88/8e185c] process > param_select... [ 0%] 0 of 1
[- ] process > param_select... -
[56/3eda8f] process > param_value_... [ 0%] 0 of 1
[- ] process > param_value_... -
[- ] process > creatingPara... -
[- ] process > geneid_WORKF... -
[- ] process > geneid_WORKF... -
[- ] process > prep_concat -
[- ] process > concatenate_... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff34portal -
Error executing process > 'param_selection_workflow:getParamName (7240)'
Caused by:
Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)
Command executed:
#!/usr/bin/env python3
# coding: utf-8
import os, sys
import pandas as pd
import requests
from lxml import etree
# Define an alternative in case everything fails
selected_param = "Homo_sapiens.9606.param"
# Define functions
def choose_node(main_node, sp_name):
for i in range(len(main_node)):
if main_node[i].attrib["scientificName"] == sp_name:
#print(main_node[i].attrib["rank"],
# main_node[i].attrib["taxId"],
# main_node[i].attrib["scientificName"])
return main_node[i]
return None
# given a node labelled with the species, and with
# lineage inside it returns the full path of the lineage
def sp_to_lineage_clean ( sp_sel ):
lineage = []
if sp_sel is not None:
lineage.append(sp_sel.attrib["taxId"])
for taxon in sp_sel:
#print(taxon.tag, taxon.attrib)
if taxon.tag == 'lineage':
lin_pos = 0
for node in taxon:
if "rank" in node.attrib.keys():
lineage.append( node.attrib["taxId"] )
else:
lineage.append( node.attrib["taxId"] )
lin_pos += 1
return(lineage)
def get_organism(taxon_id):
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
if response.status_code == 200:
root = etree.fromstring(response.content)
species = root[0].attrib
lineage = []
for taxon in root[0]:
if taxon.tag == 'lineage':
for node in taxon:
lineage.append(node.attrib["taxId"])
return lineage
if 0:
###
# We want to update the lists as new parameters may have been added
###
# List files in directory
list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
# Put the list into a dataframe
data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
# Generate the dataframe with the filename and lineage information
list_repeats_taxids = []
for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
species = species_no_space.replace("_", " ")
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
xml = response.content
if xml is None or len(xml) == 0:
continue
root = etree.fromstring(xml)
# print(species)
sp_sel = choose_node(root, species)
if sp_sel is None:
continue
# print(sp_sel.attrib.items())getParamName
lineage_sp = sp_to_lineage_clean(sp_sel)
param_species = f"{species_no_space}.{taxid}.param"
list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
# print((ens_sp, species, link_species, lineage_sp))
# Put the information into a dataframe
data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
# print("New parameters saved")
else:
###
# We want to load the previously generated dataframe
###
data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")
def split_n_convert(x):
return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
# Following either one or the other strategy we now have N parameters to choose.
# print(data.shape[0], "parameters available to choose")
###
# Separate the lineages into a single taxid per row
###
exploded_df = data.explode("taxidlist")
exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
###
# Get the species of interest lineage
###
query = pd.DataFrame(get_organism(int(7240)))
query.columns = ["taxid"]
query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
# print(query)
###
# Intersect the species lineage with the dataframe of taxids for parameters
###
intersected_params = query.merge(exploded_df, on = "taxid")
# print(intersected_params.shape)
###
# If there is an intersection, select the parameter whose taxid appears
# less times, less frequency implies more closeness
###
if intersected_params.shape[0] > 0:
#print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
#print(taxid_closest_param)
selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor > local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[- ] process > fix_chr_names -
[- ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[- ] process > prot_down_wo... -
[- ] process > build_protei... -
[- ] process > build_protei... -
[- ] process > alignGenome_... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[- ] process > param_select... -
[56/3eda8f] process > param_value_... [ 0%] 0 of 1
[- ] process > param_value_... -
[- ] process > creatingPara... -
[- ] process > geneid_WORKF... -
[- ] process > geneid_WORKF... -
[- ] process > prep_concat -
[- ] process > concatenate_... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff34portal -
Execution cancelled -- Finishing pending tasks before exit
Error executing process > 'param_selection_workflow:getParamName (7240)'
Caused by:
Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)
Command executed:
#!/usr/bin/env python3
# coding: utf-8
import os, sys
import pandas as pd
import requests
from lxml import etree
# Define an alternative in case everything fails
selected_param = "Homo_sapiens.9606.param"
# Define functions
def choose_node(main_node, sp_name):
for i in range(len(main_node)):
if main_node[i].attrib["scientificName"] == sp_name:
#print(main_node[i].attrib["rank"],
# main_node[i].attrib["taxId"],
# main_node[i].attrib["scientificName"])
return main_node[i]
return None
# given a node labelled with the species, and with
# lineage inside it returns the full path of the lineage
def sp_to_lineage_clean ( sp_sel ):
lineage = []
if sp_sel is not None:
lineage.append(sp_sel.attrib["taxId"])
for taxon in sp_sel:
#print(taxon.tag, taxon.attrib)
if taxon.tag == 'lineage':
lin_pos = 0
for node in taxon:
if "rank" in node.attrib.keys():
lineage.append( node.attrib["taxId"] )
else:
lineage.append( node.attrib["taxId"] )
lin_pos += 1
return(lineage)
def get_organism(taxon_id):
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
if response.status_code == 200:
root = etree.fromstring(response.content)
species = root[0].attrib
lineage = []
for taxon in root[0]:
if taxon.tag == 'lineage':
for node in taxon:
lineage.append(node.attrib["taxId"])
return lineage
if 0:
###
# We want to update the lists as new parameters may have been added
###
# List files in directory
list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
# Put the list into a dataframe
data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
# Generate the dataframe with the filename and lineage information
list_repeats_taxids = []
for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
species = species_no_space.replace("_", " ")
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
xml = response.content
if xml is None or len(xml) == 0:
continue
root = etree.fromstring(xml)
# print(species)
sp_sel = choose_node(root, species)
if sp_sel is None:
continue
# print(sp_sel.attrib.items())getParamName
lineage_sp = sp_to_lineage_clean(sp_sel)
param_species = f"{species_no_space}.{taxid}.param"
list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
# print((ens_sp, species, link_species, lineage_sp))
# Put the information into a dataframe
data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
# print("New parameters saved")
else:
###
# We want to load the previously generated dataframe
###
data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")
def split_n_convert(x):
return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
# Following either one or the other strategy we now have N parameters to choose.
# print(data.shape[0], "parameters available to choose")
###
# Separate the lineages into a single taxid per row
###
exploded_df = data.explode("taxidlist")
exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
###
# Get the species of interest lineage
###
query = pd.DataFrame(get_organism(int(7240)))
query.columns = ["taxid"]
query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
# print(query)
###
# Intersect the species lineage with the dataframe of taxids for parameters
###
intersected_params = query.merge(exploded_df, on = "taxid")
# print(intersected_params.shape)
###
# If there is an intersection, select the parameter whose taxid appears
# less times, less frequency implies more closeness
###
if intersected_params.shape[0] > 0:
#print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
#print(taxid_closest_param)
selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
executor > local (4)
[fe/ef4540] process > UncompressFA... [100%] 1 of 1, failed: 1 ✘
[- ] process > fix_chr_names -
[- ] process > compress_n_i... -
[d1/3e16a2] process > prot_down_wo... [100%] 1 of 1, failed: 1 ✘
[- ] process > prot_down_wo... -
[- ] process > build_protei... -
[- ] process > build_protei... -
[- ] process > alignGenome_... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[- ] process > matchAssessm... -
[88/8e185c] process > param_select... [100%] 1 of 1, failed: 1 ✘
[- ] process > param_select... -
[56/3eda8f] process > param_value_... [100%] 1 of 1, failed: 1 ✘
[- ] process > param_value_... -
[- ] process > creatingPara... -
[- ] process > geneid_WORKF... -
[- ] process > geneid_WORKF... -
[- ] process > prep_concat -
[- ] process > concatenate_... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff3addInfo:... -
[- ] process > gff34portal -
Execution cancelled -- Finishing pending tasks before exit
Oops ...
Error executing process > 'param_selection_workflow:getParamName (7240)'
Caused by:
Process `param_selection_workflow:getParamName (7240)` terminated with an error exit status (127)
Command executed:
#!/usr/bin/env python3
# coding: utf-8
import os, sys
import pandas as pd
import requests
from lxml import etree
# Define an alternative in case everything fails
selected_param = "Homo_sapiens.9606.param"
# Define functions
def choose_node(main_node, sp_name):
for i in range(len(main_node)):
if main_node[i].attrib["scientificName"] == sp_name:
#print(main_node[i].attrib["rank"],
# main_node[i].attrib["taxId"],
# main_node[i].attrib["scientificName"])
return main_node[i]
return None
# given a node labelled with the species, and with
# lineage inside it returns the full path of the lineage
def sp_to_lineage_clean ( sp_sel ):
lineage = []
if sp_sel is not None:
lineage.append(sp_sel.attrib["taxId"])
for taxon in sp_sel:
#print(taxon.tag, taxon.attrib)
if taxon.tag == 'lineage':
lin_pos = 0
for node in taxon:
if "rank" in node.attrib.keys():
lineage.append( node.attrib["taxId"] )
else:
lineage.append( node.attrib["taxId"] )
lin_pos += 1
return(lineage)
def get_organism(taxon_id):
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/{taxon_id}?download=false") ##
if response.status_code == 200:
root = etree.fromstring(response.content)
species = root[0].attrib
lineage = []
for taxon in root[0]:
if taxon.tag == 'lineage':
for node in taxon:
lineage.append(node.attrib["taxId"])
return lineage
if 0:
###
# We want to update the lists as new parameters may have been added
###
# List files in directory
list_species_taxid_params = os.listdir('Parameter_files.taxid/*.param')
list_species_taxid = [x.split('.')[:2] for x in list_species_taxid_params]
# Put the list into a dataframe
data_names = pd.DataFrame(list_species_taxid, columns = ["Species_name", "taxid"])
# Generate the dataframe with the filename and lineage information
list_repeats_taxids = []
for species_no_space, taxid in zip(data_names.loc[:,"Species_name"], data_names.loc[:,"taxid"]):
species = species_no_space.replace("_", " ")
response = requests.get(f"https://www.ebi.ac.uk/ena/browser/api/xml/textsearch?domain=taxon&query={species}")
xml = response.content
if xml is None or len(xml) == 0:
continue
root = etree.fromstring(xml)
# print(species)
sp_sel = choose_node(root, species)
if sp_sel is None:
continue
# print(sp_sel.attrib.items())getParamName
lineage_sp = sp_to_lineage_clean(sp_sel)
param_species = f"{species_no_space}.{taxid}.param"
list_repeats_taxids.append((species, taxid, param_species, lineage_sp))
# print((ens_sp, species, link_species, lineage_sp))
# Put the information into a dataframe
data = pd.DataFrame(list_repeats_taxids, columns = ["species", "taxid", "parameter_file", "taxidlist"])
data.to_csv("Parameter_files.taxid/params_df.tsv", sep = "", index = False)
# print("New parameters saved")
else:
###
# We want to load the previously generated dataframe
###
data = pd.read_csv("Parameter_files.taxid/params_df.tsv", sep = " ")
def split_n_convert(x):
return [int(i) for i in x.replace("'", "").strip("[]").split(", ")]
data.loc[:,"taxidlist"] = data.loc[:,"taxidlist"].apply(split_n_convert)
# Following either one or the other strategy we now have N parameters to choose.
# print(data.shape[0], "parameters available to choose")
###
# Separate the lineages into a single taxid per row
###
exploded_df = data.explode("taxidlist")
exploded_df.columns = ["species", "taxid_sp", "parameter_file", "taxid"]
exploded_df.loc[:,"taxid"] = exploded_df.loc[:,"taxid"].astype(int)
###
# Get the species of interest lineage
###
query = pd.DataFrame(get_organism(int(7240)))
query.columns = ["taxid"]
query.loc[:,"taxid"] = query.loc[:,"taxid"].astype(int)
# print(query)
###
# Intersect the species lineage with the dataframe of taxids for parameters
###
intersected_params = query.merge(exploded_df, on = "taxid")
# print(intersected_params.shape)
###
# If there is an intersection, select the parameter whose taxid appears
# less times, less frequency implies more closeness
###
if intersected_params.shape[0] > 0:
#print(intersected_params.loc[:,"taxid"].value_counts().sort_values())
taxid_closest_param = intersected_params.loc[:,"taxid"].value_counts().sort_values().index[0]
#print(taxid_closest_param)
selected_param = intersected_params[intersected_params["taxid"] == taxid_closest_param].loc[:,"parameter_file"].iloc[0]
print("/home/katharina/git/geneidx/data/Parameter_files.taxid/", selected_param, sep = "/", end = '')
Command exit status:
127
Command output:
(empty)
Command error:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/etc/shadow" to rootfs at "/etc/shadow": mount /etc/shadow:/etc/shadow (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown.
time="2023-03-25T04:11:47+01:00" level=error msg="error waiting for container: context canceled"
Work dir:
/home/katharina/git/geneidx/work/88/8e185cac6455345234538354fbf905
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line