churchmanlab / genewalk Goto Github PK

View Code? Open in Web Editor NEW

128.0 7.0 14.0 385 KB

GeneWalk identifies relevant gene functions for a biological context using network representation learning

Home Page: https://churchman.med.harvard.edu/genewalk

License: BSD 2-Clause "Simplified" License

Python 99.51% Dockerfile 0.13% HTML 0.36%

functional-genomics machine-learning-algorithm

genewalk's Introduction

GeneWalk

GeneWalk determines for individual genes the functions that are relevant in a particular biological context and experimental condition. GeneWalk quantifies the similarity between vector representations of a gene and annotated GO terms through representation learning with random walks on a condition-specific gene regulatory network. Similarity significance is determined through comparison with node similarities from randomized networks.

Install GeneWalk

To install the latest release of GeneWalk (preferred):

pip install genewalk

To install the latest code from Github (typically ahead of releases):

pip install git+https://github.com/churchmanlab/genewalk.git

GeneWalk uses a number of resource files that it downloads as needed during runtime. To optionally pre-download these resource files in the default resource folder, the command

python -m genewalk.resources

can be run.

Using GeneWalk

Gene list file

GeneWalk always requires as input a text file containing a list with genes of interest relevant to the biological context. For example, differentially expressed genes from a sequencing experiment that compares an experimental versus control condition. GeneWalk supports gene list files containing HGNC human gene symbols, HGNC IDs, human Ensembl gene IDs, MGI mouse gene IDs, RGD rat gene IDs, or human or mouse entrez IDs. GeneWalk internally maps these IDs to human genes.

For organisms other than human, mouse or rat, there are two options. The first is to map the genes to human orthologs yourself and then input the human ortholog list as described above. Use this strategy if you consider the organism sufficiently related to human. The second option is to provide an input gene file with custom gene IDs. These are not mapped to human genes. Use custom gene IDs for more divergent organisms, such as drosophila, worm, yeast, plants or bacteria. In this case the user must also provide a custom gene network with GO annotations as input. See section Custom input networks for more details.

Each line in the gene input file contains a gene identifier of one of the above types.

GeneWalk command line interface

Once installed, GeneWalk can be run from the command line as genewalk, with a set of required and optional arguments. The required arguments include the project name, a path to a text file containing a list of genes, and an argument specifying the type of gene identifiers in the file.

Example

genewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol

Below is the full documentation of the command line interface:

genewalk [-h] [--version] --project PROJECT --genes GENES --id_type
              {hgnc_symbol,hgnc_id,ensembl_id,mgi_id,rgd_id,entrez_human,entrez_mouse,custom}
              [--stage {all,node_vectors,null_distribution,statistics}]
              [--base_folder BASE_FOLDER]
              [--network_source {pc,indra,edge_list,sif,sif_annot,sif_full}]
              [--network_file NETWORK_FILE] [--nproc NPROC] [--nreps NREPS]
              [--alpha_fdr ALPHA_FDR] [--save_dw SAVE_DW]
              [--random_seed RANDOM_SEED]


required arguments:
  --version             Print the version of GeneWalk and exit.
  --project PROJECT     A name for the project which determines the folder
                        within the base folder in which the intermediate and
                        final results are written. Must contain only
                        characters that are valid in folder names.
  --genes GENES         Path to a text file with a list of differentially
                        expressed genes. Thetype of gene identifiers used in
                        the text file are provided in the id_type argument.
  --id_type {hgnc_symbol,hgnc_id,ensembl_id,mgi_id,rgd_id,entrez_human,entrez_mouse,custom}
                        The type of gene IDs provided in the text file in the
                        genes argument. Possible values are: hgnc_symbol,
                        hgnc_id, ensembl_id, mgi_id, rgd_id, entrez_human,
                        entrez_mouse, and custom. If custom, a network_source
                        of sif_annot or sif_full must be used.

optional arguments:
  --stage {all,node_vectors,null_distribution,statistics,visual}
                        The stage of processing to run. Default: all
  --base_folder BASE_FOLDER
                        The base folder used to store GeneWalk temporary and
                        result files for a given project. Default:
                        ~/genewalk
  --network_source {pc,indra,edge_list,sif,sif_annot,sif_full}
                        The source of the network to be used.Possible values
                        are: pc, indra, edge_list, sif, sif_annot, and
                        sif_full. In case of indra, edge_list, sif, sif_annot,
                        and sif_full, the network_file argument must be
                        specified. Default: pc
  --network_file NETWORK_FILE
                        If network_source is indra, this argument points to a
                        Python pickle file in which a list of INDRA Statements
                        constituting the network is contained. In case
                        network_source is edge_list, sif, sif_annot, or
                        sif_full, the network_file argument points to a text
                        file representing the network. See README section
                        Custom input networks for full description of file
                        format requirements.
  --nproc NPROC         The number of processors to use in a multiprocessing
                        environment. Default: 1
  --nreps_graph NREPS_GRAPH
                        The number of repeats to run when calculating node
                        vectors on the GeneWalk graph. Default: 3
  --nreps_null NREPS_NULL
                        The number of repeats to run when calculating node
                        vectors on the random network graphs for constructing
                        the null distribution. Default: 3
  --alpha_fdr ALPHA_FDR
                        The false discovery rate to use when outputting the
                        final statistics table. If 1 (default), all
                        similarities are output, otherwise only the ones whose
                        false discovery rate are below this parameter are
                        included. Default: 1 
                        For visualization a default value of 0.1 for both global
                        and gene-specific plots is used. Lower this value to 
                        increase the stringency of the regulator gene selection 
                        procedure.
  --dim_rep DIM_REP     Dimension of vector representations (embeddings). This 
                        value should only be increased if genewalk with the 
                        default value generates no statistically significant 
                        results, for instance with very large (>2500) input 
                        gene lists. Alternatively, it can be decreased in case 
                        (nearly) all GO annotations are significant, for 
                        instance with very short gene lists. Default: 8
  --save_dw SAVE_DW     If True, the full DeepWalk object for each repeat is
                        saved in the project folder. This can be useful for
                        debugging but the files are typically very large.
                        Default: False
  --random_seed RANDOM_SEED
                        If provided, the random number generator is seeded
                        with the given value. This should only be used if the
                        goal is to deterministically reproduce a prior result
                        obtained with the same random seed.

Output files

GeneWalk automatically creates a genewalk folder in the user's home folder (or the user specified base_folder). When running GeneWalk, one of the required inputs is a project name. A sub-folder is created for the given project name where all intermediate and final results are stored. The files stored in the project folder are:

genewalk_results.csv - The main results table, a comma-separated values text file. See below for detailed description.
genes.pkl - A processed representation of the given gene list, in Python pickle (.pkl) binary file format.
multi_graph.pkl - A networkx MultiGraph resembling the GeneWalk network which was assembled based on the given list of genes, an interaction network, GO annotations, and the GO ontology.
deepwalk_node_vectors_*.pkl - A set of learned node vectors for each analysis repeat for the graph.
deepwalk_node_vectors_rand_*.pkl - A set of learned node vectors for each analysis repeat for a random graph.
genewalk_rand_simdists.pkl - Distributions constructed from repeats.
deepwalk_*.pkl - A DeepWalk object for each analysis repeat on the graph (only present if save_dw argument is set to True).
deepwalk_rand_*.pkl - A DeepWalk object for each analysis repeat on a random graph (only present if save_dw argument is set to True).

Figure files

GeneWalk also automatically generates figures to visualize its results in the project/figures sub-folder:

index.html: an HTML page that includes all the figures generated, as described below.
barplots with GO annotations ranked by relevance for each input gene that GeneWalk was able to generate results for. The filenames contain the corresponding human gene symbol and input gene id: barplot_[symbol]_[gene id]_x_mlog10global_padj_y_GO.png.
regulators_x_gene_con_y_frac_rel_go(.png and .pdf): scatter plot to identify regulator genes of interest. These have a large gene connectivity and high fraction of relevant GO annotations. For more information see our publication.
genewalk_regulators.csv: list with regulator genes that are named in the regulators scatterplot.
moonlighters_x_go_con_y_frac_rel_go(.png and .pdf): scatter plot to identify moonlighting genes: genes with many GO annotations of which a low fraction are relevant. For more information see our publication.
genewalk_moonlighters.csv: list with moonlighting genes that are named in the moonlighting scatterplot.
genewalk_scatterplots.csv: data corresponding to the regulator and moonlighter scatter plots. This file can be used for further gene prioritization analyses.

GeneWalk results file description

genewalk_results.csv is the main GeneWalk output table, a comma-separated values text file with the following column headers:

hgnc_id - human gene HGNC identifier.
hgnc_symbol - human gene symbol.
go_name - GO term name.
go_id - GO term identifier.
go_domain - Ontology domain that GO term belongs to (biological process, cellular component or molecular function).
ncon_gene - number of connections to gene in GeneWalk network.
ncon_go - number of connections to GO term in GeneWalk network.
global_padj - false discovery rate (FDR) adjusted p-value of the similarity between gene and GO term, when correcting for testing over all gene-GO term pairs present in the output file. This is the key statistic that indicates how relevant the gene-GO term pair (gene function) is in the particular biological context or tested condition. Global_padj should be used for global analyses that consider all the GeneWalk output simultaneously, such as gene prioritization procedures. GeneWalk determines an adjusted p-value with Benjamini Hochberg FDR correction for multiple testing of all connected GO term for each nreps_graph repeat analysis. The value presented here is the average (mean estimate) over all p-adjust values from all nreps_graph repeat analyses.
gene_padj - FDR adjusted p-value of the similarity between gene and GO term, when correcting for multiple testing over all GO annotations of that gene. This the key statistic when investigating the functions of one (or a few) pre-defined gene(s) of interest. Gene_padj determines the statistical significance of each GO annotation (function) and gene_padj can be used to sensitively rank GO annotations to reflect the relevance to the gene of interest in the particular biological context or tested condition. When you consider all (or many) input genes simultaneously, use global_padj instead. Average over nreps_graph repeat runs as for global_padj.
pval - p-value of gene - GO term similarity, not corrected for multiple hypothesis testing. Average over nreps_graph repeat runs.
sim - gene - GO term (cosine) similarity, average over nreps_graph repeat runs.
sem_sim - standard error on sim (mean estimate).
cilow_global_padj - lower bound of 95% confidence interval on global_padj (mean estimate) from the nreps_graph repeat analyses.
ciupp_global_padj - upper bound of 95% confidence interval on global_padj.
cilow_gene_padj - lower bound of 95% confidence interval on gene_padj (mean estimate) from the nreps_graph repeat analyses.
ciupp_gene_padj - upper bound of 95% confidence interval on gene_padj.
cilow_pval - lower bound of 95% confidence interval on pval (mean estimate) from the nreps_graph repeat analyses.
ciupp_pval - upper bound of 95% confidence interval on pval.
mgi_id, rgd_id, ensembl_id, entrez_human or entrez_mouse - in case one of these gene identifiers were provided as input, the GeneWalk results table starts with an additional column to indicate the gene identifiers. In the case of mouse genes, the corresponding hgnc_id and hgnc_symbol resemble its human ortholog gene used for the GeneWalk analysis.

Run time and stages of GeneWalk algorithm

Recommended number of processors (optional argument: nproc) for a short (1-2h) run time is 4:

genewalk --project context1 --genes gene_list.txt --id_type hgnc_symbol --nproc 4

By default GeneWalk will run with 1 processor, resulting in a longer overall run time: 6-12h. Given a list of genes, GeneWalk runs three stages of analysis:

Assembling a GeneWalk network and learning node vector representations by running DeepWalk on this network, for a specified number of repeats. Typical run time: one to a few hours.
Learning random node vector representations by running DeepWalk on a set of randomized versions of the GeneWalk network, for a specified number of repeats. Typical run time: one to a few hours.
Calculating statistics of similarities between genes and GO terms, and outputting the GeneWalk results in a table. Typical run time: a few minutes.
Visualization of the GeneWalk results generated in the project/figures subfolder. Typical run time: 1-10 mins depending on the number of input genes.

GeneWalk can either be run once to complete all these stages (default), or called separately for each stage (optional argument: stage). Recommended memory availability on your operating system: 16Gb or 32Gb RAM. GeneWalk outputs the uncertainty (95% confidence intervals) of the similarity significance (global and gene p-adjust). Depending on the context-specific network topology, this uncertainty can be large for individual gene - function associations. However, if overall the uncertainties turn out very large, one can set the optional arguments nreps_graph to 10 (or more) and nreps_null to 10 to increase the algorithm's precision. This comes at the cost of an increased run time.

Custom input networks

By default, GeneWalk uses the PathwayCommons resource (--network_source pc) to create a human gene network. It then automatically adds edges representing GO annotations for input genes and ontology relations between GO terms. However, there are options to run GeneWalk with a custom network as an input.

First, specify the --network_source argument as one of the alternative sources: {indra, edge_list, sif, sif_annot, sif_full}.

If custom gene IDs are used (--id_type custom) in the input gene list, for instance from a model organism: choose as network source sif_annot or sif_full.

Then, include the argument --network_file with the path to the custom network input file. The network file format has to correspond to the chosen --network_source, as follows.

The sif/sif_annot/sif_full options require the network file in a simple interaction file (SIF) format. Each row of the SIF text file consists of three comma-separated entries representing source, relation type, and target. The relation type is not explicitly used by GeneWalk, and can be set to an arbitrary label.

The difference between the sif, sif_annot, and sif_full options:

sif: the input SIF can contain only human gene-gene relations. Genes have to be encoded as human HGNC gene symbols (for example KRAS). GO annotations for genes, as well as ontology relations between GO terms are added automatically by GeneWalk.
sif_annot: the input SIF has to contain both gene-gene relations, and GO annotations for genes: rows where the source is a gene, and the target is a GO term. Use GO IDs with prefix (for example GO:0000186) to encode GO terms. Genes should be encoded the same as in the gene input list and do not have to correspond to human genes. Ontology relations between GO terms are then added automatically by GeneWalk.
sif_full: the input SIF has to contain all GeneWalk network edges: gene-gene relations, GO annotations for genes, and ontology relations between GO terms. GeneWalk does not add any more edges to the network. Encode genes and GO terms in the same manner as for sif_annot.

The edge_list option is a simplified version of the sif option. It requires a network text file that contains rows with two columns each, a source and a target. In other words, it omits the relation type column from the SIF format. Further file preparation requirements are the same as for the sif option.

The indra option requires as custom network input file a Python pickle file containing a list of INDRA Statements. These statements can represent human gene-gene, as well as gene-GO relations from which network edges are derived. Human GO annotations and ontology relations between GO terms are then added automatically by GeneWalk during network construction.

Further documentation

For a tutorial and more general information see the GeneWalk website.
For further code documentation see our readthedocs page.

Citation

Robert Ietswaart, Benjamin M. Gyori, John A. Bachman, Peter K. Sorger, and L. Stirling Churchman
GeneWalk identifies relevant gene functions for a biological context using network representation learning,
Genome Biology 22, 55 (2021). https://doi.org/10.1186/s13059-021-02264-8

Funding

This work was supported by National Institutes of Health grant 5R01HG007173-07 (L.S.C.), EMBO fellowship ALTF 2016-422 (R.I.), and DARPA grants W911NF-15-1-0544 and W911NF018-1-0124 (P.K.S.).

genewalk's People

Contributors

Stargazers

Watchers

Forkers

johnbachman ebunnage crystalhumphries pythseq ashishjain1988 seifudd vannostrandlab hsi88 majorbio xingyujiang imamcs19 mwoerheide wyc9559 tharun-kota

genewalk's Issues

Code for regulator/moonlighter plots

Hi,

I was curious if you have a python or R script for reproducing the scatterplots such that individual genes of interest can be labeled and the size of the plot can be adjusted for easier visualization for publication?

Thanks,
Nick

Mouse ids are not working with genewalk?

Hello,

I have tried to run some mouse gene list (from my differentially expressed data) with mouse_entrez ids (around 250 genes). Even though, on the axises of Regulator & Moonlight genes plots, I got "Number of GO annotations per gene" on X axis, "the fraction of relevant GO terms" was 0 and "Connections with other genes" was also 0.
I was wondering if my entered format of mouse_entrez ids is not correct, or if, there are just no GO terms associated with these genes (from the human orthologs that Genewalk uses). Please let me know also, if the format of my entrez_mouse ids is not correct (I have the list of all my genes in GeneSymbol format before i use their entrez_ids for genewalk):
The command i run for genewalk is:

genewalk --project genewalkspermlongRNAseq --genes mouse_entrez_ids_list.txt --id_type mouse_entrez --nproc 8
I have included several files with this issue

folder with the plots that i received from genewalk (the plots of Regulator genes & Moonlight genes)
my raw list of mouse_entrez genes (as a zipped file, but it's basically a .csv file)
The error list that i receive
sperm_final_downregulated_logFCnegative2-5_entrez_mouse.zip
sperm_final_downregulated_logFC_negative2.5.csv.zip

INFO: [2021-03-26 14:58:49] genewalk.cli - Creating sperm_downregulated_entrez_mouse_26032021_anara folder at /home/anara/genewalk/sperm_downregulated_entrez_mouse_26032021_anara
INFO: [2021-03-26 14:58:49] genewalk.resources - Using /home/anara/genewalk/resources as resource folder.
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3608415
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC102224.2
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 2142174
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC124977.2
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC133523.2
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC133902.3
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 2141341
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC138299.1
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC140364.2
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3796981
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC158352.1
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC161409.5
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID AC164544.5
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Astx2
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Atcayos
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID CH25-501L8.4
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID CH36-169F23.5
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 107303
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Dlx1as
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Gm10217
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Gm17571
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Gm22690
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Gm26381
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Gm26545
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3646599
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Kat6b-ps1
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Kif22-ps
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID Lincmd1
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not find an MGI mapping for Entrez ID LINE/L1?
WARNING: [2021-03-26 14:58:49] genewalk.gene_lists - Could not get HGNC ID for MGI ID 1888480

Install issue

Hi all,

I'm having an issue with GeneWalk install as it gets to the point of installing gensim, which continues to error out with an exit status 1 regardless of being run in a virtual environment. I'm using MacOS v11.1 and Python v3.9.

The code I'm using is as follows:

$ python3 -m venv tutorial-env
$ source tutorial-env/bin/activate
$ pip install genewalk

There error I get (alongside the program log) is:

ERROR: Command errored out with exit status 1: /Users/npokoryznski/tutorial-env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/jc/43vp9sr55j714y8tzqs948580000gp/T/pip-install-3uba9rue/gensim/setup.py'"'"'; file='"'"'/private/var/folders/jc/43vp9sr55j714y8tzqs948580000gp/T/pip-install-3uba9rue/gensim/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/jc/43vp9sr55j714y8tzqs948580000gp/T/pip-record-3x8walux/install-record.txt --single-version-externally-managed --compile --install-headers /Users/npokoryznski/tutorial-env/include/site/python3.9/gensim Check the logs for full command output.

I've tried to upgrade pip and setuptools to potentially resolve the issue but neither helped. I've tried a variety of other commands to circumvent administrative barriers etc as well, but since none seemed to resolve the issue I thought I would keep it simple. I'm very novice when it comes to python, bash, etc. so I fully expect to be making a trivial error somewhere here but I can't figure out the problem. Any help is appreciated!

Importing the numpy c-extensions failed.

Hi!

Thank you for the super interesting package.
I successfully installed it on my local Anaconda machine running on a Windows 10 machine.
Now I am currently trying to run genewalk on our cluster (Ubuntu, 2.6.32-431.20.3.el6.x86_64).

genewalk --project qki --genes /home/gitpycode/Documents/genes.csv --id_type mgi_id

I already set up the whole installation multiple times using virtual environments and trying different versions of python (3.5.0 and 3.7.0) and always get the same error message:

Traceback (most recent call last):
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/numpy/core/__init__.py", line 17, in <module>
    from . import multiarray
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/numpy/core/multiarray.py", line 14, in <module>
    from . import overrides
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/numpy/core/overrides.py", line 7, in <module>
    from numpy.core._multiarray_umath import (
ImportError: PyCapsule_Import could not import module "datetime"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gitpycode/gwalk1/bin/genewalk", line 5, in <module>
    from genewalk.cli import main
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/genewalk/cli.py", line 8, in <module>
    import numpy as np
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/numpy/__init__.py", line 142, in <module>
    from . import core
  File "/home/gitpycode/gwalk1/lib/python3.7/site-packages/numpy/core/__init__.py", line 47, in <module>
    raise ImportError(msg)
ImportError:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy c-extensions failed.
- Try uninstalling and reinstalling numpy.
- If you have already done that, then:
  1. Check that you expected to use Python3.7 from "/home/gitpycode/gwalk1/bin/python3",
     and that you have no directories in your PATH or PYTHONPATH that can
     interfere with the Python and numpy version "1.17.3" you're trying to use.
  2. If (1) looks fine, you can open a new issue at
     https://github.com/numpy/numpy/issues.  Please include details on:
     - how you installed Python
     - how you installed numpy
     - your operating system
     - whether or not you have multiple versions of Python installed
     - if you built from source, your compiler versions and ideally a build log

- If you're working with a numpy git repository, try `git clean -xdf`
  (removes all files not under version control) and rebuild numpy.

Note: this error has many possible causes, so please don't comment on
an existing issue about this - open a new one instead.

Original error was: PyCapsule_Import could not import module "datetime"

Segmentation fault

Can somebody help me to identify the problem?
Thank you for your help!

Bioconda integration

GeneWalker fits very well into Bioconda, do you have plans of adding it to Bioconda as well so that it can be installed via "conda install" also? Would be great, the tools looks very promising!

Installation failed.

Hi there,

I created a fresh conda environment with
conda create -n genewalk python=3.5

and installed genewalk using
pip install git+https://github.com/churchmanlab/genewalk.git

but genewalk -h gave me this error:

Traceback (most recent call last):
  File "/exports/igmm/eddie/Glioblastoma-WGS/anaconda/envs/genewalk/bin/genewalk", line 5, in <module>
    from genewalk.cli import main
  File "/exports/igmm/eddie/Glioblastoma-WGS/anaconda/envs/genewalk/lib/python3.5/site-packages/genewalk/cli.py", line 11, in <module>
    from genewalk.nx_mg_assembler import load_network
  File "/exports/igmm/eddie/Glioblastoma-WGS/anaconda/envs/genewalk/lib/python3.5/site-packages/genewalk/nx_mg_assembler.py", line 6, in <module>
    from indra.databases import go_client
  File "/exports/igmm/eddie/Glioblastoma-WGS/anaconda/envs/genewalk/lib/python3.5/site-packages/indra/databases/__init__.py", line 7, in <module>
    from .identifiers import get_identifiers_url, parse_identifiers_url, \
  File "/exports/igmm/eddie/Glioblastoma-WGS/anaconda/envs/genewalk/lib/python3.5/site-packages/indra/databases/identifiers.py", line 302
    if not db_id.startswith(f'{db_ns}{colon}'):
                                            ^
SyntaxError: invalid syntax

Could you help me troubleshoot please?

Ensembl IDs with dots cause problems

Hi, another Ensembl ID related issue: When the IDs contain the ".X" notation, like ".3", the mapping fails for all of them, causing the pipeline to run through but an empty file at the end. I think this should be improved like this:

If all IDs could not been mapped, abort right away
For Ensembl IDs, if the IDs end with ".X", X being any integer, remove it from the ID and then do the mapping.

We can of course also remove them, but it should be stated somewhere, and the nicest of course is to do it automatically for the user :)

TypeError: Input graph is not a networkx graph type

Is there any additional inputs required for running GeneWalk on a list of human gene IDs? I am running the following command, which has about 80 gene names from a DE experiment.

genewalk --project test --genes /results.txt --id_type hgnc_symbol

Which returned this:

INFO: [2019-09-11 12:01:29] genewalk.nx_mg_assembler - Adding gene edges from Pathway Commons to graph.
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/networkx/convert.py", line 46, in _prep_create_using
create_using.clear()
TypeError: clear() missing 1 required positional argument: 'self'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda3/bin/genewalk", line 11, in
sys.exit(main())
File "/anaconda3/lib/python3.6/site-packages/genewalk/cli.py", line 151, in main
resource_manager=rm)
File "/anaconda3/lib/python3.6/site-packages/genewalk/nx_mg_assembler.py", line 38, in load_network
mg = PcNxMgAssembler(genes, resource_manager=resource_manager)
File "/anaconda3/lib/python3.6/site-packages/genewalk/nx_mg_assembler.py", line 196, in init
self.add_pc_edges()
File "/anaconda3/lib/python3.6/site-packages/genewalk/nx_mg_assembler.py", line 214, in add_pc_edges
create_using=nx.MultiGraph)
File "/anaconda3/lib/python3.6/site-packages/networkx/convert_matrix.py", line 313, in from_pandas_edgelist
g = _prep_create_using(create_using)
File "/anaconda3/lib/python3.6/site-packages/networkx/convert.py", line 48, in _prep_create_using
raise TypeError("Input graph is not a networkx graph type")

Any insight?

ENSEMBL IDs

Hi, a quick question: Is it possible to support also ENSEMBL IDs as input?

network source file

Hi,
I ran Genewalk using the following command :
`genewalk --project context1 --genes /home/amit/genewalk/gene_list_DE_ER_UBT.txt --id_type hgnc_id --stage all --base_folder /home/amit/genewalk/chigozie/ --network_source /home/amit/genewalk/chigozie/resources/PathwayCommons12.All.hgnc_current.sif --nproc 6

but it gave me an error:
genewalk: error: argument --network_source: invalid choice: '/home/amit/genewalk/chigozie/resources/PathwayCommons12.All.hgnc_current.sif' (choose from 'pc', 'indra', 'edge_list', 'sif')

I looked into the command argument and found these:
--network_source {pc,indra,edge_list,sif}
The source of the network to be used.Possible values
are: pc, indra, edge_list, and sif. In case of indra,
edge_list, and sif, the network_file argument must be
specified. Default: pc
--network_file NETWORK_FILE
If network_source is indra, this argument points to a
Python pickle file in which a list of INDRA Statements
constituting the network is contained. In case
network_source is edge_list or sif, the network_file
argument points to a text file representing the
network.
Can you kindly help in terms of the source of these files or whether the user has to supply them.

regards,
Amit.

Illustration of the GeneWalk network

Hi,
Thanks a lot for your work. I ran a GeneWalk analysis and would like to visualise the network generated. I think that's saved in multi_graph.pkl? I tried to draw it with networkx & pyplot, but it didn't turn out very pretty. Do you have a script?

Thanks
Yizhou

INDRA script generation

Hello, I was looking to explore creating a custom INDRA input. I was wondering if you could provide the script used to create the INDRA network from the paper.

Thanks!

Criteria for barplot output

Hi all,

Quick question - can you explain the criteria used to determine which barplots are automatically generated in the output files? The github page says "barplots with GO annotations ranked by relevance for each input gene that GeneWalk was able to generate results for," but I'm not sure if this is supposed to explain why only a subset of the bar plots get produced and on what basis they are selected.

Thanks,
Nick

Error while downloading resources - PathwayCommons11.All.hgnc.sif.gz

Hi there,

I was trying to get genewalk going on my data, however when running genewalk like this

genewalk --project test --genes ./input.csv --id_type hgnc_symbol --nproc 4

I'm presented with the following error message(s):

INFO: [2019-10-31 12:37:46] genewalk.cli - Creating project folder at /users/lule/genewalk/test
INFO: [2019-10-31 12:37:46] genewalk.resources - Using /users/lule/genewalk/resources as resource folder.
INFO: [2019-10-31 12:37:46] genewalk.resources - Downloading http://www.pathwaycommons.org/archives/PC2/v11/PathwayCommons11.All.hgnc.sif.gz and extracting into /users/lule/genewalk/resources/PathwayCommons11.All.hgnc.sif
Traceback (most recent call last):
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 964, in send
    self.connect()
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/http/client.py", line 936, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/socket.py", line 724, in create_connection
    raise err
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/socket.py", line 713, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/lule/.local/bin/genewalk", line 11, in <module>
    sys.exit(main())
  File "/users/lule/.local/lib/python3.6/site-packages/genewalk/cli.py", line 145, in main
    rm.download_all()
  File "/users/lule/.local/lib/python3.6/site-packages/genewalk/resources.py", line 53, in download_all
    self.get_pc()
  File "/users/lule/.local/lib/python3.6/site-packages/genewalk/resources.py", line 37, in get_pc
    download_gz(fname, url_pc)
  File "/users/lule/.local/lib/python3.6/site-packages/genewalk/resources.py", line 65, in download_gz
    urllib.request.urlretrieve(url, gz_file)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 1346, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/software/2020/software/python/3.6.6-foss-2018b/lib/python3.6/urllib/request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 113] No route to host>

Is the PathwayCommons11.All.hgnc.sif.gz file no longer available under the URL?

Thanks,
Lukas

Id_type for mouse genes

I have a text file of mouse genes with its MGI_ID (e.g. MGI:894679). However when I ran genewalk I receive errors : genewalk.gene_lists - Could not get HGNC ID for MGI ID although the code kept running. Is this an issue and if so should I convert the gene names into HGNC ID instead?

IndexError: list index out of range

I ran genewalk from a virtual environment (to avoid conflicting version dependencies) and received the following error (“IndexError”):

Traceback (most recent call last):
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/bin/genewalk", line 5, in
from genewalk.cli import main
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/genewalk/cli.py", line 12, in
from genewalk.gene_lists import read_gene_list
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/genewalk/gene_lists.py", line 8, in
from indra.databases import hgnc_client, uniprot_client
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/indra/databases/uniprot_client.py", line 10, in
from protmapper.uniprot_client import *
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/protmapper/init.py", line 16, in
from protmapper.api import ProtMapper, MappedSite
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/protmapper/api.py", line 712, in
uniprot_client._build_hgnc_mappings()
File "/software/genewalk/Python-3.9.0-genewalk-1.4.0-venv/lib/python3.9/site-packages/protmapper/uniprot_client.py", line 1221, in _build_hgnc_mappings
uniprot_id = row[6]
IndexError: list index out of range

Any thoughts on why this error arises?

Problem with genewalk

KeyError: 'ensembl_id'

Hi, I tried running the newest version with Ensembl IDs, and after around 1 hour of running time using 20 cores this is what I get, which looks like a bug to me:

`
...
INFO: [2019-09-23 17:22:32] gensim.models.base_any2vec - worker thread finished; awaiting finish of 2 more threads
INFO: [2019-09-23 17:22:32] gensim.models.base_any2vec - worker thread finished; awaiting finish of 1 more threads
INFO: [2019-09-23 17:22:32] gensim.models.base_any2vec - worker thread finished; awaiting finish of 0 more threads
INFO: [2019-09-23 17:22:32] gensim.models.base_any2vec - EPOCH - 5 : training on 164576000 raw words (164576000 effective words) took 119.9s, 1372717 effective words/s
INFO: [2019-09-23 17:22:32] gensim.models.base_any2vec - training on a 822880000 raw words (822880000 effective words) took 566.7s, 1452084 effective words/s
INFO: [2019-09-23 17:22:32] genewalk.deepwalk - Generating node vectors done in 610.30s
INFO: [2019-09-23 17:22:33] genewalk.cli - Saving into /home/carnold/genewalk/cll_test/deepwalk_node_vectors_rand_3.pkl...
INFO: [2019-09-23 17:22:41] genewalk.cli - Saving into /home/carnold/genewalk/cll_test/genewalk_rand_simdists.pkl...
INFO: [2019-09-23 17:22:41] genewalk.cli - Loading /home/carnold/genewalk/cll_test/multi_graph.pkl...
INFO: [2019-09-23 17:22:41] genewalk.cli - Loading /home/carnold/genewalk/cll_test/genes.pkl...
INFO: [2019-09-23 17:22:41] genewalk.cli - Loading /home/carnold/genewalk/cll_test/deepwalk_node_vectors_1.pkl...
INFO: [2019-09-23 17:22:42] genewalk.cli - Loading /home/carnold/genewalk/cll_test/deepwalk_node_vectors_2.pkl...
INFO: [2019-09-23 17:22:42] genewalk.cli - Loading /home/carnold/genewalk/cll_test/deepwalk_node_vectors_3.pkl...
INFO: [2019-09-23 17:22:42] genewalk.cli - Loading /home/carnold/genewalk/cll_test/genewalk_rand_simdists.pkl...
Traceback (most recent call last):
File "bla/TOOLS/miniconda/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'ensembl_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "bla/TOOLS/miniconda/bin/genewalk", line 10, in
sys.exit(main())
File "bla/TOOLS/miniconda/lib/python3.7/site-packages/genewalk/cli.py", line 203, in main
base_id_type=args.id_type)
File "bla/TOOLS/miniconda/lib/python3.7/site-packages/genewalk/perform_statistics.py", line 178, in generate_output
df[base_id_type] = df[base_id_type].astype('category')
File "bla/TOOLS/miniconda/lib/python3.7/site-packages/pandas/core/frame.py", line 2980, in getitem
indexer = self.columns.get_loc(key)
File "bla/TOOLS/miniconda/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'ensembl_id'

Error when calling word2vec: unexpected keyword argument 'size'

GeneWalk needs updating because of GenSim 4.0.0 release

For users running into the following error:
File "/lib64/python3.6/site-packages/genewalk/deepwalk.py", line 138, in word2vec sample=sample) TypeError: __init__() got an unexpected keyword argument 'size'

Immediate fix: downgrade gensim to previous version before running genewalk:
pip install --upgrade gensim==3.8.3

Long term solution that I will implement very soon: make GeneWalk compatible with gensim 4.0.0.

More info on the Gensim migration: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4
in Word2Vec: size ctr parameter is now consistently vector_size

AttributeError: type object 'object' has no attribute 'dtype'

Hello, thank you for developing this great algorithm!

I installed on my macbook, as recommended, without error.
It went smooth for the first set of genes.
Before the second set, I realized it was complaining about the NumPy version,
so I ran pip install numpy --upgrade.
Then I ran the for the second set of genes, and it failed after the moonlight plot.
I reran, failed
I reran the 1st gene set, failed as well now.

INFO: [2021-02-09 15:29:42] gensim.models.base_any2vec - EPOCH 5 - PROGRESS: at 96.25% examples, 2188362 words/s, in_qsize 8, out_qsize 0
INFO: [2021-02-09 15:29:43] gensim.models.base_any2vec - EPOCH 5 - PROGRESS: at 97.75% examples, 2190384 words/s, in_qsize 8, out_qsize 0
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - EPOCH 5 - PROGRESS: at 99.33% examples, 2194058 words/s, in_qsize 8, out_qsize 0
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - worker thread finished; awaiting finish of 3 more threads
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - worker thread finished; awaiting finish of 2 more threads
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - worker thread finished; awaiting finish of 1 more threads
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - worker thread finished; awaiting finish of 0 more threads
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - EPOCH - 5 : training on 155070000 raw words (155070000 effective words) took 70.6s, 2195957 effective words/s
INFO: [2021-02-09 15:29:44] gensim.models.base_any2vec - training on a 775350000 raw words (775350000 effective words) took 330.0s, 2349202 effective words/s
INFO: [2021-02-09 15:29:44] genewalk.deepwalk - Generating node vectors done in 374.79s
INFO: [2021-02-09 15:29:45] genewalk.cli - Saving into ~/genewalk/SEO_cl6/deepwalk_node_vectors_rand_3.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Saving into ~/genewalk/SEO_cl6/genewalk_rand_simdists.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Loading ~/genewalk/SEO_cl6/multi_graph.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Loading ~/genewalk/SEO_cl6/genes.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Loading ~/genewalk/SEO_cl6/deepwalk_node_vectors_1.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Loading ~/genewalk/SEO_cl6/deepwalk_node_vectors_2.pkl...
INFO: [2021-02-09 15:29:48] genewalk.cli - Loading ~/genewalk/SEO_cl6/deepwalk_node_vectors_3.pkl...
INFO: [2021-02-09 15:29:49] genewalk.cli - Loading ~/genewalk/SEO_cl6/genewalk_rand_simdists.pkl...
INFO: [2021-02-09 15:29:49] genewalk.cli - Saving final results into ~/genewalk/SEO_cl6/genewalk_results.csv
INFO: [2021-02-09 15:29:49] genewalk.cli - Creating figures folder at ~/genewalk/SEO_cl6/figures
INFO: [2021-02-09 15:29:49] genewalk.cli - Creating barplots folder at ~/genewalk/SEO_cl6/figures/barplots
INFO: [2021-02-09 15:29:49] genewalk.plot - Scatter plot data output to genewalk_scatterplots.csv...
INFO: [2021-02-09 15:29:51] genewalk.plot - Regulator genes plotted in regulators_x_gene_con_y_frac_rel_go...
INFO: [2021-02-09 15:29:51] genewalk.plot - Regulator genes listed in genewalk_regulators.csv...
INFO: [2021-02-09 15:29:52] genewalk.plot - Moonlighting genes plotted in moonlighters_x_go_con_y_frac_rel_go...
Traceback (most recent call last):
  File "~/miniconda3/bin/genewalk", line 11, in <module>
    sys.exit(main())
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/cli.py", line 146, in main
    run_main(args)
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/cli.py", line 235, in run_main
    GWp.generate_plots()
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/plot.py", line 53, in generate_plots
    moonlight_html = self.scatterplot_moonlighters()
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/plot.py", line 221, in scatterplot_moonlighters
    df = pd.DataFrame(sorted(moonlighters), columns=['gw_moonlighter'])
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 453, in __init__
    mgr = init_dict({}, index, columns, dtype=dtype)
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 196, in init_dict
    nan_dtype)
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1175, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

Error after reunning the with the gene set that worked:

INFO: [2021-02-09 21:21:22] genewalk.plot - Moonlighting genes plotted in moonlighters_x_go_con_y_frac_rel_go...
Traceback (most recent call last):
  File "~/miniconda3/bin/genewalk", line 11, in <module>
    sys.exit(main())
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/cli.py", line 146, in main
    run_main(args)
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/cli.py", line 235, in run_main
    GWp.generate_plots()
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/plot.py", line 53, in generate_plots
    moonlight_html = self.scatterplot_moonlighters()
  File "~/miniconda3/lib/python3.7/site-packages/genewalk/plot.py", line 221, in scatterplot_moonlighters
    df = pd.DataFrame(sorted(moonlighters), columns=['gw_moonlighter'])
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 453, in __init__
    mgr = init_dict({}, index, columns, dtype=dtype)
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 196, in init_dict
    nan_dtype)
  File "~/miniconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1175, in construct_1d_arraylike_from_scalar
    dtype = dtype.dtype
AttributeError: type object 'object' has no attribute 'dtype'

I guess I would have to downgrade NumPy. Or do you have a version working with the latest NumPy?

Thanks,
Abel

Rat genome

Hi,
I want to do analysis of genes in the Rat genome. Is this possible.
Kindly let me know.

regards,
Amit.

GO Terms

Hi,

Not an issue but more of a question - is it possible to restrict the GO terms utilized in Genewalk to only those of a specific category (e.g. biological process, etc.)? I'm curious if I can exclude GO terms I'm not particularly interested in (cellular component, for example) and derive more meaningful, significant GO term associations for identified regulators. I surveyed the options in genewalk --help but it didn't seem like any of the commands could be used to modify the GO terms.

Thanks,
Nick

GO annotations file comment rows have changed

In the most recent state of GO annotations there are 41 rows before the actual data starts but we hard coded skiprows=23 here: https://github.com/churchmanlab/genewalk/blob/master/genewalk/nx_mg_assembler.py#L158 for an older version. We could replace this with a solution that is adaptive and skips all rows starting with !.

AttributeError: module 'typing' has no attribute 'NoReturn

Command:
genewalk --project RNAseq9 --genes cluster9genelist.txt --id_type custom --network_source sif_annot --network_file fullnetwork.txt --base_folder Genewalk --nproc 8

Error:

Traceback (most recent call last):
  File "/n/groups/churchman/Genewalk/genewalkenv/bin/genewalk", line 5, in <module>
    from genewalk.cli import main
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/genewalk/cli.py", line 20, in <module>
    from genewalk.plot import GW_Plotter
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/genewalk/plot.py", line 10, in <module>
    import plotly.express as px
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/plotly/__init__.py", line 34, in <module>
    from plotly import (
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/plotly/io/__init__.py", line 6, in <module>
    from . import orca, kaleido
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/plotly/io/orca.py", line 1, in <module>
    from ._orca import (
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/plotly/io/_orca.py", line 15, in <module>
    import tenacity
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/tenacity/__init__.py", line 184, in <module>
    class RetryError(Exception):
  File "/n/groups/churchman/Genewalk/genewalkenv/lib/python3.6/site-packages/tenacity/__init__.py", line 191, in RetryError
    def reraise(self) -> t.NoReturn:
AttributeError: module 'typing' has no attribute 'NoReturn'

Genewalk$ pip freeze

certifi==2021.10.8
charset-normalizer==2.0.12
click==8.0.3
cycler==0.10.0
dataclasses==0.8
decorator==4.4.2
docopt==0.6.2
Flask==2.0.3
funcsigs==1.0.2
genewalk==1.5.3
gensim==3.8.3
goatools==1.1.12
idna==3.3
importlib-metadata==4.8.3
itsdangerous==2.0.1
Jinja2==3.0.3
kiwisolver==1.3.1
MarkupSafe==2.0.1
matplotlib==3.3.4
mock==2.0.0
networkx==2.5.1
nose==1.3.7
numpy==1.19.5
package-name==0.1
pandas==0.25.3
patsy==0.5.2
pbr==1.10.0
Pillow==8.4.0
plotly==5.6.0
pydot==1.4.2
pyparsing==2.1.10
python-dateutil==2.8.2
pytz==2021.3
requests==2.27.1
scipy==1.5.4
seaborn==0.11.2
six==1.10.0
smart-open==5.2.1
statsmodels==0.12.2
tenacity==8.0.1
typing_extensions==4.1.1
urllib3==1.26.8
virtualenv==15.1.0
Werkzeug==2.0.3
xlrd==1.2.0
XlsxWriter==3.0.2
zipp==3.6.0

compiling the resource folder - error

Hi,

Thanks again for making this and making it available.

I'm new to python- I apologize if my error is something basic but I'd appreciate anyone taking a look:

I installed the genewalk using
pip install genewalk

got one error:
indra 1.15.1 has requirement networkx<=2.3,>=2, but you'll have networkx 2.4 which is incompatible.

but then indra (1.15.1 ) installs anyway and appears in the list when I run
conda list

I've attempted to run the following command with the same error. I ran it in a python 3.7 env and 3.5 and got basically the same error. Any ideas?

(py35) osx2560:~ James$ genewalk --project QKI --genes ~/Downloads/QKI_forGW.csv --id_type mgi_id
INFO: [2019-10-30 08:50:15] genewalk.cli - Creating project folder at /Users/James/genewalk/QKI
INFO: [2019-10-30 08:50:15] genewalk.resources - Using /Users/James/genewalk/resources as resource folder.
INFO: [2019-10-30 08:50:15] genewalk.resources - Downloading http://snapshot.geneontology.org/ontology/go.obo into /Users/James/genewalk/resources/go.obo
Traceback (most recent call last):
  File "/Users/James/miniconda3/envs/py35/bin/genewalk", line 11, in <module>
    sys.exit(main())
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/site-packages/genewalk/cli.py", line 145, in main
    rm.download_all()
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/site-packages/genewalk/resources.py", line 51, in download_all
    self.get_go_obo()
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/site-packages/genewalk/resources.py", line 20, in get_go_obo
    download_go(fname)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/site-packages/genewalk/resources.py", line 59, in download_go
    urllib.request.urlretrieve(url, fname)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Users/James/miniconda3/envs/py35/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Visualization of the results

Hi, I have a result table now, and I am wondering whether you or anyone else already has an R or Python script to visualize a GeneWalk result table in an automated fashion, similar to what you show in the publication. I can code it for myself, but why reinvent the wheel? :)

Trouble Using GeneWalk

Hello Churchman Lab Team,

I have recently tried to implement your module for an enrichment analysis on genes I got from a differential gene expression analysis. However, an error keeps on recurring and I am unsure of what the problem is.

My Python version is 3.8.6 which should be able to run GeneWalk. I also installed the module with no errors. The following lines show up when I try to run the module

$ genewalk --project PMS --genes PMSUpGenesOnly.txt --id_type hgnc_symbol
INFO: [2021-03-01 13:54:37] genewalk.cli - Creating PMS folder at /Users/sinjiafan/genewalk/PMS
INFO: [2021-03-01 13:54:37] genewalk.resources - Using /Users/sinjiafan/genewalk/resources as resource folder.
INFO: [2021-03-01 13:54:37] genewalk.resources - Downloading https://www.genenames.org/cgi-bin/download/custom?col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_prev_sym&col=gd_status&col=md_eg_id&col=md_prot_id&col=md_mgd_id&col=md_rgd_id&col=gd_pub_ensembl_id&status=Approved&status=Entry%20Withdrawn&hgnc_dbtag=on&order_by=gd_app_sym_sort&format=text&submit=submit into /Users/sinjiafan/genewalk/resources/hgnc_entries.tsv
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1424, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1040, in _create
    self.do_handshake()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1124)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/bin/genewalk", line 8, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/cli.py", line 157, in main
    run_main(args)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/cli.py", line 195, in run_main
    genes = read_gene_list(args.genes, args.id_type, rm)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/gene_lists.py", line 31, in read_gene_list
    gene_mapper = GeneMapper(resource_manager)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/gene_lists.py", line 232, in __init__
    self.hgnc_file = self.resource_manager.get_hgnc()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/resources.py", line 78, in get_hgnc
    download_url(url, fname)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/genewalk/resources.py", line 125, in download_url
    urllib.request.urlretrieve(url, fname)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1124)>

I have attached the gene list I am trying to run as well.

PMSUpGenesOnly.txt

I hope you can help me resolve this issue. Thank you so much in advance!

Best,
Sinja (Xuanjia) Fan

No regulators identified?

Hi,

I've run 10 gene sets through genewalk from an RNA-seq experiment (various treatment conditions with up- or down-regulated genes).

While 9/10 gene sets have produced expected results, one particular gene set fails to identify any regulators, i.e. the scatterplot is empty with the exception of a few dots on the x-axis and the genewalk_regulators.csv is empty. Despite this, the barplot folder is populated with 688 figures, so its not clear to me if this is a true reflection of the gene set I've provided or some kind of error. I've attempted to re-run this analysis on a few different occasions by re-generating the source gene list file (thinking it was corrupted in some way maybe? Just a wild guess). Nothing has seemed to help.

For reference, the analysis is being conducted on MacOS 11.2.1 with Python v3.8. The code I'm using for the analysis is below:

$ genewalk --project UTD24_DOWN --genes UTD24_DOWN.txt --id_type ensembl_id --nproc 4 --nreps_graph 10 --nreps_null 10

I've also attached the output log file, results, scatter plot and regulators spreadsheet.

genewalk_all.log
genewalk_results.csv.zip

genewalk_regulators.csv.zip
regulators_x_gene_con_y_frac_rel_go.pdf

ML

version argument missing

A --version argument would be good to have, so it becomes easier to quickly check the version and extract it via automated pipelines that integrate genewalk.

rdflib=4.2.2 and python 3 version conflict

Hi,
I am using anaconda python ver 3.6
adding missing library

conda install -n py36 rdflib=4.2

ends up with information about conflict between py3.6 and rdflib=4.2.2
saying that rdflib=4.2 -> python=3.4

Best,
Tõnu

ensembl_id

The documentation shows that genewalk supports ensembl_id. However, after simple installation (pip install genewalk) and a test run, it reports an error that ensembl_id is not supported.

usage: genewalk [-h] --project PROJECT --genes GENES --id_type
{hgnc_symbol,hgnc_id,mgi_id}
[--stage {all,node_vectors,null_distribution,statistics}]
[--base_folder BASE_FOLDER]
[--network_source {pc,indra,edge_list,sif}]
[--network_file NETWORK_FILE] [--nproc NPROC]
[--nreps_graph NREPS_GRAPH] [--nreps_null NREPS_NULL]
[--alpha_fdr ALPHA_FDR] [--save_dw SAVE_DW]
[--random_seed RANDOM_SEED]

churchmanlab / genewalk Goto Github PK

genewalk's Introduction

GeneWalk

Install GeneWalk

Using GeneWalk

Gene list file

GeneWalk command line interface

Output files

Figure files

GeneWalk results file description

Run time and stages of GeneWalk algorithm

Custom input networks

Further documentation

Citation

Funding

genewalk's People

Contributors

Stargazers

Watchers

Forkers

genewalk's Issues

Recommend Projects

Recommend Topics

Recommend Org