zqfang / gseapy Goto Github PK

View Code? Open in Web Editor NEW

531.0 531.0 113.0 101.4 MB

Gene Set Enrichment Analysis in Python

Home Page: http://gseapy.rtfd.io/

License: BSD 3-Clause "New" or "Revised" License

Python 74.93% R 0.85% Rust 24.15% Apex 0.06% Visual Basic 6.0 0.02%

enrichment-analysis gsea python3 rust

gseapy's People

Contributors

Stargazers

Watchers

Forkers

oreh bioxiao alenzhao ostrokach olgabot fahsan kuod inambioinfo scottsnapperlab bkbonde ranikay costellolab cthoyt luissoares taigi mathieubo tuh8888 doanle0906 xiaoying201355 chensab2 yxngl jacobkimmel deepcolin anmy2014 huang961372045 venkoyoung olivomao sorrge weizhiting panando hmartiniano jiawu herophilus michael-kotliar xiaotaowang yiyinyiyang honphy alexwdong nyuhuyang hiplot bigtigr mauliknariya ar2plan sebiruehl1990 jamie-lyu fvalle1 hjanime martinirani kchennen brybackgmet melclic sarahbeenie yachenhu hsiaoyi0504 cwieder nathanielmki engelsdaniel zirandaozhang bbyun28 avargoksu fairliereese pirakd foghorntherapeutics genostack sunhuaiyu sshen8 mengchengyao devnambi crsky1023 jorsorokin pearcekieser mrjiang333 vcheung-fn rnaimehaom thz34 dujidan kleistikow lipingshu snijesh nargesr constantin1489 cher-han lisawanghsu wgsim kyungtaeklim zhangfuchang hao-biodecoder zhecibixuguo sarkaft wlzhdtk 136s tensixteenbio joshloecker smartgamer tealeave wook2014 abhi-glitchhg victoriagatlin sciencecomputing ailabteam

gseapy's Issues

Properly saving replot figures

Hi,

I've noticed that when running gseapy replot the output folder does not contain any output files. I believe this is because of the save command on line 77 of gsea.py.

The problem is the extra period after the first forward slash, which makes the files appear as invisible on my OS X filesystem.

I have updated my local version like so:
fig.savefig('{a}/{b}.gsea.replot.{c}'.format(a=outdir, b=enrich_term, c=format),
and now see produced files.

Small question about heatmap

Hello,

When gsea.call produce figures one of them is heatmap of gene set expression and in any cases (negatively or positively correlated, NES>0 or NES<0) it shows top ranked genes. I have seen in papers that heapmap presents top within core enrichment and actually it is left or right edge of ranked genes from set based on NES. How I can control heatmap output in GSEApy?
Thank you

ssGSEA sample_norm_method="custom" produces error

Hello

First, thank you very much for making this tool! I was trying to use Single Sample GSEA (ssgsea) and I was looking to use the option sample_norm_method="custom" and I couldn't see a way to provide the custom ranking, so I was looking through the code and I noticed on lines 556-558: (

GSEApy/gseapy/gsea.py

Lines 556 to 558 in d2439dc

 elif self.sample_norm_method == 'custom': 

 self._logger.info("Use custom rank metric for ssGSEA") 

 else:

of gsea.py that in the handling of the custom case in the method norm_samples the return variable data is not created, causing it to immediately throws an error if you call with this option e.g.

ss2 = gseapy.ssgsea(g.data_df, gene_sets=geneset_output_file, verbose=True, 
    sample_norm_method="custom")
UnboundLocalError: local variable 'data' referenced before assignment

Did you mean to have data = dat in this section and then when the user uses the custom option they provide the ranking?

Thank you in advance!

RuntimeError

RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if name == 'main':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.

I have been getting this message all day. I did exactly the same as your documentation suggested. All files were prepared just fine.

ssGSEA FDR q value > 1

Hi,

I ran ssGSEA with GSEApy unsing the MutSigDB hallmarks gene set and everything looks OK, except that for some of the gene sets I get FDR q values > 1.

Is this a bug?

Thanks,
Alejandro

when running GSEA on multiple gene sets which are ploted?

Hi,

I am trying to figure out which gene sets are selected for the plotting of the enrichment score after running a GSEA pre-ranked for 836 gene sets. I am running it as the 3. Prerank example here
http://pythonhosted.org/gseapy/gseapy_example.html

I cannot figure out which is the criteria for the plotting. Besides, could I force some how to see the plot of a particular enrichment performed?
Thank you

use of tempfile.TemporaryDirectory incompatible with Python 2.7

Hi, I noticed that in this commit:
a8fe55e#diff-73d289c8bfd27b90e234684d8d793e6cL8
(line 8)
that the code now uses tempfile.TemporaryDirectory, which is not present in Python 2.7, causing the code to break if using Python 2.7.

There are couple workarounds for this:
https://stackoverflow.com/questions/19296146/tempfile-temporarydirectory-context-manager-in-python-2-7

A small question about the original GSEA implementation

I found that if I use signal to noise method to prerank the data, the output between GSEApy and the
JAVA implementation from board institute are different.

e.g.

input: p53_resample, KEGG_2016 (del 1.00 in gmt file from Enrichr) and p53.cls.

part of result

name	GSEApy	JAVAImpl
DIP2A	0.486848	0.42881304

Then I found the Java implementation use this method to get the stddev

    /**
     * @return The std aa_indev of vector
     *         // Some heuristics for adjusting variance based on data from affy chips
     *         // NOTE: problem occurs when we threshold to a value, then that artificially
     *         //   reduces the variance in the data
     *         /
     *         // First, we make the variance at least a fixed percent of the mean
     *         // If the mean is too small, then we use an absolute variance
     *         /
     *         // However, we don't want to bias our algs for affy data, e.g. we may
     *         // get data in 0..1 and in that case it is not appropriate to use
     *         // an absolute standard deviation of 10 - will kill the signal.
     *         /
     */
    public double stddev(boolean biased, boolean fixlow) {

        double stddev = Math.sqrt(_var(biased));    // @note call to _var and not var
        double mean = computeset.mean;                    // avoid recalc

        if (fixlow) {
            double minallowed = (0.20 * Math.abs(mean));

            // In the case of a zero mean, assume the mean is 1
            if (minallowed == 0) {
                minallowed = 0.20;
            }

            if (minallowed < stddev) {
                // keep orig
            } else {
                stddev = minallowed;
            }
        }

        computeset.stddev = stddev;

        return computeset.stddev;
    }

view source: https://github.com/GSEA-MSigDB/gsea-desktop/blob/2f44b81359917cacb9e0b113b2c540c56f38a5d7/src/edu/mit/broad/genome/math/Vector.java#L416-L453

Could you please explain why they use a minallowed stddev?

conda dockerfile

Hi,
I'm using the gseapy docker image provided by https://quay.io/repository/biocontainers/gseapy.
I found it thanks to https://bioconda.github.io/recipes/gseapy/README.html

I have some minor problem running the image so I would like to improve it a bit but I can not find the related Dockerfile. Can you help me?

odd memory behavior use multiple processes

I've spent the better part of day investigating this and I'm stumped, so I'll mention it in case you have any ideas. I'm running single sample GSEA using processes = 8. I run it first on my data in a dataframe with 200 genesets, and then on another 200 genesets etc. in a for loop (code attached). I'm not saving any data in the for loop, and yet the memory usage goes up - I've repeated several times and the for each iteration the memory usage is (%): 1.4, 2.7, 3.9, 5.1 (starts at 0.1, goes down to 3.8 after running gc after the loop finishes).

A couple interesting points:

increasing the size of my dataframe causes the numbers above to increase, but same pattern
re-running the calculation immediately causes this pattern of memory usage: 5.0, 5.0, 5.0, 5.0 (starts at 3.8, ends at 3.8)

The second point suggests it is not a memory leak, perhaps the GC is not properly cleaning up until the pool starts up again?

I tried running del on various objects and rewriting the code in various ways, without any visible effect on the memory usage pattern above.

Code I used is to illustrate all the above attached as jupyter notebook:
07 illustrate memory leak.ipynb.gz

TypeERROR

With my genelist I am getting this error:
TypeError: sequence item 109: expected str instance, float found

gseapy prerank -r ls D*rnk -g /Users/kopardevn/Downloads/h.all.v6.0.symbols.gmt -o prerank_test_report_2
Traceback (most recent call last):
File "/Users/kopardevn/anaconda/bin/gseapy", line 11, in
sys.exit(main())
File "/Users/kopardevn/anaconda/lib/python3.6/site-packages/gseapy/main.py", line 43, in main
args.ascending, args.figsize, args.format, args.graph, args.seed)
File "/Users/kopardevn/anaconda/lib/python3.6/site-packages/gseapy/gsea.py", line 300, in prerank
rdict['genes'] = ",".join(dat2.ix[ind,'gene_name'].tolist())
TypeError: sequence item 109: expected str instance, float found

h.all.v6.0.symbols.gmt has been downloaded from http://software.broadinstitute.org/gsea/downloads.jsp
and the genelist is

D.rnk.txt

gseapy.enrichr crashes on non-default gene set

Example (python 2.7)
import gseapy
l = ['SCARA3', 'LOC100044683', 'CMBL', 'CLIC6', 'IL13RA1', 'TACSTD2', 'DKKL1', 'CSF1',
'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']

enrichr = gseapy.enrichr(gene_list=l, description='pathway', gene_sets='GO_Biological_Process_2017', outdir='test', cutoff=0.2)

Crashes

NameError Traceback (most recent call last)
in ()
3 'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']
4
----> 5 enrichr = gseapy.enrichr(gene_list=l, description='pathway', gene_sets='GO_Biological_Process_2017', outdir='test', cutoff=0.2)
6
7 # or a txt file path.

/usr/lib/python2.7/site-packages/gseapy/enrichr.pyc in enrichr(gene_list, gene_sets, description, outdir, cutoff, format, figsize, top_term, no_plot, verbose)
150 enr = Enrichr(gene_list, gene_sets, description, outdir,
151 cutoff, format, figsize, top_term, no_plot, verbose)
--> 152 enr.run()
153
154 return enr

/usr/lib/python2.7/site-packages/gseapy/enrichr.pyc in run(self)
55 enrichr_library = DEFAULT_LIBRARY
56 else:
---> 57 enrichr_library = get_library_name()
58 if gene_set not in enrichr_library:
59 sys.stderr.write("%s is not a enrichr library name\n"%gene_set)

NameError: global name 'get_library_name' is not defined

ssgsea fails with pandas multiindex

When the input data to ssgsea is a multi-indexed pd.DataFrame, I get a "TypeError: not all arguments converted during string formatting" within the runSamples process.

This can be resolved by dropping the multi-index first, then adding it back after - but I assume there's a better solution.

the value of nes，pvalue and fdr can‘t achieve reproducibility

I run the same data five times and get 5 different value of nes，pvalue and fdr. The results were shown as follow.

No version 0.8.0 via pip install

Your package is wonderful, and I've been trying to install v0.8 through pip but it doesn't seem to be available. Any solution?

Problem running with entrez ids

Hi there, many thanks for your great package! I tried to run gsea with entrez ids instead of gene names (using custom gene sets from MSigDB - they offer both symbol-indexed and entrez id indexed signatures sets), however it appears that GSEApy assumes that gene id are always strings. Could you have a look on it?

Here is my traceback:

Traceback (most recent call last):
  File "[...]/bin/gseapy", line 11, in <module>
    sys.exit(main())
  File "[...]/lib/python3.7/site-packages/gseapy/__main__.py", line 41, in main
    gs.run()
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 420, in run
    gmt=gmt, rank_metric=dat2, permutation_type=self.permutation_type)
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 270, in _save_results
    rdict['genes'] = ";".join([ g.strip() for g in _genes ])
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 270, in <listcomp>
    rdict['genes'] = ";".join([ g.strip() for g in _genes ])
AttributeError: 'numpy.int64' object has no attribute 'strip'

Support for custom background genes in Erichr

Since custom GMT is supported, I think custom background gene list is a natural next step. It could be as simple as reading unique IDs and pass on to self._bg. Need to test if the string is an existing file before treating as a Biomart dataset name. I can help if needed.

multiprocessing issue on prerank

got same values of es, nes, fdr ... for different gene list when multiprocessing mode is on

working on this now.............

Use module logger instead of overriding the root logger

Overriding the root logger (by clearing and setting it) breaks logging functionality if gseapy is used in another package. Should use logging.getLogger('gseapy') instead.

Reactome_2016

Running prerank leads to the following issue, only with Reactome_2016 and Reactome_2015:

/usr/software/conda/2.3.0/lib/python2.7/site-packages/gseapy/parser.pyc in ((line,))
125 print("Downloading and generating Enrichr library gene sets..............")
126 genesets_dict = { line.decode().split("\t")[0]: [gene.split(",")[0] for gene in line.decode().split("\t")[2:-1]]
--> 127 for line in response.iter_lines()}
128
129 else:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 65: ordinal not in range(128)

Heatmap scale bar

Hi,

Sorry for the interruption, it seems a stupid question but I did not find the answer in the documentation.

After I run GSEA, I got the heatmap like the below one: [KD, KD, control, control]. I initially thought this indicated that these genes were repressed in my KD samples.

But when I checked these genes in my input files and did scatter plot I am confused... It is indeed that these genes show down-regulated tendency in KD samples( X axis: KD, Y axis: control) but not as significant as they did in previous heatmap. Could you please explain what the heatmap value stands for? Why the scale is -1~1?

I used normalized read counts as input. The scatter plot is log2(counts + 1)

Thanks for your help in advance!

y-axis in enrichment score plot is all zeros

Hello

I'm doing a trivial "hello world" example of calculating enrichment score - I chose my gene sets to exactly match the 49 most down-regulated genes. When I look at the enrichment score plot, the y-axis numbers are all zero:

The code I used to generate this:
ss2 = gseapy.ssgsea(g.data_df.iloc[:, :1], gene_sets=geneset_output_file, verbose=True, format="png", outdir="temp2")

I believe for example the top-most tick mark should be 0.5.

Any ideas about this?

Thanks in advance!

ValueError: Length of values does not match length of index

Hello! I just cannot figure out how to make it so that the expected len of the index matches the actual length of the index. Every time I run gseapy.call I get the same result that the two are not equivalent!

Thoughts on the matter?

enrichment test for two ranked gene lists

hallo,

I'm doing enrichment test for two ranked gene lists. I couldn't find the appropriate function to do so as the parameter gene_sets always need to be a .gmt file.

I thought that p value and enrichment scores might be affected by manually creating a gmt file as only one gene set will be in the this customised gene sets file.

So I'm actually looking for a signed ks test https://github.com/franapoli/signed-ks-test from this package...

any ideas??? many thanks! :)

ssGSEA: enrichment score calculation using normalized rank.expression data

bug here, need to be fixed

Document for V0.9.4

Do you have any document for V0.9.4 for I am using python2.7. Thank you!

run speed and heatmaps in 0.9.4 vs 0.9.8

I noticed performance issues after upgrading to 0.9.8. I was wondering if that is to be expected. I think performance in 0.9.8 also suffers from lack of scaling with number of processes allowed (processes=1 seems pretty similar to processes>1), but I haven't really confirmed this.

Here is a sample code based on docs, if anyone would like to replicate and report back:

import pandas as pd
import gseapy as gp
import time
gene_exp = pd.read_table("./data/P53.txt")
for processes in range(1, 4):
    start_time = time.time()
    gs_res = gp.gsea(data=gene_exp, # or data='./P53_resampling_data.txt',
                     gene_sets='./data/temp.gmt', # enrichr library names
                     cls= './data/P53.cls', # cls=class_vector
                     permutation_type='phenotype',
                     permutation_num=100, # reduce number to speed up test
                     outdir=f'./test/{gp.__version__}/{processes}',  # do not write output to disk
                     method='signal_to_noise',
                     processes=processes,
                     format='png'
                    )
    print(f'with {processes} process(es):\t {round((time.time() - start_time), 3)} seconds')

(Use the data folder under test from the package for inputs.)
It produces

with 1 process(es):	 133.614 seconds
with 2 process(es):	 76.917 seconds
with 3 process(es):	 71.06 seconds

for v0.9.4 on my laptop. v0.9.8 version took over 20min with 1 process and I didn't continue.

With the heatmaps (hsa05330 set in this example), there are far more genes shown for 0.9.8 version (lower), compared to 0.9.4(upper). Is there a new argument to use here?

Any comment is appreciated.

Leading edge genes

Hi and thanks for the python implementation of GSEA!
Maybe I just overlooked this, but is there a way to get hold of the leading edge genes for enriched gene sets for further downstream analysis?

Thanks,

Nico

RuntimeWarning: Mean of empty slice. warnings.warn("Mean of empty slice.", RuntimeWarning)

Warning occurred because negative of esnull or positive esnull could not be caculated.
Then pval return a numpy NAN values.

But this warning will not affect the final results

NaNs in ssGSEA NES

Hi,

I have run ssGSEA using gseapy, the ES values I get are OK, but when I look at the NES values in some cases I get a NaN. I checked the corresponding ES values and they are not 0. For example in one case I have an ES == 327.7 and the NES == NaN. And this doesn't seem to relate to the geneset, because for the same geneset other samples have an ES and a numerical NES as expected.

Is this a bug?

Happy to provide some sample data if required.
Best,
Alejandro

Support for SVG format

Will SVG be accepted as an output format?

gseapy.prerank - errors for small ranked lists

Hi,
Thanks to implement GSEA on python, it's now quicker to perform analyses.

Please note a bug with gseapy.prerank. With default parameters and when the ranked list is small (<= 15), the function returns

No gene sets passed through filtering condition!!!, try new parameters again!
Note: check gene name, gmt file format, or filtering size.

I resolved the problem by lowering down the min_size parameter. So, as long as, min_size is higher than the ranked list size, there are no errors. But this parameter should apply to the gene sets and not to the expression dataset, according to the documentation of GSEA software.

Even with no elements in the ranked list matching with the gene sets, the function should return a warning but not an error.

Moreover, in case of errors, the gene set list passed to the function is emptied.

Best regards,
Michaël

Use Requests library

I have a good feeling that the python2/3 switching can be made much easier with the requests library.

https://github.com/BioNinja/GSEApy/blob/033b7b8361437847bffdb4be82d621472ee12b65/gseapy/parser.py#L154-L165

This could become:

response = requests.get('http://amp.pharm.mssm.edu/Enrichr/geneSetLibrary?mode=meta')
gmt_data = response.json()

And this already takes care of python2/3 abstraction :) I'd be happy to submit a PR!

fdr values

Hi,

This is more like a questions than an issue, hope it's OK.

My questions is about ssGSEA FDR calculation:

If you have say 50 gene sets, assuming in reality all of those gene sets are actually "enriched", does the FDR calculation will by default output some non-significant (q > 0.05) values, or would all the gene sets have a significant value (q < 0.05)? Again assuming all the genes in those gene sets are highly expressed for example.

Thanks

Question about ssGSEA FDR

Hello,
My questions is about ssGSEA FDR.
In my results, there are several FDR = 0. Is there any way to get the actual FDR value?

Thanks

Pass a pandas dataframe to prerank() or enrichr() inside python console

It will be a good practise to pass a dataframe to both of prerank or enrichr.

I haven't found a good way to solve this problem, so far

I need some time to improve this

ssGSEA: Accept a Series with gene names as index and return a dataframe

Hello,
I'm very excited about a Python implementation of ssGSEA! I'd like to convert my gene expression values to pathway expression values by using applyto perform ssGSEA on every row of a pandas.DataFrame expression matrix. Right now, this looks like this:

expression.head().apply(
    lambda x: gp.ssgsea(x.reset_index(), gene_sets=gmt), 
    axis=1)

The reset_index() seems unnecessary for every row
Each run of ssgsea returns None, rather than returning the converted pathway enrichment. I'd rather not have to read a file for every single sample I have (~6,000 of them), so can ssgsea return the Series instead?

Warmest,
Olga

expected enrichment score of -0.5, got -0.496 instead

Hello

I'm doing a trivial example where I created a gene set of the 49 most up-regulated genes of the in my dataset and then running ssgsea. In this trivial example I expect to get an enrichment score of -0.5, but instead I get -0.496.

The command I used:
ss2 = gseapy.ssgsea(g.data_df.iloc[:, :1], gene_sets=geneset_output_file, verbose=True, format="png", outdir="temp2")

I can provide the data and gene set (geneset_output_file) if you would like. Here is the output file temp2/gseapy.ssgsea.gene_sets.report.csv, the relevant line is example_up:

# normalize enrichment scores by random permutation procddure
# Same method to the orignial GSEA method, and it's not proper to use these values in your publication
Term	es	nes	pval	fdr	gene_set_size	matched_size	genes
example_dn	0.500053595619	8.26711866264	0.0	0.0	49	49	8869,9053,9170,80204,4775,7048,976,7376,22905,6839,79921,28969,5873,93487,6988,9653,3508,1870,55011,55746,64746,8270,2767,57178,23443,54957,211,2769,7043,9143,4998,10954,6944,7099,7165,51001,23161,8837,55825,5867,10732,808,93594,891,8985,4482,10610,8446,637
example_up	-0.496476958504	-22.8445330718	0.0	0.0	49	49	3383,7158,256364,23097,2146,23139,355,30849,10285,56654,5641,23386,8518,10276,178,147179,8573,993,9448,63874,6195,51335,25966,1445,83743,26128,392,890,8895,23,9375,56889,10450,26511,57761,51024,10245,23014,10493,1019,9928,1802,11188,9093,1633,51282,4144,51160,9709

You can see the es for example_up is -0.496476958504 instead of the expected -0.5

Thank you in advance, any ideas / thoughts greatly appreciated.

multiprocessing issue

Python version: 3.5
GSEApy version 0.8.3
Platform: macOS 10.12.5

When I follow the example from EXAMPLE of the session GSEA, I got the result like this:

	es	nes	pval	fdr	gene_set_size	matched_size	genes
Term
Cytokine-cytokine receptor interaction_Homo sapiens_hsa04060	0.229069	1.556522	0.000000	0.000000	265	18	IL10RB,VEGFC,CSF1,TNFSF12,LTBR,CXCL10,TNFRSF1A...
Ras signaling pathway_Homo sapiens_hsa04014	0.332033	1.141144	0.002004	0.002003	227	18	ETS1,GNG13,RRAS,VEGFC,GNB4,CSF1,SOS2,FGF17,PDG...
Rap1 signaling pathway_Homo sapiens_hsa04015	-0.285975	-1.621806	0.000000	0.005941	211	19	RRAS,VEGFC,CSF1,FGF17,PDGFRB,FGF4,PDGFC,SIPA1L...
MAPK signaling pathway_Homo sapiens_hsa04010	-0.392928	-1.037983	0.500000	0.010396	255	18	GADD45B,RRAS,SOS2,FGF17,PPP3CC,TNFRSF1A,PDGFRB...
HTLV-I infection_Homo sapiens_hsa05166	-0.249752	-0.899335	0.666667	0.996040	258	19	FZD2,ETS1,STAT5B,RRAS,LTBR,PPP3CC,TNFRSF1A,EGR...
PI3K-Akt signaling pathway_Homo sapiens_hsa04151	0.182245	0.668758	1.000000	1.000000	341	22	GNG13,VEGFC,GNB4,CSF1,SOS2,FGF17,THBS4,PDGFRB,...
Pathways in cancer_Homo sapiens_hsa05200	0.201838	0.573342	1.000000	1.000000	397	27	FZD2,ETS1,STAT5B,GNG13,VEGFC,GNB4,SOS2,FGF17,E...

It is strange that the P-value is either 0 or 1, after investigate the source, it seems the null distribution broken.

the part of value of the esnull:

[-0.5277168069182326, -0.3607821392991908, -0.27070554833780985, 0.14716709004244327, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, ....]

the value after index 4 are identical, which lead to a odd normal distribution.

And after trace of the identical value, I find the odd value is caused by the multiprocessing.Pool.

    if permutation_type == "phenotype":
        l2 = list(classes)
        dat2 = dat.copy()
        #multi-threading for rankings.
        rank_nulls=[]
        pool_rnkn = Pool(processes=processes) 

 
        for i in range(n):
            rs.shuffle(l2) 
            rank_nulls.append(pool_rnkn.apply_async(_rnknull, args=(dat2, method, 
                                                                  phenoPos, phenoNeg,
                                                                  l2, ascending)))
        pool_rnkn.close()
        pool_rnkn.join()

In fact, the 1000 times RandomState.shuffle is done after the multiprocessing.Pool close. so the classes vector live in the memory of pool_rnkn keep the reference of the last generated l2.

After copy the shuffle classes vector every time by:

for i in range(n):
            rs.shuffle(l2) 
            l3 = deepcopy(l2)
            rank_nulls.append(pool_rnkn.apply_async(_rnknull, args=(dat2, method, 
                                                                  phenoPos, phenoNeg,
                                                                  l3, ascending)))

Then the module calculates the correct p-value.

I am not sure if this issue only occur in my system, but seem several methods using multiprocessing.Pool may suffers this issue. And I hope this will be fixed soon.

Thanks~
Sheep

Memory Error

Hello BioNinja,
First of all thank you for your work! I'm a bio-informatics student and I'm trying to use your tool, however when I try to launch it gives me a memory error; this is a screen shot from my notebook.

The dataframe that I'm currently using has shape (9002, 4). Indeed I'm using jupyter notebook with anaconda environment and python 3.6.3. The version of gseapy is 0.9.2.

This is the code that I'm using:
gs_res = gp.gsea(data = final_df_gsea_cleaned, gene_sets = 'KEGG_2016',
cls = ['a', 'b', 'b', 'b'], outdir = './Data/gsea_report_bis')

Moreover I receive a warning from the program because I'm currently using negative values because I standardized (standard scaler of sklearn) my dataframe but running the same code with a dataframe without negative values ends up in the same failure.

It is a problem related to the implementation or related to a wrong use? Because it is all the day that I'm trying to launch it but I fail every time. If you need any more details about the error please ask me.

Thank you!
Francesco

Error when using .enrichr with "KEGG_2016".

I am using "KEGG_2016" as gene set through gseapy.enrichr(), but keep getting the following error:

'<=' not supported between instances of 'str' and 'float'

Using "GO_Biological_Process_2018", "GO_Biological_Process_2017", "KEGG_2015" works.

Input genes with their DE already calculated

I have a set of genes with their differential expressions already calculated and I would like to know whether it is possible to input this sort of data (instead of raw expression + class vector) to the gsea main function?

Thank you and regards!

replot function bug

Hello, when I use gseapy replot function to replot my GSEA results, it works well on a default geneset, for example, the hallmark, c2,c5 and so on. however, when I use a user-defined gene set, such as enzyme and its substrates. I get an error and I still don't know why?
Traceback (most recent call last):
File "/home/yukai/bin/gseapy", line 10, in
sys.exit(main())
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/main.py", line 30, in main
rep.run()
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/gsea.py", line 761, in run
rank_metric = self._load_ranking(rank_path)
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/gsea.py", line 66, in _load_ranking
rank_metric.sort_values(by=rank_metric.columns[1], ascending=self.ascending, inplace=True)
File "/home/yeying/miniconda3/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2084, in getitem
return getitem(key)
IndexError: index 1 is out of bounds for axis 0 with size 1

[Question] Is there a way to use custom gene sets?

Is there a way to use custom gene sets? For example, what if I had the following data:

# Gene list
gene_list = [
"gene_A", 
"gene_B", 
...
"gene_xyz"
]

and then a bunch of gene sets like this:

gene_sets = {
"gene_set_1": ["gene_A", "gene_B", ...],
"gene_set_2":["gene_B", gene_C", ...], 
...
"gene_set_100":["gene_A", "gene_T", ...]
}

Is there still a way to run this tool to figure out which gene sets are enriched?

I usually deal with microbiome datasets with de-novo called ORFs from prodigal so there are no gene ids that would be useable. A lot of my friends in cancer labs always talk about GSEA but I don't have the types of IDs that they use. Though, I still have "gene sets" that I could use so I feel like it could still apply.

How could I run ssGSEA with your module?

Hi,

If I run your implementation of gsea, from the pre-rank step, per each sample that I have it would be the same than running ssGSEA?

Thank you

All the gene sets got Zero score at 562

Hi,

After I updated the gseapy to 0.95, although the data in gseapy.gsea.gene_set.report.csv looks good, in the generated score map, ALL the gene sets got 0 score at 562, no matter how the covered genes are distributed in the list.

Can anyone check this issue? Thanks!

ko00030.gsea.pdf
ko00190.gsea.pdf
ko00230.gsea.pdf
ko00250.gsea.pdf

TypeError when running gseapy.gsea() with P53 example

Python version : 2.7
GSEApy version : 0.9.4

Hi.
When using the P53 example provided, I am getting 'TypeError':

Thank you so much in advance!

Divide by zero problem

I am trying to the prerank function using my own data. However, I get the following error message:

home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:524: RuntimeWarning: divide by zero encountered in true_divide choicelist = [np.sum(esnull < es.reshape(len(es),1), axis=1)/ np.sum(esnull < 0, axis=1), /home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:524: RuntimeWarning: invalid value encountered in true_divide choicelist = [np.sum(esnull < es.reshape(len(es),1), axis=1)/ np.sum(esnull < 0, axis=1), /home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:575: RuntimeWarning: Mean of empty slice. meanNeg = enrNull[enrNull < 0 ].mean() /home/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in true_divide ret = ret.dtype.type(ret / rcount)
My weights are from 0 to 1. I get the error above after I normalize the weights to be from -1 to 1. It doesn't matter what the weight values are, right? Do they have to include both positive and negative values?

Also, I do not get any plots generated even with more than 1500 genes. Any pointers on what I could try?