Giter Club home page Giter Club logo

gseapy's People

Contributors

136s avatar abhi-glitchhg avatar alireza-majd avatar cthoyt avatar engelsdaniel avatar fairliereese avatar falexwolf avatar ffroehlich avatar hsiaoyi0504 avatar jacobkimmel avatar jfinkle1016 avatar oreh avatar pearcekieser avatar pirakd avatar sorrge avatar theaustinator avatar yxngl avatar zqfang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gseapy's Issues

Properly saving replot figures

Hi,

I've noticed that when running gseapy replot the output folder does not contain any output files. I believe this is because of the save command on line 77 of gsea.py.

The problem is the extra period after the first forward slash, which makes the files appear as invisible on my OS X filesystem.

I have updated my local version like so:
fig.savefig('{a}/{b}.gsea.replot.{c}'.format(a=outdir, b=enrich_term, c=format),
and now see produced files.

Small question about heatmap

Hello,

When gsea.call produce figures one of them is heatmap of gene set expression and in any cases (negatively or positively correlated, NES>0 or NES<0) it shows top ranked genes. I have seen in papers that heapmap presents top within core enrichment and actually it is left or right edge of ranked genes from set based on NES. How I can control heatmap output in GSEApy?
Thank you

ssGSEA sample_norm_method="custom" produces error

Hello

First, thank you very much for making this tool! I was trying to use Single Sample GSEA (ssgsea) and I was looking to use the option sample_norm_method="custom" and I couldn't see a way to provide the custom ranking, so I was looking through the code and I noticed on lines 556-558: (

GSEApy/gseapy/gsea.py

Lines 556 to 558 in d2439dc

elif self.sample_norm_method == 'custom':
self._logger.info("Use custom rank metric for ssGSEA")
else:

of gsea.py that in the handling of the custom case in the method norm_samples the return variable data is not created, causing it to immediately throws an error if you call with this option e.g.

ss2 = gseapy.ssgsea(g.data_df, gene_sets=geneset_output_file, verbose=True, 
    sample_norm_method="custom")
UnboundLocalError: local variable 'data' referenced before assignment

Did you mean to have data = dat in this section and then when the user uses the custom option they provide the ranking?

Thank you in advance!

RuntimeError

RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if name == 'main':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.

I have been getting this message all day. I did exactly the same as your documentation suggested. All files were prepared just fine.

ssGSEA FDR q value > 1

Hi,

I ran ssGSEA with GSEApy unsing the MutSigDB hallmarks gene set and everything looks OK, except that for some of the gene sets I get FDR q values > 1.

Is this a bug?

Thanks,
Alejandro

A small question about the original GSEA implementation

I found that if I use signal to noise method to prerank the data, the output between GSEApy and the
JAVA implementation from board institute are different.

e.g.

input: p53_resample, KEGG_2016 (del 1.00 in gmt file from Enrichr) and p53.cls.

part of result

name GSEApy JAVAImpl
DIP2A 0.486848 0.42881304

Then I found the Java implementation use this method to get the stddev

    /**
     * @return The std aa_indev of vector
     *         // Some heuristics for adjusting variance based on data from affy chips
     *         // NOTE: problem occurs when we threshold to a value, then that artificially
     *         //   reduces the variance in the data
     *         /
     *         // First, we make the variance at least a fixed percent of the mean
     *         // If the mean is too small, then we use an absolute variance
     *         /
     *         // However, we don't want to bias our algs for affy data, e.g. we may
     *         // get data in 0..1 and in that case it is not appropriate to use
     *         // an absolute standard deviation of 10 - will kill the signal.
     *         /
     */
    public double stddev(boolean biased, boolean fixlow) {

        double stddev = Math.sqrt(_var(biased));    // @note call to _var and not var
        double mean = computeset.mean;                    // avoid recalc

        if (fixlow) {
            double minallowed = (0.20 * Math.abs(mean));

            // In the case of a zero mean, assume the mean is 1
            if (minallowed == 0) {
                minallowed = 0.20;
            }

            if (minallowed < stddev) {
                // keep orig
            } else {
                stddev = minallowed;
            }
        }

        computeset.stddev = stddev;

        return computeset.stddev;
    }

view source: https://github.com/GSEA-MSigDB/gsea-desktop/blob/2f44b81359917cacb9e0b113b2c540c56f38a5d7/src/edu/mit/broad/genome/math/Vector.java#L416-L453

Could you please explain why they use a minallowed stddev?

odd memory behavior use multiple processes

I've spent the better part of day investigating this and I'm stumped, so I'll mention it in case you have any ideas. I'm running single sample GSEA using processes = 8. I run it first on my data in a dataframe with 200 genesets, and then on another 200 genesets etc. in a for loop (code attached). I'm not saving any data in the for loop, and yet the memory usage goes up - I've repeated several times and the for each iteration the memory usage is (%): 1.4, 2.7, 3.9, 5.1 (starts at 0.1, goes down to 3.8 after running gc after the loop finishes).

A couple interesting points:

  • increasing the size of my dataframe causes the numbers above to increase, but same pattern
  • re-running the calculation immediately causes this pattern of memory usage: 5.0, 5.0, 5.0, 5.0 (starts at 3.8, ends at 3.8)

The second point suggests it is not a memory leak, perhaps the GC is not properly cleaning up until the pool starts up again?

I tried running del on various objects and rewriting the code in various ways, without any visible effect on the memory usage pattern above.

Code I used is to illustrate all the above attached as jupyter notebook:
07 illustrate memory leak.ipynb.gz

TypeERROR

With my genelist I am getting this error:
TypeError: sequence item 109: expected str instance, float found

gseapy prerank -r ls D*rnk -g /Users/kopardevn/Downloads/h.all.v6.0.symbols.gmt -o prerank_test_report_2
Traceback (most recent call last):
File "/Users/kopardevn/anaconda/bin/gseapy", line 11, in
sys.exit(main())
File "/Users/kopardevn/anaconda/lib/python3.6/site-packages/gseapy/main.py", line 43, in main
args.ascending, args.figsize, args.format, args.graph, args.seed)
File "/Users/kopardevn/anaconda/lib/python3.6/site-packages/gseapy/gsea.py", line 300, in prerank
rdict['genes'] = ",".join(dat2.ix[ind,'gene_name'].tolist())
TypeError: sequence item 109: expected str instance, float found

h.all.v6.0.symbols.gmt has been downloaded from http://software.broadinstitute.org/gsea/downloads.jsp
and the genelist is

D.rnk.txt

gseapy.enrichr crashes on non-default gene set

Example (python 2.7)
import gseapy
l = ['SCARA3', 'LOC100044683', 'CMBL', 'CLIC6', 'IL13RA1', 'TACSTD2', 'DKKL1', 'CSF1',
'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']

enrichr = gseapy.enrichr(gene_list=l, description='pathway', gene_sets='GO_Biological_Process_2017', outdir='test', cutoff=0.2)

Crashes

NameError Traceback (most recent call last)
in ()
3 'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']
4
----> 5 enrichr = gseapy.enrichr(gene_list=l, description='pathway', gene_sets='GO_Biological_Process_2017', outdir='test', cutoff=0.2)
6
7 # or a txt file path.

/usr/lib/python2.7/site-packages/gseapy/enrichr.pyc in enrichr(gene_list, gene_sets, description, outdir, cutoff, format, figsize, top_term, no_plot, verbose)
150 enr = Enrichr(gene_list, gene_sets, description, outdir,
151 cutoff, format, figsize, top_term, no_plot, verbose)
--> 152 enr.run()
153
154 return enr

/usr/lib/python2.7/site-packages/gseapy/enrichr.pyc in run(self)
55 enrichr_library = DEFAULT_LIBRARY
56 else:
---> 57 enrichr_library = get_library_name()
58 if gene_set not in enrichr_library:
59 sys.stderr.write("%s is not a enrichr library name\n"%gene_set)

NameError: global name 'get_library_name' is not defined

ssgsea fails with pandas multiindex

When the input data to ssgsea is a multi-indexed pd.DataFrame, I get a "TypeError: not all arguments converted during string formatting" within the runSamples process.

This can be resolved by dropping the multi-index first, then adding it back after - but I assume there's a better solution.

No version 0.8.0 via pip install

Your package is wonderful, and I've been trying to install v0.8 through pip but it doesn't seem to be available. Any solution?

Problem running with entrez ids

Hi there, many thanks for your great package! I tried to run gsea with entrez ids instead of gene names (using custom gene sets from MSigDB - they offer both symbol-indexed and entrez id indexed signatures sets), however it appears that GSEApy assumes that gene id are always strings. Could you have a look on it?

Here is my traceback:

Traceback (most recent call last):
  File "[...]/bin/gseapy", line 11, in <module>
    sys.exit(main())
  File "[...]/lib/python3.7/site-packages/gseapy/__main__.py", line 41, in main
    gs.run()
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 420, in run
    gmt=gmt, rank_metric=dat2, permutation_type=self.permutation_type)
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 270, in _save_results
    rdict['genes'] = ";".join([ g.strip() for g in _genes ])
  File "[...]/lib/python3.7/site-packages/gseapy/gsea.py", line 270, in <listcomp>
    rdict['genes'] = ";".join([ g.strip() for g in _genes ])
AttributeError: 'numpy.int64' object has no attribute 'strip'

Support for custom background genes in Erichr

Since custom GMT is supported, I think custom background gene list is a natural next step. It could be as simple as reading unique IDs and pass on to self._bg. Need to test if the string is an existing file before treating as a Biomart dataset name. I can help if needed.

Reactome_2016

Running prerank leads to the following issue, only with Reactome_2016 and Reactome_2015:

/usr/software/conda/2.3.0/lib/python2.7/site-packages/gseapy/parser.pyc in ((line,))
125 print("Downloading and generating Enrichr library gene sets..............")
126 genesets_dict = { line.decode().split("\t")[0]: [gene.split(",")[0] for gene in line.decode().split("\t")[2:-1]]
--> 127 for line in response.iter_lines()}
128
129 else:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 65: ordinal not in range(128)

Heatmap scale bar

Hi,

Sorry for the interruption, it seems a stupid question but I did not find the answer in the documentation.

After I run GSEA, I got the heatmap like the below one: [KD, KD, control, control]. I initially thought this indicated that these genes were repressed in my KD samples.
image

But when I checked these genes in my input files and did scatter plot I am confused... It is indeed that these genes show down-regulated tendency in KD samples( X axis: KD, Y axis: control) but not as significant as they did in previous heatmap. Could you please explain what the heatmap value stands for? Why the scale is -1~1?

I used normalized read counts as input. The scatter plot is log2(counts + 1)
image

Thanks for your help in advance!

y-axis in enrichment score plot is all zeros

Hello

I'm doing a trivial "hello world" example of calculating enrichment score - I chose my gene sets to exactly match the 49 most down-regulated genes. When I look at the enrichment score plot, the y-axis numbers are all zero:
example_dn ssgsea

The code I used to generate this:
ss2 = gseapy.ssgsea(g.data_df.iloc[:, :1], gene_sets=geneset_output_file, verbose=True, format="png", outdir="temp2")

I believe for example the top-most tick mark should be 0.5.

Any ideas about this?

Thanks in advance!

enrichment test for two ranked gene lists

hallo,

I'm doing enrichment test for two ranked gene lists. I couldn't find the appropriate function to do so as the parameter gene_sets always need to be a .gmt file.

I thought that p value and enrichment scores might be affected by manually creating a gmt file as only one gene set will be in the this customised gene sets file.

So I'm actually looking for a signed ks test https://github.com/franapoli/signed-ks-test from this package...

any ideas??? many thanks! :)

run speed and heatmaps in 0.9.4 vs 0.9.8

I noticed performance issues after upgrading to 0.9.8. I was wondering if that is to be expected. I think performance in 0.9.8 also suffers from lack of scaling with number of processes allowed (processes=1 seems pretty similar to processes>1), but I haven't really confirmed this.

Here is a sample code based on docs, if anyone would like to replicate and report back:

import pandas as pd
import gseapy as gp
import time
gene_exp = pd.read_table("./data/P53.txt")
for processes in range(1, 4):
    start_time = time.time()
    gs_res = gp.gsea(data=gene_exp, # or data='./P53_resampling_data.txt',
                     gene_sets='./data/temp.gmt', # enrichr library names
                     cls= './data/P53.cls', # cls=class_vector
                     permutation_type='phenotype',
                     permutation_num=100, # reduce number to speed up test
                     outdir=f'./test/{gp.__version__}/{processes}',  # do not write output to disk
                     method='signal_to_noise',
                     processes=processes,
                     format='png'
                    )
    print(f'with {processes} process(es):\t {round((time.time() - start_time), 3)} seconds')

(Use the data folder under test from the package for inputs.)
It produces

with 1 process(es):	 133.614 seconds
with 2 process(es):	 76.917 seconds
with 3 process(es):	 71.06 seconds

for v0.9.4 on my laptop. v0.9.8 version took over 20min with 1 process and I didn't continue.

With the heatmaps (hsa05330 set in this example), there are far more genes shown for 0.9.8 version (lower), compared to 0.9.4(upper). Is there a new argument to use here?
image

Any comment is appreciated.

Leading edge genes

Hi and thanks for the python implementation of GSEA!
Maybe I just overlooked this, but is there a way to get hold of the leading edge genes for enriched gene sets for further downstream analysis?

Thanks,

Nico

NaNs in ssGSEA NES

Hi,

I have run ssGSEA using gseapy, the ES values I get are OK, but when I look at the NES values in some cases I get a NaN. I checked the corresponding ES values and they are not 0. For example in one case I have an ES == 327.7 and the NES == NaN. And this doesn't seem to relate to the geneset, because for the same geneset other samples have an ES and a numerical NES as expected.

Is this a bug?

Happy to provide some sample data if required.
Best,
Alejandro

gseapy.prerank - errors for small ranked lists

Hi,
Thanks to implement GSEA on python, it's now quicker to perform analyses.

Please note a bug with gseapy.prerank. With default parameters and when the ranked list is small (<= 15), the function returns

No gene sets passed through filtering condition!!!, try new parameters again!
Note: check gene name, gmt file format, or filtering size.

I resolved the problem by lowering down the min_size parameter. So, as long as, min_size is higher than the ranked list size, there are no errors. But this parameter should apply to the gene sets and not to the expression dataset, according to the documentation of GSEA software.

Even with no elements in the ranked list matching with the gene sets, the function should return a warning but not an error.

Moreover, in case of errors, the gene set list passed to the function is emptied.

Best regards,
Michaël

fdr values

Hi,

This is more like a questions than an issue, hope it's OK.

My questions is about ssGSEA FDR calculation:

If you have say 50 gene sets, assuming in reality all of those gene sets are actually "enriched", does the FDR calculation will by default output some non-significant (q > 0.05) values, or would all the gene sets have a significant value (q < 0.05)? Again assuming all the genes in those gene sets are highly expressed for example.

Thanks

Question about ssGSEA FDR

Hello,
My questions is about ssGSEA FDR.
In my results, there are several FDR = 0. Is there any way to get the actual FDR value?

screen shot 2018-10-14 at 22 47 43

Thanks

ssGSEA: Accept a Series with gene names as index and return a dataframe

Hello,
I'm very excited about a Python implementation of ssGSEA! I'd like to convert my gene expression values to pathway expression values by using applyto perform ssGSEA on every row of a pandas.DataFrame expression matrix. Right now, this looks like this:

expression.head().apply(
    lambda x: gp.ssgsea(x.reset_index(), gene_sets=gmt), 
    axis=1)
  1. The reset_index() seems unnecessary for every row
  2. Each run of ssgsea returns None, rather than returning the converted pathway enrichment. I'd rather not have to read a file for every single sample I have (~6,000 of them), so can ssgsea return the Series instead?

Warmest,
Olga

expected enrichment score of -0.5, got -0.496 instead

Hello

I'm doing a trivial example where I created a gene set of the 49 most up-regulated genes of the in my dataset and then running ssgsea. In this trivial example I expect to get an enrichment score of -0.5, but instead I get -0.496.

The command I used:
ss2 = gseapy.ssgsea(g.data_df.iloc[:, :1], gene_sets=geneset_output_file, verbose=True, format="png", outdir="temp2")

I can provide the data and gene set (geneset_output_file) if you would like. Here is the output file temp2/gseapy.ssgsea.gene_sets.report.csv, the relevant line is example_up:

# normalize enrichment scores by random permutation procddure
# Same method to the orignial GSEA method, and it's not proper to use these values in your publication
Term	es	nes	pval	fdr	gene_set_size	matched_size	genes
example_dn	0.500053595619	8.26711866264	0.0	0.0	49	49	8869,9053,9170,80204,4775,7048,976,7376,22905,6839,79921,28969,5873,93487,6988,9653,3508,1870,55011,55746,64746,8270,2767,57178,23443,54957,211,2769,7043,9143,4998,10954,6944,7099,7165,51001,23161,8837,55825,5867,10732,808,93594,891,8985,4482,10610,8446,637
example_up	-0.496476958504	-22.8445330718	0.0	0.0	49	49	3383,7158,256364,23097,2146,23139,355,30849,10285,56654,5641,23386,8518,10276,178,147179,8573,993,9448,63874,6195,51335,25966,1445,83743,26128,392,890,8895,23,9375,56889,10450,26511,57761,51024,10245,23014,10493,1019,9928,1802,11188,9093,1633,51282,4144,51160,9709

You can see the es for example_up is -0.496476958504 instead of the expected -0.5

Thank you in advance, any ideas / thoughts greatly appreciated.

multiprocessing issue

Python version: 3.5
GSEApy version 0.8.3
Platform: macOS 10.12.5

When I follow the example from EXAMPLE of the session GSEA, I got the result like this:

es nes pval fdr gene_set_size matched_size genes
Term
Cytokine-cytokine receptor interaction_Homo sapiens_hsa04060 0.229069 1.556522 0.000000 0.000000 265 18 IL10RB,VEGFC,CSF1,TNFSF12,LTBR,CXCL10,TNFRSF1A...
Ras signaling pathway_Homo sapiens_hsa04014 0.332033 1.141144 0.002004 0.002003 227 18 ETS1,GNG13,RRAS,VEGFC,GNB4,CSF1,SOS2,FGF17,PDG...
Rap1 signaling pathway_Homo sapiens_hsa04015 -0.285975 -1.621806 0.000000 0.005941 211 19 RRAS,VEGFC,CSF1,FGF17,PDGFRB,FGF4,PDGFC,SIPA1L...
MAPK signaling pathway_Homo sapiens_hsa04010 -0.392928 -1.037983 0.500000 0.010396 255 18 GADD45B,RRAS,SOS2,FGF17,PPP3CC,TNFRSF1A,PDGFRB...
HTLV-I infection_Homo sapiens_hsa05166 -0.249752 -0.899335 0.666667 0.996040 258 19 FZD2,ETS1,STAT5B,RRAS,LTBR,PPP3CC,TNFRSF1A,EGR...
PI3K-Akt signaling pathway_Homo sapiens_hsa04151 0.182245 0.668758 1.000000 1.000000 341 22 GNG13,VEGFC,GNB4,CSF1,SOS2,FGF17,THBS4,PDGFRB,...
Pathways in cancer_Homo sapiens_hsa05200 0.201838 0.573342 1.000000 1.000000 397 27 FZD2,ETS1,STAT5B,GNG13,VEGFC,GNB4,SOS2,FGF17,E...

It is strange that the P-value is either 0 or 1, after investigate the source, it seems the null distribution broken.

the part of value of the esnull:

[-0.5277168069182326, -0.3607821392991908, -0.27070554833780985, 0.14716709004244327, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, -0.19182208447985646, ....]

the value after index 4 are identical, which lead to a odd normal distribution.

And after trace of the identical value, I find the odd value is caused by the multiprocessing.Pool.

    if permutation_type == "phenotype":
        l2 = list(classes)
        dat2 = dat.copy()
        #multi-threading for rankings.
        rank_nulls=[]
        pool_rnkn = Pool(processes=processes) 

 
        for i in range(n):
            rs.shuffle(l2) 
            rank_nulls.append(pool_rnkn.apply_async(_rnknull, args=(dat2, method, 
                                                                  phenoPos, phenoNeg,
                                                                  l2, ascending)))
        pool_rnkn.close()
        pool_rnkn.join()

In fact, the 1000 times RandomState.shuffle is done after the multiprocessing.Pool close. so the classes vector live in the memory of pool_rnkn keep the reference of the last generated l2.

After copy the shuffle classes vector every time by:

for i in range(n):
            rs.shuffle(l2) 
            l3 = deepcopy(l2)
            rank_nulls.append(pool_rnkn.apply_async(_rnknull, args=(dat2, method, 
                                                                  phenoPos, phenoNeg,
                                                                  l3, ascending)))

Then the module calculates the correct p-value.

I am not sure if this issue only occur in my system, but seem several methods using multiprocessing.Pool may suffers this issue. And I hope this will be fixed soon.

Thanks~
Sheep

Memory Error

Hello BioNinja,
First of all thank you for your work! I'm a bio-informatics student and I'm trying to use your tool, however when I try to launch it gives me a memory error; this is a screen shot from my notebook.

istantanea_2018-01-03_16-32-50

The dataframe that I'm currently using has shape (9002, 4). Indeed I'm using jupyter notebook with anaconda environment and python 3.6.3. The version of gseapy is 0.9.2.

This is the code that I'm using:
gs_res = gp.gsea(data = final_df_gsea_cleaned, gene_sets = 'KEGG_2016',
cls = ['a', 'b', 'b', 'b'], outdir = './Data/gsea_report_bis')

Moreover I receive a warning from the program because I'm currently using negative values because I standardized (standard scaler of sklearn) my dataframe but running the same code with a dataframe without negative values ends up in the same failure.

It is a problem related to the implementation or related to a wrong use? Because it is all the day that I'm trying to launch it but I fail every time. If you need any more details about the error please ask me.

Thank you!
Francesco

Error when using .enrichr with "KEGG_2016".

I am using "KEGG_2016" as gene set through gseapy.enrichr(), but keep getting the following error:

'<=' not supported between instances of 'str' and 'float'

Using "GO_Biological_Process_2018", "GO_Biological_Process_2017", "KEGG_2015" works.

Input genes with their DE already calculated

I have a set of genes with their differential expressions already calculated and I would like to know whether it is possible to input this sort of data (instead of raw expression + class vector) to the gsea main function?

Thank you and regards!

replot function bug

Hello, when I use gseapy replot function to replot my GSEA results, it works well on a default geneset, for example, the hallmark, c2,c5 and so on. however, when I use a user-defined gene set, such as enzyme and its substrates. I get an error and I still don't know why?
Traceback (most recent call last):
File "/home/yukai/bin/gseapy", line 10, in
sys.exit(main())
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/main.py", line 30, in main
rep.run()
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/gsea.py", line 761, in run
rank_metric = self._load_ranking(rank_path)
File "/home/yeying/miniconda3/lib/python2.7/site-packages/gseapy/gsea.py", line 66, in _load_ranking
rank_metric.sort_values(by=rank_metric.columns[1], ascending=self.ascending, inplace=True)
File "/home/yeying/miniconda3/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 2084, in getitem
return getitem(key)
IndexError: index 1 is out of bounds for axis 0 with size 1

[Question] Is there a way to use custom gene sets?

Is there a way to use custom gene sets? For example, what if I had the following data:

# Gene list
gene_list = [
"gene_A", 
"gene_B", 
...
"gene_xyz"
]

and then a bunch of gene sets like this:

gene_sets = {
"gene_set_1": ["gene_A", "gene_B", ...],
"gene_set_2":["gene_B", gene_C", ...], 
...
"gene_set_100":["gene_A", "gene_T", ...]
}

Is there still a way to run this tool to figure out which gene sets are enriched?

I usually deal with microbiome datasets with de-novo called ORFs from prodigal so there are no gene ids that would be useable. A lot of my friends in cancer labs always talk about GSEA but I don't have the types of IDs that they use. Though, I still have "gene sets" that I could use so I feel like it could still apply.

Divide by zero problem

I am trying to the prerank function using my own data. However, I get the following error message:

home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:524: RuntimeWarning: divide by zero encountered in true_divide choicelist = [np.sum(esnull < es.reshape(len(es),1), axis=1)/ np.sum(esnull < 0, axis=1), /home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:524: RuntimeWarning: invalid value encountered in true_divide choicelist = [np.sum(esnull < es.reshape(len(es),1), axis=1)/ np.sum(esnull < 0, axis=1), /home/anaconda2/lib/python2.7/site-packages/gseapy/algorithm.py:575: RuntimeWarning: Mean of empty slice. meanNeg = enrNull[enrNull < 0 ].mean() /home/anaconda2/lib/python2.7/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in true_divide ret = ret.dtype.type(ret / rcount)
My weights are from 0 to 1. I get the error above after I normalize the weights to be from -1 to 1. It doesn't matter what the weight values are, right? Do they have to include both positive and negative values?

Also, I do not get any plots generated even with more than 1500 genes. Any pointers on what I could try?

Mislocated labels in Heatmap

For some of the heat maps generated in gsea, the heatmap labels were located at random coordinates.
Few examples were attached. Thanks!

ppa04812 heatmap
ppa00230 heatmap
ppa04138 heatmap

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.