Comments (8)
Hi @npokorzynski ,
In principle barplots are generated for all input genes. If a barplot for a gene is not generated, it is either because there are no GO annotations for that gene or it could not be included in the GeneWalk network for analysis. In the former case the gene is still listed in genewalk_results.csv, but with 0 GO annotations. The latter can happen when there is no uniprot ID for example: all those drop out genes with the reasons for drop out are listed in genewalk_all.log
.
Cheers,
Robert
from genewalk.
@ri23 Thank you for that reply. Perhaps I'm misunderstanding then: When I run genewalk, the genewalk_results.csv contains far more genes with associated GO terms (many of which are significant below FDR p 0.01) than bar plots produced. I also am observing that between conditions a variable number of bar plots are generated, so that gave me the impression that there was some criteria selecting which genes barplots would be produced for under a given condition. Perhaps you can clarify what is causing this?
from genewalk.
This is unexpected, at the moment I am not sure what is happening.
Can you clarify: do you mean some barplots are not generated in the figures/barplot
folder as a .png file, and/or they are not visible in the index.html
file?
Can you check what the confidence interval cilow_global_padj values are for genes that are present in genewalk_results.csv but without plot?
Moving forward, would you be able to share your genewalk_results.csv and genewalk_all.log files and mention a few genes that did not have a bar plot generated? My email is [email protected] if you don't want your files here on github in the public domain. Alternatively a screen shot of the above could also help to figure this out.
Cheers,
Robert
from genewalk.
@ri23 With some troubleshooting I may have revealed at least one part of the problem, though I'm not sure why this would cause issues:
It turns out that if I run the analysis with Ensembl IDs instead of HGNC symbols, I produce the expected index file and barplots folder containing all barplots. In the above, I was getting the results csv file with all the relevant genes, but only a portion of the barplots and no index.html file.
Here are the log and results files for the analysis I ran using hgnc_symbol as the ID:
This has led me to a related but distinct question - in the analysis run with ensembl_id (from the same starting gene list just with the IDs changed, which contains 1500 genes) I receive an output of 975 genes with associated GO terms (which seems high to me, e.g. 65% of my input is returned as functionally relevant?), with very few even reaching FDR p < 0.1. In fact, 0.0946 is the lowest p-value I get from any of the results. If needed, I can start a new issue for this, but I've attached the log and results files for this analysis as well.
from genewalk.
Hi @npokorzynski ,
From the log file with hgnc symbols as input, I suspect GeneWalk got killed after 3h. Could it be that you ran a job on a cluster or something with this time limit (from 14:52 to 17:52)? For the ensembl input, you ran it with parallelization (great!) so GeneWalk ran faster (~1h15min) and finished as expected. According to the log file:
genewalk.nx_mg_assembler - Number of PC originating nodes 1027
vs
genewalk.nx_mg_assembler - Number of PC originating nodes 1029
So the number of gene nodes in the network is almost identical (1027 vs 1029 gene nodes): almost all hgnc_symbols are recognized the same as the ensembl IDs. The rest of the 1500 input genes are filtered out because they are pseudogenes/ as and lncRNAs etc without Uniprot IDs as listed also in the log file.
If you want to check you can rerun the hgnc symbol input with parallelization and it should provide all the barplots and index.html within the same time frame as your ensembl ID input run.
Regarding your second question:
Your genewalk_results.csv output contains 975 genes: those are all input genes with at least 1 GO annotation (subset of 1029 in the network). So, the genes are in there irrespective of whether the GO annotations are statistically significant or not.
Out of the 975 a lot can have 1 or more significant (global_padj<0.1) GO annotation per gene. In our publication, we mention that for the QKI context, this was the case for 54% of the genes. We interpret this as a lot genes have some significant GO annotation, but not all annotations are relevant for a given gene and different annotations are relevant for different genes (unlike the typical GO enrichment results, which test for global relevance, so you get the 1 GO term for a set of genes).
Also yes the global_padj does not extend much below 0.1, this is because of the FDR multiple testing correction across all ~11k gene-GO term pairs.
In the paper we do mention there is limited stochasticity between independent replicate runs (Supplementary Fig. S2), because GeneWalk relies on random walk sampling. As a result the cosine similarities, resulting p-value and FDR adjust p-values can vary slightly between runs. This is also why we provide 95% confidence intervals on the p-values. And this explains why some of the global_padj of your hgnc symbol run are slightly lower than 0.0946, the lowest value of your second run.
If such variability due to stochasticity bothers you: you can choose to rerun GeneWalk with --nreps_graph 10
and --nreps_null 10
. This decreases the stochasticity because GeneWalk now runs the network representation learning DeepWalk part 10 times instead of 3 and the null distributions is also more populated from 10 replicate runs instead of 3.
If you are interested in further gene selection, then focus in on the regulator scatter plot in the top right corner. There are the genes with a high fraction of relevant GO annotations and many gene-gene connections. Once you find a gene of interest from there, then look at the barplot for that gene to understand what their relevant functions are.
So in conclusion all your output is as expected. I hope these clarifications help you interpret the results further.
Best,
Robert
from genewalk.
@ri23 Thank you that was incredibly helpful! Do you happen to have a code snippet to recreate the scatterplots? If not, I can close out this issue.
from genewalk.
Hi @npokorzynski
Great, yes so we save the scatterplots by default also as separate .png, .pdf and the interactive plot as .html files in the figures folder. If you want to recreate, you can call the GW_Plotter.scatterplot_regulators
function or copy over code from there: https://github.com/churchmanlab/genewalk/blob/master/genewalk/plot.py#L57 The package dependencies are listed at the top of the plot.py
file.
from genewalk.
@ri23 okay, I guess this must have also been a consequence of the incomplete run from my non-parallelized attempts - I had only previously observed the .png file as an output (the resolution of which is understandably not great). In the complete/parallelized run I get the .pdf file. In that case I'm all good, for now at least. Thanks again for all the help!
from genewalk.
Related Issues (20)
- ensembl_id HOT 1
- ML HOT 1
- rdflib=4.2.2 and python 3 version conflict HOT 2
- Rat genome HOT 1
- Installation failed. HOT 6
- Illustration of the GeneWalk network HOT 4
- network source file HOT 4
- Install issue HOT 8
- AttributeError: type object 'object' has no attribute 'dtype' HOT 2
- IndexError: list index out of range HOT 2
- No regulators identified? HOT 4
- INDRA script generation HOT 2
- Trouble Using GeneWalk HOT 2
- GO Terms HOT 1
- Error when calling word2vec: unexpected keyword argument 'size' HOT 2
- Mouse ids are not working with genewalk? HOT 1
- Code for regulator/moonlighter plots HOT 4
- GO annotations file comment rows have changed HOT 1
- AttributeError: module 'typing' has no attribute 'NoReturn HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from genewalk.