Giter Club home page Giter Club logo

Comments (8)

ri23 avatar ri23 commented on June 8, 2024

Hi @npokorzynski ,

In principle barplots are generated for all input genes. If a barplot for a gene is not generated, it is either because there are no GO annotations for that gene or it could not be included in the GeneWalk network for analysis. In the former case the gene is still listed in genewalk_results.csv, but with 0 GO annotations. The latter can happen when there is no uniprot ID for example: all those drop out genes with the reasons for drop out are listed in genewalk_all.log.

Cheers,
Robert

from genewalk.

npokorzynski avatar npokorzynski commented on June 8, 2024

@ri23 Thank you for that reply. Perhaps I'm misunderstanding then: When I run genewalk, the genewalk_results.csv contains far more genes with associated GO terms (many of which are significant below FDR p 0.01) than bar plots produced. I also am observing that between conditions a variable number of bar plots are generated, so that gave me the impression that there was some criteria selecting which genes barplots would be produced for under a given condition. Perhaps you can clarify what is causing this?

from genewalk.

ri23 avatar ri23 commented on June 8, 2024

This is unexpected, at the moment I am not sure what is happening.

Can you clarify: do you mean some barplots are not generated in the figures/barplot folder as a .png file, and/or they are not visible in the index.html file?

Can you check what the confidence interval cilow_global_padj values are for genes that are present in genewalk_results.csv but without plot?

Moving forward, would you be able to share your genewalk_results.csv and genewalk_all.log files and mention a few genes that did not have a bar plot generated? My email is [email protected] if you don't want your files here on github in the public domain. Alternatively a screen shot of the above could also help to figure this out.

Cheers,
Robert

from genewalk.

npokorzynski avatar npokorzynski commented on June 8, 2024

@ri23 With some troubleshooting I may have revealed at least one part of the problem, though I'm not sure why this would cause issues:

It turns out that if I run the analysis with Ensembl IDs instead of HGNC symbols, I produce the expected index file and barplots folder containing all barplots. In the above, I was getting the results csv file with all the relevant genes, but only a portion of the barplots and no index.html file.

Here are the log and results files for the analysis I ran using hgnc_symbol as the ID:

genewalk_all.log

genewalk_results.csv.zip

This has led me to a related but distinct question - in the analysis run with ensembl_id (from the same starting gene list just with the IDs changed, which contains 1500 genes) I receive an output of 975 genes with associated GO terms (which seems high to me, e.g. 65% of my input is returned as functionally relevant?), with very few even reaching FDR p < 0.1. In fact, 0.0946 is the lowest p-value I get from any of the results. If needed, I can start a new issue for this, but I've attached the log and results files for this analysis as well.

genewalk_all.log

genewalk_results.csv.zip

from genewalk.

ri23 avatar ri23 commented on June 8, 2024

Hi @npokorzynski ,

From the log file with hgnc symbols as input, I suspect GeneWalk got killed after 3h. Could it be that you ran a job on a cluster or something with this time limit (from 14:52 to 17:52)? For the ensembl input, you ran it with parallelization (great!) so GeneWalk ran faster (~1h15min) and finished as expected. According to the log file:
genewalk.nx_mg_assembler - Number of PC originating nodes 1027
vs
genewalk.nx_mg_assembler - Number of PC originating nodes 1029
So the number of gene nodes in the network is almost identical (1027 vs 1029 gene nodes): almost all hgnc_symbols are recognized the same as the ensembl IDs. The rest of the 1500 input genes are filtered out because they are pseudogenes/ as and lncRNAs etc without Uniprot IDs as listed also in the log file.

If you want to check you can rerun the hgnc symbol input with parallelization and it should provide all the barplots and index.html within the same time frame as your ensembl ID input run.

Regarding your second question:
Your genewalk_results.csv output contains 975 genes: those are all input genes with at least 1 GO annotation (subset of 1029 in the network). So, the genes are in there irrespective of whether the GO annotations are statistically significant or not.

Out of the 975 a lot can have 1 or more significant (global_padj<0.1) GO annotation per gene. In our publication, we mention that for the QKI context, this was the case for 54% of the genes. We interpret this as a lot genes have some significant GO annotation, but not all annotations are relevant for a given gene and different annotations are relevant for different genes (unlike the typical GO enrichment results, which test for global relevance, so you get the 1 GO term for a set of genes).

Also yes the global_padj does not extend much below 0.1, this is because of the FDR multiple testing correction across all ~11k gene-GO term pairs.

In the paper we do mention there is limited stochasticity between independent replicate runs (Supplementary Fig. S2), because GeneWalk relies on random walk sampling. As a result the cosine similarities, resulting p-value and FDR adjust p-values can vary slightly between runs. This is also why we provide 95% confidence intervals on the p-values. And this explains why some of the global_padj of your hgnc symbol run are slightly lower than 0.0946, the lowest value of your second run.

If such variability due to stochasticity bothers you: you can choose to rerun GeneWalk with --nreps_graph 10 and --nreps_null 10. This decreases the stochasticity because GeneWalk now runs the network representation learning DeepWalk part 10 times instead of 3 and the null distributions is also more populated from 10 replicate runs instead of 3.

If you are interested in further gene selection, then focus in on the regulator scatter plot in the top right corner. There are the genes with a high fraction of relevant GO annotations and many gene-gene connections. Once you find a gene of interest from there, then look at the barplot for that gene to understand what their relevant functions are.

So in conclusion all your output is as expected. I hope these clarifications help you interpret the results further.

Best,
Robert

from genewalk.

npokorzynski avatar npokorzynski commented on June 8, 2024

@ri23 Thank you that was incredibly helpful! Do you happen to have a code snippet to recreate the scatterplots? If not, I can close out this issue.

from genewalk.

ri23 avatar ri23 commented on June 8, 2024

Hi @npokorzynski
Great, yes so we save the scatterplots by default also as separate .png, .pdf and the interactive plot as .html files in the figures folder. If you want to recreate, you can call the GW_Plotter.scatterplot_regulators function or copy over code from there: https://github.com/churchmanlab/genewalk/blob/master/genewalk/plot.py#L57 The package dependencies are listed at the top of the plot.py file.

from genewalk.

npokorzynski avatar npokorzynski commented on June 8, 2024

@ri23 okay, I guess this must have also been a consequence of the incomplete run from my non-parallelized attempts - I had only previously observed the .png file as an output (the resolution of which is understandably not great). In the complete/parallelized run I get the .pdf file. In that case I'm all good, for now at least. Thanks again for all the help!

from genewalk.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.