krishnanlab / pecanpy Goto Github PK
View Code? Open in Web Editor NEWA fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec
License: BSD 3-Clause "New" or "Revised" License
A fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec
License: BSD 3-Clause "New" or "Revised" License
It seems that the default setting of the command line client is PreComp. I tried putting in a network with ~500K nodes / ~ 9M edges and after it had been running for a while, I realized that I had made a poor choice based on your insights outlined in the manuscript. Instead, I should have probably chosen the SparseOTF mode. Since you already make suggestions in the manuscript on what the cutoffs for size and density are to go between each, it would be helpful to also output a warning after the graph has been loaded if the mode does not match the selected mode (especially since selecting the mode is an optional parameter, and people, like me, will try using the default parameters unless otherwise encouraged)
Advantages
Still, the original storage strategy (saving as txt) is still useful, for smaller networks due to the ease of inspection of the embedding text file.
To toggle save as .emd.npz
, simply specify the output file with the file extension .emd.npz
instead of .emd
. In the .emd.npz
file, there should be two arrays, one for IDs
(node ids, of size n), and one for data
(embedding vectors, of size n x d).
Not a bug, need to run preprocess_transition_probs
first. Maybe one thing to do is to automatically run this if not yet run?
Originally posted by @RemyLau in #61 (comment)
The main research output of this manuscript is obviously the code in this repository. However, to assess the veracity of the claims made in the paper about its goodness in comparison to previous libraries, the code for running the benchmarking experiments, organizing their respective results, and turning them into the figures that appear in the main manuscript and supplemental material should be made available (in a form that is readily useful). I'd go as far as to suggest putting it in a different repository entirely, to mitigate confusion and disorganization.
I appreciate that it's not easy to set up a barrage of different packages written in a variety of languages, and understand that it might take significant time and effort to prepare this code for the same scrutiny as the package itself. I hope you agree that it is necessary to assess the methodology used for benchmarking to give an accurate review.
It's non-obvious how the following alias_draw()
function works.
PecanPy/src/pecanpy/node2vec.py
Lines 262 to 273 in 12c047d
I've also copy-pasted a similar python implementation from StackOverflow, where nobody seemed to know how it worked either. However, because this piece of code is now part of an academic paper, it would do some good to make it instructive to a potential reader - code is, after all, the methods section and major result of your paper.
I'd be happy to assist in writing the unit tests and making them easy to run after you've prepared the rest if you want any help with that. I'd also help you set up Travis-CI to automatically run the tests every time you make changes to the package if you'd like
PecanPy/src/pecanpy/node2vec.py
Line 407 in 2777907
In the example given in https://pecanpy.readthedocs.io/en/latest/pecanpy.html#module-pecanpy.node2vec,
it reads "# at this point, the random walks could be fed to w2v to generate embeddings"
however, I am not clear how and which w2v should be utilized?
Could you provide perhaps a preferred word2vec package?
Thank you!
After #14 is merged, the docs are updated, and #13 is closed, you can make an account at https://readthedocs.org which will automatically build the docs every time you push and host the html. You can get a nice vanity URL like pecanpy.readthedocs.io, and also get another badge to add to your repo
Since the package is pip-installable and does not rely on any exotic dependencies, you should deploy it to PyPI. If you've never done this before, follow these steps in the bash in the root directory of the repository:
python3 setup.py -q sdist bdist_wheel
python3 -m pip install twine
twine upload --skip-existing dist/*
You should do this step after you name a certain version as a "release" and upload to Zenodo (as outlined in #4) then immediately follow-up with a commit to bump the version in setup.cfg
from 0.0.1 to 0.0.2-dev. Later, when you're done with version 0.0.2-dev (or whatever you want to call it), you take away the -dev
suffix before releasing again. This suffix is important because it lets users know that they're not on stable versions if they're installing from github's master (main) branch
Also, please include in your manuscript that your code is available through PyPI! There are many potential users who won't be comfortable with cloning the repo or even using the tricky git+https:// way of installing through GitHub who would benefit from easy installation through PyPI.
Does the library automatically detect number of cores and parellize or do we have to specify ? If we have to specify how to do that ?
Dear @RemyLau and @arjunkrish,
I've been invited to review your manuscript. I'm sure this was no coincidence, as I'm always trolling bioRxiv for interesting software papers and making PRs :) Overall, I'm quite happy with the way you wrote the manuscript, especially how you structured the dichotomy between the main manuscript and supplement. I will give specific comments for the paper itself in the review you will receive directly from the journal.
You'll also see I've left quite a few issues on this repository ranging from specific code suggestions to more philosophical discussions. I've meant all in good faith - have no doubt that your paper will be accepted. However, I'm likely one of the pickiest code reviewers in the entire bioinformatics community, so we're going to make your repository the best it can possibly be before the review is done. I hope you're excited about the idea, since you're already 90% of the way there. Most methods published in the literature are utterly useless, so I already really appreciate the work you've done and how you've done it until now.
Greetings from Bonn, Germany. Have a nice weekend.
-Charlie Hoyt
Related issues:
Though many of the main functionality is exposed through a command line interface, there is no documentation of how to import pecanpy and use it in other pipelines. I will send a PR to get you started, then you will see where else you need to add/improve documentation and create some tutorials.
Currently the repository is released under a CC BY-NC-SA 4.0 License as per LICENSE.md.
This is a very restrictive license that is incompatible with other licenses and is not designed for software.
If the goal is to create an open source software package that researchers can use and improve upon, then I'd recommend an OSI-approved license at https://opensource.org/licenses.
For Python packages, the BSD Licenses are popular. I personally use the BSD-2-Clause Plus Patent License. This license is OSI-approved and rated highly for its simplicity, compatability, and effectiveness.
Happy to answer any license questions.
As a potential user of this package, I am wary to consider it further until it has a license to ensure a sustainable future and open exisitence!
Awesome project @RemyLau, thanks for sharing your work on it!
Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.
I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with @njit(parallel=True)
though. Have you already figured out how to make that work?
This isn't a major issue but it irks me that the time stamps aren't aligned or left-zero-padded in
PecanPy/src/pecanpy/wrappers.py
Line 18 in 12c047d
This will fix it:
print("Took %02d:%02d:%05.2f to %s"%(hrs, mins, secs, self.name))
Results:
Took 00:00:00.07 to load graph
Took 00:00:03.37 to pre-compute transition probabilities
Took 00:00:03.00 to generate walks
Took 00:00:06.14 to train embeddings
Hi,
with the version 2.0.1 there emerged this issue:
TypeError: simulate_walks() missing 2 required positional arguments: 'n_ckpts' and 'pb_len'
Once I downgraded to 2.0.0, it worked as before.
Originally done so following the node2vec source code. It makes sense to specify these parameters during the embedding function call, so that when embedding with different settings of p
and q
, we do not need to load the save graph object over and over again.
The only catch is that for PreComp
, need to make sure each time we change the p
and q
settings, we need to rerun the preprocessing function.
I'll send a PR your way to make this package installable with pip and so you can upload it to PyPI as well. I'll also add some nice entrypoints so you don't have to write python -m ...
@RemyLau what are your thoughts on rejection sampling used here https://github.com/louisabraham/fastnode2vec ?
When p
and q
are set to 1, the second order transition computation is unnecessary, leading to extra runtime.
Furthermore, when the random walks are first order, there's no need to use OTF type approach, since the size of the transition probabilities is the same as that of the graph.
Thus SparseOTF
and DenseOTF
can be treated as SparsePreComp
and DensePreComp
.
Two options to implement the above enhancement
_SparsePreComp
and _DensePreComp
as hidden classes, which are used by the "front end" classes PreComp
, SparseOTF
, and DenseOTF
, when p = q = 1
.p = q = 1
within SparseOTF
and DenseOTF
; if so, use specialized get_normalized_probs
(also need preprocessing to row-normalize the adjacency matrix), similar to the approach taken for implementing node2vec+.Option 1 seems to be a cleaner solution.
Thinking about this, maybe node2vec+ needs to be implemented this way also.
In the case of dead end during random walk (for directed graph), the random walk process is simply stopped in the current implementation (L111). However, because of the way the random walk index array are preinitialized with zeros (L91), this causes problem that the word2vec model will comprehend this as the walk is never halted and is continued with the rest of the sequence all being the word that correspond to zero index.
PecanPy/src/pecanpy/node2vec.py
Lines 88 to 111 in 9c6ca6f
To fix this issue, need to inform the translator (L131) to stop when dead end encountered.
PecanPy/src/pecanpy/node2vec.py
Line 131 in 9c6ca6f
Gensim word2vec is single-precision, so it might be a good idea to also use single-precision for random walk generation. More specifically, the stored word vectors c.syn0 are REAL_t type, which, according to heir definition, is single-precision (np.float32_t
).
Benefits of switching to single precision:
For small graphs like BioGRID, it might be reasonable to wait for 30 seconds or so in the different steps, but as the graphs start getting large and training time gets into the minutes/hours, it would be nice to communicate progress to the user.
I would normally suggest using tqdm, but after modifying the code myself, it doesn't look like it plays nicely with numba's parallelization. I found this thread: numba/numba#4267 that describes the issue a bit more in detail, and the outlook is a bit grim.
It's not a dealbreaker, but it would make your package much more user friendly.
Gensim has changed the size keyword argument to vector_size in genesis ~4. This broke pecanpy, as it pins gensim >=3.8.0.
The resulting traceback looks like:
Traceback (most recent call last):\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/bin/pecanpy", line 8, in <module>\n sys.exit(main())\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 182, in main\n main_helper(parse_args())\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 177, in main_helper\n timed_emb()\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/wrappers.py", line 18, in wrapper\n result = func(*args, **kwargs)\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 171, in timed_emb\n learn_embeddings(args=args, walks=walks)\n File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 148, in learn_embeddings\n iter=args.iter,\nTypeError: __init__() got an unexpected keyword argument \'size\'\n
I will probably open a PR once I identify the breaking version (either to ping the version or just to fix the keyword argument usage), maybe later today, but opening this in case I forget so others can see. The easy fix is just to pin gensim at 3.8 locally.
Backup source on breaking changes: https://stackoverflow.com/questions/53195906/getting-init-got-an-unexpected-keyword-argument-document-this-error-in
I've a file with nearly 70M edges and it fails to load into memory on a machine with 32 GB ram
Implement random state specification for random walk generations to enable reproducibility.
While GitHub is an excellent place to centralize your code and keep it under version control, it is not so appropriate for archiving the state of the repository. Zenodo is a proper alternative, and it will even make your life easy by connecting to GitHub. I'd recommend following this tutorial to:
I personally thing Zenodo is the best tool for code because of its integration with GitHub, but other services exist for archiving your research output that are also good, such as Figshare.
More on the theme of "code as the methods section" - I would strongly suggest applying some formatting to the code, as it currently has a mashup of inconsistent styles. This makes it difficult to read. There's also the secondary benefit of finding possible mistakes when doing the cleanup. A really good way to start enforcing consistent code style is to:
black
. This tools is completely automatic and gets your code into an almost reader-ready stateAgain, I'm happy to help you understand these things and work through it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.