krishnanlab / pecanpy Goto Github PK

View Code? Open in Web Editor NEW

149.0 149.0 21.0 1.21 MB

A fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

pecanpy's People

Contributors

Stargazers

Watchers

pecanpy's Issues

[Peer Review] Give feedback on potentially bad choices of settings

It seems that the default setting of the command line client is PreComp. I tried putting in a network with ~500K nodes / ~ 9M edges and after it had been running for a while, I realized that I had made a poor choice based on your insights outlined in the manuscript. Instead, I should have probably chosen the SparseOTF mode. Since you already make suggestions in the manuscript on what the cutoffs for size and density are to go between each, it would be helpful to also output a warning after the graph has been loaded if the mode does not match the selected mode (especially since selecting the mode is an optional parameter, and people, like me, will try using the default parameters unless otherwise encouraged)

Test transition probability computation

Run time tests

Use float32 for DenseOTF

Load network with from_mat then run PreComp

Add option to save embedding as ``.emd.npz`` file

Advantages

Faster loading time for embedding for later usage
Compact storage

Still, the original storage strategy (saving as txt) is still useful, for smaller networks due to the ease of inspection of the embedding text file.

To toggle save as .emd.npz, simply specify the output file with the file extension .emd.npz instead of .emd. In the .emd.npz file, there should be two arrays, one for IDs (node ids, of size n), and one for data (embedding vectors, of size n x d).

Gensim word2vec can not scale to networks with million nodes efficiently, find alternative

Call ``preprocess_transition_probs`` if not processed yet

Not a bug, need to run preprocess_transition_probs first. Maybe one thing to do is to automatically run this if not yet run?

Originally posted by @RemyLau in #61 (comment)

[Peer Review] Reproduction of results of manuscript

The main research output of this manuscript is obviously the code in this repository. However, to assess the veracity of the claims made in the paper about its goodness in comparison to previous libraries, the code for running the benchmarking experiments, organizing their respective results, and turning them into the figures that appear in the main manuscript and supplemental material should be made available (in a form that is readily useful). I'd go as far as to suggest putting it in a different repository entirely, to mitigate confusion and disorganization.

I appreciate that it's not easy to set up a barrage of different packages written in a variety of languages, and understand that it might take significant time and effort to prepare this code for the same scrutiny as the package itself. I hope you agree that it is necessary to assess the methodology used for benchmarking to give an accurate review.

[Peer Review] alias_draw function is too hard to understand

It's non-obvious how the following alias_draw() function works.

PecanPy/src/pecanpy/node2vec.py

Lines 262 to 273 in 12c047d

 @jit(nopython=True, nogil=True) 

 def alias_draw(J, q): 

 """ 

  Draw sample from a non-uniform discrete distribution using alias sampling. 

  """ 

 K = J.size 

 kk = np.random.randint(K) 

 if np.random.rand() < q[kk]: 

 return kk 

 else: 

 return J[kk]

I've also copy-pasted a similar python implementation from StackOverflow, where nobody seemed to know how it worked either. However, because this piece of code is now part of an academic paper, it would do some good to make it instructive to a potential reader - code is, after all, the methods section and major result of your paper.

Improve the docstring such that it explains what this function does and why
Include a reference to where you found it and what you extra reading you did to understand it yourself
Add type annotations for the arguments for the function
Improve the names of all variables appearing in the function
Write examples to illustrate when you put certain data in, you get certain data out. You'll have to seed the numpy random number generator (ref) to make sure the results are consistent.
Unit tests (which just make it possible to automatically check that your examples really do what they say they should)

I'd be happy to assist in writing the unit tests and making them easy to run after you've prepared the rest if you want any help with that. I'd also help you set up Travis-CI to automatically run the tests every time you make changes to the package if you'd like

Read from edge list

Read from adjacency matrix

Remove dev comment

PecanPy/src/pecanpy/node2vec.py

Line 407 in 2777907

# data, indices, indptr, p, q, cur_idx, prev_idx)

What happens after simulate_walks?

In the example given in https://pecanpy.readthedocs.io/en/latest/pecanpy.html#module-pecanpy.node2vec,
it reads "# at this point, the random walks could be fed to w2v to generate embeddings"
however, I am not clear how and which w2v should be utilized?
Could you provide perhaps a preferred word2vec package?

Thank you!

[Peer Review] Automatically build docs with ReadTheDocs

After #14 is merged, the docs are updated, and #13 is closed, you can make an account at https://readthedocs.org which will automatically build the docs every time you push and host the html. You can get a nice vanity URL like pecanpy.readthedocs.io, and also get another badge to add to your repo

graph

[Peer Review] Deploy to PyPI

Since the package is pip-installable and does not rely on any exotic dependencies, you should deploy it to PyPI. If you've never done this before, follow these steps in the bash in the root directory of the repository:

python3 setup.py -q sdist bdist_wheel 
python3 -m pip install twine
twine upload --skip-existing dist/*

You should do this step after you name a certain version as a "release" and upload to Zenodo (as outlined in #4) then immediately follow-up with a commit to bump the version in setup.cfg from 0.0.1 to 0.0.2-dev. Later, when you're done with version 0.0.2-dev (or whatever you want to call it), you take away the -dev suffix before releasing again. This suffix is important because it lets users know that they're not on stable versions if they're installing from github's master (main) branch

Also, please include in your manuscript that your code is available through PyPI! There are many potential users who won't be comfortable with cloning the repo or even using the tricky git+https:// way of installing through GitHub who would benefit from easy installation through PyPI.

Reg Multi Processing

Does the library automatically detect number of cores and parellize or do we have to specify ? If we have to specify how to do that ?

[Peer Review] Code Review for "PecanPy: a fast, efficient, and parallelized Python implementation of node2vec"

Dear @RemyLau and @arjunkrish,

I've been invited to review your manuscript. I'm sure this was no coincidence, as I'm always trolling bioRxiv for interesting software papers and making PRs :) Overall, I'm quite happy with the way you wrote the manuscript, especially how you structured the dichotomy between the main manuscript and supplement. I will give specific comments for the paper itself in the review you will receive directly from the journal.

You'll also see I've left quite a few issues on this repository ranging from specific code suggestions to more philosophical discussions. I've meant all in good faith - have no doubt that your paper will be accepted. However, I'm likely one of the pickiest code reviewers in the entire bioinformatics community, so we're going to make your repository the best it can possibly be before the review is done. I hope you're excited about the idea, since you're already 90% of the way there. Most methods published in the literature are utterly useless, so I already really appreciate the work you've done and how you've done it until now.

Greetings from Bonn, Germany. Have a nice weekend.

-Charlie Hoyt

Related issues:

[Peer Review] Add documentation for using code in Python

Though many of the main functionality is exposed through a command line interface, there is no documentation of how to import pecanpy and use it in other pipelines. I will send a PR to get you started, then you will see where else you need to add/improve documentation and create some tutorials.

pecanpy

Release under an open source license

Currently the repository is released under a CC BY-NC-SA 4.0 License as per LICENSE.md.

This is a very restrictive license that is incompatible with other licenses and is not designed for software.

If the goal is to create an open source software package that researchers can use and improve upon, then I'd recommend an OSI-approved license at https://opensource.org/licenses.

For Python packages, the BSD Licenses are popular. I personally use the BSD-2-Clause Plus Patent License. This license is OSI-approved and rated highly for its simplicity, compatability, and effectiveness.

Happy to answer any license questions.

As a potential user of this package, I am wary to consider it further until it has a license to ensure a sustainable future and open exisitence!

Seed random number generators

Awesome project @RemyLau, thanks for sharing your work on it!

Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.

I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with @njit(parallel=True) though. Have you already figured out how to make that work?

[Peer Review] Fix logging time widths

This isn't a major issue but it irks me that the time stamps aren't aligned or left-zero-padded in

PecanPy/src/pecanpy/wrappers.py

Line 18 in 12c047d

print("Took %d:%d:%.2f to %s"%(hrs, mins, secs, self.name))

This will fix it:

print("Took %02d:%02d:%05.2f to %s"%(hrs, mins, secs, self.name))

Results:

Took 00:00:00.07 to load graph
Took 00:00:03.37 to pre-compute transition probabilities
Took 00:00:03.00 to generate walks
Took 00:00:06.14 to train embeddings

Unit tests

TypeError: simulate_walks() missing 2 required positional arguments: 'n_ckpts' and 'pb_len'

Hi,

with the version 2.0.1 there emerged this issue:

TypeError: simulate_walks() missing 2 required positional arguments: 'n_ckpts' and 'pb_len'

Once I downgraded to 2.0.0, it worked as before.

Option to save CSR graph to be used later, similar idea as dense graph for ``DenseOTF``

Speciy ``p`` and ``q`` when executing ``embed`` instead of during creation

Originally done so following the node2vec source code. It makes sense to specify these parameters during the embedding function call, so that when embedding with different settings of p and q, we do not need to load the save graph object over and over again.

The only catch is that for PreComp, need to make sure each time we change the p and q settings, we need to rerun the preprocessing function.

sparse_rw

dense_rw

[Peer Review] Package and make pip installable

I'll send a PR your way to make this package installable with pip and so you can upload it to PyPI as well. I'll also add some nice entrypoints so you don't have to write python -m ...

Using Rejection Sampling

@RemyLau what are your thoughts on rejection sampling used here https://github.com/louisabraham/fastnode2vec ?

Improve runtime for first order walk

When p and q are set to 1, the second order transition computation is unnecessary, leading to extra runtime.
Furthermore, when the random walks are first order, there's no need to use OTF type approach, since the size of the transition probabilities is the same as that of the graph.
Thus SparseOTF and DenseOTF can be treated as SparsePreComp and DensePreComp.

Two options to implement the above enhancement

Implement _SparsePreComp and _DensePreComp as hidden classes, which are used by the "front end" classes PreComp, SparseOTF, and DenseOTF, when p = q = 1.
Check whether p = q = 1 within SparseOTF and DenseOTF; if so, use specialized get_normalized_probs (also need preprocessing to row-normalize the adjacency matrix), similar to the approach taken for implementing node2vec+.

Option 1 seems to be a cleaner solution.
Thinking about this, maybe node2vec+ needs to be implemented this way also.

Static typing

Dead end walks not handled correctly

In the case of dead end during random walk (for directed graph), the random walk process is simply stopped in the current implementation (L111). However, because of the way the random walk index array are preinitialized with zeros (L91), this causes problem that the word2vec model will comprehend this as the walk is never halted and is continued with the rest of the sequence all being the word that correspond to zero index.

PecanPy/src/pecanpy/node2vec.py

Lines 88 to 111 in 9c6ca6f

 def node2vec_walks(): 

 """Simulate a random walk starting from start node.""" 

 n = start_node_idx_ary.size 

 walk_idx_mat = np.zeros((n, walk_length + 1), dtype=np.uint32) 

 walk_idx_mat[:, 0] = start_node_idx_ary 

 # progress bar parameters 

 n_checkpoints = 10 

 checkpoint = n / get_num_threads() // n_checkpoints 

 progress_bar_length = 25 

 private_count = 0 

 for i in prange(n): 

 start_node_idx = walk_idx_mat[i, 0] 

 walk_idx_mat[i, 1] = move_forward(start_node_idx) 

 for j in range(2, walk_length + 1): 

 cur_idx = walk_idx_mat[i, j - 1] 

 if has_nbrs(cur_idx): 

 prev_idx = walk_idx_mat[i, j - 2] 

 walk_idx_mat[i, j] = move_forward(cur_idx, prev_idx) 

 else: 

 print("Dead end!") # TODO: need to modify walks accordingly 

 break

To fix this issue, need to inform the translator (L131) to stop when dead end encountered.

PecanPy/src/pecanpy/node2vec.py

Line 131 in 9c6ca6f

walks = [[self.IDlst[idx] for idx in walk] for walk in node2vec_walks()]

Use single precision for transition probabilities (and edge weights)

Gensim word2vec is single-precision, so it might be a good idea to also use single-precision for random walk generation. More specifically, the stored word vectors c.syn0 are REAL_t type, which, according to heir definition, is single-precision (np.float32_t).

Benefits of switching to single precision:

Reduce memory usage (from the graph) by half
Might also improve run time since reading less from memory?

Write to npz (csr or dense)

Node2vec+ on directed graphs

[Peer Review] Add progress monitoring

For small graphs like BioGRID, it might be reasonable to wait for 30 seconds or so in the different steps, but as the graphs start getting large and training time gets into the minutes/hours, it would be nice to communicate progress to the user.

I would normally suggest using tqdm, but after modifying the code myself, it doesn't look like it plays nicely with numba's parallelization. I found this thread: numba/numba#4267 that describes the issue a bit more in detail, and the outlook is a bit grim.

It's not a dealbreaker, but it would make your package much more user friendly.

Gensim update breaks pecanpy due to size vs vector_size argument

Gensim has changed the size keyword argument to vector_size in genesis ~4. This broke pecanpy, as it pins gensim >=3.8.0.

The resulting traceback looks like:

Traceback (most recent call last):\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/bin/pecanpy", line 8, in <module>\n    sys.exit(main())\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 182, in main\n    main_helper(parse_args())\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 177, in main_helper\n    timed_emb()\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/wrappers.py", line 18, in wrapper\n    result = func(*args, **kwargs)\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 171, in timed_emb\n    learn_embeddings(args=args, walks=walks)\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 148, in learn_embeddings\n    iter=args.iter,\nTypeError: __init__() got an unexpected keyword argument \'size\'\n

I will probably open a PR once I identify the breaking version (either to ping the version or just to fix the keyword argument usage), maybe later today, but opening this in case I forget so others can see. The easy fix is just to pin gensim at 3.8 locally.

Backup source on breaking changes: https://stackoverflow.com/questions/53195906/getting-init-got-an-unexpected-keyword-argument-document-this-error-in

Memory Error while reading for large files

I've a file with nearly 70M edges and it fails to load into memory on a machine with 32 GB ram

Test random walk simulation

Random state

Implement random state specification for random walk generations to enable reproducibility.

[Peer Review] Archive repository on Zenodo

While GitHub is an excellent place to centralize your code and keep it under version control, it is not so appropriate for archiving the state of the repository. Zenodo is a proper alternative, and it will even make your life easy by connecting to GitHub. I'd recommend following this tutorial to:

log in to Zenodo with your GitHub account
enable its usage on the https://github.com/krishnanlab organization (an admin of the organization will have to do this, or a non-admin can ask for approval from the admin through he same process)
create a "release" on GitHub, which automatically notifies Zenodo to make an archive
Go to the Zenodo page and grab the newly minted DOI which corresponds to the code as a resource
a. Put the badge on the GitHub page (it will allow you to copy+paste the markdown already prepared for you)
b. Put the DOI in your manuscript and mention that you're using Zenodo to do it.

I personally thing Zenodo is the best tool for code because of its integration with GitHub, but other services exist for archiving your research output that are also good, such as Figshare.

Performance tests

Write to edge list

Read from npz (csr or dense)

Read from adjacency matrix

[Peer Review] Enforce consistent code style

More on the theme of "code as the methods section" - I would strongly suggest applying some formatting to the code, as it currently has a mashup of inconsistent styles. This makes it difficult to read. There's also the secondary benefit of finding possible mistakes when doing the cleanup. A really good way to start enforcing consistent code style is to:

Apply black. This tools is completely automatic and gets your code into an almost reader-ready state
Apply flake8. I'll shamelessly link you to my own blog post, which is exactly written for people who might just first be learning about code formatters, code style checkers, and automation of their respective application.
Set up continuous integration to make sure new commits don't break the style rules

Again, I'm happy to help you understand these things and work through it.

	@jit(nopython=True, nogil=True)
	def alias_draw(J, q):
	"""
	Draw sample from a non-uniform discrete distribution using alias sampling.
	"""
	K = J.size

	kk = np.random.randint(K)
	if np.random.rand() < q[kk]:
	return kk
	else:
	return J[kk]

	def node2vec_walks():
	"""Simulate a random walk starting from start node."""
	n = start_node_idx_ary.size
	walk_idx_mat = np.zeros((n, walk_length + 1), dtype=np.uint32)
	walk_idx_mat[:, 0] = start_node_idx_ary

	# progress bar parameters
	n_checkpoints = 10
	checkpoint = n / get_num_threads() // n_checkpoints
	progress_bar_length = 25
	private_count = 0

	for i in prange(n):
	start_node_idx = walk_idx_mat[i, 0]
	walk_idx_mat[i, 1] = move_forward(start_node_idx)

	for j in range(2, walk_length + 1):
	cur_idx = walk_idx_mat[i, j - 1]
	if has_nbrs(cur_idx):
	prev_idx = walk_idx_mat[i, j - 2]
	walk_idx_mat[i, j] = move_forward(cur_idx, prev_idx)
	else:
	print("Dead end!") # TODO: need to modify walks accordingly
	break

krishnanlab / pecanpy Goto Github PK

pecanpy's People

Contributors

Stargazers

Watchers

Forkers

pecanpy's Issues

Recommend Projects

Recommend Topics

Recommend Org