Giter Club home page Giter Club logo

pecanpy's People

Contributors

arjunkrish avatar aviswaroop avatar cakiki avatar cthoyt avatar dependabot[bot] avatar justinlboyer avatar pascalsun avatar pre-commit-ci[bot] avatar remylau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pecanpy's Issues

[Peer Review] Give feedback on potentially bad choices of settings

It seems that the default setting of the command line client is PreComp. I tried putting in a network with ~500K nodes / ~ 9M edges and after it had been running for a while, I realized that I had made a poor choice based on your insights outlined in the manuscript. Instead, I should have probably chosen the SparseOTF mode. Since you already make suggestions in the manuscript on what the cutoffs for size and density are to go between each, it would be helpful to also output a warning after the graph has been loaded if the mode does not match the selected mode (especially since selecting the mode is an optional parameter, and people, like me, will try using the default parameters unless otherwise encouraged)

Add option to save embedding as ``.emd.npz`` file

Advantages

  • Faster loading time for embedding for later usage
  • Compact storage

Still, the original storage strategy (saving as txt) is still useful, for smaller networks due to the ease of inspection of the embedding text file.

To toggle save as .emd.npz, simply specify the output file with the file extension .emd.npz instead of .emd. In the .emd.npz file, there should be two arrays, one for IDs (node ids, of size n), and one for data (embedding vectors, of size n x d).

[Peer Review] Reproduction of results of manuscript

The main research output of this manuscript is obviously the code in this repository. However, to assess the veracity of the claims made in the paper about its goodness in comparison to previous libraries, the code for running the benchmarking experiments, organizing their respective results, and turning them into the figures that appear in the main manuscript and supplemental material should be made available (in a form that is readily useful). I'd go as far as to suggest putting it in a different repository entirely, to mitigate confusion and disorganization.

I appreciate that it's not easy to set up a barrage of different packages written in a variety of languages, and understand that it might take significant time and effort to prepare this code for the same scrutiny as the package itself. I hope you agree that it is necessary to assess the methodology used for benchmarking to give an accurate review.

[Peer Review] alias_draw function is too hard to understand

It's non-obvious how the following alias_draw() function works.

@jit(nopython=True, nogil=True)
def alias_draw(J, q):
"""
Draw sample from a non-uniform discrete distribution using alias sampling.
"""
K = J.size
kk = np.random.randint(K)
if np.random.rand() < q[kk]:
return kk
else:
return J[kk]

I've also copy-pasted a similar python implementation from StackOverflow, where nobody seemed to know how it worked either. However, because this piece of code is now part of an academic paper, it would do some good to make it instructive to a potential reader - code is, after all, the methods section and major result of your paper.

  1. Improve the docstring such that it explains what this function does and why
  2. Include a reference to where you found it and what you extra reading you did to understand it yourself
  3. Add type annotations for the arguments for the function
  4. Improve the names of all variables appearing in the function
  5. Write examples to illustrate when you put certain data in, you get certain data out. You'll have to seed the numpy random number generator (ref) to make sure the results are consistent.
  6. Unit tests (which just make it possible to automatically check that your examples really do what they say they should)

I'd be happy to assist in writing the unit tests and making them easy to run after you've prepared the rest if you want any help with that. I'd also help you set up Travis-CI to automatically run the tests every time you make changes to the package if you'd like

[Peer Review] Deploy to PyPI

Since the package is pip-installable and does not rely on any exotic dependencies, you should deploy it to PyPI. If you've never done this before, follow these steps in the bash in the root directory of the repository:

python3 setup.py -q sdist bdist_wheel 
python3 -m pip install twine
twine upload --skip-existing dist/*

You should do this step after you name a certain version as a "release" and upload to Zenodo (as outlined in #4) then immediately follow-up with a commit to bump the version in setup.cfg from 0.0.1 to 0.0.2-dev. Later, when you're done with version 0.0.2-dev (or whatever you want to call it), you take away the -dev suffix before releasing again. This suffix is important because it lets users know that they're not on stable versions if they're installing from github's master (main) branch

Also, please include in your manuscript that your code is available through PyPI! There are many potential users who won't be comfortable with cloning the repo or even using the tricky git+https:// way of installing through GitHub who would benefit from easy installation through PyPI.

Reg Multi Processing

Does the library automatically detect number of cores and parellize or do we have to specify ? If we have to specify how to do that ?

[Peer Review] Code Review for "PecanPy: a fast, efficient, and parallelized Python implementation of node2vec"

Dear @RemyLau and @arjunkrish,

I've been invited to review your manuscript. I'm sure this was no coincidence, as I'm always trolling bioRxiv for interesting software papers and making PRs :) Overall, I'm quite happy with the way you wrote the manuscript, especially how you structured the dichotomy between the main manuscript and supplement. I will give specific comments for the paper itself in the review you will receive directly from the journal.

You'll also see I've left quite a few issues on this repository ranging from specific code suggestions to more philosophical discussions. I've meant all in good faith - have no doubt that your paper will be accepted. However, I'm likely one of the pickiest code reviewers in the entire bioinformatics community, so we're going to make your repository the best it can possibly be before the review is done. I hope you're excited about the idea, since you're already 90% of the way there. Most methods published in the literature are utterly useless, so I already really appreciate the work you've done and how you've done it until now.

Greetings from Bonn, Germany. Have a nice weekend.

-Charlie Hoyt

Related issues:

  • #4 Archive repository on Zenodo
  • #5 Add progress monitoring (if possible)
  • #6 Fix logging time widths
  • #7 Improve understandability of alias_draw() function
  • #8 Enforce consistent code style
  • #9 Deploy to PyPI
  • #10 Give feedback on potentially bad choices of settings
  • #11 Enable reproduction of results of manuscript
  • #13 Improve documentation and build with Sphinx
  • #15 Host documentation with ReadTheDocs

[Peer Review] Add documentation for using code in Python

Though many of the main functionality is exposed through a command line interface, there is no documentation of how to import pecanpy and use it in other pipelines. I will send a PR to get you started, then you will see where else you need to add/improve documentation and create some tutorials.

Release under an open source license

Currently the repository is released under a CC BY-NC-SA 4.0 License as per LICENSE.md.

This is a very restrictive license that is incompatible with other licenses and is not designed for software.

If the goal is to create an open source software package that researchers can use and improve upon, then I'd recommend an OSI-approved license at https://opensource.org/licenses.

For Python packages, the BSD Licenses are popular. I personally use the BSD-2-Clause Plus Patent License. This license is OSI-approved and rated highly for its simplicity, compatability, and effectiveness.

Happy to answer any license questions.

As a potential user of this package, I am wary to consider it further until it has a license to ensure a sustainable future and open exisitence!

Seed random number generators

Awesome project @RemyLau, thanks for sharing your work on it!

Is there a way to seed the embedding functions/classes so that they produce the same results every time? I didn't see anything to that end in https://github.com/krishnanlab/PecanPy_benchmarks.

I haven't tried it yet, but I'd assume after a brief look at the code that seeding the numpy generator both within and outside of a numba function might do it (something like numba/numba#6002 (comment)). I'm not sure if that will work with @njit(parallel=True) though. Have you already figured out how to make that work?

[Peer Review] Fix logging time widths

This isn't a major issue but it irks me that the time stamps aren't aligned or left-zero-padded in

print("Took %d:%d:%.2f to %s"%(hrs, mins, secs, self.name))

This will fix it:

print("Took %02d:%02d:%05.2f to %s"%(hrs, mins, secs, self.name))

Results:

Took 00:00:00.07 to load graph
Took 00:00:03.37 to pre-compute transition probabilities
Took 00:00:03.00 to generate walks
Took 00:00:06.14 to train embeddings

Speciy ``p`` and ``q`` when executing ``embed`` instead of during creation

Originally done so following the node2vec source code. It makes sense to specify these parameters during the embedding function call, so that when embedding with different settings of p and q, we do not need to load the save graph object over and over again.

The only catch is that for PreComp, need to make sure each time we change the p and q settings, we need to rerun the preprocessing function.

[Peer Review] Package and make pip installable

I'll send a PR your way to make this package installable with pip and so you can upload it to PyPI as well. I'll also add some nice entrypoints so you don't have to write python -m ...

Improve runtime for first order walk

When p and q are set to 1, the second order transition computation is unnecessary, leading to extra runtime.
Furthermore, when the random walks are first order, there's no need to use OTF type approach, since the size of the transition probabilities is the same as that of the graph.
Thus SparseOTF and DenseOTF can be treated as SparsePreComp and DensePreComp.

Two options to implement the above enhancement

  1. Implement _SparsePreComp and _DensePreComp as hidden classes, which are used by the "front end" classes PreComp, SparseOTF, and DenseOTF, when p = q = 1.
  2. Check whether p = q = 1 within SparseOTF and DenseOTF; if so, use specialized get_normalized_probs (also need preprocessing to row-normalize the adjacency matrix), similar to the approach taken for implementing node2vec+.

Option 1 seems to be a cleaner solution.
Thinking about this, maybe node2vec+ needs to be implemented this way also.

Dead end walks not handled correctly

In the case of dead end during random walk (for directed graph), the random walk process is simply stopped in the current implementation (L111). However, because of the way the random walk index array are preinitialized with zeros (L91), this causes problem that the word2vec model will comprehend this as the walk is never halted and is continued with the rest of the sequence all being the word that correspond to zero index.

def node2vec_walks():
"""Simulate a random walk starting from start node."""
n = start_node_idx_ary.size
walk_idx_mat = np.zeros((n, walk_length + 1), dtype=np.uint32)
walk_idx_mat[:, 0] = start_node_idx_ary
# progress bar parameters
n_checkpoints = 10
checkpoint = n / get_num_threads() // n_checkpoints
progress_bar_length = 25
private_count = 0
for i in prange(n):
start_node_idx = walk_idx_mat[i, 0]
walk_idx_mat[i, 1] = move_forward(start_node_idx)
for j in range(2, walk_length + 1):
cur_idx = walk_idx_mat[i, j - 1]
if has_nbrs(cur_idx):
prev_idx = walk_idx_mat[i, j - 2]
walk_idx_mat[i, j] = move_forward(cur_idx, prev_idx)
else:
print("Dead end!") # TODO: need to modify walks accordingly
break

To fix this issue, need to inform the translator (L131) to stop when dead end encountered.

walks = [[self.IDlst[idx] for idx in walk] for walk in node2vec_walks()]

Use single precision for transition probabilities (and edge weights)

Gensim word2vec is single-precision, so it might be a good idea to also use single-precision for random walk generation. More specifically, the stored word vectors c.syn0 are REAL_t type, which, according to heir definition, is single-precision (np.float32_t).

Benefits of switching to single precision:

  • Reduce memory usage (from the graph) by half
  • Might also improve run time since reading less from memory?

[Peer Review] Add progress monitoring

For small graphs like BioGRID, it might be reasonable to wait for 30 seconds or so in the different steps, but as the graphs start getting large and training time gets into the minutes/hours, it would be nice to communicate progress to the user.

I would normally suggest using tqdm, but after modifying the code myself, it doesn't look like it plays nicely with numba's parallelization. I found this thread: numba/numba#4267 that describes the issue a bit more in detail, and the outlook is a bit grim.

It's not a dealbreaker, but it would make your package much more user friendly.

Gensim update breaks pecanpy due to size vs vector_size argument

Gensim has changed the size keyword argument to vector_size in genesis ~4. This broke pecanpy, as it pins gensim >=3.8.0.

The resulting traceback looks like:

Traceback (most recent call last):\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/bin/pecanpy", line 8, in <module>\n    sys.exit(main())\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 182, in main\n    main_helper(parse_args())\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 177, in main_helper\n    timed_emb()\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/wrappers.py", line 18, in wrapper\n    result = func(*args, **kwargs)\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 171, in timed_emb\n    learn_embeddings(args=args, walks=walks)\n  File "/usr/local/Caskroom/miniconda/base/envs/pipeline/lib/python3.7/site-packages/pecanpy/cli.py", line 148, in learn_embeddings\n    iter=args.iter,\nTypeError: __init__() got an unexpected keyword argument \'size\'\n

I will probably open a PR once I identify the breaking version (either to ping the version or just to fix the keyword argument usage), maybe later today, but opening this in case I forget so others can see. The easy fix is just to pin gensim at 3.8 locally.

Backup source on breaking changes: https://stackoverflow.com/questions/53195906/getting-init-got-an-unexpected-keyword-argument-document-this-error-in

Random state

Implement random state specification for random walk generations to enable reproducibility.

[Peer Review] Archive repository on Zenodo

While GitHub is an excellent place to centralize your code and keep it under version control, it is not so appropriate for archiving the state of the repository. Zenodo is a proper alternative, and it will even make your life easy by connecting to GitHub. I'd recommend following this tutorial to:

  1. log in to Zenodo with your GitHub account
  2. enable its usage on the https://github.com/krishnanlab organization (an admin of the organization will have to do this, or a non-admin can ask for approval from the admin through he same process)
  3. create a "release" on GitHub, which automatically notifies Zenodo to make an archive
  4. Go to the Zenodo page and grab the newly minted DOI which corresponds to the code as a resource
    a. Put the badge on the GitHub page (it will allow you to copy+paste the markdown already prepared for you)
    b. Put the DOI in your manuscript and mention that you're using Zenodo to do it.

I personally thing Zenodo is the best tool for code because of its integration with GitHub, but other services exist for archiving your research output that are also good, such as Figshare.

[Peer Review] Enforce consistent code style

More on the theme of "code as the methods section" - I would strongly suggest applying some formatting to the code, as it currently has a mashup of inconsistent styles. This makes it difficult to read. There's also the secondary benefit of finding possible mistakes when doing the cleanup. A really good way to start enforcing consistent code style is to:

  1. Apply black. This tools is completely automatic and gets your code into an almost reader-ready state
  2. Apply flake8. I'll shamelessly link you to my own blog post, which is exactly written for people who might just first be learning about code formatters, code style checkers, and automation of their respective application.
  3. Set up continuous integration to make sure new commits don't break the style rules

Again, I'm happy to help you understand these things and work through it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.