Giter Club home page Giter Club logo

Comments (9)

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

@RemyLau

from pecanpy.

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

I've a graph with 4.5M nodes and 80M edges. What is your rough estimate of running time on large graphs like this.

from pecanpy.

RemyLau avatar RemyLau commented on September 4, 2024

Hi @KarthikRevanuru , from my rough estimation, it should take no more than 20GB of memory to fully load and convert your graph into the CSR format, which is used as the final graph data structure. I have run some testing with a couple large biological networks (see bench repo). For example, SSN has roughly 72M edges with 800k nodes, and it uses ~10GB of memory throughout the execution of the program (see line 65 or line 71 in this benchmarking result table).

May I ask what mode of execution you are using (i.e. did you explicitly set --mode to PreComp or DenseOTF)? In this case, since the network has a large number of nodes with very sparse connections, it is best to use SparseOTF (which is the default mode).

from pecanpy.

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

Thanks for quick revert @RemyLau
I'm using SparseOTF and I've used 32 GB ram but it throws error. On disk my input data size is 3.4 GB, it's stored as edge list.
I've taken a bigger instance and it seems to work. But it has printed only this since 6 hours.
Took 00:06:06.35 to load graph Took 00:00:00.00 to pre-compute transition probabilities

Do you have any estimates on running time ? Also I've enabled verbose mode and it didn't print anything regarding walks

from pecanpy.

RemyLau avatar RemyLau commented on September 4, 2024

Hmm.. What error are you seeing @KarthikRevanuru ? It would be helpful if you can share the error log here.

From the log message you shared here, it only took 6 minutes to load the graph, which is a quite reasonable time for networks of this size. And since you're using SparseOTF, there's no preprocessing step, so 0 sec of preprocessing is expected. But I'm not sure why you only see this after 6 hours. If you look at the command-line interface, there's no other steps before loading graph.

PecanPy/src/pecanpy/cli.py

Lines 202 to 233 in 6a0a733

def main():
"""Pipeline for representational learning for all nodes in a graph."""
args = parse_args()
if args.directed and args.extend:
raise NotImplementedError("Node2vec+ not implemented for directed graph yet.")
@Timer("load graph", True)
def timed_read_graph():
return read_graph(args)
@Timer("pre-compute transition probabilities", True)
def timed_preprocess():
g.preprocess_transition_probs()
@Timer("generate walks", True)
def timed_walk():
return g.simulate_walks(args.num_walks, args.walk_length)
@Timer("train embeddings", True)
def timed_emb():
learn_embeddings(args=args, walks=walks)
if args.workers == 0:
args.workers = numba.config.NUMBA_DEFAULT_NUM_THREADS
numba.set_num_threads(args.workers)
g = timed_read_graph()
timed_preprocess()
walks = timed_walk()
g = None
timed_emb()

from pecanpy.

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

Error message is killed, after I took 64gb ram its fixed.
@RemyLau this is not printed after 6 hrs. Its printed within 6 min of running, but nothing after that since 6 hrs..

from pecanpy.

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

what's the expected time to generate walks and train embeddings

from pecanpy.

RemyLau avatar RemyLau commented on September 4, 2024

@KarthikRevanuru It honestly depends on a lot of factors, e.g. number of processors, CPU clock, memory clock, etc. But in your case, I'll say 6 hours would be the time where the random walk generation process is finished. So given these couple of clues, I think the issue might be caused by the large number of random walks generated. Previously in my case, although SSN network has roughly the same number of edges as yours, it has an order of magnitude less number of nodes (800k compared to 4.5M). The number of nodes does not affect the size of the sparse graph structure much, but it does affect the size of the corpurs generated (i.e. the random walks).

In particular, your killed error message may very likely be caused by exceeding memory limit from the following line of code in the simulate_walks function, which convert the node index sequence into list of strings (of node ids).

walks = [[self.IDlst[idx] for idx in walk[:walk[-1]]] for walk in node2vec_walks()]

This was done originally due to convenience when calling gensim.Word2Vec. I'll try to find an alternative solution where we don't need to convert to list of strings first, but instead, use the index sequence directly to save memory usage for specific cases like these (networks with large number of nodes). So keep posted.

In the meantime, if it is possible, try to further increase your memory allocation to, say 128GB, and see if that could resolve the issue.

from pecanpy.

KarthikRevanuru avatar KarthikRevanuru commented on September 4, 2024

Ok thanks !

from pecanpy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.