aditya-grover / node2vec Goto Github PK
View Code? Open in Web Editor NEWHome Page: http://snap.stanford.edu/node2vec/
License: MIT License
Home Page: http://snap.stanford.edu/node2vec/
License: MIT License
I am attempting to classify nodes in a networkx graph using node2vec and struggling, is there a simple way to use this algorithm on a networkx graph?
I have a small graph of about 100MB (edge-list file size). #nodes = 65k, #edges = 3.5m. Node2vec just does not run on this graph. I have tracked the problem to being in the preprocess_transition_probs()
function. This function gradually eats my whole system memory within 30minutes and everything hangs. I don't even reach the word2vec part after random walks.
I am running experiments on a i7, 16GB laptop. Deepwalk and LINE are able to process upto 1GB graphs. Deepwalk on this 100MB file runs in like 5minutes (including the gensim w2v procedure).
Is there a problem with using an edgelist with strings instead of integers? I've made a couple of tests, using an edgelist such as:
a b 0.2
a c 0.1
b a 0.2
a d 0.3
....
and removing nodetype = int in read_edgelist, and it seems to work fine.
Message:
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 473, in init
self.train(sentences)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 777, in train
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
Google said it is a problem with Word2Vec where min_count is to large. However, here the problem still occur although min_count=0
Problem found:
because walks = []
Has the data set BlogCatalog been processed?
The data set BlogCatalog here(http://socialcomputing.asu.edu/datasets/BlogCatalog3) is different from the one used in deepwalk(https://github.com/phanein/deepwalk).
the data set BlogCatalog downloaded here:
10306,39
10307,39
10308,39
10309,39
10310,39
10311,39
10311,39
10312,39
from "group-edges.txt".
the data set BlogCatalog downloaded from http://leitang.net/social_dimension.html (which is used in deepwalk):
(14,39)
(691,39)
(1250,39)
(1344,39)
(1465,39)
(1550,39)
(4709,39)
(7759,39)
For category 39, there are obvious differences, are these two different data sets?
Which parameter controls the scale of the value in the output vector?
I have build a network with 400 thousand nodes and more than 2 million edges, the node vectors are composed with values larger than 100 million. Is that normal?
Hi,
Thanks for node2vec - such an interesting idea.
Could I ask you to specify some additional parameters for the case study 4.1 in you paper so that I can reproduce the community-result?
For the top example you set p=1
, q=0.5
, but I'm wondering what you specified num_walks
, walk_length
for the random walk generation, as well as size
, window
, min_count
, sg
and iter
for Word2Vec.
Hope this isn't too cumbersome to reply to. Thanks again!
Hi ,
Could you please release the evaluation code for link prediction?
for I can't reproduce the result in the paper.
I fed a graph of 60K nodes into Node2Vec program. It was running on a 8 core 8 G cloud server and it's running for more than 2 days without producing the output emd file. I checked the monitor metrics , the CPU and memory consumption at the moment is very low , so probabily the program stopped at some point without sending out any warning or printing any messages .
last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.
for dst_nbr in sorted(G.neighbors(dst)):
if dst_nbr == src:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
elif G.has_edge(dst_nbr, src):
unnormalized_probs.append(G[dst][dst_nbr]['weight'])
else:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)
But in spark version, "α=1 node" y is the node that there is an edge t->y in graph.
val neighbors_ = dstNeighbors.map { case (dstNeighborId, weight) => var unnormProb = weight / q if (srcId == dstNeighborId) unnormProb = weight / p else if (srcNeighbors.exists(_._1 == dstNeighborId)) unnormProb = weight (dstNeighborId, unnormProb) }
So, which one is correct?
I can't find any copy of your node2vec slide at KDD 16 meeting, can you send one ?
I met this error when I tried to run the karate sample graph. I searched online and the reason is that gensim updates their word2vec model as shown here https://groups.google.com/forum/#!topic/gensim/pjZo1ZunIZA
Could you help to solve this issue?
Details:
python src/main.py --input graph/karate.edgelist
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1438, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.
Hi,
I am trying to load a model previously created via Spark Node2Vec. My environment is as follows:
Scala: 2.11.12
Spark: 2.4.0
Here is the code I am trying to run:
import org.apache.spark.ml.feature.Word2VecModel
import org.apache.spark.sql.SparkSession
object Main {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("TestApp")
.master("local[*]")
.getOrCreate()
val sameModel = Word2VecModel.load("output/embedding.bin")
}
}
The folder output/embedding.bin contains the metadata and data folders generating by previous execution of Node2Vec on Spark.
This results in the following exception:
Exception in thread "main" org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String at org.json4s.reflect.package$.fail(package.scala:95) at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:704) at scala.Option.getOrElse(Option.scala:138) at org.json4s.Extraction$.convert(Extraction.scala:704) at org.json4s.Extraction$.$anonfun$extract$9(Extraction.scala:394) at org.json4s.Extraction$.customOrElse(Extraction.scala:606) at org.json4s.Extraction$.extract(Extraction.scala:392) at org.json4s.Extraction$.extract(Extraction.scala:39) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:632) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:392) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:384) at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:380) at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:380) at org.apache.spark.ml.feature.Word2VecModel$.load(Word2Vec.scala:422) at Main.main(Main.scala)
Best,
Matteo
Hi guys,
I have a relationship graph with about 100M nodes, I doubt that whether node2vec could train on that level? Could somebody give me dataset size and memory usage exchange
Thanks
process the code and throw the nullpointer exception during the execution. After check the code, Maybe
In node2vec.scala line 171:
if (Some(this.label2id).isDefined) {
should be:
if (Option(this.label2id).isDefined) {
It seems that the default p=1, q=1 is not as good. and the paper didn't specify the actual value, just mentioned 'set p, q to low values'
I have tied a network, the dataset is here:
http://mlg.ucd.ie/networks/politics-uk.html
and there are 419 nodes, I use the twitter follows and followedby data as edges
but I only got 418 vectors
anyone know why?
hi, the graph is dynamic, so there are usually some new node added to the graph at any time. Is there some solution to the issue? I find word2vec can supply online trainhere, But how can i get the new sequences with the new nodes? Anyone who can help me?
last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.
But in spark version, 1-bp node y is the node that there is an edge t->y in graph.
So, which one is correct?
i find rawEdges join in indexingGraph the result is not right. the node2id used in first join and second join is not same!
I think there should be cache the node2id first,
Hi!
I was recently using the spark version to analyze a large scale network with 21M nodes, 310M edges and the file size is 13G. I ran the code on a cluster of 5 machines. It runs very well until saving the result to HDFS. I got the excpetion "diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory xxxxxx already exists."
Has anyone encountered this problem before?
Thank you!
Best,
Ying
It gives segmentation fault when no of edges are greater than equal to 20000. Please suggest some way to fix it.
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1312, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.
if add the code below:
================
randomWalk.unpersist(blocking = false)
================to:
if (randomWalkPaths != null) {
val prevRandomWalkPaths = randomWalkPaths
randomWalkPaths = randomWalkPaths.union(randomWalk).cache()
randomWalkPaths.first
prevRandomWalkPaths.unpersist(blocking = false)
randomWalk.unpersist(blocking = false)
} else {
randomWalkPaths = randomWalk
}
After running the following command:
python src/main.py --input graph/karate.edgelist --output emb/karate.emd
I am receiving this error:
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
TypeError: init() got an unexpected keyword argument 'iter'
Hi,
Somehow the Spark implementation crashes whenever the directed flag is set to true.
I ran it with both the karate.edgelist
example and some dummy two-edge graph.
The exception is always raised at the same location inside initTransitionProb
after the
graph has been loaded.
java.lang.NullPointerException
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:69)
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:68)
at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Cheers.
val edge2attr = graph.triplets.map { edgeTriplet =>
(s"${edgeTriplet.srcId}${edgeTriplet.dstId}", edgeTriplet.attr)
}.repartition(200).cache
(s"${prevNodeId}${currentNodeId}", (srcNodeId, pathBuffer))
}.join(edge2attr).map { case (edge, ((srcNodeId, pathBuffer), attr)) =>
in the code, join key is generated by s"${edgeTriplet.srcId}${edgeTriplet.dstId}",
do we need a separator between the two elements?
hi,
i run node2vec with a graph created by nx.read_adjlist(), but sometimes i have this problem.
it's not always show up,when i adjust parameters like p,q,walk_length,number_walks..
how can slove this problem?
//////////////////////////////////////////////////////
\node2vec.py", line 80, in get_alias_edge
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_pro
bs]
ZeroDivisionError: float division by zero
//////////////////////////////////////////////////////
the dataset is DBLP,there are some nodes don't have edge..
Hi, I'm trying to run node2vec using the Spark implementation on a large graph (~2.8M nodes, ~41M edges, 4.1GB file), this is the command that I'm running:
./spark-submit --class com.navercorp.Main node2vec/node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar --cmd node2vec --p 1 --q 1 --walkLength 40 --numWalks 5 --input yago_types.edgelist --output output/yago_types_p1_q1_l40_num5.emb --weighted False --directed False --indexed False
I get this error:
"2017-05-09T16:45:13.259237677Z 17/05/09 16:45:13 ERROR scheduler.TaskSchedulerImpl: Lost executor 1 on spark-worker1-97711-prod: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages"
Everything was working fine with a smaller sample, so it seems like a memory problem to me. Have you ever experienced anything similar? Any clue on what could be a proper memory allocation for such a size of a graph? At the moment, I have a master node with 2GB and six workers with 42GB.
Thank you a lot!
Enrico
when I run 119M input edges file with node2vec and only run randomwalk ,encounter below problem(run the small file has not such problem):
7/11/27 15:28:23 INFO YarnClientClusterScheduler: Stage 21090 was cancelled
17/11/27 15:28:23 INFO DAGScheduler: Job 422 failed: saveAsTextFile at Node2vec.scala:177, took 63.784991 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21090.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21090.0 (TID 45952, hadoop195): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Hello,
I've been stuck on this problem for a a few hours and I'm not sure how to resolve it. :::
(py34) coryfalconer@Opti:~/Documents/Graph/HBDSCAN_testing/Node2vec/node2vec$ python src/main.py --input graph/karate.edgelist --output emb/karate.emd)
The above gives me the following errors:
./home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Exception in thread Thread-9:
Traceback (most recent call last):
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 911, in _bootstrap_inner
self.run()
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 859, in run
self._target(*self._args, **self._kwargs)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 822, in job_producer
sentence_length = self._raw_word_count([sentence])
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in _raw_word_count
return sum(len(sentence) for sentence in job)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in
return sum(len(sentence) for sentence in job)
TypeError: object of type 'map' has no len()
Initially, I had been running the code using Python 3.6 but found that the version of gensim I am using (0.13.3) did not support Python 3.6. I checked to see which versions of python gensim(0.13.3) were available and found that Python 2.6, 2.7, 3.3, 3.4 and 3.5 are all supported. I then proceeded to create three virtual environment with anaconda using python versions 3.3, 3.4, and 3.5 and tried to run the code again though I keep receiving the same issue as above.
If you could shed any light on how to fix this, that would be greatly appreciated.
Thanks!
Cory
Hello, I apply node2vec method in a citation network generated myself. I want to compare the deepwalk and node2vec. In my opinion, the node2vec is better than deepwalk, but I have tried lots of p, q, the performance is not as good as deepwalk. When I set p and q to 1, the performance is much worse than deepwalk, but in fact, in this case it equals to deepwalk. The result shows below, both p and q equal 1, and other parameters are the same. I dont know why, could you please give me some advice, Thank you.
start deepwalk***************************
starting computing RPrecision
1 in top 2 is 0.500000
2 in top 4 is 0.500000
4 in top 6 is 0.666667
5 in top 8 is 0.625000
5 in top 10 is 0.500000
6 in top 20 is 0.300000
11 in top 30 is 0.366667
17 in top 40 is 0.425000
21 in top 50 is 0.420000
37 in top 100 is 0.370000
182 in top 1000 is 0.182000
end deepwalk***************************
start node2vec***************************
starting computing RPrecision
1 in top 2 is 0.500000
1 in top 4 is 0.250000
1 in top 6 is 0.166667
1 in top 8 is 0.125000
1 in top 10 is 0.100000
1 in top 20 is 0.050000
1 in top 30 is 0.033333
2 in top 40 is 0.050000
5 in top 50 is 0.100000
22 in top 100 is 0.220000
139 in top 1000 is 0.139000
end node2vec***************************
can it supports hundreds of millions of nodes and edges???
Hi,
I downloaded the whole projected and compiled the spark_node2vec project with mvn, without changing the pmo.xml file.
The build succeeded and target directory was created, but I got MethodNotFound exception when I tried to run the jar file.
Should I change the pmo.xml file to my spark and its scala version?
Thanks.
will it work on python 3?
karate.edgelist karate.emd这两个文件能不能提供一下?
Hi,
How can I use this work to embed the graph with weight (such as the distance between tow nodes), could you please give some suggestions? Thanks in advance.
The first walk path of each nodes is unchanging after init in Node2Vec.initTransitionProb
module.
graph = Graph(indexedNodes, indexedEdges)
.mapVertices[NodeAttr] { case (vertexId, clickNode) =>
val (j, q) = GraphOps.setupAlias(clickNode.neighbors)
val nextNodeIndex = GraphOps.drawAlias(j, q)
clickNode.path = Array(vertexId, clickNode.neighbors(nextNodeIndex)._1)
clickNode
}
Therefor it result that the first path of a node from each walkers is same.
for (iter <- 0 until config.numWalks) {
var prevWalk: RDD[(Long, ArrayBuffer[Long])] = null
var randomWalk = graph.vertices.map { case (nodeId, clickNode) =>
val pathBuffer = new ArrayBuffer[Long]()
pathBuffer.append(clickNode.path:_*)
(nodeId, pathBuffer)
}.cache
In particular at the setting of directed=true
, the edges is so sparse that it‘s hard to come back.
Hi,
Do you mind releasing the code you used for evaluation?
I am trying the multilabel classification on the BlogCatalog data. However, I can only get macro-f1 about 0.15 with 50% training data. My parameter setting is as follows:
length of walk: 80,
walks per node: 10
context window size: 10
dimension: 128
epoch: 1
p,q : 1
1 vs Rest Logistic regression with 0.01 L2 regularization.
In this code ,d default 128,is the best one? In this paper , the authors discussed the sensitivity of the parameter d. It is pointed out that d tends to be stable after 100, and it is for a specific data set . How to determine a suitable dimension for different data sets?
The code below should not use 'map' directly, since 'map' returns a object and not a list, the correct version would be 'walks = [list(map(str, walk)) for walk in walks]'.
def learn_embeddings(walks):
'''
Learn embeddings by optimizing the Skipgram objective using SGD.
'''
walks = [map(str, walk) for walk in walks]
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers,
iter=args.iter)
model.save_word2vec_format(args.output)
I am working on improving the speed/parallelization of node2vec for a research project I am working on to see if it can be used for question answering via the wikipedia network structure (https://github.com/Pinafore/qb), and noticed that the source code doesn't have a license.
Would it be possible for you to add a permissive license like MIT/Apache so there are no worries about that later?
could the input data have repetitive edge?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.