aditya-grover / node2vec Goto Github PK

View Code? Open in Web Editor NEW

2.6K 2.6K 912.0 52 KB

Home Page: http://snap.stanford.edu/node2vec/

License: MIT License

Python 26.87% Scala 73.13%

node2vec's People

Contributors

Stargazers

Watchers

Forkers

annamalai-nr mommi84 lenovor feilong0309 btabibian tpnguyen antoine-tran spgenot datnamer laisun redsuncmx zyzzhaoyuzhe 358463121 zhiyu-chen chenglongchen jimmyyfeng skallumadi zhenv5 qiucode heyongcs wdan poldrack ml-ai-nlp-ir yijunran gzzou 1206lyp pingjietang yetweka ml-lab zshwuhan bosconi ceachen njuhugn baichuan yiluan leotrs wtgme jeppe lightingwy tythonlee jsonac chenjohnai wirehack tobiasskovgaardjepsen entilzha henryslzhao sachinpc gfarnadi ashokpant wenqiliu liu4lin scpei jessilee thomasopsomer yuyue zhchxi11 buaaxzl matic0209 yuwenlidao tolety sucre111 bohengchen pengshuang inzamamrahaman unmeshvrije gear august-yeom xlees tandychao vyraun tranhungnghiep asxzy pseth kili-mandjaro yflyzhang afarzam hurwitzlab yangkuoone sdjksdafji wisonhuang jimgoo boyuzhangus milestonesvn xiaoqiangcs tracholar klms ustclin edwlin twicerogue lrjconan gth158a nruemmele aabbcc0812206523 chrinide susu12138 dimenwarper vishalbelsare youngleec ken57 gearchen

node2vec's Issues

Feeding a netwokx graph in

I am attempting to classify nodes in a networkx graph using node2vec and struggling, is there a simple way to use this algorithm on a networkx graph?

I have a small graph of about 100MB (edge-list file size). #nodes = 65k, #edges = 3.5m. Node2vec just does not run on this graph. I have tracked the problem to being in the preprocess_transition_probs() function. This function gradually eats my whole system memory within 30minutes and everything hangs. I don't even reach the word2vec part after random walks.

I am running experiments on a i7, 16GB laptop. Deepwalk and LINE are able to process upto 1GB graphs. Deepwalk on this 100MB file runs in like 5minutes (including the gensim w2v procedure).

nodetype = int

Is there a problem with using an edgelist with strings instead of integers? I've made a couple of tests, using an edgelist such as:
a b 0.2
a c 0.1
b a 0.2
a d 0.3
....
and removing nodetype = int in read_edgelist, and it seems to work fine.

RuntimeError:you must first build vocabulary before training the model

Message:

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 473, in init
self.train(sentences)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 777, in train
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

Google said it is a problem with Word2Vec where min_count is to large. However, here the problem still occur although min_count=0

Problem found:
because walks = []

Has the data set BlogCatalog been processed?

The data set BlogCatalog here(http://socialcomputing.asu.edu/datasets/BlogCatalog3) is different from the one used in deepwalk(https://github.com/phanein/deepwalk).

the data set BlogCatalog downloaded here:
10306,39
10307,39
10308,39
10309,39
10310,39
10311,39
10311,39
10312,39
from "group-edges.txt".

the data set BlogCatalog downloaded from http://leitang.net/social_dimension.html (which is used in deepwalk):
(14,39)
(691,39)
(1250,39)
(1344,39)
(1465,39)
(1550,39)
(4709,39)
(7759,39)

For category 39, there are obvious differences, are these two different data sets?

About "walkLength" of node2vec_spark

I have a question about the "walkLength"

OK, this file "Node2vec.scala"

My point of view has any problem?

Thanks

Best
Yu

About the vector value of node2vec_spark

Which parameter controls the scale of the value in the output vector?
I have build a network with 400 thousand nodes and more than 2 million edges, the node vectors are composed with values larger than 100 million. Is that normal?

Parameters for Les Miserables dataset

Hi,

Thanks for node2vec - such an interesting idea.

Could I ask you to specify some additional parameters for the case study 4.1 in you paper so that I can reproduce the community-result?

For the top example you set p=1, q=0.5, but I'm wondering what you specified num_walks, walk_length for the random walk generation, as well as size, window, min_count, sg and iter for Word2Vec.

Hope this isn't too cumbersome to reply to. Thanks again!

Could you please release the evaluation code for link prediction?

Hi ,
Could you please release the evaluation code for link prediction?

for I can't reproduce the result in the paper.

Take very long to finish the program

I fed a graph of 60K nodes into Node2Vec program. It was running on a 8 core 8 G cloud server and it's running for more than 2 days without producing the output emd file. I checked the monitor metrics , the CPU and memory consumption at the moment is very low , so probabily the program stopped at some point without sending out any warning or printing any messages .

How to define "α=1 node" in directed graph?

last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.

for dst_nbr in sorted(G.neighbors(dst)):
if dst_nbr == src:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
elif G.has_edge(dst_nbr, src):
unnormalized_probs.append(G[dst][dst_nbr]['weight'])
else:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)

But in spark version, "α=1 node" y is the node that there is an edge t->y in graph.

val neighbors_ = dstNeighbors.map { case (dstNeighborId, weight) =>
  var unnormProb = weight / q
  if (srcId == dstNeighborId) unnormProb = weight / p
  else if (srcNeighbors.exists(_._1 == dstNeighborId)) unnormProb = weight

  (dstNeighborId, unnormProb)
}

So, which one is correct？

Can you send me your Node2vec KDD 16 slides?

I can't find any copy of your node2vec slide at KDD 16 meeting, can you send one ?

DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

I met this error when I tried to run the karate sample graph. I searched online and the reason is that gensim updates their word2vec model as shown here https://groups.google.com/forum/#!topic/gensim/pjZo1ZunIZA
Could you help to solve this issue?

Details:
python src/main.py --input graph/karate.edgelist
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1438, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

node2vec has no attribute Graph

Node2Vec-Spark in Scala: Error on Loading Model

Hi,

I am trying to load a model previously created via Spark Node2Vec. My environment is as follows:

Scala: 2.11.12
Spark: 2.4.0

Here is the code I am trying to run:

import org.apache.spark.ml.feature.Word2VecModel
import org.apache.spark.sql.SparkSession

object Main {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("TestApp")
      .master("local[*]")
      .getOrCreate()
    val sameModel = Word2VecModel.load("output/embedding.bin")
  }
}

The folder output/embedding.bin contains the metadata and data folders generating by previous execution of Node2Vec on Spark.

This results in the following exception:
Exception in thread "main" org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String at org.json4s.reflect.package$.fail(package.scala:95) at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:704) at scala.Option.getOrElse(Option.scala:138) at org.json4s.Extraction$.convert(Extraction.scala:704) at org.json4s.Extraction$.$anonfun$extract$9(Extraction.scala:394) at org.json4s.Extraction$.customOrElse(Extraction.scala:606) at org.json4s.Extraction$.extract(Extraction.scala:392) at org.json4s.Extraction$.extract(Extraction.scala:39) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:632) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:392) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:384) at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:380) at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:380) at org.apache.spark.ml.feature.Word2VecModel$.load(Word2Vec.scala:422) at Main.main(Main.scala)

Best,
Matteo

what's the max number of nodes and edges node2vec could handle

Hi guys,

I have a relationship graph with about 100M nodes, I doubt that whether node2vec could train on that level? Could somebody give me dataset size and memory usage exchange

Thanks

A problem occurred in save method

process the code and throw the nullpointer exception during the execution. After check the code, Maybe
In node2vec.scala line 171:

if (Some(this.label2id).isDefined) {
should be:
if (Option(this.label2id).isDefined) {

how to set the p, q in blog-catalog dataset

It seems that the default p=1, q=1 is not as good. and the paper didn't specify the actual value, just mentioned 'set p, q to low values'

some vectors lost

I have tied a network, the dataset is here:
http://mlg.ucd.ie/networks/politics-uk.html

and there are 419 nodes, I use the twitter follows and followedby data as edges
but I only got 418 vectors
anyone know why?

how to deal with new node in graph

hi, the graph is dynamic, so there are usually some new node added to the graph at any time. Is there some solution to the issue? I find word2vec can supply online trainhere, But how can i get the new sequences with the new nodes? Anyone who can help me?

How do you load a model that you saved (with the model.save method)?

How to define "α=1 node in directed graph

last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.
But in spark version, 1-bp node y is the node that there is an edge t->y in graph.

So, which one is correct？

Node2vec node2id should be cache before join.

i find rawEdges join in indexingGraph the result is not right. the node2id used in first join and second join is not same!

I think there should be cache the node2id first,

https://declara.com/content/lgAe2qVg

Spark implementation Exception: Output directory xxxxxx already exists.

Hi!

I was recently using the spark version to analyze a large scale network with 21M nodes, 310M edges and the file size is 13G. I ran the code on a cluster of 5 machines. It runs very well until saving the result to HDFS. I got the excpetion "diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory xxxxxx already exists."

Has anyone encountered this problem before?

Thank you!

Best,
Ying

Segmentation fault

It gives segmentation fault when no of edges are greater than equal to 20000. Please suggest some way to fix it.

problem when running python src/main.py graph/karate.edgelist emb/karate.emd

Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1312, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

The variable name as "randomWalk" in node2vec can not unpersist after second iter problem.

if add the code below:

================
randomWalk.unpersist(blocking = false)

================to:

  if (randomWalkPaths != null) {
    val prevRandomWalkPaths = randomWalkPaths
    randomWalkPaths = randomWalkPaths.union(randomWalk).cache()
    randomWalkPaths.first
    prevRandomWalkPaths.unpersist(blocking = false)
   
    randomWalk.unpersist(blocking = false)

  } else {
    randomWalkPaths = randomWalk
  }

node2vec on Zachary's karate club network

After running the following command:
python src/main.py --input graph/karate.edgelist --output emb/karate.emd

I am receiving this error:
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
TypeError: init() got an unexpected keyword argument 'iter'

Setting Directed Flag Causes NullPointerException

Hi,

Somehow the Spark implementation crashes whenever the directed flag is set to true.
I ran it with both the karate.edgelist example and some dummy two-edge graph.
The exception is always raised at the same location inside initTransitionProb after the
graph has been loaded.

java.lang.NullPointerException
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:69)
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:68)
at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Cheers.

TypeError: object of type 'map' has no len()

something wrong with random walk in node2vec_spark?

val edge2attr = graph.triplets.map { edgeTriplet =>
(s"${edgeTriplet.srcId}${edgeTriplet.dstId}", edgeTriplet.attr)
}.repartition(200).cache

(s"${prevNodeId}${currentNodeId}", (srcNodeId, pathBuffer))
}.join(edge2attr).map { case (edge, ((srcNodeId, pathBuffer), attr)) =>

in the code, join key is generated by s"${edgeTriplet.srcId}${edgeTriplet.dstId}",
do we need a separator between the two elements?

random ZeroDivisionError: float division by zero

hi,
i run node2vec with a graph created by nx.read_adjlist(), but sometimes i have this problem.
it's not always show up,when i adjust parameters like p,q,walk_length,number_walks..
how can slove this problem?

//////////////////////////////////////////////////////
\node2vec.py", line 80, in get_alias_edge
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_pro
bs]
ZeroDivisionError: float division by zero
//////////////////////////////////////////////////////

the dataset is DBLP,there are some nodes don't have edge..

node2vec Spark - memory issue

Hi, I'm trying to run node2vec using the Spark implementation on a large graph (~2.8M nodes, ~41M edges, 4.1GB file), this is the command that I'm running:

./spark-submit --class com.navercorp.Main node2vec/node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar --cmd node2vec --p 1 --q 1 --walkLength 40 --numWalks 5 --input yago_types.edgelist --output output/yago_types_p1_q1_l40_num5.emb --weighted False --directed False --indexed False

I get this error:
"2017-05-09T16:45:13.259237677Z 17/05/09 16:45:13 ERROR scheduler.TaskSchedulerImpl: Lost executor 1 on spark-worker1-97711-prod: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages"

Everything was working fine with a smaller sample, so it seems like a memory problem to me. Have you ever experienced anything similar? Any clue on what could be a proper memory allocation for such a size of a graph? At the moment, I have a master node with 2GB and six workers with 42GB.

Thank you a lot!
Enrico

node2vec savesTestfile java.lang.ArrayIndexOutOfBoundsException: -1

when I run 119M input edges file with node2vec and only run randomwalk ,encounter below problem(run the small file has not such problem):

the code error in node2vec file's:

randomPaths.filter(x => x != null && x.replaceAll("\s", "").length > 0)
.repartition(Property.repartition_num)
.saveAsTextFile(s"${config.output}.${Property.pathSuffix}")

7/11/27 15:28:23 INFO YarnClientClusterScheduler: Stage 21090 was cancelled
17/11/27 15:28:23 INFO DAGScheduler: Job 422 failed: saveAsTextFile at Node2vec.scala:177, took 63.784991 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21090.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21090.0 (TID 45952, hadoop195): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Issue with running example from Git

Hello,
I've been stuck on this problem for a a few hours and I'm not sure how to resolve it. :::

(py34) coryfalconer@Opti:~/Documents/Graph/HBDSCAN_testing/Node2vec/node2vec$ python src/main.py --input graph/karate.edgelist --output emb/karate.emd)

The above gives me the following errors:

./home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Exception in thread Thread-9:
Traceback (most recent call last):
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 911, in _bootstrap_inner
self.run()
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 859, in run
self._target(*self._args, **self._kwargs)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 822, in job_producer
sentence_length = self._raw_word_count([sentence])
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in _raw_word_count
return sum(len(sentence) for sentence in job)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in
return sum(len(sentence) for sentence in job)
TypeError: object of type 'map' has no len()

Initially, I had been running the code using Python 3.6 but found that the version of gensim I am using (0.13.3) did not support Python 3.6. I checked to see which versions of python gensim(0.13.3) were available and found that Python 2.6, 2.7, 3.3, 3.4 and 3.5 are all supported. I then proceeded to create three virtual environment with anaconda using python versions 3.3, 3.4, and 3.5 and tried to run the code again though I keep receiving the same issue as above.

If you could shed any light on how to fix this, that would be greatly appreciated.

Thanks!
Cory

why p=1 and q=1 performs much worse than deepwalk

Hello, I apply node2vec method in a citation network generated myself. I want to compare the deepwalk and node2vec. In my opinion, the node2vec is better than deepwalk, but I have tried lots of p, q, the performance is not as good as deepwalk. When I set p and q to 1, the performance is much worse than deepwalk, but in fact, in this case it equals to deepwalk. The result shows below, both p and q equal 1, and other parameters are the same. I dont know why, could you please give me some advice, Thank you.
start deepwalk***************************
starting computing RPrecision
1 in top 2 is 0.500000
2 in top 4 is 0.500000
4 in top 6 is 0.666667
5 in top 8 is 0.625000
5 in top 10 is 0.500000
6 in top 20 is 0.300000
11 in top 30 is 0.366667
17 in top 40 is 0.425000
21 in top 50 is 0.420000
37 in top 100 is 0.370000
182 in top 1000 is 0.182000
end deepwalk***************************
start node2vec***************************
starting computing RPrecision
1 in top 2 is 0.500000
1 in top 4 is 0.250000
1 in top 6 is 0.166667
1 in top 8 is 0.125000
1 in top 10 is 0.100000
1 in top 20 is 0.050000
1 in top 30 is 0.033333
2 in top 40 is 0.050000
5 in top 50 is 0.100000
22 in top 100 is 0.220000
139 in top 1000 is 0.139000
end node2vec***************************

the scale problem

can it supports hundreds of millions of nodes and edges???

Fail to execute on spark

Hi,

I downloaded the whole projected and compiled the spark_node2vec project with mvn, without changing the pmo.xml file.

The build succeeded and target directory was created, but I got MethodNotFound exception when I tried to run the jar file.

Should I change the pmo.xml file to my spark and its scala version?

Thanks.

TypeError: 'float' object is not iterable : Failed to convert edge data (['0.0000055313498705800166']) to dictionary.

hi～
I am attempting to creating a weighted graph, and the weight down to e-9, but does not works

while I convert it to string, either

anyone know why?

will it work on python 3

will it work on python 3?

,

karate.edgelist karate.emd这两个文件能不能提供一下？

hello, anyone has this problem?

![image](https://user-images.githubusercontent.com/41321247/42804244-fbe80dc4-89da-11e8-8bb2-3b7ae4294387.png

Python implementation wrong ?

There is probably a bug in this implementation.
On same dataset we compare the cumulative values of the pca for several trials

Graph with weight

Hi,

How can I use this work to embed the graph with weight (such as the distance between tow nodes), could you please give some suggestions? Thanks in advance.

The first walk path problem in spark-version

The first walk path of each nodes is unchanging after init in Node2Vec.initTransitionProb module.

 graph = Graph(indexedNodes, indexedEdges)
            .mapVertices[NodeAttr] { case (vertexId, clickNode) =>
              val (j, q) = GraphOps.setupAlias(clickNode.neighbors)
              val nextNodeIndex = GraphOps.drawAlias(j, q)
              clickNode.path = Array(vertexId, clickNode.neighbors(nextNodeIndex)._1)
              
              clickNode
            }

Therefor it result that the first path of a node from each walkers is same.

    for (iter <- 0 until config.numWalks) {
      var prevWalk: RDD[(Long, ArrayBuffer[Long])] = null
      var randomWalk = graph.vertices.map { case (nodeId, clickNode) =>
        val pathBuffer = new ArrayBuffer[Long]()
        pathBuffer.append(clickNode.path:_*)
        (nodeId, pathBuffer) 
      }.cache

In particular at the setting of directed=true, the edges is so sparse that it‘s hard to come back.

evaluation code

Hi,

Do you mind releasing the code you used for evaluation?

I am trying the multilabel classification on the BlogCatalog data. However, I can only get macro-f1 about 0.15 with 50% training data. My parameter setting is as follows:
length of walk: 80,
walks per node: 10
context window size: 10
dimension: 128
epoch: 1
p,q : 1
1 vs Rest Logistic regression with 0.01 L2 regularization.

How to determine suitable generation dimension parameters for different data sets？

In this code ,d default 128，is the best one? In this paper , the authors discussed the sensitivity of the parameter d. It is pointed out that d tends to be stable after 100, and it is for a specific data set . How to determine a suitable dimension for different data sets?

Problem in 'learn_embeddings' method

The code below should not use 'map' directly, since 'map' returns a object and not a list, the correct version would be 'walks = [list(map(str, walk)) for walk in walks]'.

def learn_embeddings(walks):
'''
Learn embeddings by optimizing the Skipgram objective using SGD.
'''
walks = [map(str, walk) for walk in walks]
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers,
iter=args.iter)
model.save_word2vec_format(args.output)

MIT/Apache License

I am working on improving the speed/parallelization of node2vec for a research project I am working on to see if it can be used for question answering via the wikipedia network structure (https://github.com/Pinafore/qb), and noticed that the source code doesn't have a license.

Would it be possible for you to add a permissive license like MIT/Apache so there are no worries about that later?

could repetitive edge?

could the input data have repetitive edge?

aditya-grover / node2vec Goto Github PK

node2vec's People

Contributors

Stargazers

Watchers

Forkers

node2vec's Issues

the code error in node2vec file's:

randomPaths.filter(x => x != null && x.replaceAll("\s", "").length > 0) .repartition(Property.repartition_num) .saveAsTextFile(s"${config.output}.${Property.pathSuffix}")

Recommend Projects

Recommend Topics

Recommend Org

randomPaths.filter(x => x != null && x.replaceAll("\s", "").length > 0)
.repartition(Property.repartition_num)
.saveAsTextFile(s"${config.output}.${Property.pathSuffix}")