Giter Club home page Giter Club logo

node2vec's People

Contributors

aditya-grover avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node2vec's Issues

Feeding a netwokx graph in

I am attempting to classify nodes in a networkx graph using node2vec and struggling, is there a simple way to use this algorithm on a networkx graph?

Consuming too much memory

I have a small graph of about 100MB (edge-list file size). #nodes = 65k, #edges = 3.5m. Node2vec just does not run on this graph. I have tracked the problem to being in the preprocess_transition_probs() function. This function gradually eats my whole system memory within 30minutes and everything hangs. I don't even reach the word2vec part after random walks.

I am running experiments on a i7, 16GB laptop. Deepwalk and LINE are able to process upto 1GB graphs. Deepwalk on this 100MB file runs in like 5minutes (including the gensim w2v procedure).

nodetype = int

Is there a problem with using an edgelist with strings instead of integers? I've made a couple of tests, using an edgelist such as:
a b 0.2
a c 0.1
b a 0.2
a d 0.3
....
and removing nodetype = int in read_edgelist, and it seems to work fine.

RuntimeError:you must first build vocabulary before training the model

Message:

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 473, in init
self.train(sentences)
File "C:\Anaconda2\lib\site-packages\gensim\models\word2vec.py", line 777, in train
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

Google said it is a problem with Word2Vec where min_count is to large. However, here the problem still occur although min_count=0


Problem found:
because walks = []

Has the data set BlogCatalog been processed?

Has the data set BlogCatalog been processed?

The data set BlogCatalog here(http://socialcomputing.asu.edu/datasets/BlogCatalog3) is different from the one used in deepwalk(https://github.com/phanein/deepwalk).

the data set BlogCatalog downloaded here:
10306,39
10307,39
10308,39
10309,39
10310,39
10311,39
10311,39
10312,39
from "group-edges.txt".

the data set BlogCatalog downloaded from http://leitang.net/social_dimension.html (which is used in deepwalk):
(14,39)
(691,39)
(1250,39)
(1344,39)
(1465,39)
(1550,39)
(4709,39)
(7759,39)

For category 39, there are obvious differences, are these two different data sets?

About the vector value of node2vec_spark

Which parameter controls the scale of the value in the output vector?
I have build a network with 400 thousand nodes and more than 2 million edges, the node vectors are composed with values larger than 100 million. Is that normal?

Parameters for Les Miserables dataset

Hi,

Thanks for node2vec - such an interesting idea.

Could I ask you to specify some additional parameters for the case study 4.1 in you paper so that I can reproduce the community-result?

For the top example you set p=1, q=0.5, but I'm wondering what you specified num_walks, walk_length for the random walk generation, as well as size, window, min_count, sg and iter for Word2Vec.

Hope this isn't too cumbersome to reply to. Thanks again!

Take very long to finish the program

I fed a graph of 60K nodes into Node2Vec program. It was running on a 8 core 8 G cloud server and it's running for more than 2 days without producing the output emd file. I checked the monitor metrics , the CPU and memory consumption at the moment is very low , so probabily the program stopped at some point without sending out any warning or printing any messages .

How to define "α=1 node" in directed graph?

last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.

for dst_nbr in sorted(G.neighbors(dst)):
if dst_nbr == src:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)
elif G.has_edge(dst_nbr, src):
unnormalized_probs.append(G[dst][dst_nbr]['weight'])
else:
unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)

But in spark version, "α=1 node" y is the node that there is an edge t->y in graph.

val neighbors_ = dstNeighbors.map { case (dstNeighborId, weight) =>
  var unnormProb = weight / q
  if (srcId == dstNeighborId) unnormProb = weight / p
  else if (srcNeighbors.exists(_._1 == dstNeighborId)) unnormProb = weight

  (dstNeighborId, unnormProb)
}

So, which one is correct?

DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

I met this error when I tried to run the karate sample graph. I searched online and the reason is that gensim updates their word2vec model as shown here https://groups.google.com/forum/#!topic/gensim/pjZo1ZunIZA
Could you help to solve this issue?

Details:
python src/main.py --input graph/karate.edgelist
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1438, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

Node2Vec-Spark in Scala: Error on Loading Model

Hi,

I am trying to load a model previously created via Spark Node2Vec. My environment is as follows:

Scala: 2.11.12
Spark: 2.4.0

Here is the code I am trying to run:

import org.apache.spark.ml.feature.Word2VecModel
import org.apache.spark.sql.SparkSession

object Main {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("TestApp")
      .master("local[*]")
      .getOrCreate()
    val sameModel = Word2VecModel.load("output/embedding.bin")
  }
}

The folder output/embedding.bin contains the metadata and data folders generating by previous execution of Node2Vec on Spark.

This results in the following exception:
Exception in thread "main" org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String at org.json4s.reflect.package$.fail(package.scala:95) at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:704) at scala.Option.getOrElse(Option.scala:138) at org.json4s.Extraction$.convert(Extraction.scala:704) at org.json4s.Extraction$.$anonfun$extract$9(Extraction.scala:394) at org.json4s.Extraction$.customOrElse(Extraction.scala:606) at org.json4s.Extraction$.extract(Extraction.scala:392) at org.json4s.Extraction$.extract(Extraction.scala:39) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.apache.spark.ml.util.DefaultParamsReader$.parseMetadata(ReadWrite.scala:632) at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:616) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:392) at org.apache.spark.ml.feature.Word2VecModel$Word2VecModelReader.load(Word2Vec.scala:384) at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:380) at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:380) at org.apache.spark.ml.feature.Word2VecModel$.load(Word2Vec.scala:422) at Main.main(Main.scala)

Best,
Matteo

A problem occurred in save method

process the code and throw the nullpointer exception during the execution. After check the code, Maybe
In node2vec.scala line 171:

if (Some(this.label2id).isDefined) {
should be:
if (Option(this.label2id).isDefined) {

how to deal with new node in graph

hi, the graph is dynamic, so there are usually some new node added to the graph at any time. Is there some solution to the issue? I find word2vec can supply online trainhere, But how can i get the new sequences with the new nodes? Anyone who can help me?

How to define "α=1 node in directed graph

last node: t
curr node: x
next node: y
In local version, "α=1 node" y is the node that there is an edge y->t in graph.
But in spark version, 1-bp node y is the node that there is an edge t->y in graph.

So, which one is correct?

Spark implementation Exception: Output directory xxxxxx already exists.

Hi!

I was recently using the spark version to analyze a large scale network with 21M nodes, 310M edges and the file size is 13G. I ran the code on a cluster of 5 machines. It runs very well until saving the result to HDFS. I got the excpetion "diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory xxxxxx already exists."

Has anyone encountered this problem before?

Thank you!

Best,
Ying

Segmentation fault

It gives segmentation fault when no of edges are greater than equal to 20000. Please suggest some way to fix it.

problem when running python src/main.py graph/karate.edgelist emb/karate.emd

Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 88, in learn_embeddings
model.save_word2vec_format(args.output)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1312, in save_word2vec_format
raise DeprecationWarning("Deprecated. Use model.wv.save_word2vec_format instead.")
DeprecationWarning: Deprecated. Use model.wv.save_word2vec_format instead.

The variable name as "randomWalk" in node2vec can not unpersist after second iter problem.

if add the code below:

================
randomWalk.unpersist(blocking = false)

================to:

  if (randomWalkPaths != null) {
    val prevRandomWalkPaths = randomWalkPaths
    randomWalkPaths = randomWalkPaths.union(randomWalk).cache()
    randomWalkPaths.first
    prevRandomWalkPaths.unpersist(blocking = false)
   
    randomWalk.unpersist(blocking = false)

  } else {
    randomWalkPaths = randomWalk
  }

node2vec on Zachary's karate club network

After running the following command:
python src/main.py --input graph/karate.edgelist --output emb/karate.emd

I am receiving this error:
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Traceback (most recent call last):
File "src/main.py", line 104, in
main(args)
File "src/main.py", line 100, in main
learn_embeddings(walks)
File "src/main.py", line 87, in learn_embeddings
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers, iter=args.iter)
TypeError: init() got an unexpected keyword argument 'iter'

Setting Directed Flag Causes NullPointerException

Hi,

Somehow the Spark implementation crashes whenever the directed flag is set to true.
I ran it with both the karate.edgelist example and some dummy two-edge graph.
The exception is always raised at the same location inside initTransitionProb after the
graph has been loaded.

java.lang.NullPointerException
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:69)
at Node2vec$$anonfun$initTransitionProb$2.apply(Node2vec.scala:68)
at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Cheers.

something wrong with random walk in node2vec_spark?

val edge2attr = graph.triplets.map { edgeTriplet =>
(s"${edgeTriplet.srcId}${edgeTriplet.dstId}", edgeTriplet.attr)
}.repartition(200).cache

(s"${prevNodeId}${currentNodeId}", (srcNodeId, pathBuffer))
}.join(edge2attr).map { case (edge, ((srcNodeId, pathBuffer), attr)) =>

in the code, join key is generated by s"${edgeTriplet.srcId}${edgeTriplet.dstId}",
do we need a separator between the two elements?

random ZeroDivisionError: float division by zero

hi,
i run node2vec with a graph created by nx.read_adjlist(), but sometimes i have this problem.
it's not always show up,when i adjust parameters like p,q,walk_length,number_walks..
how can slove this problem?

//////////////////////////////////////////////////////
\node2vec.py", line 80, in get_alias_edge
normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_pro
bs]
ZeroDivisionError: float division by zero
//////////////////////////////////////////////////////

the dataset is DBLP,there are some nodes don't have edge..

node2vec Spark - memory issue

Hi, I'm trying to run node2vec using the Spark implementation on a large graph (~2.8M nodes, ~41M edges, 4.1GB file), this is the command that I'm running:

./spark-submit --class com.navercorp.Main node2vec/node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar --cmd node2vec --p 1 --q 1 --walkLength 40 --numWalks 5 --input yago_types.edgelist --output output/yago_types_p1_q1_l40_num5.emb --weighted False --directed False --indexed False

I get this error:
"2017-05-09T16:45:13.259237677Z 17/05/09 16:45:13 ERROR scheduler.TaskSchedulerImpl: Lost executor 1 on spark-worker1-97711-prod: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages"

Everything was working fine with a smaller sample, so it seems like a memory problem to me. Have you ever experienced anything similar? Any clue on what could be a proper memory allocation for such a size of a graph? At the moment, I have a master node with 2GB and six workers with 42GB.

Thank you a lot!
Enrico

node2vec savesTestfile java.lang.ArrayIndexOutOfBoundsException: -1

when I run 119M input edges file with node2vec and only run randomwalk ,encounter below problem(run the small file has not such problem):

the code error in node2vec file's:

randomPaths.filter(x => x != null && x.replaceAll("\s", "").length > 0)
.repartition(Property.repartition_num)
.saveAsTextFile(s"${config.output}.${Property.pathSuffix}")

7/11/27 15:28:23 INFO YarnClientClusterScheduler: Stage 21090 was cancelled
17/11/27 15:28:23 INFO DAGScheduler: Job 422 failed: saveAsTextFile at Node2vec.scala:177, took 63.784991 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21090.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21090.0 (TID 45952, hadoop195): java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Issue with running example from Git

Hello,
I've been stuck on this problem for a a few hours and I'm not sure how to resolve it. :::

(py34) coryfalconer@Opti:~/Documents/Graph/HBDSCAN_testing/Node2vec/node2vec$ python src/main.py --input graph/karate.edgelist --output emb/karate.emd)

The above gives me the following errors:

./home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
warnings.warn("Pattern library is not installed, lemmatization won't be available.")
Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10
Exception in thread Thread-9:
Traceback (most recent call last):
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 911, in _bootstrap_inner
self.run()
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/threading.py", line 859, in run
self._target(*self._args, **self._kwargs)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 822, in job_producer
sentence_length = self._raw_word_count([sentence])
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in _raw_word_count
return sum(len(sentence) for sentence in job)
File "/home/coryfalconer/anaconda3/envs/py34/lib/python3.4/site-packages/gensim/models/word2vec.py", line 747, in
return sum(len(sentence) for sentence in job)
TypeError: object of type 'map' has no len()

Initially, I had been running the code using Python 3.6 but found that the version of gensim I am using (0.13.3) did not support Python 3.6. I checked to see which versions of python gensim(0.13.3) were available and found that Python 2.6, 2.7, 3.3, 3.4 and 3.5 are all supported. I then proceeded to create three virtual environment with anaconda using python versions 3.3, 3.4, and 3.5 and tried to run the code again though I keep receiving the same issue as above.

If you could shed any light on how to fix this, that would be greatly appreciated.

Thanks!
Cory

why p=1 and q=1 performs much worse than deepwalk

Hello, I apply node2vec method in a citation network generated myself. I want to compare the deepwalk and node2vec. In my opinion, the node2vec is better than deepwalk, but I have tried lots of p, q, the performance is not as good as deepwalk. When I set p and q to 1, the performance is much worse than deepwalk, but in fact, in this case it equals to deepwalk. The result shows below, both p and q equal 1, and other parameters are the same. I dont know why, could you please give me some advice, Thank you.
start deepwalk***************************
starting computing RPrecision
1 in top 2 is 0.500000
2 in top 4 is 0.500000
4 in top 6 is 0.666667
5 in top 8 is 0.625000
5 in top 10 is 0.500000
6 in top 20 is 0.300000
11 in top 30 is 0.366667
17 in top 40 is 0.425000
21 in top 50 is 0.420000
37 in top 100 is 0.370000
182 in top 1000 is 0.182000
end deepwalk***************************
start node2vec***************************
starting computing RPrecision
1 in top 2 is 0.500000
1 in top 4 is 0.250000
1 in top 6 is 0.166667
1 in top 8 is 0.125000
1 in top 10 is 0.100000
1 in top 20 is 0.050000
1 in top 30 is 0.033333
2 in top 40 is 0.050000
5 in top 50 is 0.100000
22 in top 100 is 0.220000
139 in top 1000 is 0.139000
end node2vec***************************

Fail to execute on spark

Hi,

I downloaded the whole projected and compiled the spark_node2vec project with mvn, without changing the pmo.xml file.

The build succeeded and target directory was created, but I got MethodNotFound exception when I tried to run the jar file.

Should I change the pmo.xml file to my spark and its scala version?

Thanks.

,

karate.edgelist karate.emd这两个文件能不能提供一下?

Python implementation wrong ?

There is probably a bug in this implementation.
On same dataset we compare the cumulative values of the pca for several trials

image

Graph with weight

Hi,

How can I use this work to embed the graph with weight (such as the distance between tow nodes), could you please give some suggestions? Thanks in advance.

The first walk path problem in spark-version

The first walk path of each nodes is unchanging after init in Node2Vec.initTransitionProb module.

 graph = Graph(indexedNodes, indexedEdges)
            .mapVertices[NodeAttr] { case (vertexId, clickNode) =>
              val (j, q) = GraphOps.setupAlias(clickNode.neighbors)
              val nextNodeIndex = GraphOps.drawAlias(j, q)
              clickNode.path = Array(vertexId, clickNode.neighbors(nextNodeIndex)._1)
              
              clickNode
            }

Therefor it result that the first path of a node from each walkers is same.

    for (iter <- 0 until config.numWalks) {
      var prevWalk: RDD[(Long, ArrayBuffer[Long])] = null
      var randomWalk = graph.vertices.map { case (nodeId, clickNode) =>
        val pathBuffer = new ArrayBuffer[Long]()
        pathBuffer.append(clickNode.path:_*)
        (nodeId, pathBuffer) 
      }.cache

In particular at the setting of directed=true, the edges is so sparse that it‘s hard to come back.

evaluation code

Hi,

Do you mind releasing the code you used for evaluation?

I am trying the multilabel classification on the BlogCatalog data. However, I can only get macro-f1 about 0.15 with 50% training data. My parameter setting is as follows:
length of walk: 80,
walks per node: 10
context window size: 10
dimension: 128
epoch: 1
p,q : 1
1 vs Rest Logistic regression with 0.01 L2 regularization.

Problem in 'learn_embeddings' method

The code below should not use 'map' directly, since 'map' returns a object and not a list, the correct version would be 'walks = [list(map(str, walk)) for walk in walks]'.

def learn_embeddings(walks):
'''
Learn embeddings by optimizing the Skipgram objective using SGD.
'''
walks = [map(str, walk) for walk in walks]
model = Word2Vec(walks, size=args.dimensions, window=args.window_size, min_count=0, sg=1, workers=args.workers,
iter=args.iter)
model.save_word2vec_format(args.output)

MIT/Apache License

I am working on improving the speed/parallelization of node2vec for a research project I am working on to see if it can be used for question answering via the wikipedia network structure (https://github.com/Pinafore/qb), and noticed that the source code doesn't have a license.

Would it be possible for you to add a permissive license like MIT/Apache so there are no worries about that later?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.