Giter Club home page Giter Club logo

node2vec-spark's Introduction

Node2vec on Spark 2

This library is a implementation using scala for running on spark of node2vec as described in the paper:

node2vec: Scalable Feature Learning for Networks. Aditya Grover and Jure Leskovec. Knowledge Discovery and Data Mining, 2016.

The node2vec algorithm learns continuous representations for nodes in any (un)directed, (un)weighted graph. Please check the project page for more details.

Building node2vec_spark

In order to build node2vec_spark, use the following:

$ git clone https://github.com/QuanLab/node2vec-spark.git
$ sbt clean assembly

and requires:
Sbt 1.2.8 or newer
Java 8+
Scala 2.11 or newer.

This will produce jar file in "node2vec_spark/target/"

Examples

This library has two functions: randomwalk and embedding.
These were described in these papers node2vec: Scalable Feature Learning for Networks and Efficient Estimation of Word Representations in Vector Space.

Random walk

Example:

./spark-submit --class vn.five9.Main \ 
			   ./node2vec-spark/target/node2vec-spark-assembly-0.1.jar \
			   --cmd randomwalk --p 100.0 --q 100.0 --walkLength 40 \
			   --input <input> --output <output>

Options

Invoke a command without arguments to list available arguments and their default values:

--cmd COMMAND
	Functions: randomwalk or embedding. If you want to execute all functions "randomwalk" and "embedding" sequentially input "node2vec". Default "node2vec"
--input [INPUT]
	Input edgelist path. The supported input format is an edgelist: "node1_id_int node2_id_int <weight_float, optional>"
--output [OUTPUT]
	Random paths path.
--walkLength WALK_LENGTH
	Length of walk per source. Default is 80.
--numWalks NUM_WALKS
	Number of walks per source. Default is 10.
--p P
	Return hyperparaemter. Default is 1.0.
--q Q
	Inout hyperparameter. Default is 1.0.
--weighted Boolean
	Specifying (un)weighted. Default is true.
--directed Boolean
	Specifying (un)directed. Default is false.
--degree UPPER_BOUND_OF_NUMBER_OF_NEIGHBORS
	Specifying upper bound of number of neighbors. Default is 30.
--indexed Boolean
	Specifying whether nodes in edgelist are indexed or not. Default is true.
  • If "indexed" is set to false, node2vec_spark index nodes in input edgelist, example:
    unindexed edgelist:
    node1 node2 1.0
    node2 node7 1.0

    indexed:
    1 2 1.0
    2 3 1.0

    1 node1
    2 node2
    3 node7

Input

The supported input format is an edgelist:

node1_id_int 	node2_id_int 	<weight_float, optional>
or
node1_str 	node2_str 	<weight_float, optional>, Please set the option "indexed" to false

Output

The output file (number of nodes)*numWalks random paths as follows:

src_node_id_int 	node1_id_int 	node2_id_int 	... 	noden_id_int

Embedding random paths

Example:

./spark-submit --class vn.five9.Main \
			   ./node2vec-spark/target/node2vec-spark-assembly-0.1.jar \
			   --cmd embedding --dim 50 --iter 20 \
			   --input <input> --nodePath <node2id_path> --output <output>

Options

Invoke a command without arguments to list available arguments and their default values:

--cmd COMMAND
	embedding. If you want to execute sequentially all functions: "randomwalk" and "embedding", input "node2vec". default "node2vec"
--input [INPUT]
	Input random paths. The supported input format is an random paths: "src_node_id_int node1_id_int ... noden_id_int"
--output [OUTPUT]
	word2vec model(.bin) and embeddings(.emb).
--nodePath [NODE\_PATH]
	Input node2index path. The supported input format: "node1_str node1_id_int"
--iter ITERATION
	Number of epochs in SGD. Default 10.
--dim DIMENSION
	Number of dimensions. Default is 128.
--window WINDOW_SIZE
	Context size for optimization. Default is 10.

Input

The supported input format is an random paths:

src_node_id_int 	node1_id_int 	... 	noden_id_int

Output

The output files are embeddings and word2vec model. The embeddings file has the following format:

node1_str 	dim1 dim2 ... dimd

where dim1, ... , dimd is the d-dimensional representation learned by word2vec.

the output file word2vec model has the spark word2vec model format. please reference to https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

References

  1. node2vec: Scalable Feature Learning for Networks
  2. Efficient Estimation of Word Representations in Vector Space
  3. Node2Vec for Spark version 1.6

node2vec-spark's People

Contributors

quanlab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

node2vec-spark's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.