nkahmed / pgd Goto Github PK

A Parallel Graphlet Decomposition Library for Large Graphs

License: Other

Makefile 2.14% C 0.95% C++ 96.91%

graph graph-algorithms graphlet-decomposition-algorithms parallel subgraph-count motif-counts motif graphlet-frequency-distribution graphlet-counts pgd

pgd's Introduction

Parallel Parameterized Graphlet Decomposition (PGD) Library

A fast parallel graphlet decomposition library for large graphs.

Please refer to our paper Efficient Graphlet Counting for Large Networks for detailed discussion on the algorithm.

Synopsis

In short, a parameterized high performance library for counting motifs in large sparse graphs.

Setup

First, you'll need to compile PGD.

    $ cd path/to/pgd/
    $ make

Afterwards, the following should work:

    # compute the motif counts
    ./pgd -f sample_graph.csv

Currently, PGD supports only UNIX-based systems. PGD has been tested on Ubuntu linux (10.10 tested) and Mac OSX (Lion tested) with gcc-mp-4.7 and gcc-mp-4.5.4

Please let us know if you run into any issues.

Symbol	Description	ρ	δ	d_avg	r	T	K	χ	D	B	C
g_4₁	4-clique	1.00	3	3.0	1.00	4	3	4	1	0	1
g_4₂	4-chordal-cycle	0.83	3	2.5	-0.66	2	2	3	2	1	1
g_4₃	4-tailed-triangle	0.67	3	2.0	-0.71	1	2	3	2	2	1
g_4₄	4-cycle	0.67	2	2.0	1.00	0	2	2	2	1	1
g_4₅	3-star	0.50	3	1.5	-1.00	0	1	2	2	3	1
g_4₆	4-path	0.50	2	1.5	-0.50	0	1	2	3	2	1
g_4₇	4-node-1-triangle	0.50	2	1.5	1.00	1	2	3	1	0	2
g_4₈	4-node-2-star	0.33	2	1.0	-1.00	0	1	2	2	1	2
g_4₉	4-node-2-edge	0.33	1	1.0	1.00	0	1	2	1	0	2
g_4₁₀	4-node-1-edge	0.17	1	0.5	1.00	0	1	2	1	0	3
g_4₁₁	4-node-independent	0.00	0	0.0	0.00	0	0	1	∞	0	4
g_3₁	triangle	1.00	2	2.0	1.00	1	2	3	1	0	1
g_3₂	2-star	0.67	2	1.33	-1.00	0	1	2	2	1	1
g_3₃	3-node-1-edge	0.33	1	0.67	1.00	0	1	2	1	0	2
g_3₄	3-node-independent	0.00	0	0.00	0.00	0	0	1	∞	0	3
g_2₁	edge	1.00	1	1.0	1.00	0	1	2	1	0	1
g_2₂	2-node-independent	0.00	0	0.0	0.00	0	0	1	∞	0	2

Input file format

Matrix Market Coordinate Format (symmetric)
For details see: http://math.nist.gov/MatrixMarket/formats.html#MMformat

  %%MatrixMarket matrix coordinate pattern symmetric  
  4 4 6  
  2 1  
  3 1  
  3 2  
  4 1  
  4 2  
  4 3

Note comments are denoted by %. First line represents n n m where n is the number of nodes and m is the number of edges |E|. For instance, the first line above is 4 4 6 and indicates the number of nodes is n=4 and number of edges is m=6.

A graph file with the extension .mtx is read using this strict mtx graph reader. Thus, if the graph file does not strictly follow the above mtx format, then the file extension should be changed to allow it to be read by the more flexible graph reader discussed below.

Edge list: These codes are designed to be as flexible as possible and accept many variations of edge lists. Note these codes may be slightly slower than the mtx reader. This is due to allowing flexible edge list formats. Hence, this reader must perform many checks to figure out the exact format of the input file, and performs any necessary preprocessing work that may be required.
- Delimiters: The mcpack reader accepts comma, space, or tab delimited edge lists.
- Comments: Comments are allowed and should be denoted with the first character of a newline as # or %
- Weights: If an edge list contains weights on the 3rd column, they are simply ignored. A user may specify to read the weights by setting the wt parameter or by noting the graph is in fact a temporal graph.
- Multigraph: When an edge list contains multiple edges, we simply remove the duplicate edges.
- The edge list may also contain gaps in the vertex ids (non sequential vertex ids) and start from any positive integer. Self-loops are removed.
- The edge list is assumed to be undirected. However, if a directed graph is given, it is simply treated as undirected.

Output Graphlet Quantities

The PGD family of graphlet decomposition algorithms provide three types of output 1. Global macro statistics indicating the total frequency of each motif 2. Local micro statistics representing the number of times each motif appears (for each edge or node in the graph) 3. Graphlet frequency distribution

NOTE: The total counts for each motif is outputted to the screen by default.

Macro motif counts

The macro (global) graphlet counts are printed to the screen by default. These statistics may also be saved to a file using --macro filename.macro where filename.macro is the path to save stats.

    ./pgd -f sample_graph.csv --macro sample_graph.macro

Micro motif counts

The motif counts for each edge may also be saved using the --micro filename.micro.

    ./pgd -f sample_graph.csv --micro sample_graph.micro

Graphlet frequency distribution (GFD)

To output the graphlet frequency distribution, use the --gfd filename.gfd option.

    ./pgd -f sample_graph.csv --gfd sample_graph.gfd

Advanced

Orderings

The PGD algorithms are easily adapted to use various ordering strategies. To prescribe an ordering, use the -o option with one of the following:

Ordering techniques	Description
`deg`	order by degree
`kcore`	degeneracy order
`dual_deg`	ordering defined by the sum of degrees from neighbors
`dual_kcore`	order by the sum of core numbers from neighbors
`kcore_deg`	order by the weight k(v)d(v)
`rand`	uniformly random order
`natural`	use the order given as input

Direction of ordering

Descending order is the default (largest to smallest). For ascending order (smallest to largest), simply set --s2l as follows:

./pgd -f sample_graph.csv --s2l

To reverse the order of the neighbors (for each node) use `--s2l_neigh``.

Command Line Interface (CLI)

You can execute PGD with --help to see the list of options

$ ./pgd --help

=================================================================================
Parallel Parameterized Graphlet Decomposition (PGD) Library
=================================================================================
-f, --file,--graph              : Input GRAPH file for computing the graphlets (e.g., matrix market format, simple edge list). 
-a, --algorithm                 : Algorithm for the GRAPHLET DECOMPOSITION. Default: exact, approximate, etc.
---------------------------------------------------------------------------------
-w, --workers                   : Number of WORKERS (processing units) for the algorithm to use (default = max). 
-b, --block_size                : Size of blocks assigned to workers (processing units), that is, 1, 64, 512, etc.  Default: -b 64
---------------------------------------------------------------------------------
-o, --ordering                  : Strategy used to determine the order in which the edge/node graphlets are computed.
                                  Default: -o degree (kcore, rand, natural/off, etc.).
    --s2l                       : Direction of the ordering (default: largest to smallest).
                                  Note to order from smallest to largest, set '--s2l'  
-n, --neigh_ordering            : Ordering to use for the neighbors of each node. 
                    	           Default: degree (kcore, rand, natural, etc.)
                                   Note only applicable if CSC/CSR is used (-r csc).
    --s2l_neigh                 : Order direction for neighbor/csc ordering strategy
                                  (e.g., --neigh_ordering degree --s2l_neigh, default: largest to smallest)
---------------------------------------------------------------------------------
-c, --counts,--macro            : Compute MACRO (GLOBAL) GRAPHLET FEATURES and save counts to file (e.g., --counts name.graphlets)
-m, --micro                     : Compute MICRO (LOCAL) GRAPHLET FEATURES and save EDGE-by-MOTIF feature matrix (-m name.micro_graphlets)
                                  Default: OFF. NOTE: Turn ON edge graphlet counting by specifying an output file, e.g., '-m name.micro_graphlets' 
---------------------------------------------------------------------------------
-v, --verbose                   : Output additional details to the screen. 
-?, -h, --help                  : Print out this help menu. 


REPRESENTATION: Example: ./pgd -r csc (adj, etc.)
=================================================================================
-r,   --rep                     : Graph representation [adj, csc, hybrid, auto, etc].
                                  Default: Auto select optimal. 
   'adj'    - adjacency matrix  : dense n by n matrix, O(|V|^2) storage cost
   'csc'    - comp. sparse col  : large sparse graphs, O(|V|+|E|) storage cost
   'hybrid' -  csc + adj        : small sparse and dense graphs, O(|V|^2 + |V| + |E|) storage cost
-l, --adj_limit                 : Threshold/limit for creating adj representation. Default: '-l 10000' (that is <10000 nodes).


ORDERING TECHNIQUES: Example: ./pgd -o degree (kcore, rand, etc.)
=================================================================================
'degree',   'deg'                    : O(|V|)
'kcore',                             : O(|E|)
'rand', 'random'                     : O(|V|)
'off',  'natural'                    

 Other methods for ordering include: 
'kcore_degree', 'kcore_deg'          : O(|V|)
'degree_vol',   'deg_vol'            : O(|E|)
'kcore_vol',                         : O(|E|)
'deg_kcore_vol'                      : O(|E|)
------------------------------------------------------------------
NOTE: Default ordering is kcore (degeneracy order). For natural order, use '-o off' or '-o natural'

API and Sample Codes

Exact Sample Codes

Sample codes for computing exact graphlet statistics using the family of exact graphlet decomposition algorithms from pgd library.

macro graphlet statistics
micro graphlet statistics

Let us note that if the micro-level graphlet statistics are not needed, then it is recommended to use the macro graphlet decomposition algorithms. These are highly optimized for this task and thus are significantly more space-efficient while also faster and more scalable.

Macro graphlet statistics

Compute the global frequency of the various motif patterns with just two lines:

// read graph, optimize alg/data structs, etc.
graphlet_core G("sample_graph.csv");
G.graphlet_decomposition();

Afterwards, it is easy to print or write global motif counts to a file.

G.print_graphlet_counts(); // print to screen

or SAVE to a file,

G.write_macro_stats(path);

Micro graphlet statistics

Compute the frequency of the various motif patterns with just two lines:

// read graph, optimize alg/data structs, etc.
graphlet_core G("sample_graph.csv");
G.graphlet_decomposition_micro();

Afterwards, it is easy to print or write the motif counts to a file.

G.print_micro_stats(); // print to screen

or SAVE to a file,

G.write_micro_stats(path);

GFD Sample Codes

Estimate GFD

To obtain a fast and accurate estimation of the graphlet frequency distribution, use the following:

// Approximate GFD by sampling uniformly at random 10% of the edges (vertices) to use
G.graphlet_approximation(0.10);

Afterwards, the GFD can be approximated from these counts as follows:

// Estimate graphlet distribution for connected and disconnected motifs
G.compute_GFD();

Distribution	API Call
Graphlet Freq. Distribution (GFD)	`compute_GFD()`
Connected GFD	`compute_connected_GFD()`
Disconnected GFD	`compute_disconnected_GFD()`

Exact GFD

Exact graphlet distributions may also be computed fast by simply using an exact graphlet decomposition method from those expressed by the pgd library and then using the API calls above in the table.

G.graphlet_decomposition();
G.compute_GFD();

Documentation

The documentation is generated (using doxygen) by simply typing make doc in the root directory of pgd.

make doc

This creates the ./doc directory with the documentation. To update the documentation, use make cleandoc then make doc.

Doxygen and graphvis are required and installed via homebrew (if not installed already). Currently, this works only for Mac OSX and other Unix-based systems.

Additional Info.

To generate the documentation you must have doxygen and graphviz installed. On Mac OSX these can be stalled using homebrew with the following commands:

# install doxygen and graphviz using homebrew on Mac OSX
brew install doxygen
brew install graphviz

Afterwards, the documentation is generated by simply typing make docs in the root directory of pgd. This creates the ./docs directory with the documentation.

Terms and conditions

Please feel free to use the PGD library. We only ask that you cite:

Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick Duffield, Efficient Graphlet Counting for Large Networks, IEEE International Conference on Data Mining (ICDM), pages 10, 2015.

Also the BiBTeX for [1] is:

 @inproceedings{ahmed2015icdm,
     title={Efficient Graphlet Counting for Large Networks},
     author={Nesreen K. Ahmed and Jennifer Neville and Ryan A. Rossi and Nick Duffield},
     booktitle={ICDM},
     pages={1--10},
     year={2015}
 }

Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick Duffield Fast Parallel Graphlet Counting for Large Networks, arXiv preprint 1506.04322, 2015.

Graph Datasets

Please check the following link for additional graph datasets: [Network Repository] (http://networkrepository.com/)

pgd's People

Contributors

Stargazers

Watchers

pgd's Issues

command was killed by system

hi, today I run this command (./pgd -f europe-airports.mtx --macro sample_graph.macro)in Ubuntu system, but soon the shell was killed, I think that I have enough memory. If you know why, please let me know, thinks.
there is some information about dataset:
europe-airports: nodes:399, edges:5995, mean_degree:30.050125, mean_path:2.323611 , mean_clustering:0.539272, degree_assortativity: -0.22479390366

Invalid Micro Counts

Running PGD with the --micro flag yields incorrect micro counts.
For example, on a graph with two tailed triangles, PGD is counting zero tailed triangles for each edge.

Graph tested

Edgelist:

1,2
1,3
2,3
2,4
3,5

PGD output

[reading generic edge list: read_edge_list func]  filename: ./double_tailed_triangle.edgelist
|V|: 5
|E|: 5
p: 0.5
d_max: 3
d_avg: 2
r = -0.5625
K = 2
************************************************************
total_2_1edge = 5
total_2_indep = 5
----------------------------------------
total_3_tris = 1
total_2_star = 4
total_3_1edge = 4
total_3_indep = 1
----------------------------------------
total_4_clique = 0
total_4_chordcycle = 0
total_4_tailed_tris = 2
total_4_cycle = 0
total_3_star = 0
total_4_path = 1
----------------------------------------
total_4_1edge = 0
total_4_2edge = 0
total_4_2star = 2
total_4_tri = 0
total_4_indep = 0
************************************************************
--------------------------------------------------------------------------------


Micro graphlet statistics
--------------------------------------------------------------------------------
src,dst,triangle,2-star,4-clique,4-chordal-cycle,4-tailed-triangle,4-cycle,3-star,4-path
3,2,1,2,0,0,0,0,0,1
1,2,1,1,0,0,0,0,0,0
1,3,1,1,0,0,0,0,0,0
5,3,0,2,0,0,0,0,1,0
4,2,0,2,0,0,0,0,1,0

graphlet decomposition time: 2.40803e-05 sec
--------------------------------------------------------------------------------
Graphlet Frequency Distribution (GFD)
--------------------------------------------------------------------------------
4-clique     	0
4-chordal-cycle	0
4-tailed-tri	0.4
4-cycle       	0
3-star       	0
4-path       	0.2
4-node-1-tri   	0
4-node-2-star  	0.4
4-node-2-edge	0
4-node-1-edge  	0
4-node-indep  	0


--------------------------------------------------------------------------------
Connected Motif Graphlet Frequency Distribution (GFD)
--------------------------------------------------------------------------------
4-clique     	0
4-chordal-cycle	0
4-tailed-tri	0.666667
4-cycle       	0
3-star       	0
4-path       	0.333333


--------------------------------------------------------------------------------
Disconnected Motif Graphlet Frequency Distribution (GFD)
--------------------------------------------------------------------------------
4-node-1-tri   	0
4-node-2-star	1
4-node-2-edge  	0
4-node-1-edge  	0
4-node-indep  	0


mean	= 0
median	= 0
max	= 0
min	= 0
range	= 0
std	= 0
var	= 0
iqr	= 0
q1	= 0
q3	= 0

Graphlet frequency distribution (GFD)

Is the "gdf" option is not available? I can't find it in the help or in the graphlet_params.h file.
Let me know. Thanks.

can we get other modify?

Hi, can we get other modify except the follow:
triangle,2-star,4-clique,4-chordal-cycle,4-tailed-triangle,4-cycle,3-star,4-path

Fixing Compile Issue on macOS 10.13 High Sierra and GCC 7.2.0

When compiling using gcc 7.2.0 on macOS 10.13 High Sierra, I encountered the following error:

graphlet_utils.cpp: In function 'bool fexists(const char*)':
graphlet_utils.cpp:42:12: error: cannot convert 'std::ifstream {aka std::basic_ifstream<char>}' to 'bool' in return
     return ifile;
            ^~~~~
make: *** [pgd] Error 1

I changed 'return ifile;' to be 'return (bool)ifile;' and was able to compile successfully.

Reference: https://stackoverflow.com/questions/38659115/make-fails-with-error-cannot-convert-stdistream-aka-stdbasic-istreamchar

The vertex ids are different between input and output

Thank you very much for your open code! But these is something wrong when I try to use these code.

I use the edge list as the input of PGD as follow:

The output of the PGD is :

% src,dst,triangle,2-star,4-clique,4-chordal-cycle,4-tailed-triangle,4-cycle,3-star,4-path
6,5,1,0,0,0,0,0,0,0
5,4,1,1,0,0,0,0,0,0
6,4,1,1,0,0,0,0,0,0
2,1,0,2,0,0,0,0,1,0
3,1,0,2,0,0,0,0,1,0
4,1,0,4,0,0,0,0,2,4

It's easy to see that the vertex ids are different between input and output
So do you have any method to solve it ?

Best Regards

Error in compilation

Hello,

Thanks for sharing your implementation

However, I cannot compile the program on my computer, I am on Ubuntu 20.04

Here is the error message :

graphlet_utils.cpp: In function ‘bool fexists(const char*)’:
graphlet_utils.cpp:42:12: error: cannot convert ‘std::ifstream’ {aka ‘std::basic_ifstream<char>’} to ‘bool’ in return
   42 |     return ifile;
      |            ^~~~~
make: *** [Makefile:47: pgd] Error 1

Thank you for your help

number of nodes

Hi there! I tried to apply the method to the CiteSeer dataset with edgelist.txt as input. The input has 3312 nodes, but the --macro returned 3264 nodes. There's no isolated node in the dataset. The number of edges didn't match either. Has anyone met this problem?

input file format csv or mtx

Thank you for making your code public. I want to use this for comparing two networks. If I take a small, well studied graph: karate club; changing the edge-list to .mtx and then giving that as input to pgd isn't working. Here is what I do:

I changed the KC edge-list file to mtx as so:

%% Karate Club Graph (graph is undirected, each edge is saved twice)
%% Nodes: 34 Edges:78*2=156
34 34 78
0 1
0 2
<snip>

Then I tried to do: /pgd -f kc.mtx

On the other hand, if I change the original edgelist file by removing the comments and replace them with a header line: src,dst it seems to work, so the whole thing on MatrixMarket input file format is confusing, but I'm sure I'm missing part of the big-picture.

Compilation Issue

When running make, I get the following output

g++ -O3 -fopenmp -ffast-math -funroll-loops -fno-strict-aliasing -fomit-frame-pointer -fexpensive-optimizations -funroll-loops -fmove-loop-invariants -fprefetch-loop-arrays -ftree-loop-optimize -ftree-vect-loop-version -ftree-vectorize -o pgd graphlet_driver.o graphlet_utils.cpp graphlet_core.cpp graphlet_rand.cpp
graphlet_driver.o: file not recognized: File format not recognized
collect2: error: ld returned 1 exit status
make: *** [pgd] Error 1

Is this an issue with my environment's setup or with the build process? I got the same error on two machines.
Thanks

How to calculate the specified motif?

The library computes 8 motifs by default,and how to calculate the specified motif?
Thanks

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

nkahmed / pgd Goto Github PK

pgd's Introduction

Parallel Parameterized Graphlet Decomposition (PGD) Library

Synopsis

Setup

Input file format

Output Graphlet Quantities

Macro motif counts

Micro motif counts

Graphlet frequency distribution (GFD)

Advanced

Orderings

Direction of ordering

Command Line Interface (CLI)

API and Sample Codes

Exact Sample Codes

Macro graphlet statistics

Micro graphlet statistics

GFD Sample Codes

Estimate GFD

Exact GFD

Documentation

Additional Info.

Terms and conditions

Graph Datasets

pgd's People

Contributors

Stargazers

Watchers

Forkers

pgd's Issues

Graph tested

PGD output

Recommend Projects

Recommend Topics

Recommend Org