Giter Club home page Giter Club logo

pygenstability's Introduction

PyPI version codecov Actions Status Code style: black Documentation status License DOI

Downloads Downloads

PyGenStability

This python package is designed for multiscale community detection with Markov Stability (MS) analysis [1, 2] and allows researchers to identify robust network partitions at different resolutions. It implements several variants of the MS cost functions that are based on graph diffusion processes to explore the network (see illustration below). Whilst primarily built for MS, the internal architecture of PyGenStability has been designed to solve for a wide range of clustering cost functions since it is based on optimising the so-called generalized Markov Stability function [3]. To maximize the generalized Markov Stability cost function, PyGenStability provides a convenient python interface for C++ implementations of Louvain [4] and Leiden [5] algorithms. We further provide specific analysis tools to process and analyse the results from multiscale community detection, and to facilitate the automatic selection of robust partitions [6]. PyGenStability is accompanied by a software paper that further details the implementation, result analysis, benchmarks and applications [7].

illustration

Documentation

A documentation of all features of the PyGenStability package is available here: https://barahona-research-group.github.io/PyGenStability/, or in pdf here.

Installation

You can install the package using pypi:

pip install pygenstability

Using a fresh python3 virtual environment, e.g. conda, may be recommended to avoid conflicts with other python packages.

By default, the package uses the Louvain algorithm [4] for optimizing generalized Markov Stability. To use the Leiden algorithm [5], install this package with:

pip install pygenstability[leiden]

To plot network partitions using networkx, install this package with:

pip install pygenstability[networkx]

To use plotly for interactive plots in the browser, install this package with:

pip install pygenstability[plotly]

To install all dependencies, run:

pip install pygenstability[all]

Installation from GitHub

You can also install the source code of this package from GitHub directly by first cloning this repo with:

git clone --recurse-submodules https://github.com/ImperialCollegeLondon/PyGenStability.git

(if the --recurse-submodules has not been used, just do git submodule update --init --recursive to fetch the submodule with M. Schaub's code).

The wrapper for the submodule uses Pybind11 https://github.com/pybind/pybind11 and, to install the package, simply run (within the PyGenStability directory):

pip install . 

using a fresh python3 virtual environment to avoid conflicts. Similar to above, you can also specify additional dependencies, e.g. to install the package with networkx run:

pip install .[networkx]

Using the code

The code is simple to run with the default settings. We can input our graph (of type scipy.csgraph), run a scan in scales with a chosen Markov Stability constructor and plot the results in a summary figure presenting different partition quality measures across scales (values of MS cost function, number of communities, etc.) with an indication of optimal scales.

import pygenstability as pgs
results = pgs.run(graph)
pgs.plot_scan(results)

Although it is enforced in the code, it is advised to set environment variables

export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1

to ensure numpy does not use multi-threadings, which may clash with the parallelisation and slow down the computation.

There are a variety of further choices that users can make that will impact the partitioning, including:

  • Constructor: Generalized Markov Stability requires the user to input a quality matrix and associated null models. We provide an object-oriented module to write user-defined constructors for these objects, with several already implemented (see pygenstability/constructors.py for some classic examples).
  • Generalized Markov Stability maximizers: To maximize the NP-hard optimal generalized Markov Stability we interface with two algorithms: (i) Louvain and (ii) Leiden.

While Louvain is defined as the default due to its familiarity within the research community, Leiden is known to produce better partitions and can be used by specifying the run function.

results = pgs.run(graph, method="leiden")

There are also additional post-processing and analysis functions, including:

  • Plotting via matplotlib and plotly (interactive).
  • Automated optimal scale selection.

Optimal scale selection [6] is performed by default with the run function but can be repeated with different parameters if needed, see pygenstability/optimal_scales.py. To reduce noise, e.g., one can increase the parameter values for block_size and window_size. The optimal network partitions can then be plotted given a NetworkX nx_graph.

results = pgs.identify_optimal_scales(results, block_size=10, window_size=5)
pgs.plot_optimal_partitions(nx_graph, results)

Constructors

We provide an object-oriented module for constructing quality matrices and null models in pygenstability/constructors.py. Various constructors are implemented for different types of graphs:

  • linearized based on linearized MS for large undirected weighted graphs [2]
  • continuous_combinatorial based on combinatorial Laplacian for undirected weighted graphs [2]
  • continuous_normalized based on random-walk normalized Laplacians for undirected weighted graphs [2]
  • signed_modularity based on signed modularity for large signed graphs [8]
  • signed_combinatorial based on signed combinatorial Laplacian for signed graphs [3]
  • directed based on random-walk Laplacian with teleportation for directed weighted graphs [2]
  • linearized_directed based on random-walk Laplacian with teleportation for large directed weighted graphs

For the computationally efficient analysis of large graphs, we recommend using the linearized, linearized_directed or signed_modularity constructors instead of continuous_combinatorial, continuous_normalized, directed or signed_combinatorial that rely on the computation of matrix exponentials.

For those of you that wish to implement their own constructor, you will need to design a function with the following properties:

  • take a scipy.csgraph graph and a float time as argument
  • return a quality_matrix (sparse scipy matrix) and a null_model (multiples of two, in a numpy array)

Graph-based data clustering

PyGenStability can also be used to perform multiscale graph-based data clustering on data that comes in the form of a sample-by-feature matrix. This approach was shown to achieve better performance than other popular clustering methods without the need of setting the number of clusters externally [9].

We provide an easy-to-use interface in our pygenstability.data_clustering.py module. Given a sample-by-feature matrix X, one can apply graph-based data clustering as follows:

clustering = pgs.DataClustering(
    graph_method="cknn",
    k=5,
    constructor="continuous_normalized")

# apply graph-based data clustering to X
results = clustering.fit(X)

# identify optimal scales and plot scan
clustering.scale_selection(kernel_size=0.2)
clustering.plot_scan()

We currently support $k$-Nearest Neighbor (kNN) and Continuous $k$-Nearest Neighbor (CkNN) [10] graph constructions (specified by graph_method) and k refers to the number of neighbours considered in the construction. See documentation for a list of all parameters. All functionalities of PyGenStability including plotting and scale selection are also available for data clustering. For example, given two-dimensional coordinates of the data points one can plot the optimal partitions directly:

# plot robust partitions
clustering.plot_robust_partitions(x_coord=x_coord,y_coord=y_coord)

Contributors

  • Alexis Arnaudon, GitHub: arnaudon <https://github.com/arnaudon>
  • Robert Peach, GitHub: peach-lucien <https://github.com/peach-lucien>
  • Dominik Schindler, GitHub: d-schindler <https://github.com/d-schindler>

We always look out for individuals that are interested in contributing to this open-source project. Even if you are just using PyGenStability and made some minor updates, we would be interested in your input.

Cite

Please cite our paper if you use this code in your own work:

@article{pygenstability,
  author = {Arnaudon, Alexis and Schindler, Dominik J. and Peach, Robert L. and Gosztolai, Adam and Hodges, Maxwell and Schaub, Michael T. and Barahona, Mauricio},  
  title = {PyGenStability: Multiscale community detection with generalized Markov Stability},
  publisher = {arXiv},
  year = {2023},
  doi = {10.48550/ARXIV.2303.05385},
  url = {https://arxiv.org/abs/2303.05385}
}

The original paper for Markov Stability can also be cited as:

@article{delvenne2010stability,
  title={Stability of graph communities across time scales},
  author={Delvenne, J-C and Yaliraki, Sophia N and Barahona, Mauricio},
  journal={Proceedings of the National Academy of Sciences},
  volume={107},
  number={29},
  pages={12755--12760},
  year={2010},
  publisher={National Acad Sciences}
}

Run example

In the example folder, a demo script with a stochastic block model can be tried with

python simple_example.py

or using the click app:

./run_simple_example.sh

Other examples can be found as jupyter-notebooks in the examples/ directory, including:

  • Example 1: Undirected SBM
  • Example 2: Multiscale SBM
  • Example 3: Directed networks
  • Example 4: Custom constructors
  • Example 5: Hypergraphs
  • Example 6: Signed networks
  • Example 7: Graph-based data clustering

Finally, we provide applications to real-world networks in the examples/real_examples/ directory, including:

  • Power grid network
  • Protein structures

Our other available packages

If you are interested in trying our other packages, see the below list:

  • GDR : Graph diffusion reclassification. A methodology for node classification using graph semi-supervised learning.
  • hcga : Highly comparative graph analysis. A graph analysis toolbox that performs massive feature extraction from a set of graphs, and applies supervised classification methods.
  • MSC : MultiScale Centrality: A scale-dependent metric of node centrality.
  • DynGDim : Dynamic Graph Dimension: Computing the relative, local and global dimension of complex networks.
  • RMST : Relaxed Minimum Spanning Tree: Computing the relaxed minimum spanning tree to sparsify networks whilst retaining dynamic structure.
  • StEP : Spatial-temporal Epidemiological Proximity: Characterising contact in disease outbreaks via a network model of spatial-temporal proximity.

References

[1] J.-C. Delvenne, S. N. Yaliraki, and M. Barahona, 'Stability of graph communities across time scales', Proceedings of the National Academy of Sciences, vol. 107, no. 29, pp. 12755–12760, Jul. 2010, doi: 10.1073/pnas.0903215107.

[2] R. Lambiotte, J.-C. Delvenne, and M. Barahona, 'Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks', IEEE Trans. Netw. Sci. Eng., vol. 1, no. 2, pp. 76–90, Jul. 2014, doi: 10.1109/TNSE.2015.2391998.

[3] M. T. Schaub, J.-C. Delvenne, R. Lambiotte, and M. Barahona, 'Multiscale dynamical embeddings of complex networks', Phys. Rev. E, vol. 99, no. 6, Jun. 2019, doi: 10.1103/PhysRevE.99.062308.

[4] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, 'Fast unfolding of communities in large networks', J. Stat. Mech., vol. 2008, no. 10, Oct. 2008, doi: 10.1088/1742-5468/2008/10/p10008.

[5] V. A. Traag, L. Waltman, and N. J. van Eck, 'From Louvain to Leiden: guaranteeing well-connected communities', Sci Rep, vol. 9, no. 1, p. 5233, Mar. 2019, doi: 10.1038/s41598-019-41695-z.

[6] D. J. Schindler, J. Clarke, and M. Barahona, 'Multiscale Mobility Patterns and the Restriction of Human Movement', Royal Society Open Science, vol. 10, no. 10, p. 230405, Oct. 2023, doi: 10.1098/rsos.230405.

[7] A. Arnaudon, D. J. Schindler, R. L. Peach, A. Gosztolai, M. Hodges, M. T. Schaub, and M. Barahona, 'PyGenStability: Multiscale community detection with generalized Markov Stability', arXiv pre-print, Mar. 2023, doi: 10.48550/arXiv.2303.05385.

[8] S. Gomez, P. Jensen, and A. Arenas, 'Analysis of community structure in networks of correlated data'. Physical Review E, vol. 80, no. 1, p. 016114, Jul. 2009, doi: 10.1103/PhysRevE.80.016114.

[9] Z. Liu and M. Barahona, 'Graph-based data clustering via multiscale community detection', Applied Network Science, vol. 5, no. 1, p. 3, Dec. 2020, doi: 10.1007/s41109-019-0248-7.

[10] T. Berry and T. Sauer, 'Consistent manifold representation for topological data analysis', Foundations of Data Science, vol. 1, no. 1, p. 1-38, Feb. 2019, doi: 10.3934/fods.2019001.

Licence

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

pygenstability's People

Contributors

agosztolai avatar arnaudon avatar cellassembly avatar d-schindler avatar mauriciobarahona avatar michaelschaub avatar peach-lucien avatar zj-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pygenstability's Issues

Allow to bypass louvain install

On windows, it may be hard to install do to pthread.h issue, so maybe we can allow to not instal louvain, and rather use leiden.

AttributeError: no attribute 'run_louvain' upon installation

How to reproduce:

  1. Create and activate a new conda (v4.8.3) environment with Python (v3.8.3) installation.
  2. Run $ conda install pybind11 pyparsing six
  3. Clone the PyGenStability repository
  4. Run $ pip install . while in the cloned repository
  5. Copy the contents of the subfolder examples to the top level directory
  6. Run $ python simple_example.py
Traceback (most recent call last):
  File "simple_example.py", line 25, in <module>
    simple_test()
  File "simple_example.py", line 14, in simple_test
    all_results = run(graph)
  File "/home/benedikt/pygenstability/PyGenStability/pygenstability/pygenstability.py", line 119, in run
    louvain_results = run_several_louvains(
  File "/home/benedikt/pygenstability/PyGenStability/pygenstability/pygenstability.py", line 251, in run_several_louvains
    list(pool.imap_unordered(worker, range(n_runs), chunksize=chunksize))
  File "/home/benedikt/anaconda3/envs/pygen_env/lib/python3.8/multiprocessing/pool.py", line 448, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/benedikt/anaconda3/envs/pygen_env/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
AttributeError: module 'pygenstability.generalized_louvain' has no attribute 'run_louvain'

add tests

the code is not tested, some tests would be nice

Memory issue for large directed graphs

Hello,
I'am trying to use the package on a big network (around 170000 nodes).
However, when I call the contractor:

defining the constructor externally

directed_constructor = constructors.load_constructor('directed', adjacency, alpha=0.85)

I get the following error:
Unable to allocate 241. GiB for an array with shape (179783, 179783) and data type float6

Is there a way to avoid it?

Moreover, could i give as input a weighted adj matrix?

Thanks a lot

doc

Make a nice documentation with an associated website (using sphinx)

update cli

Make sure cli is up to date, and does not use pickle, but better format for large graphs

possible speed ups

Precompute the Laplacian and null model once at the start, not every time call of exponential

backend issue

In the line 208 of plotting.py:matplotlib.use("TkAgg")
Not everyone uses the tkagg.

Maybe before matplotlib.use("Agg"),
let b = matplotlib.get_backend()
then matplotlib.use(b), instead of matplotlib.use("TkAgg")

installing bug

in the process of install pygenstability, i have the following problems:
can you help me to solve it?
building 'pygenstability.generalized_louvain' extension
C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Iextra -ID:\Install_software\Anaconda\envs\tensorflow\lib\site-packages\pybind11\include -ID:\Install_software\Anaconda\envs\tensorflow\include -ID:\Install_software\Anaconda\envs\tensorflow\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tppygenstability/generalized_louvain/generalized_louvain.cpp /Fobuild\temp.win-amd64-3.7\Release\pygenstability/generalized_louvain/generalized_louvain.obj /EHsc /bigobj -std=c++11
cl: 命令行 warning D9002 :忽略未知选项“-std=c++11”
generalized_louvain.cpp
extra\lemon/bits/lock.h(24): fatal error C1083: 无法打开包括文件: “pthread.h”: No such file or directory
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit status 2

ttprime

update the ttprime computation with tqdm + desactivating flag, etc...

modularity

The value for modularity/stability is behaving strangely, need to investigate that.

improve sankey diagram

the sankey diagram ploting function works but is rudimentary. Some work to improve it is needed

scale selection at margins of domain

The current scale selection criterion is not reliable at the margins of the domain of Markov scale t due to:

  1. Cut-off through windowing
  2. The diagonal count of low values in the NVI(t,t') necessarily leads to smaller values at the margins because the diagonal is smaller. Hence, 1-NVI(t,.) tends to produce anomalous local minima at the margins.

While 1) is perhaps acceptable, Mauricio and me discussed that we should come up with a solution for 2). I will think about this and try a couple of things.

plotting with plotly.py

Seems ok to use, and provides interactive plots so that we can click on the community structure for example.

pyyaml install

We are currently forcing pyyaml as a dependency, however, some people are experiencing a bug that PyYAML cannot be uninstalled (to then update) because its a distutils installed project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.