pyt-team / topobenchmarkx Goto Github PK

TopoBenchmarkX is a Python library designed to standardize benchmarking and accelerate research in Topological Deep Learning

License: MIT License

Python 64.37% Jupyter Notebook 10.72% Shell 24.78% Dockerfile 0.14%

topological-deep-learning cell-complex-neural-networks simplicial-neural-networks topological-neural-networks hypergraph-neural-networks topological-data-bechmark cell-complexes simplicial-complex topological-learning

topobenchmarkx's Introduction

A Comprehensive Benchmark Suite for Topological Deep Learning

Assess how your model compares against state-of-the-art topological neural networks.

Overview • Get Started • Tutorials • Neural Networks • Liftings • Datasets • References

📌 Overview

TopoBenchmarkX (TBX) is a modular Python library designed to standardize benchmarking and accelerate research in Topological Deep Learning (TDL). In particular, TBX allows to train and compare the performances of all sorts of Topological Neural Networks (TNNs) across the different topological domains, where by topological domain we refer to a graph, a simplicial complex, a cellular complex, or a hypergraph. For detailed information, please refer to the TopoBenchmarkX: A Framework for Benchmarking Topological Deep Learning paper.

The main pipeline trains and evaluates a wide range of state-of-the-art TNNs and Graph Neural Networks (GNNs) (see ⚙️ Neural Networks) on numerous and varied datasets and benchmark tasks (see 📚 Datasets ).

Additionally, the library offers the ability to transform, i.e. lift, each dataset from one topological domain to another (see 🚀 Liftings), enabling for the first time an exhaustive inter-domain comparison of TNNs.

🧩 Get Started

Create Environment

If you do not have conda on your machine, please follow their guide to install it.

First, clone the TopoBenchmarkX repository and set up a conda environment tbx with python 3.11.3.

git clone [email protected]:pyt-team/topobenchmarkx.git
cd TopoBenchmarkX
conda create -n tbx python=3.11.3

Next, check the CUDA version of your machine:

/usr/local/cuda/bin/nvcc --version

and ensure that it matches the CUDA version specified in the env_setup.sh file (CUDA=cu121 by default). If it does not match, update env_setup.sh accordingly by changing both the CUDA and TORCH environment variables to compatible values as specified on this website.

Next, set up the environment with the following command.

source env_setup.sh

This command installs the TopoBenchmarkX library and its dependencies.

Run Training Pipeline

Next, train the neural networks by running the following command:

python -m topobenchmarkx

Thanks to hydra implementation, one can easily override the default experiment configuration through the command line. For instance, the model and dataset can be selected as:

python -m topobenchmarkx model=cell/cwn dataset=graph/MUTAG

Remark: By default, our pipeline identifies the source and destination topological domains, and applies a default lifting between them if required.

The same CLI override mechanism also applies when modifying more finer configurations within a CONFIG GROUP. Please, refer to the official hydradocumentation for further details.

🚲 Experiments Reproducibility

To reproduce Table 1 from the TopoBenchmarkX: A Framework for Benchmarking Topological Deep Learning paper, please run the following command:

bash scripts/reproduce.sh

Remark: We have additionally provided a public W&B (Weights & Biases) project with logs for the corresponding runs (updated on June 11, 2024).

⚓ Tutorials

Explore our tutorials for further details on how to add new datasets, transforms/liftings, and benchmark tasks.

⚙️ Neural Networks

We list the neural networks trained and evaluated by TopoBenchmarkX, organized by the topological domain over which they operate: graph, simplicial complex, cellular complex or hypergraph. Many of these neural networks were originally implemented in TopoModelX.

Graphs

Model	Reference
GAT	Graph Attention Networks
GIN	How Powerful are Graph Neural Networks?
GCN	Semi-Supervised Classification with Graph Convolutional Networks

Simplicial complexes

Model	Reference
SAN	Simplicial Attention Neural Networks
SCCN	Efficient Representation Learning for Higher-Order Data with Simplicial Complexes
SCCNN	Convolutional Learning on Simplicial Complexes
SCN	Simplicial Complex Neural Networks

Cellular complexes

Model	Reference
CAN	Cell Attention Network
CCCN	Inspired by A learning algorithm for computational connected cellular network, implementation adapted from Generalized Simplicial Attention Neural Networks
CXN	Cell Complex Neural Networks
CWN	Weisfeiler and Lehman Go Cellular: CW Networks

Hypergraphs

Model	Reference
AllDeepSet	You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks
AllSetTransformer	You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks
EDGNN	Equivariant Hypergraph Diffusion Neural Operators
UniGNN	UniGNN: a Unified Framework for Graph and Hypergraph Neural Networks
UniGNN2	UniGNN: a Unified Framework for Graph and Hypergraph Neural Networks

🚀 Liftings

We list the liftings used in TopoBenchmarkX to transform datasets. Here, a lifting refers to a function that transforms a dataset defined on a topological domain (e.g., on a graph) into the same dataset but supported on a different topological domain (e.g., on a simplicial complex).

Topology Liftings

Graph2Simplicial

Name	Description	Reference
CliqueLifting	The algorithm finds the cliques in the graph and creates simplices. Given a clique the first simplex added is the one containing all the nodes of the clique, then the simplices composed of all the possible combinations with one node missing, then two nodes missing, and so on, until all the possible pairs are added. Then the method moves to the next clique.	Simplicial Complexes
KHopLifting	For each node in the graph, take the set of its neighbors, up to k distance, and the node itself. These sets are then treated as simplices. The dimension of each simplex depends on the degree of the nodes. For example, a node with d neighbors forms a d-simplex.	Neighborhood Complexes

Graph2Cell

Name	Description	Reference
CellCycleLifting	To lift a graph to a cell complex (CC) we proceed as follows. First, we identify a finite set of cycles (closed loops) within the graph. Second, each identified cycle in the graph is associated to a 2-cell, such that the boundary of the 2-cell is the cycle. The nodes and edges of the cell complex are inherited from the graph.	Appendix B

Graph2Hypergraph

Name	Description	Reference
KHopLifting	For each node in the graph, the algorithm finds the set of nodes that are at most k connections away from the initial node. This set is then used to create an hyperedge. The process is repeated for all nodes in the graph.	Section 3.4
KNearestNeighborsLifting	For each node in the graph, the method finds the k nearest nodes by using the Euclidean distance between the vectors of features. The set of k nodes found is considered as an hyperedge. The proces is repeated for all nodes in the graph.	Section 3.1

Feature Liftings

Name Description Supported Domains

ProjectionSum Projects r-cell features of a graph to r+1-cell structures utilizing incidence matrices (B_{r}). Simplicial, Cell

ConcatenationLifting Concatenate r-cell features to obtain r+1-cell features. Simplicial

Name	Description	Supported Domains
ProjectionSum	Projects r-cell features of a graph to r+1-cell structures utilizing incidence matrices (B_{r}).	Simplicial, Cell
ConcatenationLifting	Concatenate r-cell features to obtain r+1-cell features.	Simplicial

Dataset	Task	Description	Reference
Cora	Classification	Cocitation dataset.	Source
Citeseer	Classification	Cocitation dataset.	Source
Pubmed	Classification	Cocitation dataset.	Source
MUTAG	Classification	Graph-level classification.	Source
PROTEINS	Classification	Graph-level classification.	Source
NCI1	Classification	Graph-level classification.	Source
NCI109	Classification	Graph-level classification.	Source
IMDB-BIN	Classification	Graph-level classification.	Source
IMDB-MUL	Classification	Graph-level classification.	Source
REDDIT	Classification	Graph-level classification.	Source
Amazon	Classification	Heterophilic dataset.	Source
Minesweeper	Classification	Heterophilic dataset.	Source
Empire	Classification	Heterophilic dataset.	Source
Tolokers	Classification	Heterophilic dataset.	Source
US-county-demos	Regression	In turn each node attribute is used as the target label.	Source
ZINC	Regression	Graph-level regression.	Source

## 📚 Datasets

Dataset Task Description Reference

Cora Classification Cocitation dataset. Source

Citeseer Classification Cocitation dataset. Source

Pubmed Classification Cocitation dataset. Source

MUTAG Classification Graph-level classification. Source

PROTEINS Classification Graph-level classification. Source

NCI1 Classification Graph-level classification. Source

NCI109 Classification Graph-level classification. Source

IMDB-BIN Classification Graph-level classification. Source

IMDB-MUL Classification Graph-level classification. Source

REDDIT Classification Graph-level classification. Source

Amazon Classification Heterophilic dataset. Source

Minesweeper Classification Heterophilic dataset. Source

Empire Classification Heterophilic dataset. Source

Tolokers Classification Heterophilic dataset. Source

US-county-demos Regression In turn each node attribute is used as the target label. Source

ZINC Regression Graph-level regression. Source

🛠️ Development

To join the development of TopoBenchmarkX, you should install the library in dev mode.

For this, you can create an environment using either conda or docker. Both options are detailed below.

🐍 Using Conda Environment

Follow the steps in 🧩 Get Started.

🐳 Using Docker

For ease of use, TopoBenchmarkX employs Docker. To set it up on your system you can follow their guide. once installed, please follow the next steps:

First, clone the repository and navigate to the correct folder.

git clone [email protected]:pyt-team/topobenchmarkx.git
cd TopoBenchmarkX

Then, build the Docker image.

docker build -t topobenchmarkx:new .

Depending if you want to use GPUs or not, these are the commands to run the Docker image and mount the current directory.

With GPUs

docker run -it -d --gpus all --volume $(pwd):/TopoBenchmarkX topobenchmarkx:new

With CPU

docker run -it -d --volume $(pwd):/TopoBenchmarkX topobenchmarkx:new

Happy development!

🔍 References

To learn more about TopoBenchmarkX, we invite you to read the paper:

@misc{topobenchmarkx2024,
      title={TopoBenchmarkX},
      author={PyT-Team},
      year={2024},
      eprint={TBD},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

If you find TopoBenchmarkX useful, we would appreciate if you cite us!

🐭 Additional Details

Hierarchy of configuration files

├── configs                   <- Hydra configs
│   ├── callbacks                <- Callbacks configs
│   ├── dataset                  <- Dataset configs
│   │   ├── graph                    <- Graph dataset configs
│   │   ├── hypergraph               <- Hypergraph dataset configs
│   │   └── simplicial               <- Simplicial dataset configs
│   ├── debug                    <- Debugging configs
│   ├── evaluator                <- Evaluator configs
│   ├── experiment               <- Experiment configs
│   ├── extras                   <- Extra utilities configs
│   ├── hparams_search           <- Hyperparameter search configs
│   ├── hydra                    <- Hydra configs
│   ├── local                    <- Local configs
│   ├── logger                   <- Logger configs
│   ├── loss                     <- Loss function configs
│   ├── model                    <- Model configs
│   │   ├── cell                     <- Cell model configs
│   │   ├── graph                    <- Graph model configs
│   │   ├── hypergraph               <- Hypergraph model configs
│   │   └── simplicial               <- Simplicial model configs
│   ├── optimizer                <- Optimizer configs
│   ├── paths                    <- Project paths configs
│   ├── scheduler                <- Scheduler configs
│   ├── trainer                  <- Trainer configs
│   ├── transforms               <- Data transformation configs
│   │   ├── data_manipulations       <- Data manipulation transforms
│   │   ├── dataset_defaults         <- Default dataset transforms
│   │   ├── feature_liftings         <- Feature lifting transforms
│   │   └── liftings                 <- Lifting transforms
│   │       ├── graph2cell               <- Graph to cell lifting transforms
│   │       ├── graph2hypergraph         <- Graph to hypergraph lifting transforms
│   │       ├── graph2simplicial         <- Graph to simplicial lifting transforms
│   │       ├── graph2cell_default.yaml  <- Default graph to cell lifting config
│   │       ├── graph2hypergraph_default.yaml <- Default graph to hypergraph lifting config
│   │       ├── graph2simplicial_default.yaml <- Default graph to simplicial lifting config
│   │       ├── no_lifting.yaml           <- No lifting config
│   │       ├── custom_example.yaml       <- Custom example transform config
│   │       └── no_transform.yaml         <- No transform config
│   ├── wandb_sweep              <- Weights & Biases sweep configs
│   │
│   ├── __init__.py              <- Init file for configs module
│   └── run.yaml               <- Main config for training

More information regarding Topological Deep Learning

Topological Graph Signal Compression

Architectures of Topological Deep Learning: A Survey on Topological Neural Networks

TopoX: a suite of Python packages for machine learning on topological domains

topobenchmarkx's People

Contributors

Stargazers

Watchers

Forkers

alexandor91 duydl conglesolutionx hubayirp

topobenchmarkx's Issues

Ruff issue wit docstrings

Some conflicts in the installation appear when running the tests

Essential issue training for topological models

When the best score is achieved, something eats all the CPU memory; hence, at some point, the process has to be stopped. Look closely at what happens when the best score is achieved. It happens only with Topological models.

Random guess: Maybe in PyTorch lightning, there is something special about saving model/data after the best loss is achieved

Normalization problems

Some TopoModelX models do not have normalization of outputs, which results in an inability to stack multiple layers. For example, now we use external implementation of cwn, can be found in custom/ folder.

The same also happens with other models in the topomodelx library.

PERF203 `try`-`except` within a loop incurs performance overhead

1: topobenchmarkx/io/load/utils.py
2: topobenchmarkx/io/load/utils.py

Temporally added "PERF203" to Ruff ignores

Documentation

I have found that all the wrappers in the doc have exactly the same row, see below

Please update it with the appropriate documentation

Issue with passing torch module in config

While using Hydra config instantiation I have faced the issue that it is not trivial to pass torch modules as an argument, I guess that it is somehow possible to do.

Example:

backbone:
  _target_: topomodelx.nn.hypergraph.allset.AllSet
  in_channels: ${data.num_features}
  hidden_channels: 64
  mlp_activation_layer: torch.nn.ReLU

How to make it possible to pass torch.nn.ReLU? Is it possible?

Unify general transform

The classes Graph2SimplicialLifting, Graph2HypergraphLifting, Graph2CellLifting share most of the interface. Unify through the class Graph2Domain will define general logic for all of the domains.

I have partially implemented this logic for Graph2SimplicialLifting

Evaluator

General Evaluator

Add more metrics

It is essential to add a broader number of metrics before starting any experiments. Which metrics to add for the experiments for Neurips?

Metrics:

Classification task:

Accuracy
F1
-...
Regression task:
MSE
MAE
...

General Data class and additional for each domains

Define BaseData class logic and classes for every corresponding domain.

Memory issue with AllSetTransformer

It seems that there is a memory leak somewhere, because allsettransformer with 2 layers cannot be allocated.

backbone:
target: topomodelx.nn.hypergraph.allset_transformer.AllSetTransformer
in_channels: ${data.num_features}
hidden_channels: 256
n_layers: 1
heads: 4
dropout: 0.2
mlp_num_layers: 1
mlp_dropout: 0.2

Make the transforms to be applied n parallel

For some datasets it is too slow

General ReadOut class

Second source dataset different from torch_geometric

We need to show that our lib can easily process datasets that come from any source; till now, we have everything from torch_geometric.

Cleaning the repos

Maybe we need to tidy up io/load

Class CWNLayer is redefined in custom_models/cell/cin.py

The class is first imported: from topomodelx.nn.cell.cwn_layer import CWNLayer but, then, it is redefined in the same file

class CWNLayer(nn.Module):
    r"""Layer of a CW Network (CWN).

Make the ReadOut general

Make the abstract ReadOut class, and create a set of different ReadOut classes that inherit from the general ReadOut class.

Node regression task

Till now we had graph regression. We need to make sure / update the pipeline to be able to do node regression

Config class

Create a config logic:

Load config
Validate config
Save config

General utils

Make general utils similar to pyg

Tutorials

Define which tutorial to do and make them

Update docker file

Fix the issues with the docker file:

Very long launching (40 min)
Need always to install jupyter
Existence of two different Python paths

Some tricks might be taken from Graphworld repo

issue with SAN

SAN layer has to be mapped to torch.nn.ModuleList

Add to EqualGausFeatures the option that allows to generate unique random vectors for each node. So far the feature vector is the same for each node.

Reproducibility of experiments

Make sure that two consecutive runs with the same configuration will produce the same train/val/test outputs

General Loss class

How to do a sequence of transforms through CLI starting from connectivity modification?

How to make it through CLI

TopoModelX argument issue

It is not possible to change mlp_activation argument in AllSetTransformerLayer while initializing AllSetTransformer

Impossibility to do mini-batching with pytorch lightning

torch_geometric.data.lightning.LightningDataset do not work with Data objects that have sparse representations.
TopomodelX works only with sparse matrix multiplications

Do you have any ideas on how to allow mini-batching?

General Data class and additional for each domains

Define BaseData class logic and classes for every corresponding domains.

Make sure metrics are collected correctly

General Network class

Test

Add more test for the library

Collate function running index

I suppose there was a bug on lines 84 and 94; the variable for cell_i was always current_number_of_nodes but not current_number_of_cells. Doublecheck that it should be current_number_of_cells.

Make the NodeDegrees class computes in-degree and actual degree. Currently, it computes out-degree.

Dataset domain statistics collector

Make a class that infers statistics from the initial and lifted dataset.
Collect the statistics and add to overleaf into table

Slight inconsistency with modules

We must resolve the module's inconsistency; all modules must follow the same logic.

There are two options:

We define the abstract class and call methods, then inherit and write the corresponding forward. In this way, we will need to additionally always specify the module's name executed, and the module name, along with the import, must be added to the file with the abstract class. (An example of this implementation is the readout module)
Every defines separate classes and imports them through the path to the config file (Example is feature encoder)

I believe we should go with strategy 1. It allows the definition of a stable pipeline while adding modularity by adding the classes with forward. The problem is that it is unclear if this approach is best suited for some modules like Evaluator.

Additional datasets

There are 3 additional datasets I have found:

https://www.cs.cornell.edu/~arb/data/US-county-demos/
Note: I already have created the loading pipeline for this one.
The cool part of this dataset is that it is a node regression task where in total 6 variables. In this paper, they used a couple of them as targets. Hence, we can sell one dataset as 1-6.

These 2) and 3) datasets can be uploaded mostly with the same pipeline. Hence, we can also add them, but I didn't go into details.
2) https://www.cs.cornell.edu/~arb/data/CDC-climate/
3) https://www.cs.cornell.edu/~arb/data/US-county-fb/

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.