jajupmochi / graphkit-learn Goto Github PK

A python package for graph kernels, graph edit distances, and graph pre-image problem.

Home Page: https://graphkit-learn.readthedocs.io

License: GNU General Public License v3.0

Jupyter Notebook 95.84% Python 3.80% Shell 0.01% C++ 0.16% Cython 0.19%

graph-kernels walks paths kernel-methods graph-representations pattern-recognition machine-learning chemoinformatics pre-image graph-edit-distance

graphkit-learn's Introduction

graphkit-learn

A Python package for graph kernels, graph edit distances and graph pre-image problem.

Requirements

python>=3.6
numpy>=1.16.2
scipy>=1.1.0
matplotlib>=3.1.0
networkx>=2.2
scikit-learn>=0.20.0
tabulate>=0.8.2
tqdm>=4.26.0
control>=0.8.2 (for generalized random walk kernels only)
slycot>=0.3.3 (for generalized random walk kernels only, which requires a fortran compiler (e.g., gfortran) and BLAS/LAPACK (e.g. liblapack-dev))
Cython~=0.29.33 (for GEDLIB only)

How to use?

Install the library

Install stable version from PyPI (may not be up-to-date):

$ pip install graphkit-learn

Install latest version from GitHub:

$ git clone https://github.com/jajupmochi/graphkit-learn.git
$ cd graphkit-learn/
$ python setup.py install

Run the test

A series of tests can be run to check if the library works correctly:

$ pip install -U pip pytest codecov coverage pytest-cov
$ pytest -v --cov-config=.coveragerc --cov-report term --cov=gklearn gklearn/tests/

Check examples

A series of demos of using the library can be found on Google Colab and in the example folder.

Other demos

Check notebooks directory for more demos:

notebooks directory includes test codes of graph kernels based on linear patterns;
notebooks/tests directory includes codes that test some libraries and functions;
notebooks/utils directory includes some useful tools, such as a Gram matrix checker and a function to get properties of datasets;
notebooks/else directory includes other codes that we used for experiments.

Documentation

The docs of the library can be found here.

Main contents

1 List of graph kernels

Based on walks
- The common walk kernel [1]
  - Exponential
  - Geometric
- The marginalized kernel
  - With tottering [2]
  - Without tottering [7]
- The generalized random walk kernel [3]
  - Sylvester equation
  - Conjugate gradient
  - Fixed-point iterations
  - Spectral decomposition
Based on paths
- The shortest path kernel [4]
- The structural shortest path kernel [5]
- The path kernel up to length h [6]
  - The Tanimoto kernel
  - The MinMax kernel
Non-linear kernels
- The treelet kernel [10]
- Weisfeiler-Lehman kernel [11]
  - Subtree

A demo of computing graph kernels can be found on Google Colab and in the examples folder.

2 Graph Edit Distances

3 Graph preimage methods

A demo of generating graph preimages can be found on Google Colab and in the examples folder.

4 Interface to `GEDLIB`

GEDLIB is an easily extensible C++ library for (suboptimally) computing the graph edit distance between attributed graphs. A Python interface for GEDLIB is integrated in this library, based on gedlibpy library.

5 Computation optimization methods

Python’s multiprocessing.Pool module is applied to perform parallelization on the computations of all kernels as well as the model selection.
The Fast Computation of Shortest Path Kernel (FCSP) method [8] is implemented in the random walk kernel, the shortest path kernel, as well as the structural shortest path kernel where FCSP is applied on both vertex and edge kernels.
The trie data structure [9] is employed in the path kernel up to length h to store paths in graphs.

Issues

This library uses multiprocessing.Pool.imap_unordered function to do the parallelization, which may not be able to run correctly under Windows system. For now, Windows users may need to comment the parallel codes and uncomment the codes below them which run serially. We will consider adding a parameter to control serial or parallel computations as needed.
Some modules (such as Numpy, Scipy, sklearn) apply OpenBLAS to perform parallel computation by default, which causes conflicts with other parallelization modules such as multiprossing.Pool, highly increasing the computing time. By setting its thread to 1, OpenBLAS is forced to use a single thread/CPU, thus avoids the conflicts. For now, this procedure has to be done manually. Under Linux, type this command in terminal before running the code:

$ export OPENBLAS_NUM_THREADS=1

Or add export OPENBLAS_NUM_THREADS=1 at the end of your ~/.bashrc file, then run

$ source ~/.bashrc

to make this effective permanently.

Results

Check this paper for detailed description of graph kernels and experimental results:

Linlin Jia, Benoit Gaüzère, and Paul Honeine. Graph Kernels Based on Linear Patterns: Theoretical and Experimental Comparisons. working paper or preprint, March 2019. URL https://hal-normandie-univ.archives-ouvertes.fr/hal-02053946.

A comparison of performances of graph kernels on benchmark datasets can be found here.

How to contribute

Fork the library and open a pull request! Make your own contribute to the community!

Authors

Linlin Jia, LITIS, INSA Rouen Normandie
Benoit Gaüzère, LITIS, INSA Rouen Normandie
Paul Honeine, LITIS, Université de Rouen Normandie

Citation

If you have used graphkit-learn in your publication, please cite the the following paper:

@article{JIA2021,
	title = "graphkit-learn: A Python Library for Graph Kernels Based on Linear Patterns",
	journal = "Pattern Recognition Letters",
	year = "2021",
	issn = "0167-8655",
	doi = "https://doi.org/10.1016/j.patrec.2021.01.003",
	url = "http://www.sciencedirect.com/science/article/pii/S0167865521000131",
	author = "Linlin Jia and Benoit Gaüzère and Paul Honeine",
	keywords = "Graph Kernels, Linear Patterns, Python Implementation",
	abstract = "This paper presents graphkit-learn, the first Python library for efficient computation of graph kernels based on linear patterns, able to address various types of graphs. Graph kernels based on linear patterns are thoroughly implemented, each with specific computing methods, as well as two well-known graph kernels based on non-linear patterns for comparative analysis. Since computational complexity is an Achilles’ heel of graph kernels, we provide several strategies to address this critical issue, including parallelization, the trie data structure, and the FCSP method that we extend to other kernels and edge comparison. All proposed strategies save orders of magnitudes of computing time and memory usage. Moreover, all the graph kernels can be simply computed with a single Python statement, thus are appealing to researchers and practitioners. For the convenience of use, an advanced model selection procedure is provided for both regression and classification problems. Experiments on synthesized datasets and 11 real-world benchmark datasets show the relevance of the proposed library."
}

Acknowledgments

This research was supported by CSC (China Scholarship Council) and the French national research agency (ANR) under the grant APi (ANR-18-CE23-0014). The authors would like to thank the CRIANN (Le Centre Régional Informatique et d’Applications Numériques de Normandie) for providing computational resources.

References

[1] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.

[2] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003.

[3] Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M., 2010. Graph kernels. Journal of Machine Learning Research 11, 1201–1242.

[4] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proceedings of the International Conference on Data Mining, pages 74-81, 2005.

[5] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.

[6] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).

[7] Mahé, P., Ueda, N., Akutsu, T., Perret, J.L., Vert, J.P., 2004. Extensions of marginalized graph kernels, in: Proc. the twenty-first international conference on Machine learning, ACM. p. 70.

[8] Lifan Xu, Wei Wang, M Alvarez, John Cavazos, and Dongping Zhang. Parallelization of shortest path graph kernels on multi-core cpus and gpus. Proceedings of the Programmability Issues for Heterogeneous Multicores (MultiProg), Vienna, Austria, 2014.

[9] Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490–499, 1960.

[10] Gaüzere, B., Brun, L., Villemin, D., 2012. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters 33, 2038–2047.

[11] Shervashidze, N., Schweitzer, P., Leeuwen, E.J.v., Mehlhorn, K., Borgwardt, K.M., 2011. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561.

graphkit-learn's People

Contributors

Stargazers

Watchers

Forkers

lyf14020510036 mgalusza gear vishalbelsare faraazrm qianlinjun ceciliazzzzzz shuowang-ai lsz19960814 standardgalactic linhongxiang gitter-badger kazimierz-256 jiawozhong mhpthesis zsigacsaba rnaimehaom tamal-mondal bgauzere

graphkit-learn's Issues

Request for the atom types labels of NCI1 datasets

I have searched almost all the literature, but have not found the atomic type corresponding to the node label of the NCI1 dataset. We know that the NCI1 dataset is made up of molecule graphs composed of atoms in 37 categories. just like the molcules in MUTAG dataset are composed of atoms in 7 categories: {0:'C', 1:'N', 2:'O', 3:'F', 4:'I', 5:'Cl', 6:'Br'}. Sincerely ask everyone to reply !!!

In ./notebooks/utils/plot_all_graphs.py, it plots graphs in MUTAG dataset with node(atom) labels like this:

# line [19 - 40]
    dataset, y = loadDataset("../../datasets/MUTAG/MUTAG_A.txt")
    for idx in [6]: #[65]:#
        G = dataset[idx]
        ncolors= []
        for node in G.nodes:
            if G.nodes[node]['atom'] == '0':
                G.nodes[node]['atom'] = 'C'
                ncolors.append('#bd3182')
            elif G.nodes[node]['atom'] == '1':
                G.nodes[node]['atom'] = 'N'
                ncolors.append('#3182bd')
            elif G.nodes[node]['atom'] == '2':
                G.nodes[node]['atom'] = 'O'
                ncolors.append('#82bd31')
            elif G.nodes[node]['atom'] == '3':
                G.nodes[node]['atom'] = 'F'
            elif G.nodes[node]['atom'] == '4':
                G.nodes[node]['atom'] = 'I'
            elif G.nodes[node]['atom'] == '5':
                G.nodes[node]['atom'] = 'Cl'
            elif G.nodes[node]['atom'] == '6':
                G.nodes[node]['atom'] = 'Br'

function generate_median_preimages_by_class() does not work correctly sometimes

When using the function generate_median_preimages_by_class(), the results after the first class are not correct when I run the code in Ubuntu terminal using # python3 ....

The answer here says it has something to do with the Cython and conda. I do have conda installed. So I use a fresh virtual environment without conda and the results seems correct, but I still do not know exactly why.

Reproducing code example:

Here is the test.py:

import multiprocessing
import functools
from gklearn.utils.kernels import deltakernel, gaussiankernel, kernelproduct
from gklearn.preimage.utils import generate_median_preimages_by_class


def xp_median_preimage_1_1():
	"""xp 1_1: Letter-high, sspkernel.
	"""
	# set parameters.
	ds_name = 'Letter-high'
	mpg_options = {'fit_method': 'k-graphs',
				   'init_ecc': [3, 3, 1, 3, 3],
				   'ds_name': ds_name,
				   'parallel': True, # False
				   'time_limit_in_sec': 0,
				   'max_itrs': 100,
				   'max_itrs_without_update': 3,
				   'epsilon_residual': 0.01,
				   'epsilon_ec': 0.1,
				   'verbose': 2}
	mixkernel = functools.partial(kernelproduct, deltakernel, gaussiankernel)
	sub_kernels = {'symb': deltakernel, 'nsymb': gaussiankernel, 'mix': mixkernel}
	kernel_options = {'name': 'structuralspkernel',
					  'edge_weight': None,
					  'node_kernels': sub_kernels,
					  'edge_kernels': sub_kernels, 
					  'compute_method': 'naive',
					  'parallel': 'imap_unordered', 
# 						  'parallel': None, 
					  'n_jobs': multiprocessing.cpu_count(),
					  'normalize': True,
					  'verbose': 2}
	ged_options = {'method': 'IPFP',
				   'initialization_method': 'RANDOM', # 'NODE'
				   'initial_solutions': 1, # 1
				   'edit_cost': 'LETTER2',
				   'attr_distance': 'euclidean',
				   'ratio_runs_from_initial_solutions': 1,
				   'threads': multiprocessing.cpu_count(),
				   'init_option': 'EAGER_WITHOUT_SHUFFLED_COPIES'}
	mge_options = {'init_type': 'MEDOID',
				   'random_inits': 10,
				   'time_limit': 600,
				   'verbose': 2,
				   'refine': False}
	save_results = True
	
	# print settings.
	print('parameters:')
	print('dataset name:', ds_name)
	print('mpg_options:', mpg_options)
	print('kernel_options:', kernel_options)
	print('ged_options:', ged_options)
	print('mge_options:', mge_options)
	print('save_results:', save_results)
	
	# generate preimages.
	for fit_method in ['k-graphs', 'expert', 'random', 'random', 'random']:
		print('\n-------------------------------------')
		print('fit method:', fit_method, '\n')
		mpg_options['fit_method'] = fit_method
		generate_median_preimages_by_class(ds_name, mpg_options, kernel_options, ged_options, mge_options, save_results=save_results, save_medians=True, plot_medians=True, load_gm='auto', dir_save='../results/xp_median_preimage/')


if __name__ == "__main__":
	
	#### xp 1_1: Letter-high, sspkernel.
 	xp_median_preimage_1_1()

Error message:

When I run in Ubuntu terminal:

python3 test.py

The output results are not correct after the first class. However, If I remove the first class before computation, then the results of the first class in the remainder (the original second class) is correct, and the results of the new second class (the original third class) is wrong. This problem does not occur in Spyder3 (4.1.1) console IPython 7.0.1 or fresh virtualenv with only Python modules required installed.

graphkit-learn/Python version information:

Python 3.6.9
graphkit-learn 0.1
Ubuntu 18.04.4 LTS

Weisfeiler_Lehman graph kernel

您好，我刚刚入门graph, 请问一下使用：Weisfeiler_Lehman graph kernel，链接矩阵必须是对称的（无向图）吗？？？

ValueError: cost matrix is infeasible

Any workaround for this error.

Also, I want to compute GED for large matrices ?

Citing

How can I cite the library?

Is there any chance to add more node_label and edge_label?

Hi, lin. In the file 'graphfiles.py' you used the 'loadDataset' function to read the dataset and build graphs. I found that all the functions corresponding to different datasets like 'loadCT' , 'loadGXL ' , 'loadSDF' etc. usually add one node lable 'atom' and one edge label 'bond_type'. Is there any chance to add more node_label and edge_label to make the classification more accurate？
Thanks, man.

Key Error gklearn.kernels.treeletKernel

For some Graphs gklearn throws a "Key Error" when generating canonical keys. This does not happen for all graphs, I assume it is limited to this pattern. Help would be highly appreciated!

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:128, in treeletkernel(sub_kernel, node_label, edge_label, parallel, n_jobs, chunksize, verbose, *args)
126 canonkeys = []
127 for g in (tqdm(Gn, desc='getting canonkeys', file=sys.stdout) if verbose else Gn):
--> 128 canonkeys.append(get_canonkeys(g, node_label, edge_label, labeled,
129 ds_attrs['is_directed']))
131 # compute kernels.
132 from itertools import combinations_with_replacement

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:324, in get_canonkeys(G, node_label, edge_label, labeled, is_directed)
322 treelet = []
323 for pattern in patterns[str(i) + 'star']:
--> 324 canonlist = [tuple((G.nodes[leaf][node_label],
325 G[leaf][pattern[0]][edge_label])) for leaf in pattern[1:]]
326 canonlist.sort()
327 canonlist = list(chain.from_iterable(canonlist))

File ~\Anaconda3\lib\site-packages\gklearn\kernels\treeletKernel.py:325, in (.0)
322 treelet = []
323 for pattern in patterns[str(i) + 'star']:
324 canonlist = [tuple((G.nodes[leaf][node_label],
--> 325 G[leaf][pattern[0]][edge_label])) for leaf in pattern[1:]]
326 canonlist.sort()
327 canonlist = list(chain.from_iterable(canonlist))

File ~\Anaconda3\lib\site-packages\networkx\classes\coreviews.py:51, in AtlasView.getitem(self, key)
50 def getitem(self, key):
---> 51 return self._atlas[key]

KeyError: 1

Is the graph edit distance developmented?

How long will the edit distance be supported? Thanks a lot!

model_selection_precomputed.py文件中，关于normalization的问题

Hi，linlin同学。两个问题想请教下：
1.在model_selection_precomputed.py文件中,144行我看到您想删除和自身核函数为0的图

remove graphs whose kernels with themselves are zeros

Kmatrix 删除Kmatrix_diag中值为0的坐标处的行和列
但在151行中，normalization又取了Kmatrix_diag的值重新赋给Kmatrix，我使用自己的测试数据（因为我没能把你完整的代码跑下来）发现如果上一步中Kmatrix_diag如果有0的元素出现，这里的乘积为0，相除数组越界。请问这一个公式的目的是什么，你会遇到这个bug吗

normalization

      Kmatrix[i][j] /= np.sqrt(Kmatrix_diag[i] * Kmatrix_diag[j])
      Kmatrix[j][i] = Kmatrix[i][j]`

2.另外在几个"run_"和"test_"开头的ipynb文件中发现几个变量名错误（大小写之类，可能是留下的版本不同的代码？或者是我比较菜没看出来），所以请问您这套代码哪些是新的可以使用的，哪些是旧的不用的，我在测试中notebook目录下的ipynb文件出现的问题比较多，求大神指教，谢谢！