fentechsolutions / causaldiscoverytoolbox Goto Github PK

Package for causal inference in graphs and in the pairwise settings. Tools for graph structure recovery and dependencies are included.

Home Page: https://fentechsolutions.github.io/CausalDiscoveryToolbox/html/index.html

License: MIT License

Python 95.80% R 3.70% Shell 0.35% Dockerfile 0.09% Roff 0.06%

causal-inference graph causality causal-models algorithm machine-learning graph-structure-recovery python causal-discovery toolbox

causaldiscoverytoolbox's Introduction

The Causal Discovery Toolbox is a package for causal inference in graphs and in the pairwise settings for Python>=3.5. Tools for graph structure recovery and dependencies are included. The package is based on Numpy, Scikit-learn, Pytorch and R.

It implements lots of algorithms for graph structure recovery (including algorithms from the bnlearn, pcalg packages), mainly based out of observational data.

Check out the documentation here

Please cite us if you use our software

A tutorial is available here

Install it using pip: (See more details on installation below)

pip install cdt

Docker images

Docker images are available, including all the dependencies, and enabled functionalities:

Branch	master	dev
Python 3.6 - CPU
Python 3.6 - GPU

Installation

The packages requires a python version >=3.5, as well as some libraries listed in requirements file. For some additional functionalities, more libraries are needed for these extra functions and options to become available. Here is a quick install guide of the package, starting off with the minimal install up to the full installation.

Note: A (mini/ana)conda framework would help installing all those packages and therefore could be recommended for non-expert users.

Install PyTorch

As some of the key algorithms in the cdt package use the PyTorch package, it is required to install it. Check out their website to install the PyTorch version suited to your hardware configuration: http://pytorch.org

Install the CausalDiscoveryToolbox package

The package is available on PyPi:

pip install cdt

Or you can also install it from source.

$ git clone https://github.com/FenTechSolutions/CausalDiscoveryToolbox.git  # Download the package 
$ cd CausalDiscoveryToolbox
$ pip install -r requirements.txt  # Install the requirements
$ python setup.py install develop --user

The package is then up and running! You can run most of the algorithms in the CausalDiscoveryToolbox, you might get warnings: some additional features are not available

From now on, you can import the library using:

import cdt

Check out the package structure and more info on the package itself here.

Additional : R and R libraries

In order to have access to additional algorithms from various R packages such as bnlearn, kpcalg, pcalg, ... while using the cdt framework, it is required to install R.

Check out how to install all R dependencies in the before-install section of the travis.yml file for debian based distributions. The r-requirements file notes all of the R packages used by the toolbox.

Here is an example of installation script of the R packages on Ubuntu 20.04:

apt-get -qq update
DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata
apt-get -qq install dialog apt-utils -y
apt-get install apt-transport-https -y
apt-get install -qq software-properties-common -y
apt-get -qq update
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/' -y
apt-get -qq update

apt-get -qq install r-base -y
apt-get -qq install libssl-dev -y
apt-get -qq install libgmp3-dev  -y
apt-get -qq install git -y
apt-get -qq install build-essential  -y
apt-get -qq install libv8-dev  -y
apt-get -qq install libcurl4-openssl-dev -y
apt-get -qq install libgsl-dev -y

Rscript -e 'install.packages(c("V8"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages(c("sfsmisc"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages(c("clue"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/randomForest/randomForest_4.6-14.tar.gz", repos=NULL, type="source")'
Rscript -e 'install.packages(c("lattice"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages(c("devtools"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages(c("MASS"),repos="http://cran.us.r-project.org", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages("BiocManager")'
Rscript -e 'BiocManager::install(c("igraph"))'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/fastICA/fastICA_1.2-2.tar.gz", repos=NULL, type="source")'
Rscript -e 'BiocManager::install(c("SID", "bnlearn", "pcalg", "kpcalg", "glmnet", "mboost"))'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/CAM/CAM_1.0.tar.gz", repos=NULL, type="source")'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/sparsebnUtils_0.0.8.tar.gz", repos=NULL, type="source")'
Rscript -e 'BiocManager::install(c("ccdrAlgorithm", "discretecdAlgorithm"))'

apt-get -qq install libxml2-dev -y
Rscript -e 'install.packages("devtools")'
Rscript -e 'library(devtools); install_github("cran/CAM"); install_github("cran/momentchi2"); install_github("Diviyan-Kalainathan/RCIT", quiet=TRUE, verbose=FALSE)'
Rscript -e 'install.packages("https://cran.r-project.org/src/contrib/Archive/sparsebn/sparsebn_0.1.2.tar.gz", repos=NULL, type="source")'

Overview

General package structure

The following figure shows how the package and its algorithms are structured

   cdt package
   |
   |- independence
   |  |- graph (Infering the skeleton from data)
   |  |  |- Lasso variants (Randomized Lasso[1], Glasso[2], HSICLasso[3])
   |  |  |- FSGNN (CGNN[12] variant for feature selection)
   |  |  |- Skeleton recovery using feature selection algorithms (RFECV[5], LinearSVR[6], RRelief[7], ARD[8,9], DecisionTree)
   |  |
   |  |- stats (pairwise methods for dependency)
   |     |- Correlation (Pearson, Spearman, KendallTau)
   |     |- Kernel based (NormalizedHSIC[10])
   |     |- Mutual information based (MIRegression, Adjusted Mutual Information[11], Normalized mutual information[11])
   |
   |- data
   |  |- CausalPairGenerator (Generate causal pairs)
   |  |- AcyclicGraphGenerator (Generate FCM-based graphs)
   |  |- load_dataset (load standard benchmark datasets)
   |
   |- causality
   |  |- graph (methods for graph inference)
   |  |  |- CGNN[12]
   |  |  |- PC[13]
   |  |  |- GES[13]
   |  |  |- GIES[13]
   |  |  |- LiNGAM[13]
   |  |  |- CAM[13]
   |  |  |- GS[23]
   |  |  |- IAMB[24]
   |  |  |- MMPC[25]
   |  |  |- SAM[26]
   |  |  |- CCDr[27]
   |  |
   |  |- pairwise (methods for pairwise inference)
   |     |- ANM[14] (Additive Noise Model)
   |     |- IGCI[15] (Information Geometric Causal Inference)
   |     |- RCC[16] (Randomized Causation Coefficient)
   |     |- NCC[17] (Neural Causation Coefficient)
   |     |- GNN[12] (Generative Neural Network -- Part of CGNN )
   |     |- Bivariate fit (Baseline method of regression)
   |     |- Jarfo[20]
   |     |- CDS[20]
   |     |- RECI[28]
   |
   |- metrics (Implements the metrics for graph scoring)
   |  |- Precision Recall
   |  |- SHD
   |  |- SID [29]
   |
   |- utils
      |- Settings -> SETTINGS class (hardware settings)
      |- loss -> MMD loss [21, 22] & various other loss functions
      |- io -> for importing data formats
      |- graph -> graph utilities

Hardware and algorithm settings

The toolbox has a SETTINGS class that defines the hardware settings. Those settings are unique and their default parameters are defined in cdt/utils/Settings.

These parameters are accessible and overridable via accessing the class:

import cdt
cdt.SETTINGS

Moreover, the hardware parameters are detected and defined automatically (including number of GPUs, CPUs, available optional packages) at the import of the package using the cdt.utils.Settings.autoset_settings method, run at startup.

The graph class

The whole package revolves around using the DiGraph and Graph classes from the networkx package.

References

[1] Wang, S., Nan, B., Rosset, S., & Zhu, J. (2011). Random lasso. The annals of applied statistics, 5(1), 468.
[2] Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432-441.
[3] Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., & Sugiyama, M. (2014). High-dimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1), 185-207.
[4] Feizi, S., Marbach, D., Médard, M., & Kellis, M. (2013). Network deconvolution as a general method to distinguish direct dependencies in networks. Nature biotechnology, 31(8), 726-733.
[5] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389-422.
[6] Vapnik, V., Golowich, S. E., & Smola, A. J. (1997). Support vector method for function approximation, regression estimation and signal processing. In Advances in neural information processing systems (pp. 281-287).
[7] Kira, K., & Rendell, L. A. (1992, July). The feature selection problem: Traditional methods and a new algorithm. In Aaai (Vol. 2, pp. 129-134).
[8] MacKay, D. J. (1992). Bayesian interpolation. Neural Computation, 4, 415–447.
[9] Neal, R. M. (1996). Bayesian learning for neural networks. No. 118 in Lecture Notes in Statistics. New York: Springer.
[10] Gretton, A., Bousquet, O., Smola, A., & Scholkopf, B. (2005, October). Measuring statistical dependence with Hilbert-Schmidt norms. In ALT (Vol. 16, pp. 63-78).
[11] Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837-2854.
[12] Goudet, O., Kalainathan, D., Caillou, P., Lopez-Paz, D., Guyon, I., Sebag, M., ... & Tubaro, P. (2017). Learning functional causal models with generative neural networks. arXiv preprint arXiv:1709.05321.
[13] Spirtes, P., Glymour, C., Scheines, R. (2000). Causation, Prediction, and Search. MIT press.
[14] Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In Advances in neural information processing systems (pp. 689-696).
[15] Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniušis, P., ... & Schölkopf, B. (2012). Information-geometric approach to inferring causal directions. Artificial Intelligence, 182, 1-31.
[16] Lopez-Paz, D., Muandet, K., Schölkopf, B., & Tolstikhin, I. (2015, June). Towards a learning theory of cause-effect inference. In International Conference on Machine Learning (pp. 1452-1461).
[17] Lopez-Paz, D., Nishihara, R., Chintala, S., Schölkopf, B., & Bottou, L. (2017, July). Discovering causal signals in images. In Proceedings of CVPR.
[18] Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., & Schölkopf, B. (2010). Probabilistic latent variable models for distinguishing between cause and effect. In Advances in Neural Information Processing Systems (pp. 1687-1695).
[19] Zhang, K., & Hyvärinen, A. (2009, June). On the identifiability of the post-nonlinear causal model. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 647-655). AUAI Press.
[20] Fonollosa, J. A. (2016). Conditional distribution variability measures for causality detection. arXiv preprint arXiv:1601.06680.
[21] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 723-773.
[22] Li, Y., Swersky, K., & Zemel, R. (2015). Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (pp. 1718-1727).
[23] Margaritis D (2003). Learning Bayesian Network Model Structure from Data . Ph.D. thesis, School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA. Available as Technical Report CMU-CS-03-153
[24] Tsamardinos I, Aliferis CF, Statnikov A (2003). “Algorithms for Large Scale Markov Blanket Discovery”. In “Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference”, pp. 376-381. AAAI Press.
[25] Tsamardinos I, Aliferis CF, Statnikov A (2003). “Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations”. In “KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining”, pp. 673-678. ACM. Tsamardinos I, Brown LE, Aliferis CF (2006). “The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm”. Machine Learning,65(1), 31-78.
[26] Kalainathan, Diviyan & Goudet, Olivier & Guyon, Isabelle & Lopez-Paz, David & Sebag, Michèle. (2018). SAM: Structural Agnostic Model, Causal Discovery and Penalized Adversarial Learning.
[27] Aragam, B., & Zhou, Q. (2015). Concave penalized estimation of sparse Gaussian Bayesian networks. Journal of Machine Learning Research, 16, 2273-2328.
[28] Bloebaum, P., Janzing, D., Washio, T., Shimizu, S., & Schoelkopf, B. (2018, March). Cause-Effect Inference by Comparing Regression Errors. In International Conference on Artificial Intelligence and Statistics (pp. 900-909).
[29] Structural Intervention Distance (SID) for Evaluating Causal Graphs, Jonas Peters, Peter Bühlmann: https://arxiv.org/abs/1306.1043

causaldiscoverytoolbox's People

Contributors

Stargazers

Watchers

Forkers

kuanih ishmaelbelghazi zuomatthew mengyuan404 caoao codeaudit dineshresearch hongtengxu mislam5285 ridwanalam wangtz1994 dfwlab vandith yrodill ml-lab fyz11 weilongzheng databill86 harshs27 jdding koutrgor ritik99 basurounaq11 elementai saltydizz opnumten yanlirock jbdatascience christran16 elanafu jun-zhang-32108 andrewczgithub ltbyron while519 kurowasan goudetolivier pankajkarman booblu joeyqiang949 caikuijie rohitpandey13 jungel2star vishalbelsare tashay swapneelm slachapelle zosezhuo aashaybhupendradoshi zhengxuyu hotmanc devonbrackbill huangzhongyu h4o2 xuezhizhang nikolaospapachristou abasu644 adityachivu wangt-cn midekko yueyub sethi7ik ahoyosid dotrado taomingming black-swan-icl c00h00g ziwenzu-zz greenmoonforhdl angliyaaa sjain24 jonygao621 ljanastas lillywu chaochaolu blackpigg smily1984 dscausality yangxhcaf mancunian1792 mohadese-ghayoomi n8sty wjj5881005 nobe55269 pacowong drahnreb chetanmehra rmansillal cliffordlai lepy inetkenya zhaoli2017 wslhahaha kreattang gintian badryoubiidrissi chloe-wang-december nullgogo lijush edgarvardanyan pkuphyfw

causaldiscoverytoolbox's Issues

CGNN Module : unexpected argument (nb_runs,nb_max_runs, etc)

When i tried to run the example, this line
Cgnn.predict(data, graph=ugraph, nb_runs=12, nb_max_runs=20, train_epochs=1500, test_epochs=1000)
Produces an unexpected argument error, is there any changes in the CGNN module that causes this? i tried investigating the module but i cant seem to figure out why this happens. Is this reproducible in your machine?

Reproducing Jarfo results on Kaggle challenge data

Hi,
I've been trying to reproduce Jarfo results from the Kaggle challenge (2013) with its final dataset from http://www.causality.inf.ethz.ch/CEdata/, i.e., with CEfinal_train_text.zip and CEfinal_test_text.zip.

It seems to output very different results compared to the original code, while the learning parameters and the used features seems to be the same.

import numpy as np
import pandas as pd
from cdt.causality.pairwise.Jarfo import Jarfo
from cdt.utils.io import read_causal_pairs
from cdt import SETTINGS

SETTINGS.GPU = False
SETTINGS.NJOBS = 1

train_data = read_causal_pairs(".../CEfinal_train_pairs.csv")
train_target = pd.read_csv(".../CEfinal_train_target.csv").iloc[:,:2].set_index("SampleID")
test_data = read_causal_pairs(".../CEfinal_test_pairs.csv")
test_target = pd.read_csv(".../CEfinal_test_target.csv").iloc[:,:2].set_index("SampleID")

j = Jarfo()
j.fit(train_data, train_target)

jp = j.predict(test_data)

acc = np.mean(jp * test_target.values > 0)
print(acc)

0.25491827465325406

I've tested it with
python 3.7.3
cdt 0.5.14

Thank you in advance for any hint.
Best
Tom

NCC example: pytorch error

Hi,

I've been trying the NCC-example from the docs and get an error from torch:

from cdt.causality.pairwise import NCC
import networkx as nx
import matplotlib.pyplot as plt
from cdt.data import load_dataset
from sklearn.model_selection import train_test_split
data, labels = load_dataset('tuebingen')
X_tr, X_te, y_tr, y_te = train_test_split(data, labels, train_size=.5)
obj = NCC()
obj.fit(X_tr, y_tr)
Epochs: 0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.6/site-packages/cdt-0.5.5-py3.6.egg/cdt/causality/pairwise/NCC.py", line 183, in fit
for (batch, label), i in zip(da, t):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 529, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate
return [default_collate(samples) for samples in transposed]
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 68, in
return [default_collate(samples) for samples in transposed]
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 43, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 349 and 392 in dimension 3 at /tmp/pip-req-build-l1dtn3mo/aten/src/THC/generic/THCTensorMath.cu:71

The error is almost the same from within
python 3.6.8
pytorch 1.1.0
cdt 0.5.5
or
with the nvidia-docker:0.5.5

I'm not sure if it's up to my hardware.
I get it on my notebook:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| 0 GeForce 940MX Off | 00000000:02:00.0 Off | N/A |
| N/A 40C P0 N/A / N/A | 269MiB / 2004MiB | 0% Default |

and on a workstation:

Thank you in advance for any hint.
Best
Tom

Potential Bug for GNN when computing the loss function

Hi,

I just read through the CGNN code, mainly interested in the pairwise version.

It looks like the criterion in computing the MMD(y, y_pred)
https://github.com/FenTechSolutions/CausalDiscoveryToolbox/blob/32200779ab9b63762be3a24a2147cff09ba2bb72/cdt/causality/pairwise/GNN.py#L111

However, in the original paper the compute the MMD([x, y], [x, y_pred])
https://github.com/GoudetOlivier/CGNN/blob/e3fcfc570e30fb8dad8bf00f619ef3c21998bb90/Code/cgnn/GNN.py#L70

Thanks a lot for the repo and the reply. Helped me understand a lot of new things.

Higher dimension data support

Hi,

I just started to look into your work and really like it.

I started out with the LUCAS example, and wonder if you have any plan to support high dimension features?

For example: I have a data set where feature 1 is an array of length L but feature 2 is just a single number.

Thanks for the great work, and thanks for using Pytorch.

Cheers,

All labels in TCEP are equal

Hi,
I was doing some more experiments using CGNN based algorithms on TCEP, and I couldn't replicate the results. I also got only 60% accuracy using the hyperparameters that got me 75%~ a few months ago. I couldn't find the cause of this, but when I printed the labels they were all 1. Is this a mistake?

Edit: I checked the commit warnings, apparently you did permute all pairs so that the labels is 1, and still is the right label. I'll compare with other instances of the dataset but I'm now reassured!

[enhancement] Causal Estimation Code?

Hey. Maybe this is a dumb question, but has any thought been put into performing causal estimation in a graph in this package? It's great to have a package like this that does causal discovery, but I'd also like to have the functionality that can generate the conditional probability estimates over the causal graph, as well as a method for generating answers to Pr(Y|do(X=x)) for x and y being continuous and discrete. I've found the ability to do this kind of general calculation to be absent from most python packages. I currently use the pomegranate package to build a causal bayes net for my discrete variables (and I discretize continuous variables when I have them), and then use another package to determine backdoor or frontdoor based variables I can condition on, and then use the adjustment formula. It just all seems to be glued together however because all of this doesn't exist in the same package, especially for graphs with all discrete variables, which I would think is a simpler case than the continuous case.

I know that 'DoWhy' exists, but it only allows for estimation of cause and effect for the direct effect case, where a treatment variable directly impacts an outcome variable.

NCC gives wrong prediction on TCEP?

I test NCC with half TCEP pair for training and half for testing.
When testing, I flip all pairs (X2 is cause). However, NCC outputs positive value for all pairs!
Code:

def test_NCC():
    from sklearn.model_selection import train_test_split
    tueb, labels = load_dataset('tuebingen')
    method = NCC
    print(method)
    m = method()
    X_tr, X_te, y_tr, y_te = train_test_split(tueb, labels, train_size=.5)
    m.fit(X_tr, y_tr, epochs=10000)
    r = m.predict_dataset(X_te.reindex(columns=['B', 'A']))
    print(r)

Outputs:

0       89.886803
1    42859.230469
2     2996.945312
3   351716.406250
4   218484.812500
5     1456.278320
6      354.131256
7      453.962494
8    29202.076172
9    47342.875000
10    2115.986084
11     175.141602
12    2060.776123
13   10275.829102
14    2584.913574
15    8027.451660
16     637.758789
17   49512.773438
18     794.610840
19     177.425110
20    5133.766602
21    2414.513916
22     205.962494
23     411.851135
24     186.423264
25     880.144958
26     173.254272
27      85.153000
28     758.132324
29     726.009766
30    2010.785767
31    1986.761475
32    1791.590332
33      32.296738
34    2300.482666
35   12707.833008
36   63790.007812
37    4901.006836
38     935.546875
39     232.197510
40    5229.793457
41    2120.424316
42     180.572327
43    2947.156738
44    2176.514160
45    2140.100098
46    6997.687988
47   28182.152344
48     881.467407
49    1656.368042

GS/MMPC algorithm removed all non-related variable

Hi there,
This is my very first post on Github so parden me if this is a simple fix to my problem.

I am currently exploring the different graph inference algorithms on my own dataset. One thing I realized is that when implementing algorithm like GS/MMPC the uncorrelated variable is not shown(automatically removed) in the nx.adjacency_matrix(output).todense() command.

For example I fed 50 variables into the MMPC algorithm and out of which only 35 variables are correlated and 15 are not. The nx.adjacency_matrix(output).todense() will only spit out the matrix for 35 variables and I do not know the variables that are uncorrelated and removed.

While CDT does provide plotting option I do perfer to use package like Graphviz. Thus it will be helpful to acquire the matrix for all 50 input variables instead.

Is there a way for me to obtain such matrix?
Thank you in advance!

NCC outputs a continuous value

The NCC code is supposed to output 1 or -1 but that does not happen. I tried it on the example used in the documentation.

FSGNN strange matrix multiplication

Hello,
I played with the FSGNN example avalaible, modifying very little pieces of the code (most of it, especially the NN-related part, is the same as the original).
However after training, the whole thing crashes because of some matrix multiplication.
You can find a screen capture of the error message (inside the jupyter notebook) below.

I think you meant to write matrix_results = matrix_results.dot(matrix_results.T) or something like that? ( or matrix_results = matrix_results @ matrix_results.T works too i believe); it would only make sense, as (2,11) (11,2) (2,11) is a valid matrix multiplication operation..
Maybe A*B is now performing an element-wise product ?

Regards,
Arno V.

Docker image usage

Hi Diviyan,

Thank you for your project!

I am not very familiar with Docker, but I think it's worth trying.
I did the following:

$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
divkal/cdt-py3.6    0.4.0               2af9c5c51ac1        2 weeks ago         3.34GB
$ docker run -it --entrypoint /bin/bash 2af9c5c51ac1
root@35b4b45ecfc8:/#

But it seems there is no python or conda in the image:

root@35b4b45ecfc8:/# python
bash: python: command not found
root@35b4b45ecfc8:/# conda
bash: conda: command not found

I believe it must be my fault. Could you please give some pointers?

Best,
Abel

ImportError: cannot import name 'version' from 'sklearn' (unknown location)

While attempted to import cdt, I get an Import error.

For context:
I'm on a Mac, running python 3 and the most recent versions of sklearn and cdt.

Full Traceback:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-166-859cb185c28f> in <module>
      2 #from sklearn.gaussian_process import GaussianProcessRegressor
      3 #import networkx as nx
----> 4 import cdt

~/anaconda3/lib/python3.7/site-packages/cdt/__init__.py in <module>
     26 """
     27 
---> 28 import cdt.causality
     29 import cdt.independence
     30 import cdt.data

~/anaconda3/lib/python3.7/site-packages/cdt/causality/__init__.py in <module>
     22 .. SOFTWARE.
     23 """
---> 24 from .pairwise import __init__
     25 from .graph import __init__

~/anaconda3/lib/python3.7/site-packages/cdt/causality/pairwise/__init__.py in <module>
     22 .. SOFTWARE.
     23 """
---> 24 from .ANM import ANM
     25 from .CDS import CDS
     26 from .Jarfo import Jarfo

~/anaconda3/lib/python3.7/site-packages/cdt/causality/pairwise/ANM.py in <module>
     27 """
     28 
---> 29 from sklearn.gaussian_process import GaussianProcessRegressor
     30 from sklearn.preprocessing import scale
     31 from .model import PairwiseModel

~/anaconda3/lib/python3.7/site-packages/sklearn/gaussian_process/__init__.py in <module>
     11 """
     12 
---> 13 from .gpr import GaussianProcessRegressor
     14 from .gpc import GaussianProcessClassifier
     15 from . import kernels

~/anaconda3/lib/python3.7/site-packages/sklearn/gaussian_process/gpr.py in <module>
     12 from scipy.optimize import fmin_l_bfgs_b
     13 
---> 14 from ..base import BaseEstimator, RegressorMixin, clone
     15 from ..base import MultiOutputMixin
     16 from .kernels import RBF, ConstantKernel as C

~/anaconda3/lib/python3.7/site-packages/sklearn/base.py in <module>
     13 import numpy as np
     14 
---> 15 from . import __version__
     16 from .utils import _IS_32BIT
     17 

ImportError: cannot import name '__version__' from 'sklearn' (unknown location)

ANM score

What does the value on the ANM score indicate?
I never get a negative value even when I reverse the order to the function anm_score.

For Eg. anm_score(x1, y1) ,anm_score(y1, x1) = (1.492447882112475, 0.6205516300704923)
and anm_score(x2, y2), anm_score(y2, x2) = (1.2622033043127454, 1.9067645693295359)

What can I infer from these values?
Does this mean that x1 causes y1 since 1.49 > 0.620 and y2 causes x2 since 1.90 > 1.26 ?

import error when importing cdt.causality.pairwise.ANM

When importing the ANM package from cdt.causality.pairwise, I got the following error message:" ImportError: cannot import name 'GraphLasso'".
Which is a function from the sklearn.covariance package.
I checked the sklearn documentation and the function is apparently only documented for version 0.11. I tried to pip install this older version of sklearn, but got the error message:" ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output." Which has apparently something to do with conversion from python2 to python3.
So my guess is, that GraphLasso is not supported anymore for new versions of sklearn.

I use python 3.6.8 in a virtual environment.
My Pytorch version is 1.3.1
My sklearn version is 0.22

I haven't tried running in a docker image, because I am new to docker.
Any help would be much appreciated.

Matthias

Example

Hey man in trying your example, I have run into an error. I am not sure what to make of it, as it might be device related. I will include the message here and the file link.

https://github.com/snowde/firmai.github.io/blob/master/Discovery_LUCAS.ipynb


# So the question is, if you only have the data can you find the
# structure of the graph
from cdt.independence.graph import FSGNN

Fsgnn = FSGNN()

start_time = time.time()
ugraph = Fsgnn.predict(data, train_epochs=2000, test_epochs=1000, threshold=5e-4, l1=0.01)
print("--- Execution time : %4.4s seconds ---" % (time.time() - start_time))
nx.draw_networkx(ugraph, font_size=8) # The plot function allows for quick visualization of the graph.
plt.show()
# List results
pd.DataFrame(list(ugraph.edges(data='weight')))

`---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/model.py", line 43, in run_feature_selection
return self.predict_features(df_features, df_target, **kwargs)
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/FSGNN.py", line 79, in predict_features
x = th.FloatTensor(scale(df_features.as_matrix())).to(device)
AttributeError: 'torch.FloatTensor' object has no attribute 'to'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException

AttributeError Sat Jun 30 16:19:48 2018
PID: 92373 Python 3.6.3: /Users/dereksnow/anaconda/envs/py36/bin/python
...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self.size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(<bound method FeatureSelectionModel.run_feature...n of <cdt.independence.graph.FSGNN.FSGNN object>>, ( Allergy Anxiety Genetics Peer_Pressure...0.858699 -1.037579

[500 rows x 11 columns], 'Allergy', 0), {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self.size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <bound method FeatureSelectionModel.run_feature...n of <cdt.independence.graph.FSGNN.FSGNN object>>
args = ( Allergy Anxiety Genetics Peer_Pressure...0.858699 -1.037579

[500 rows x 11 columns], 'Allergy', 0)
kwargs = {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/model.py in run_feature_selection(self=<cdt.independence.graph.FSGNN.FSGNN object>, df_data= Allergy Anxiety Genetics Peer_Pressure...0.858699 -1.037579

[500 rows x 11 columns], target='Allergy', idx=0, **kwargs={'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000})
38 list_features = list(df_data.columns.values)
39 list_features.remove(target)
40 df_target = pd.DataFrame(df_data[target], columns=[target])
41 df_features = df_data[list_features]
42
---> 43 return self.predict_features(df_features, df_target, **kwargs)
self.predict_features = <bound method FSGNN.predict_features of <cdt.independence.graph.FSGNN.FSGNN object>>
df_features = Anxiety Genetics Peer_Pressure Attentio...0.858699 -1.037579

[500 rows x 10 columns]
df_target = Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

[500 rows x 1 columns]
kwargs = {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000}
44
45 def predict(self, df_data, threshold=0.05, **kwargs):
46 """Get the skeleton of the graph from raw data.
47

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/FSGNN.py in predict_features(self=<cdt.independence.graph.FSGNN.FSGNN object>, df_features= Anxiety Genetics Peer_Pressure Attentio...0.858699 -1.037579

[500 rows x 10 columns], df_target= Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

[500 rows x 1 columns], nh=20, idx=0, dropout=0.0, activation_function=<class 'torch.nn.modules.activation.ReLU'>, lr=0.01, l1=0.01, train_epochs=2000, test_epochs=1000, device='cpu', verbose=False, nb_runs=3)
74 activation_function=th.nn.ReLU, lr=0.01, l1=0.1, # batch_size=-1,
75 train_epochs=1000, test_epochs=1000, device=None,
76 verbose=None, nb_runs=3):
77 """For one variable, predict its neighbours."""
78 device, verbose = SETTINGS.get_default(('device', device), ('verbose', verbose))
---> 79 x = th.FloatTensor(scale(df_features.as_matrix())).to(device)
x = undefined
df_features.as_matrix.to = undefined
device = 'cpu'
80 y = th.FloatTensor(scale(df_target.as_matrix())).to(device)
81 out = []
82 for i in range(nb_runs):
83 model = FSGNN_model([x.size()[1], nh, 1],

AttributeError: 'torch.FloatTensor' object has no attribute 'to'

"""

The above exception was the direct cause of the following exception:

TransportableException Traceback (most recent call last)
~/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
698 if getattr(self._backend, 'supports_timeout', False):
--> 699 self._output.extend(job.get(timeout=self.timeout))
700 else:

~/anaconda/envs/py36/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
643 else:
--> 644 raise self._value
645

TransportableException: TransportableException

[500 rows x 11 columns], 'Allergy', 0), {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000})]
132
133 def len(self):
134 return self._size
135

[500 rows x 11 columns], 'Allergy', 0)
kwargs = {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000}
132
133 def len(self):
134 return self._size
135

[500 rows x 10 columns]
df_target = Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/FSGNN.py in predict_features(self=<cdt.independence.graph.FSGNN.FSGNN object>, df_features= Anxiety Genetics Peer_Pressure Attentio...0.858699 -1.037579

[500 rows x 10 columns], df_target= Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

AttributeError: 'torch.FloatTensor' object has no attribute 'to'

During handling of the above exception, another exception occurred:

JoblibAttributeError Traceback (most recent call last)
in ()
6
7 start_time = time.time()
----> 8 ugraph = Fsgnn.predict(data, train_epochs=2000, test_epochs=1000, threshold=5e-4, l1=0.01)
9 print("--- Execution time : %4.4s seconds ---" % (time.time() - start_time))
10 nx.draw_networkx(ugraph, font_size=8) # The plot function allows for quick visualization of the graph.

~/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/model.py in predict(self, df_data, threshold, **kwargs)
53 result_feature_selection = Parallel(n_jobs=nb_jobs)(delayed(self.run_feature_selection)
54 (df_data, node, idx, **kwargs)
---> 55 for idx, node in enumerate(list_nodes))
56 else:
57 result_feature_selection = [self.run_feature_selection(df_data, node, idx, **kwargs) for idx, node in enumerate(list_nodes)]

~/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in call(self, iterable)
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time

~/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
738 exception = exception_type(report)
739
--> 740 raise exception
741
742 def call(self, iterable):

JoblibAttributeError: JoblibAttributeError

Multiprocessing exception:
...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/runpy.py in _run_module_as_main(mod_name='ipykernel.main', alter_argv=1)
188 sys.exit(msg)
189 main_globals = sys.modules["main"].dict
190 if alter_argv:
191 sys.argv[0] = mod_spec.origin
192 return _run_code(code, main_globals, None,
--> 193 "main", mod_spec)
mod_spec = ModuleSpec(name='ipykernel.main', loader=<_f...b/python3.6/site-packages/ipykernel/main.py')
194
195 def run_module(mod_name, init_globals=None,
196 run_name=None, alter_sys=False):
197 """Execute a module's code without importing it

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/runpy.py in _run_code(code=<code object at 0x10f262a50, file "/Use...3.6/site-packages/ipykernel/main.py", line 1>, run_globals={'annotations': {}, 'builtins': <module 'builtins' (built-in)>, 'cached': '/Users/dereksnow/anaconda/envs/py36/lib/python3....ges/ipykernel/pycache/main.cpython-36.pyc', 'doc': None, 'file': '/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/main.py', 'loader': <_frozen_importlib_external.SourceFileLoader object>, 'name': 'main', 'package': 'ipykernel', 'spec': ModuleSpec(name='ipykernel.main', loader=<_f...b/python3.6/site-packages/ipykernel/main.py'), 'app': <module 'ipykernel.kernelapp' from '/Users/derek.../python3.6/site-packages/ipykernel/kernelapp.py'>}, init_globals=None, mod_name='main', mod_spec=ModuleSpec(name='ipykernel.main', loader=<_f...b/python3.6/site-packages/ipykernel/main.py'), pkg_name='ipykernel', script_name=None)
80 cached = cached,
81 doc = None,
82 loader = loader,
83 package = pkg_name,
84 spec = mod_spec)
---> 85 exec(code, run_globals)
code = <code object at 0x10f262a50, file "/Use...3.6/site-packages/ipykernel/main.py", line 1>
run_globals = {'annotations': {}, 'builtins': <module 'builtins' (built-in)>, 'cached': '/Users/dereksnow/anaconda/envs/py36/lib/python3....ges/ipykernel/pycache/main.cpython-36.pyc', 'doc': None, 'file': '/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/main.py', 'loader': <_frozen_importlib_external.SourceFileLoader object>, 'name': 'main', 'package': 'ipykernel', 'spec': ModuleSpec(name='ipykernel.main', loader=<_f...b/python3.6/site-packages/ipykernel/main.py'), 'app': <module 'ipykernel.kernelapp' from '/Users/derek.../python3.6/site-packages/ipykernel/kernelapp.py'>}
86 return run_globals
87
88 def _run_module_code(code, init_globals=None,
89 mod_name=None, mod_spec=None,

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/main.py in ()
1 if name == 'main':
2 from ipykernel import kernelapp as app
----> 3 app.launch_new_instance()

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
653
654 If a global instance already exists, this reinitializes and starts it
655 """
656 app = cls.instance(**kwargs)
657 app.initialize(argv)
--> 658 app.start()
app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
659
660 #-----------------------------------------------------------------------------
661 # utility functions, for convenience
662 #-----------------------------------------------------------------------------

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
472 return self.subapp.start()
473 if self.poller is not None:
474 self.poller.start()
475 self.kernel.start()
476 try:
--> 477 ioloop.IOLoop.instance().start()
478 except KeyboardInterrupt:
479 pass
480
481 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/tornado/ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
883 self._events.update(event_pairs)
884 while self._events:
885 fd, events = self._events.popitem()
886 try:
887 fd_obj, handler_func = self._handlers[fd]
--> 888 handler_func(fd_obj, events)
handler_func = <function wrap..null_wrapper>
fd_obj = <zmq.sugar.socket.Socket object>
events = 1
889 except (OSError, IOError) as e:
890 if errno_from_exception(e) == errno.EPIPE:
891 # Happens when the client closes the connection
892 pass

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
272 # Fast path when there are no active contexts.
273 def null_wrapper(*args, **kwargs):
274 try:
275 current_state = _state.contexts
276 _state.contexts = cap_contexts[0]
--> 277 return fn(*args, **kwargs)
args = (<zmq.sugar.socket.Socket object>, 1)
kwargs = {}
278 finally:
279 _state.contexts = current_state
280 null_wrapper._wrapped = True
281 return null_wrapper

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
445 return
446 zmq_events = self.socket.EVENTS
447 try:
448 # dispatch events:
449 if zmq_events & zmq.POLLIN and self.receiving():
--> 450 self._handle_recv()
self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
451 if not self.socket:
452 return
453 if zmq_events & zmq.POLLOUT and self.sending():
454 self._handle_send()

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
475 else:
476 raise
477 else:
478 if self._recv_callback:
479 callback = self._recv_callback
--> 480 self._run_callback(callback, msg)
self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
callback = <function wrap..null_wrapper>
msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
481
482
483 def _handle_send(self):
484 """Handle a send event."""

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function wrap..null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
427 close our socket."""
428 try:
429 # Use a NullContext to ensure that all StackContexts are run
430 # inside our blanket exception handler rather than outside.
431 with stack_context.NullContext():
--> 432 callback(*args, **kwargs)
callback = <function wrap..null_wrapper>
args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
kwargs = {}
433 except:
434 gen_log.error("Uncaught exception in ZMQStream callback",
435 exc_info=True)
436 # Re-raise the exception so that IOLoop.handle_callback_exception

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/tornado/stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
272 # Fast path when there are no active contexts.
273 def null_wrapper(*args, **kwargs):
274 try:
275 current_state = _state.contexts
276 _state.contexts = cap_contexts[0]
--> 277 return fn(*args, **kwargs)
args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
kwargs = {}
278 finally:
279 _state.contexts = current_state
280 null_wrapper._wrapped = True
281 return null_wrapper

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
278 if self.control_stream:
279 self.control_stream.on_recv(self.dispatch_control, copy=False)
280
281 def make_dispatcher(stream):
282 def dispatcher(msg):
--> 283 return self.dispatch_shell(stream, msg)
msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
284 return dispatcher
285
286 for s in self.shell_streams:
287 s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {'allow_stdin': True, 'code': "# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 6, 30, 4, 19, 48, 623932, tzinfo=tzutc()), 'msg_id': '3F664F4DF8994E9980115EFD76A70918', 'msg_type': 'execute_request', 'session': '1DCE294733F74AD0BBF17671DE4E5820', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '3F664F4DF8994E9980115EFD76A70918', 'msg_type': 'execute_request', 'parent_header': {}})
230 self.log.warn("Unknown message type: %r", msg_type)
231 else:
232 self.log.debug("%s: %s", msg_type, msg)
233 self.pre_handler_hook()
234 try:
--> 235 handler(stream, idents, msg)
handler = <bound method Kernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
stream = <zmq.eventloop.zmqstream.ZMQStream object>
idents = [b'1DCE294733F74AD0BBF17671DE4E5820']
msg = {'buffers': [], 'content': {'allow_stdin': True, 'code': "# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))", 'silent': False, 'stop_on_error': True, 'store_history': True, 'user_expressions': {}}, 'header': {'date': datetime.datetime(2018, 6, 30, 4, 19, 48, 623932, tzinfo=tzutc()), 'msg_id': '3F664F4DF8994E9980115EFD76A70918', 'msg_type': 'execute_request', 'session': '1DCE294733F74AD0BBF17671DE4E5820', 'username': 'username', 'version': '5.2'}, 'metadata': {}, 'msg_id': '3F664F4DF8994E9980115EFD76A70918', 'msg_type': 'execute_request', 'parent_header': {}}
236 except Exception:
237 self.log.error("Exception in message handler:", exc_info=True)
238 finally:
239 self.post_handler_hook()

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code="# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))", silent=False, store_history=True, user_expressions={}, allow_stdin=True)
191
192 self._forward_input(allow_stdin)
193
194 reply_content = {}
195 try:
--> 196 res = shell.run_cell(code, store_history=store_history, silent=silent)
res = undefined
shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
code = "# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))"
store_history = True
silent = False
197 finally:
198 self._restore_input()
199
200 if res.error_before_exec is not None:

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/ipykernel/zmqshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, *args=("# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))",), **kwargs={'silent': False, 'store_history': True})
528 )
529 self.payload_manager.write_payload(payload)
530
531 def run_cell(self, *args, **kwargs):
532 self._last_traceback = None
--> 533 return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
self.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
args = ("# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))",)
kwargs = {'silent': False, 'store_history': True}
534
535 def _showtraceback(self, etype, evalue, stb):
536 # try to preserve ordering of tracebacks and print statements
537 sys.stdout.flush()

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/IPython/core/interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object at 0x1a1f3f18a0, file "", line 8>, result=<ExecutionResult object at 111a330b8, execution_...before_exec=None error_in_exec=None result=None>)
2905 outflag = True # happens in more places, so it's easier as default
2906 try:
2907 try:
2908 self.hooks.pre_run_code_hook()
2909 #rprint('Running code', repr(code_obj)) # dbg
-> 2910 exec(code_obj, self.user_global_ns, self.user_ns)
code_obj = <code object at 0x1a1f3f18a0, file "", line 8>
self.user_global_ns = {'FSGNN': <class 'cdt.independence.graph.FSGNN.FSGNN'>, 'Fsgnn': <cdt.independence.graph.FSGNN.FSGNN object>, 'In': ['', '#Import libraries\nimport cdt\nfrom cdt import SET...pandas as pd\nfrom matplotlib import pyplot as plt', '#Import libraries\nimport cdt\nfrom cdt import SET...pandas as pd\nfrom matplotlib import pyplot as plt', '# Load data and graph solution\ndata = pd.read_cs...sualization of the graph. \nplt.show()\ndata.head()', 'solution', "# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))"], 'Out': {3: Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , 4: <networkx.classes.digraph.DiGraph object>}, 'SETTINGS': <cdt.utils.Settings.ConfigSettings object>, '': <networkx.classes.digraph.DiGraph object>, '3': Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , '4': <networkx.classes.digraph.DiGraph object>, '': Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , '': '', ...}
self.user_ns = {'FSGNN': <class 'cdt.independence.graph.FSGNN.FSGNN'>, 'Fsgnn': <cdt.independence.graph.FSGNN.FSGNN object>, 'In': ['', '#Import libraries\nimport cdt\nfrom cdt import SET...pandas as pd\nfrom matplotlib import pyplot as plt', '#Import libraries\nimport cdt\nfrom cdt import SET...pandas as pd\nfrom matplotlib import pyplot as plt', '# Load data and graph solution\ndata = pd.read_cs...sualization of the graph. \nplt.show()\ndata.head()', 'solution', "# So the question is, if you only have the data ...s\npd.DataFrame(list(ugraph.edges(data='weight')))"], 'Out': {3: Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , 4: <networkx.classes.digraph.DiGraph object>}, 'SETTINGS': <cdt.utils.Settings.ConfigSettings object>, '': <networkx.classes.digraph.DiGraph object>, '_3': Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , '4': <networkx.classes.digraph.DiGraph object>, '': Allergy Anxiety Genetics Peer_Pressure ... -0.733240 -0.149308 0.854195 -0.633940 , '': '', ...}
2911 finally:
2912 # Reset our crash handler in place
2913 sys.excepthook = old_excepthook
2914 except SystemExit as e:

...........................................................................
/Volumes/extra/FirmAI/Causal Inference/CausalDiscoveryToolbox-master/examples/ in ()
3 from cdt.independence.graph import FSGNN
4
5 Fsgnn = FSGNN()
6
7 start_time = time.time()
----> 8 ugraph = Fsgnn.predict(data, train_epochs=2000, test_epochs=1000, threshold=5e-4, l1=0.01)
9 print("--- Execution time : %4.4s seconds ---" % (time.time() - start_time))
10 nx.draw_networkx(ugraph, font_size=8) # The plot function allows for quick visualization of the graph.
11 plt.show()
12 # List results

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/model.py in predict(self=<cdt.independence.graph.FSGNN.FSGNN object>, df_data= Allergy Anxiety Genetics Peer_Pressure...0.858699 -1.037579

[500 rows x 11 columns], threshold=0.0005, **kwargs={'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000})
50 nb_jobs = kwargs.get("nb_jobs", SETTINGS.NB_JOBS)
51 list_nodes = list(df_data.columns.values)
52 if nb_jobs != 1:
53 result_feature_selection = Parallel(n_jobs=nb_jobs)(delayed(self.run_feature_selection)
54 (df_data, node, idx, **kwargs)
---> 55 for idx, node in enumerate(list_nodes))
idx = undefined
node = undefined
list_nodes = ['Allergy', 'Anxiety', 'Genetics', 'Peer_Pressure', 'Attention_Disorder', 'Smoking', 'Lung_Cancer', 'Yellow_Fingers', 'Coughing', 'Fatigue', 'Car_Accident']
56 else:
57 result_feature_selection = [self.run_feature_selection(df_data, node, idx, **kwargs) for idx, node in enumerate(list_nodes)]
58 for idx, i in enumerate(result_feature_selection):
59 try:

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/joblib/parallel.py in call(self=Parallel(n_jobs=4), iterable=<generator object FeatureSelectionModel.predict..>)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=4)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),

Sub-process traceback:

[500 rows x 11 columns], 'Allergy', 0), {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000})]
132
133 def len(self):
134 return self._size
135

[500 rows x 11 columns], 'Allergy', 0)
kwargs = {'l1': 0.01, 'test_epochs': 1000, 'train_epochs': 2000}
132
133 def len(self):
134 return self._size
135

[500 rows x 10 columns]
df_target = Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

...........................................................................
/Users/dereksnow/anaconda/envs/py36/lib/python3.6/site-packages/cdt/independence/graph/FSGNN.py in predict_features(self=<cdt.independence.graph.FSGNN.FSGNN object>, df_features= Anxiety Genetics Peer_Pressure Attentio...0.858699 -1.037579

[500 rows x 10 columns], df_target= Allergy
0 -0.266076
1 -0.579084
2 -0...8 -0.064685
499 -0.638704

AttributeError: 'torch.FloatTensor' object has no attribute 'to'
___________________________________________________________________________`

TCEP dataset incoherent with 'official' version?

Hi,
After I opened the issue about the labels being all set to 1, I went to check the tcep reference website to identify some pairs that got permuted and so on.

I stumbled across something strange: in the original dataset, some of the variables are multivariate. This can be seen, for example, in pair 54 or pair 71.

However, in the current version of cdt, when checking these two pairs, one finds 1D variables.

> data.iloc[53]
A    [43.51, 41.33, 36.78, -8.82, 34.61, 40.11, 12....
B    [42.0, 75.0, 69.0, 42.0, 76.0, 72.0, 77.0, 81....
Name: pair54, dtype: object
> data.iloc[53]['A']
array([ 43.51,  41.33,  36.78,  -8.82,  34.61,  40.11,  12.52, -35.18,
        48.12,  40.24,  25.4 ,  26.19,  23.71,  13.09,  53.97,  50.83,
        17.25,   6.48,  27.44, -16.3 ,  43.86, -24.66, -15.78,   4.94,
        42.71,  12.35,  -3.38,  11.54,   3.86,  45.42,   4.36,  12.11,
        49.42, -33.45,  31.14, -11.7 ,  -4.25,  -4.33,   9.92,   5.33,
        45.8 ,  23.  ,  35.17,  50.08,  55.68,  11.58,  18.48,  -0.23,
        30.06,  13.7 ,   3.75,  15.33,  59.43,   9.  ,   6.92, -18.14,
        60.17,  48.86,   4.93, -17.54,   0.39,  13.44,  41.7 ,  52.52,
         5.54,  37.97,  12.05,  16.  ,  13.47,  14.62,   9.54,  11.86,
         6.8 ,  18.54,  14.08,  22.3 ,  47.5 ,  64.14,  28.63,  -6.19,
        35.71,  33.32,  53.34,  31.79,  41.9 ,  18.  ,  35.68,  31.94,
        51.18,  -1.28,  39.02,  37.51,  29.37,  42.87,  17.97,  56.95,
        33.89, -29.3 ,   6.31,  32.88,  54.69,  49.61,  22.18,  42.  ,
       -18.92, -13.99,   3.15,   4.17,  12.65,  35.9 ,  14.6 ,  18.07,
       -20.16,  19.42,  47.91,  42.46,  33.99, -25.97,  19.74, -22.57,
        27.71,  52.37,  12.1 , -22.28, -41.29,  12.15,  13.52,   9.06,
        59.91,  23.61,  33.68,  31.88,   8.99,  -9.47, -25.3 , -12.09,
        14.58,  52.22,  38.71,  18.45,  25.29,  47.01, -20.87,  44.45,
        55.76,  27.15,  14.  , -13.83,   0.34,  24.67,  14.7 ,  44.8 ,
         8.47,   1.29,  48.21,  46.05,  -9.43,   2.04, -25.75,  40.42,
         6.92,  13.2 ,  15.63,   5.82, -26.32,  59.33,  46.95,  33.52,
        38.57,  -6.17,  13.76,  -8.57,   6.12, -21.14,  10.66,  36.81,
        39.94,  37.95,   0.31,  50.44,  24.48,  51.5 ,  38.89,  18.34,
       -34.89,  41.32, -17.74,  10.5 ,  21.03,  15.36, -15.41, -17.82])

Same can be seen about pair 71. Is this a mistake, or just a shuffling of the data?
I made sure I set shuffle=False before testing for the two pairs.
If the basic (non-shuffled) dataset is already shuffled, or has been pre-processed in some way to reduce dimensionality, can we have some explanation of how the two datasets relate to each other?

Any amount of information would help,
Thanks

[Metrics] AUPR needs to take account of undirected edges.

GNN never stop even P-value < 0.01

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2648 C python 1077MiB |

My question is the GNN would never stop even I the smallest P-value obtained, what is wrong with my code, please help
from cdt.independence.graph import FSGNN
Fsgnn = FSGNN(train_epochs=100, test_epochs=50, l1=0.1, batch_size=1000)

start_time = time.time()
ugraph = Fsgnn.predict(df, threshold=1e-7)
print("--- Execution time : %4.4s seconds ---" % (time.time() - start_time))
nx.draw_networkx(ugraph, font_size=8)  # The plot function allows for quick visualization of the graph.
# plt.show()
# List results
list00 = pd.DataFrame(list(ugraph.edges(data='weight')))

from cdt.causality.graph import CGNN

Cgnn = CGNN(nruns=16, train_epochs=200, test_epochs=100, batch_size=1000)
start_time = time.time()
dgraph = Cgnn.orient_undirected_graph(df, ugraph)
print("--- Execution time : %4.4s seconds ---" % (time.time() - start_time))

# Plot the output graph
nx.draw_networkx(dgraph, font_size=8)  # The plot function allows for quick visualization of the graph.
# plt.show()
# Print output results :
list22 = pd.DataFrame(list(dgraph.edges(data='weight')), columns=['Cause', 'Effect', 'Score'])
print(list22)

using other data sets - pandas data frame

Hi @Diviyan-Kalainathan

How do we use this package with other data sets.
Do we just create a pandas data frame and use the load_data function?

Many thanks,
Best,
Andrew

NCC code not coherent with the paper?

Hi,
This could be obviously due to misunderstandings, but I thought the NCC framework should be built from 2 MLPs. From my understanding, MLPs typically do not use convolutions.
However, the RCC code I found in CDT is the following

        self.conv = th.nn.Sequential(th.nn.Conv1d(2, n_hiddens, kernel_size),
                                     th.nn.ReLU(),
                                     th.nn.Conv1d(n_hiddens, n_hiddens,
                                                  kernel_size),
                                     th.nn.ReLU())

In addition, the original paper recommends to enforce 1 - NCC(x1,x2) = NCC(x2,x1) (where [x1,x2] is a n-sample of pairs. They do this by having a composite output .5*(1- NCC(x2,x1) + NCC(x1,x2))
I am not sure which lines are responsible for this if they exist?

Any help would be greatly appreciated.
Regards,
Arno V.

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

I get the following error while trying to fit the NCC model using a GPU

obj.fit(data, labels, epochs=1000, batch_size=32, learning_rate=0.01, device = cdt.SETTINGS.default_device)

FSGNN

Hi,

When trying to run:

Fsgnn = FSGNN(train_epochs=1000, test_epochs=500, l1=0.1, batch_size=1000)

From the example notebook of the LUCAS data (with my own data, though) I get the following error:

TypeError: __init__() got an unexpected keyword argument 'batch_size'
I wonder whether there are some changes on the definition of the FSGNN subject. Second, I would like to know how does the package manage categorical data, as, from what I have noticed, the values are converted into floats at some point in the objects generation.

Thanks in advance,
Sergio

CGNN

Hey man, thanks for this, looking forward to exploring. Just a quick question regarding and error, do you know what is causing this.

Is HSIC Lasso different from KCI ?

Hello,
I was wondering whether the independence test found in the HSIC Lasso script was the one introduced in the paper Kernel-based Conditional Independence Test and Application in Causal Discovery ?
I'm asking this because they seem related (based on HSIC) but there's no citation in the code you provide, neither of the Gretton et Al. paper or the one I'm referring to.

If I had to guess, I would bet on a standard independence test, not conditional independence. Is that the case?

Thanks,
Arno V.

Error when running PC alg (some CSV file)

Hi,
As you know, I have installed most of the packages and attempted to run PC alg on sachs as well as fsgnn.
I thought I was out of trouble as the R setup looked fine, but when calling the following python3.6 snippet

from cdt.causality.graph.PC import PC

pc = PC(CItest="hsic",method_indep="rcit")
pcgraph = pc.predict(data)

The interpreter spat out the following error message:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b'/tmp/cdt_pc_0f4d2452-ff3c-4a84-81dd-73ab5b4d474b//result.csv' does not exist: b'/tmp/cdt_pc_0f4d2452-ff3c-4a84-81dd-73ab5b4d474b//result.csv'

Any idea of what this is about?

Regards,
A.V

NB: here is the full error printout

--------------------------------------------------------------------
FileNotFoundError                  Traceback (most recent call last)
<ipython-input-50-db1fca6dbddc> in <module>
----> 1 pcgraph = pc.predict(data)

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/causality/graph/model.py in predict(self, df_data, graph, **kwargs)
     61         """
     62         if graph is None:
---> 63             return self.create_graph_from_data(df_data, **kwargs)
     64         elif isinstance(graph, nx.DiGraph):
     65             return self.orient_directed_graph(df_data, graph, **kwargs)

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/causality/graph/PC.py in create_graph_from_data(self, data, **kwargs)
    257         self.arguments['{VERBOSE}'] = str(self.verbose).upper()
    258 
--> 259         results = self._run_pc(data, verbose=self.verbose)
    260 
    261         return nx.relabel_nodes(nx.DiGraph(results),

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/causality/graph/PC.py in _run_pc(self, data, fixedEdges, fixedGaps, verbose)
    300         except Exception as e:
    301             rmtree(run_dir)
--> 302             raise e
    303         except KeyboardInterrupt:
    304             rmtree(run_dir)

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/causality/graph/PC.py in _run_pc(self, data, fixedEdges, fixedGaps, verbose)
    296 
    297             pc_result = launch_R_script("{}/R_templates/pc.R".format(os.path.dirname(os.path.realpath(__file__))),
--> 298                                         self.arguments, output_function=retrieve_result, verbose=verbose)
    299         # Cleanup
    300         except Exception as e:

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/utils/R.py in launch_R_script(template, arguments, output_function, verbose, debug)
    198         if not debug:
    199             rmtree(base_dir)
--> 200         raise e
    201     except KeyboardInterrupt:
    202         if not debug:

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/utils/R.py in launch_R_script(template, arguments, output_function, verbose, debug)
    192                                            stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    193             process.wait()
--> 194             output = output_function()
    195 
    196     # Cleaning up

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/cdt/causality/graph/PC.py in retrieve_result()
    284 
    285         def retrieve_result():
--> 286             return read_csv('{}/result.csv'.format(run_dir), delimiter=',').values
    287 
    288         try:

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

~/progtools/python/virtualenvs/tfcuda/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1915         kwds["usecols"] = self.usecols
   1916 
-> 1917         self._reader = parsers.TextReader(src, **kwds)
   1918         self.unnamed_cols = self._reader.unnamed_cols
   1919 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b'/tmp/cdt_pc_0f4d2452-ff3c-4a84-81dd-73ab5b4d474b//result.csv' does not exist: b'/tmp/cdt_pc_0f4d2452-ff3c-4a84-81dd-73ab5b4d474b//result.csv'

Failed to import cdt on a machine without GPU.

I am using Windows 7 and I don't have a GPU.

I create a fresh conda environment as follows.

conda create --prefix venv python=3.7
conda activate ./venv
# Note that I don't have a GPU. https://pytorch.org/get-started/locally/ 
conda install pytorch torchvision cpuonly -c pytorch
conda install pip
pip install cdt

Packages:

pytorch: 1.2.0
torchversion: 0.4.0
cdt: 0.5.8
...

Executing import cdt failed.

The following lines caused the problem. https://github.com/Diviyan-Kalainathan/CausalDiscoveryToolbox/blob/6df55f4ec0800a377cb4688b7eaedb2b1f75f3fe/cdt/utils/Settings.py#L161-L163

My workaround is executing os.environ["CUDA_VISIBLE_DEVICES"]="[]" before executing import cdt.

CGNN results question

Hi,

So I have tried to run the experiments again for the CGNN pairwise experiments.

And I can confirm to get the same results for the Multi, Gauss, Net, Tueb datasets in terms of AUPRC (using 12 different runs to ensemble)
AUPR: 0.95 MULTI
AUPR: 0.80 GAUSS
AUPR: 0.90 NET

However when I look at the acc ie. predicting the actual direction I get:
0.43, 0.46, 0.49 respectively.

I compute the acc by the score

for dataset_name in ['multi', 'gauss', 'net']:
	data, labels = load_dataset(dataset_name)
	res = genfromtxt('results/res2_{}.csv'.format(dataset_name), delimiter=',', skip_header=True)
	idx = 0
	acc = 0
	labels = labels.to_numpy()
	for data_ in res[:, 1]:
		if data_ < 0 and labels[idx] == -1:
			acc += 1
                        idx += 1 # EDIT
		elif data_ > 0 and labels[idx] == 1:
			acc += 1
                        idx += 1 # EDIT
		else:
			idx += 1

	acc /= (res.shape[0]-1)
	print(res.shape[0])
	print('{} ACC : {}'.format(acc, dataset_name))
	aupr, curve = precision_recall(labels[:res.shape[0]], res[:, 1])
	print('AUPR: {}'.format(aupr))

This method also gives me around 74% unweighted on Tueb dataset.

So my question is whether this is expected or whether i should be computing the acc differently or maybe even the ACC doesnt matter?

Thanks for the clarification in advance.

Best

NCC.py

It is written in the code that it outputs 1 or -1 but sigmoid has been used which outputs between 0 and 1. So data with which type of labels is required? Please solve this issue.

Weights for TCEP dataset

Hi,
Many different papers related to bivariate causal discovery discuss the necessity of attaching a weight to each pair to account for the fact they come from the same joint distribution.

I do not see this as an option currently in CDT.

Would this possibly be an option in later releases? :)

Thanks!

Fix examples/Discovery_LUCAS.ipynb

Cgnn.predict(data, graph=ugraph, nb_runs=16, train_epochs=1500, test_epochs=1000) CGNN predict function doesn't accept nb_runs, train_epochs and test_epochs anymore. It has to be called like this:

Cgnn = CGNN(nb_runs=16, train_epochs=1500, test_epochs=1000)
Cgnn.predict(data, graph=ugraph)

ModuleNotFoundError: No module named 'sklearn.gaussian_process'

While attempted to import cdt, I get a ModuleNotFound error.

For context:
I'm on a Mac, running python 3.

Full Traceback:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-37-0d5686c1281b> in <module>
      1 import sklearn
      2 import networkx as nx
----> 3 import cdt

~/anaconda3/lib/python3.7/site-packages/cdt/__init__.py in <module>
     26 """
     27 
---> 28 import cdt.causality
     29 import cdt.independence
     30 import cdt.data

~/anaconda3/lib/python3.7/site-packages/cdt/causality/__init__.py in <module>
     22 .. SOFTWARE.
     23 """
---> 24 from .pairwise import __init__
     25 from .graph import __init__

~/anaconda3/lib/python3.7/site-packages/cdt/causality/pairwise/__init__.py in <module>
     22 .. SOFTWARE.
     23 """
---> 24 from .ANM import ANM
     25 from .CDS import CDS
     26 from .Jarfo import Jarfo

~/anaconda3/lib/python3.7/site-packages/cdt/causality/pairwise/ANM.py in <module>
     27 """
     28 
---> 29 from sklearn.gaussian_process import GaussianProcessRegressor
     30 from sklearn.preprocessing import scale
     31 from .model import PairwiseModel

ModuleNotFoundError: No module named 'sklearn.gaussian_process'

Sign of the IGCI score

Hi Diviyan,

I think the sign of the IGCI score is wrong.

In the UAI 2010 paper, on page 5, at the beginning of sec 3.5, Cx2y = S(Py) - S(Px) after preprocessing. And on page 3, after Postulate 2, the paper says negative value of Cx2y indicates X cause Y.

Thus, in method predict_proba of class IGCI, line 105, the entropy estimator should be eval_entropy(x) - eval_entropy(y) (replace x and y). Now, positve return value indicates X cause Y.

Running the code on CEP dataset, the result also agrees with the above.

Best,
Abel

ANM, lack of Gamma in "Gamma HSIC"

Hi,
This might simply be a conceptual problem, or a lack of knowledge on my part.
Usually, using HSIC to compare two ANM candidates can be done by comparing the statistics directly, or by computing the related p-value.
However, to compute a p-value one needs to have some notion of the HSIC distribution under the null.
The classic paper from Gretton et al. proposes a Gamma Approximation by giving specific plug-in values for the two Gamma parameters in terms of the expectation and variance of the HSIC;
If I had to compute the p-value myself, I would use the above approximation for the gamma distribution, and then use the gamma CDF parametrized by the above values.

I am aware there might be other ways to do such a thing, however your snipper in the anm method does not seem to compute p-values, but only test statistics.
While this might be wrong, the variable names as well as the description of the method suggests this.

Am I wrong? Right? If either, how so?

Thanks for any additionnal information on this topic,
I would ideally like to design a test which detects whenever a model satisfies an ANM with low Type I and II error.

[PyTorch] Initialization failure when using joblib w/ pytorch in another context.

import cdt
import joblib

joblib.Parallel()(joblib.delayed(## job that uses pytorch GPU)

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/torch/lib/THC/THCGeneral.c:74

problem about the SAM model

Hi, thanks for your great work, I have tried the SAM model on a 90 variables graph estimation.
However, after an hour, I found the loss become to nan (It's NOT always nan, sometimes seemed regularly)

Is that no problem? and need I change the lr or any parameters? (I just use the default parameters)

commond:
obj = SAM(gpus=3, njobs=6,nruns=16,batchsize=1024) output = obj.predict(data)

output:
397/11000 [2:52:25<76:44:58, 26.06s/it, disc=nan, gen=nan, regul_loss=nan, tot=nan]
402/11000 [2:52:16<75:41:45, 25.71s/it, disc=6.07, gen=-0.992, regul_loss=0.49, tot=-78.9]

Is Graphviz needed when using R's pcalg?

My question is related to these explanations about the pcalg package in R.
I am currently installing the required packages to run CDT's graph-related PC.py.
As the plotting is done using networkx, I figured Graphviz would not be needed.

Am I guessing correctly?
Is there anything more I should know about these R requirements (say, about the path to the packages or some environment variables..)

Thanks,
A.V

issue in tutorial

print(nx.adj_matrix(output_graph)).todense() does not work. It requires

print(nx.adjacency_matrix(output_graph)).todense()

instead

NCC got random guess performance on TCEP

I test NCC with half TCEP pairs for training and half for testing, and randomly split training and testing for 100 times.
Code:

tueb, labels = load_tuebingen(shuffle=True)

def test_NCC():
    from sklearn.model_selection import train_test_split
    method = NCC
    print(method)
    m = method()

    accs = []
    for n in range(100):
        X_tr, X_te, y_tr, y_te = train_test_split(tueb, labels, train_size=.5)
        m.fit(X_tr, y_tr, epochs=10000)
        r = m.predict_dataset(X_te)
        acc = np.mean(r.values*y_te.values > 0)
        accs.append(acc)
        print(acc, file=open('ncc_.txt', 'a'))
    print(np.mean(accs), np.std(accs), file=open('ncc_.txt', 'a'))

The average acc of ~60 times is 50.03%.

A first image is overfitting. But I am also running with epochs=500, and there seems no big difference (although the training accs are less like overfitting).

Thank you,
Abel

Issue with running RScript in Mac

The subprocess.call function does not recognise RScript. You need to provide full path (e.g. /usr/local/bin/Rscript) in R.py in order to resolve.

In cdt.causality.pairwise , the "RCC" example uses "Jarfo"

Hi,
minor problem, just a confusing example I found through experiments, see below

class RCC(PairwiseModel):
    """Randomized Causation Coefficient model. 2nd approach in the Fast
    Causation challenge.
    **Description:** The Randomized causation coefficient (RCC) relies on the
    projection of the empirical distributions into a RKHS using random cosine
    embeddings, then classfies the pairs using a random forest based on those
    features.
    **Data Type:** Continuous, Categorical, Mixed
    **Assumptions:** This method needs a substantial amount of labelled causal
    pairs to train itself. Its final performance depends on the training set
    used.
    Args:
        rand_coeff (int): number of randomized coefficients
        nb_estimators (int): number of estimators
        nb_min_leaves (int): number of min samples leaves of the estimator
        max_depth (): (optional) max depth of the model
        s (float): scaling
        njobs (int): number of jobs to be run on parallel (defaults to ``cdt.SETTINGS.NJOBS``)
        verbose (bool): verbosity (defaults to ``cdt.SETTINGS.verbose``)
    .. note::
       Ref : Lopez-Paz, David and Muandet, Krikamol and Schölkopf, Bernhard and Tolstikhin, Ilya O,
       "Towards a Learning Theory of Cause-Effect Inference", ICML 2015.
    Example:
        >>> from cdt.causality.pairwise import RCC
        >>> import networkx as nx
        >>> import matplotlib.pyplot as plt
        >>> from cdt.data import load_dataset
        >>> from sklearn.model_selection import train_test_split
        >>> data, labels = load_dataset('tuebingen')
        >>> X_tr, X_te, y_tr, y_te = train_test_split(data, labels, train_size=.5)
        >>>
        >>> obj = Jarfo()
        >>> obj.fit(X_tr, y_tr)
        >>> # This example uses the predict() method
        >>> output = obj.predict(X_te)
        >>>
        >>> # This example uses the orient_graph() method. The dataset used
        >>> # can be loaded using the cdt.data module
        >>> data, graph = load_dataset('sachs')
        >>> output = obj.orient_graph(data, nx.DiGraph(graph))
        >>>
        >>> # To view the directed graph run the following command
        >>> nx.draw_networkx(output, font_size=8)
        >>> plt.show()
    """

Recommended training parameters for GNN, balancing accuracy and running time

Hi Diviyan,

I tried GNN on TCEP dataset, with default parameters:

def test_pairwise_GNN():
    method = GNN
    print(method)
    m = method()
    r = m.predict_dataset(tueb)
    assert r is not None
    print(r)
    return 0

But it requires nearly half a day to test a single pair on my PC!

Could you recommend a set of parameters that run as fast as possible without much sacrifice on accuracy?

Thank you,
Abel

CDT GPU Util: When using several Jupyter Notebooks, some do not detect CUDA.

Hello,
I have been relying on Jupyter and CDT to experiment the past few weeks.
Several times, some unwanted behavior has manifested itself:

Trying algorithm one with CUDA and visualizing results, then opening a new notebook to test a new algorithm, the new notebook's CDT instance did not detect GPU (in this case shutting down the notebook that detected CUDA before was enough to solve the problem)
Making sure all other notebooks were shut down a few days later, I experiment again. For some reason, after modifying the code and pressing "Restart & Clear Output", the same notebook that could detect my only one CUDA GPU did not.
After closing the entire jupyter server and launching again a single notebook file, I could then use CDT with CUDA successfully.

Is this behavior expected? Could the new notebook take priority over the old ones?
Here are screenshots of the same notebook before and after restarting the Jupyter server.

R algorithms all write in the same /tmp/ folder, preventing multiple runs

The issue has to be solved with folder identifiers.

Reproducing results on Teubingen

I ran GNN with default parameters on Tub dataset with 10 epochs for train and 10 for test and nb_max_runs=5. Got AUC 54 (in https://arxiv.org/pdf/1711.08936.pdf it is specified that I should get a higher score). Am I doing something wrong?

Note: running 1000 epochs is infeasible, since it already takes more than 5 hours to run it with 10 epochs.

Thanks!!

[Generators] GMM_cause returns uniform distribution.

FileNotFoundError

When I'm trying to run some examples with different parameters, I get this error:

FileNotFoundError: File b'/tmp/cdt_CAMbc4bbf1c-b80b-4e8b-9184-23ba73222cce/result.csv' does not exist

Here is the snippet of code I'm trying to run

import networkx as nx
from cdt.causality.graph import CAM
from cdt.data import load_dataset
data, graph = load_dataset("sachs")
obj = CAM(selmethod='gam')
output = obj.predict(data)

Mini-batch train

Hi,

I would like suggest to implement mini-batch training. Specifically, I tried to run FSGNN on my data and got the following error

not enough memory: you tried to allocate 160465GB. Buy new RAM! at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/TH/THGeneral.c:218

I looked into the code but unfortunately I'm not familiar with GAN to modify the code.

ValueError when running VarLiNGAM

When I run VarLiNGAM on Finance dataset (http://www.skleinberg.org/data/FinanceCPT.tar.gz), I meet a ValueError.
df_data = pd.read_csv(datafile) model = VarLiNGAM(lag=3) result = model.create_graph_from_data(df_data)
The error is as follows

File "", line 1, in
runfile('F:/work_python_d_2/TimeSeriesCausalDiscovery/VARLiNGAM/VARLiNGAM_Finance.py', wdir='F:/work_python_d_2/TimeSeriesCausalDiscovery/VARLiNGAM')

File "D:_work\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)

File "D:_work\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "F:/work_python_d_2/TimeSeriesCausalDiscovery/VARLiNGAM/VARLiNGAM_Finance.py", line 31, in
G = model.create_graph_from_data(df_data)

File "D:_work\Anaconda3\lib\site-packages\cdt\timeseries\graph\VARLiNGAM.py", line 75, in create_graph_from_data
inst, lagged = self._run_varLiNGAM(data.values, verbose=self.verbose)

File "D:_work\Anaconda3\lib\site-packages\cdt\timeseries\graph\VARLiNGAM.py", line 109, in run_varLiNGAM
Bhat = np.dot((Ident - Bo_), Mt_)

ValueError: shapes (25,25) and (75,25) not aligned: 25 (dim 1) != 75 (dim 0)

load_dataset in cdt.data is "random"

Hello. I am currently working on multiple Pairwise algorithms, and found the following problems.
Whenever i load the TCEP data as follows,
data, labels = load_dataset('tuebingen')

the apparent order of the pairs is different.
I had this problem because of the following: to test threshold-dependent algos (such as ANM, GNN) i had a single jupyter file in which i would have a single instance of data, labels loaded, and then threshold and compute metrics on the predictions of different pre-recorded scores.

But each time i would call data, labels = load_dataset('tuebingen') and then compute_metrics(preds,labels) , they would all change ?!
I was quite worried when the accuracy of cdt implementations of RCC,ANM and IGCI were as low as 40% on TCEP ...

Thank you for this wonderful work by the way!

fentechsolutions / causaldiscoverytoolbox Goto Github PK

causaldiscoverytoolbox's Introduction

Docker images

Installation

Install PyTorch

Install the CausalDiscoveryToolbox package

Additional : R and R libraries

Overview

General package structure

Hardware and algorithm settings

The graph class

References

causaldiscoverytoolbox's People

Contributors

Stargazers

Watchers

Forkers

causaldiscoverytoolbox's Issues

Sub-process traceback:

Recommend Projects

Recommend Topics

Recommend Org