furrer-lab / abn Goto Github PK

Bayesian network analysis in R

Home Page: https://r-bayesian-networks.org/

License: GNU General Public License v3.0

R 51.20% Shell 0.05% M4 0.04% TeX 5.32% C++ 1.82% C 41.54% Makefile 0.03%

bayesian-network binomial categorical-data gaussian grouped-datasets mixed-effects multinomial multivariate poisson structure-learning

abn's Introduction

abn: Additive Bayesian Networks

The R package abn is a tool for Bayesian network analysis, a form of probabilistic graphical model. It derives a directed acyclic graph (DAG) from empirical data that describes the dependency structure between random variables. The package provides routines for structure learning and parameter estimation of additive Bayesian network models.

Installation

The abn R package can easily be installed from CRAN using:

install.packages("abn", dependencies = TRUE)

The most recent development version is available from Github and can be installed with:

devtools::install_github("furrer-lab/abn")

It is recommended to install abn within a virtual environment, e.g., using renv) which can be done with:

renv::install("bioc::graph")
renv::install("bioc::Rgraphviz")
renv::install("abn", dependencies = c("Depends", "Imports", "LinkingTo", "Suggests"))

Additional libraries

The following additional libraries are recommended to best profit from the abn features.

INLA, which is an R package used for model fitting. It is hosted separately from CRAN and is easy to install on common platforms (see instructions on the INLA website).

install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

Rgraphviz is used to produce plots of network graphs and is hosted on Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rgraphviz", version = "3.8")

JAGS is a program for analysing Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation. Its installation is platform-dependent and is, therefore, not covered here.

Quickstart

Explore the basics of data analysis using additive Bayesian networks with the abn package through our simple example. The datasets required for these examples are included within the abn package.

For a deeper understanding, refer to the manual pages on the abn homepage, which include numerous examples. Key pages to visit are fitAbn(), buildScoreCache(), mostProbable(), and searchHillClimber(). Also, see the examples below for a quick overview of the package's capabilities.

Features

The R package abn provides routines for determining optimal additive Bayesian network models for a given data set. The core functionality is concerned with model selection - determining the most likely model of data from interdependent variables. The model selection process can incorporate expert knowledge by specifying structural constraints, such as which arcs are banned or retained.

The general workflow with abn follows a three-step process:

Determine the model search space: The function buildScoreCache() builds a cache of pre-computed scores for each possible DAG. For this, it's required to specify the data types of the variables in the data set and the structural constraints of the model (e.g. which arcs are banned or retained and the maximum number of parents per node).
Structure learning: abn offers different structure learning algorithms:
- The exact structure learning algorithm from Koivisto and Sood (2004) is implemented in C and can be called with the function mostProbable(), which finds the most probable DAG for a given data set. The function searchHeuristic() provides a set of heuristic search algorithms. These include the hill-climber, tabu search, and simulated annealing algorithms implemented in R. searchHillClimber() searches for high-scoring DAGs using a random re-start greedy hill-climber heuristic search and is implemented in C. It slightly deviates from the method initially presented by Heckerman et al. 1995 (for details consult the respective help page ?abn::searchHillClimber()).
Parameter estimation: The function fitAbn() estimates the model's parameters based on the DAG from the previous step.

abn allows for two different model formulations, specified with the argument method:

method = "mle" fits a model under the frequentist paradigm using information-theoretic criteria to select the best model.
method = "bayes" estimates the posterior distribution of the model parameters based on two Laplace approximation methods, that is, a method for Bayesian inference and an alternative to Markov Chain Monte Carlo (MCMC): A standard Laplace approximation is implemented in the abn source code but switches in specific cases (see help page ?fitAbn) to the Integrated Nested Laplace Approximation from the INLA package requiring the installation thereof.

To generate new observations from a fitted ABN model, the function simulateAbn() simulates data based on the DAG and the estimated parameters from the previous step. simulateAbn() is available for both method = "mle" and method = "bayes" and requires the installation of the JAGS package.

Supported Data types

The abn package supports the following distributions for the variables in the network:

Gaussian distribution for continuous variables.
Binomial distribution for binary variables.
Poisson distribution for variables with count data.
Multinomial distribution for categorical variables (only available with method = "mle").

Unlike other packages, abn does not restrict the combination of parent-child distributions.

Multilevel Models for Grouped Data Structures

The analysis of "hierarchical" or "grouped" data, in which observations are nested within higher-level units, requires statistical models with parameters that vary across groups (e.g. mixed-effect models).

abn allows to control for one-layer clustering, where observations are grouped into a single layer of clusters which are themself assumed to be independent, but observations within the clusters may be correlated (e.g. students nested within schools, measurements over time for each patient, etc). The argument group.var specifies the discrete variable that defines the group structure. The model is then fitted separately for each group, and the results are combined.

For example, studying student test scores across different schools, a varying intercept model would allow for the possibility that average test scores (the intercept) might be higher in one school than another due to factors specific to each school. This can be modelled in abn by setting the argument group.var to the variable containing the school names. The model is then fitted as a varying intercept model, where the intercept is allowed to vary across schools, but the slope is assumed to be the same for all schools.

Under the frequentist paradigm (method = "mle"), abn relies on the lme4 package to fit generalised linear mixed models (GLMMs) for Binomial, Poisson, and Gaussian distributed variables. For multinomial distributed variables, abn fits a multinomial baseline category logit model with random effects using the mclogit package. Currently, only one-layer clustering is supported (e.g., for method = "mle", this corresponds to a random intercept model).

With a Bayesian approach (method = "bayes"), abn relies on its own implementation of the Laplace approximation and the package INLA to fit a single-level hierarchical model for Binomial, Poisson, and Gaussian distributed variables. Multinomial distributed variables in general (see Section Supported Data Types) are not yet implemented with method = "bayes".

Basic Background

Bayesian network modelling is a data analysis technique ideally suited to messy, highly correlated and complex datasets. This methodology is rather distinct from other forms of statistical modelling in that its focus is on structure discovery—determining an optimal graphical model that describes the interrelationships in the underlying processes that generated the data. It is a multivariate technique and can be used for one or many dependent variables. This is a data-driven approach, as opposed to relying only on subjective expert opinion to determine how variables of interest are interrelated (for example, structural equation modelling).

Below and on the package's website, we provide some cookbook-type examples of how to perform Bayesian network structure discovery analyses with observational data. The particular type of Bayesian network models considered here are additive Bayesian networks. These are rather different, mathematically speaking, from the standard form of Bayesian network models (for binary or categorical data) presented in the academic literature, which typically use an analytically elegant but arguably interpretation-wise opaque contingency table parametrisation. An additive Bayesian network model is simply a multidimensional regression model, e.g. directly analogous to generalised linear modelling but with all variables potentially dependent.

An example can be found in the American Journal of Epidemiology, where this approach was used to investigate risk factors for child diarrhoea. A special issue of Preventive Veterinary Medicine on graphical modelling features several articles that use abn to fit epidemiological data. Introductions to this methodology can be found in Emerging Themes in Epidemiology and in Computers in Biology and Medicine where it is compared to other approaches.

What is an additive Bayesian network?

Additive Bayesian network (ABN) models are statistical models that use the principles of Bayesian statistics and graph theory. They provide a framework for representing data with multiple variables, known as multivariate data.

ABN models are a graphical representation of (Bayesian) multivariate regression. This form of statistical analysis enables the prediction of multiple outcomes from a given set of predictors while simultaneously accounting for the relationships between these outcomes.

In other words, additive Bayesian network models extend the concept of generalised linear models (GLMs), which are typically used to predict a single outcome, to scenarios with multiple dependent variables. This makes them a powerful tool for understanding complex, multivariate datasets.

The term Bayesian network is interpreted differently across various fields.

Bayesian network models often involve binary nodes, arguably the most frequently used type of Bayesian network. These models typically use a contingency table instead of an additive parameter formulation. This approach allows for mathematical elegance and enables key metrics like model goodness of fit and marginal posterior parameters to be estimated analytically (i.e., from a formula) rather than numerically (an approximation). However, this parametrisation may not be parsimonious, and the interpretation of the model parameters is less straightforward than the usual Generalized Linear Model (GLM) type models, which are prevalent across all scientific disciplines.

While this is a crucial practical distinction, it’s a relatively low-level technical one, as the primary aspect of BN modelling is that it’s a form of graphical modelling – a model of the data’s joint probability distribution. This joint – multidimensional – aspect makes this methodology highly attractive for complex data analysis and sets it apart from more standard regression techniques, such as GLMs, GLMMs, etc., which are only one-dimensional as they assume all covariates are independent. While this assumption is entirely reasonable in a classical experimental design scenario, it’s unrealistic for many observational studies in fields like medicine, veterinary science, ecology, and biology.

Examples

Example 1: Basic usage
Example 2: Restrict model search space
Example 3: Grouped Data Structures
Example 4: Using INLA vs internal Laplace approximation

Example 1: Basic Usage

This is a basic example which shows the basic workflow:

library(abn)

# Built-in toy dataset with two Gaussian variables G1 and G2, two Binomial variables B1 and B2, and one multinomial variable C
str(g2b2c_data)

# Define the distributions of the variables
dists <- list(G1 = "gaussian",
              B1 = "binomial",
              B2 = "binomial",
              C = "multinomial",
              G2 = "gaussian")


# Build the score cache
cacheMLE <- buildScoreCache(data.df = g2b2c_data,
                         data.dists = dists,
                         method = "mle",
                         max.parents = 2)

# Find the most probable DAG
dagMP <- mostProbable(score.cache = cacheMLE)

# Print the most probable DAG
print(dagMP)

# Plot the most probable DAG
plot(dagMP)

# Fit the most probable DAG
myfit <- fitAbn(object = dagMP,
                method = "mle")

# Print the fitted DAG
print(myfit)

Example 2: Restrict Model Search Space

Based on example 1, we may know that the arc G1->G2 is not possible and that the arc from C -> G2 must be present. This "expert knowledge" can be included in the model by banning the arc from G1 to G2 and retaining the arc from C to G2.

The retain and ban matrices are specified as an adjacency matrix of 0 and 1 entries, where 1 indicates that the arc is banned or retained, respectively. Row and column names must match the variable names in the data set. The corresponding column is a parent of the variable in the row. Each column represents the parents, and the row is the child. For example, the first row of the ban matrix indicates that G1 is banned as a parent of G2.

Further, we can restrict the maximum number of parents per node to 2.

# Ban the edge G1 -> G2
banmat <- matrix(0, nrow = 5, ncol = 5, dimnames = list(names(dists), names(dists)))
banmat[1, 5] <- 1

# retain always the edge C -> G2
retainmat <- matrix(0, nrow = 5, ncol = 5, dimnames = list(names(dists), names(dists)))
retainmat[5, 4] <- 1

# Limit the maximum number of parents to 2
max.par <- 2

# Build the score cache
cacheMLE_small <- buildScoreCache(data.df = g2b2c_data,
                            data.dists = dists,
                            method = "mle",
                            dag.banned = banmat,
                            dag.retained = retainmat,
                            max.parents = max.par)
print(paste("Without restrictions from example 1: ", nrow(cacheMLE$node.defn)))
print(paste("With restrictions as in example 2: ", nrow(cacheMLE_small$node.defn)))

Example 3: Grouped Data Structures

Depending on the data structure, we may want to control for one-layer clustering, where observations are grouped into a single layer of clusters that are themselves assumed to be independent, but observations within the clusters may be correlated (e.g., students nested within schools, measurements over time for each patient, etc.).

Currently, abn supports only one layer clustering.

# Built-in toy data set
str(g2pbcgrp)

# Define the distributions of the variables
dists <- list(G1 = "gaussian",
              P = "poisson",
              B = "binomial",
              C = "multinomial",
              G2 = "gaussian") # group is not among the list of variable distributions

# Ban arcs such that C has only B and P as parents
ban.mat <- matrix(0, nrow = 5, ncol = 5, dimnames = list(names(dists), names(dists)))
ban.mat[4, 1] <- 1
ban.mat[4, 4] <- 1
ban.mat[4, 5] <- 1

# Build the score cache
cache <- buildScoreCache(data.df = g2pbcgrp,
                         data.dists = dists,
                         group.var = "group",
                         dag.banned = ban.mat,
                         method = "mle",
                         max.parents = 2)

# Find the most probable DAG
dag <- mostProbable(score.cache = cache)

# Plot the most probable DAG
plot(dag)

# Fit the most probable DAG
fit <- fitAbn(object = dag,
              method = "mle")

# Plot the fitted DAG
plot(fit)

# Print the fitted DAG
print(fit)

Example 4: Using INLA vs internal Laplace approximation

Under a Bayesian approach, abn automatically switches to the Integrated Nested Laplace Approximation from the INLA package if the internal Laplace approximation fails to converge. However, we can also force the use of INLA by setting the argument control=list(max.mode.error=100).

The following example shows that the results are very similar. It also shows how to constrain arcs as formula objects and how to specify different parent limits for each node separately.

library(abn)

# Subset of the build-in dataset, see  ?ex0.dag.data
mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols

# setup distribution list for each node
mydists <- list(b1="binomial", b2="binomial", g1="gaussian",
                g2="gaussian", b3="binomial", g3="gaussian")

# Structural constraints
## ban arc from b2 to b1
## always retain arc from g2 to g1
## parent limits - can be specified for each node separately
max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2)

# now build the cache of pre-computed scores according to the structural constraints
res.c <- buildScoreCache(data.df=mydat, data.dists=mydists,
                         dag.banned= ~b1|b2, 
                         dag.retained= ~g1|g2, 
                         max.parents=max.par)


# repeat but using R-INLA. The mlik's should be virtually identical.
if(requireNamespace("INLA", quietly = TRUE)){
  res.inla <- buildScoreCache(data.df=mydat, data.dists=mydists,
                              dag.banned= ~b1|b2, # ban arc from b2 to b1
                              dag.retained= ~g1|g2, # always retain arc from g2 to g1
                              max.parents=max.par,
                              control=list(max.mode.error=100)) # force using of INLA
  
  ## comparison - very similar
  difference <- res.c$mlik - res.inla$mlik
  summary(difference)
}

Contributing

We greatly appreciate contributions from the community and are excited to welcome you to the development process of the abn package. Here are some guidelines to help you get started:

Seeking Support: If you need help with using the abn package, you can seek support by creating a new issue on our GitHub repository. Please describe your problem in detail and include a minimal reproducible example if possible.
Reporting Issues or Problems: If you encounter any issues or problems with the software, please report them by creating a new issue on our GitHub repository. When reporting an issue, try to include as much detail as possible, including steps to reproduce the issue, your operating system and R version, and any error messages you received.
Software Contributions: We encourage contributions directly via pull requests on our GitHub repository. Before starting your work, please first create an issue describing the contribution you wish to make. This allows us to discuss and agree on the best way to integrate your contribution into the package.

By participating in this project, you agree to abide by our code of conduct. We are committed to making participation in this project a respectful and harassment-free experience for everyone.

Citation

If you use abn in your research, please cite it as follows:

> citation("abn")
To cite the methodology of the R package 'abn' use:

  Kratzer G, Lewis F, Comin A, Pittavino M, Furrer R (2023). “Additive Bayesian Network Modeling with the R Package abn.” _Journal of Statistical Software_,
  *105*(8), 1-41. doi:10.18637/jss.v105.i08 <https://doi.org/10.18637/jss.v105.i08>.

To cite an example of a typical ABN analysis use:

  Kratzer, G., Lewis, F.I., Willi, B., Meli, M.L., Boretti, F.S., Hofmann-Lehmann, R., Torgerson, P., Furrer, R. and Hartnack, S. (2020). Bayesian Network
  Modeling Applied to Feline Calicivirus Infection Among Cats in Switzerland. Frontiers in Veterinary Science, 7, 73

To cite the software implementation of the R package 'abn' use:

  Furrer, R., Kratzer, G. and Lewis, F.I. (2023). abn: Modelling Multivariate Data with Additive Bayesian Networks. R package version 2.7-2.
  https://CRAN.R-project.org/package=abn

License

The abn package is licensed under the GNU General Public License v3.0.

Code of Conduct

Please note that the abn project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Applications

The abn website provides a comprehensive set of documented case studies, numerical accuracy/quality assurance exercises, and additional documentation.

Technical articles

Kratzer et al. (2023): Additive Bayesian Network Modeling with the R Package abn
Kratzer et al. (2020) Bayesian Networks modeling applied to Feline Calicivirus infection among cats in Switzerland
Kratzer et al. (2018): Comparison between Suitable Priors for Additive Bayesian Networks
Koivisto et al. (2004): Exact Bayesian structure discovery in Bayesian networks
Friedman et al. (2003): Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks
Friedman et al. (1999): Data analysis with Bayesian networks: A bootstrap approach
Heckerman et al. (1995): Learning Bayesian Networks – The Combination of Knowledge And Statistical-Data

Application articles

Delucchi et al. (2022): Bayesian network analysis reveals the interplay of intracranial aneurysm rupture risk factors
Guinat et al. (2020) Biosecurity risk factors for highly pathogenic avian influenza (H5N8) virus infection in duck farms, France
Hartnack et al. (2019) Additive Bayesian networks for antimicrobial resistance and potential risk factors in non-typhoidal Salmonella isolates from layer hens in Uganda
Ruchti et al. (2019): Progression and risk factors of pododermatitis in part-time group housed rabbit does in Switzerland
Comin et al. (2019) Revealing the structure of the associations between housing system, facilities, management and welfare of commercial laying hens using Additive Bayesian Networks
Ruchti et al. (2018): Pododermatitis in group housed rabbit does in Switzerland – prevalence, severity and risk factors
Pittavino et al. (2017): Comparison between generalised linear modelling and additive Bayesian network; identification of factors associated with the incidence of antibodies against Leptospira interrogans sv Pomona in meat workers in New Zealand
Hartnack et al. (2017): Attitudes of Austrian veterinarians towards euthanasia in small animal practice: impacts of age and gender on views on euthanasia
Lewis et al. (2012): Revealing the Complexity of Health Determinants in Resource-poor Settings
Lewis et al. (2011): Structure discovery in Bayesian networks: An analytical tool for analysing complex animal health data

Workshops

Causality:

4 December 2018, Beate Sick & Gilles Kratzer of the 1st Causality workshop talk, Bayesian Networks meet Observational data. (UZH, Switzerland)

ABN modeling

07 July 2021, workshop at the UseR! Conference on Additive Bayesian Networks Modeling. (Online)
29 March 2019, workshop at the SVEPM conference on Multivariate analysis using Additive Bayesian Networks. (Utrecht, Netherland)

Presentations

4 October 2018, talk in Nutricia (Danone). Multivariable analysis: variable and model selection in system epidemiology. (Utrecht, Netherland)
30 May 2018. Brown Bag Seminar in ZHAW. Presentation: Bayesian Networks Learning in a Nutshell. (Winterthur, Switzerland)

abn's People

Contributors

Stargazers

Watchers

abn's Issues

start with new development environment (basic setup of a package) for abn 4.0.0

new branch in the public abn repo where we build a new version of the package from scratch

Add urlchecks to the tests

This is related to #9

We include sanity checks on URL's/URI's into the testing procedure, also because CRAN does the same when a package is submitted.

The check can be performed with https://github.com/r-lib/urlchecker which might even update permanent redirects (301s).

As such sanity checks are generally relevant, the installation of https://github.com/r-lib/urlchecker should happen in the testing container already, therefore this issue relies on the resolution of furrer-lab/r-containers#21

JOSS Submission Checklist

The software must be open source as per the OSI definition.
The software must be hosted at a location where users can open issues and propose code changes without manual approval of (or payment for) accounts. furrer-lab/devel-abn#134
The software must have an obvious research application.
You must be a major contributor to the software you are submitting, and have a GitHub account to participate in the review process.
Your paper must not focus on new research results accomplished with the software.
Your paper (paper.md and BibTeX files, plus any figures) must be hosted in a Git-based repository together with your software (although they may be in a short-lived branch which is never merged with the default).

In addition, the software associated with your submission must:

Be stored in a repository that can be cloned without registration.
Be stored in a repository that is browsable online without registration.
Have an issue tracker that is readable without registration.
Permit individuals to create issues/file tickets against your repository.

In addition, JOSS requires that software should be

feature-complete (i.e., no half-baked solutions),
packaged appropriately according to common community standards for the programming language being used (e.g., Python, R), furrer-lab/devel-abn#134
and designed for maintainable extension (not one-off modifications of existing tools). “Minor utility” packages, including “thin” API clients, and single-function packages are not acceptable.

Co-publication of science, methods, and software:

We ask that authors indicate whether related publications (published, in review, or nearing submission) exist as part of submitting to JOSS.

CoI Policy:

furrer-lab/devel-abn#140
furrer-lab/devel-abn#141
Review process: Editors and reviewers must be informed of any potential conflicts of interest before reviewing the manuscript to ensure unbiased evaluation of the research.
Compliance: Authors who fail to comply with the COI policy may have their manuscript rejected or retracted if a conflict is discovered after publication.
Review and Update: This COI policy will be reviewed and updated regularly to ensure it remains relevant and effective.

What should my paper contain?
Given this format, a “full length” paper is not permitted, and software documentation such as API (Application Programming Interface) functionality should not be in the paper and instead should be outlined in the software documentation.

furrer-lab/devel-abn#142
A summary describing the high-level functionality and purpose of the software for a diverse, non-specialist audience.
A Statement of need section that clearly illustrates the research purpose of the software and places it in the context of related work.
A list of key references, including to other software addressing related needs. Note that the references should include full names of venues, e.g., journals and conferences, not abbreviations only understood in the context of a specific discipline.
Mention (if applicable) a representative set of past or ongoing research projects using the software and recent scholarly publications enabled by it.
Acknowledgement of any financial support. see furrer-lab/devel-abn#141
Citations
Bibliographic data should be collected in a file paper.bib; it should be formatted in the BibLaTeX format, although plain BibTeX is acceptable as well. see furrer-lab/devel-abn#139
- references include full names of venues, e.g., journals and conferences, not abbreviations only understood in the context of a specific discipline.

Checking that your paper compiles

use the Open Journals GitHub Action to automatically compile your paper each time you update your repository.

This relates to #33

Session crashes / low-level error from irls_poisson_fast.cpp if no solution found by solve()

Tag based deployment pipeline including fast/slow checks

We want to implement a robust testing and deployment pipeline.

The ideas is that the creation of a new tag on the master branch will trigger a CRAN submission under the condition that our fast running checks passed. In this case we also want to start to a slow run to monitor a.o. memory leakage. If the slow run succeeds, then the pipeline can create a new release from the tag.

Conditions:

Perform a fast check on
- every commit to master
- whenever there is a pull request to master
Perform a slow check when (fast run was successful and)
- new tag is created on master (optionally only if it is a release candidate, so x.x.x-rc)
Submit to CRAN (fast run was successful and)
- new tag is created on master
(optionally) create a release (or new tag without -rc):
- slow check terminates without errors

Add more checks for the individual control parameters of `fit.control()` and `build.control()`

Currently, not all control parameters are checked for eligibility.
Extend for build.control() here:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/abn-internal.R#L650-L697

and extend for fit.control() here:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/abn-internal.R#L766-L815

Can we speed up mostProbable?

Decide if it is worth speeding up mostProbable().

For this example, it takes quite a while to run:

  # get data
  mydat <- ex5.dag.data[,-19] ## get the data - drop group variable

  # Restrict DAG
  banned<-matrix(c(
    # 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
    0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b1
    1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b2
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b3
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b4
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b5
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b6
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g1
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g2
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g3
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g4
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g5
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g6
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g7
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g8
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g9
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g10
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g11
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # g12
  ),byrow=TRUE,ncol=18)

  colnames(banned)<-rownames(banned)<-names(mydat)

  retain<-matrix(c(
    # 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b1
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b2
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b3
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b4
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b5
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b6
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g1
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g2
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g3
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g4
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g5
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g6
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g7
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g8
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g9
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g10
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g11
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # g12
  ),byrow=TRUE,ncol=18)
  ## again must set names
  colnames(retain)<-rownames(retain)<-names(mydat)

  # set distributions
  mydists<-list(b1="binomial",
                b2="binomial",
                b3="binomial",
                b4="binomial",
                b5="binomial",
                b6="binomial",
                g1="gaussian",
                g2="gaussian",
                g3="gaussian",
                g4="gaussian",
                g5="gaussian",
                g6="gaussian",
                g7="gaussian",
                g8="gaussian",
                g9="gaussian",
                g10="gaussian",
                g11="gaussian",
                g12="gaussian"
  )

  # Compute score cache
  mycache.1par <- buildScoreCache(data.df=mydat,data.dists=mydists, max.parents=1,centre=TRUE)

  # Estimate most probable DAG
  mp.dag <- mostProbable(score.cache = mycache.1par)

p-values for mixed-effects with `apex::mixed()` instead of `lme4::glmer()`

Use apex::mixed() instead of lme4::glmer()? This would return pvalues and etc. see: https://mspeekenbrink.github.io/sdam-r-companion/generalized-linear-models.html#generalized-linear-mixed-effects-models

JOSS Review Checklist

General checks

Repository: Is the source code for this software available at the repository url? furrer-lab/devel-abn#134
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Contribution and authorship: Has the submitting author made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

Installation: Does installation proceed as outlined in the documentation? #70
Functionality: Have the functional claims of the software been confirmed?
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
furrer-lab/devel-abn#138

Software paper

Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
A statement of need: Does the paper have a section titled ‘Statement of need’ that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
State of the field: Do the authors describe how this software compares to other commonly-used packages?
Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
furrer-lab/devel-abn#139

Numerical variations in IRLS for Poissons

irls_poisson_fast.cpp results in slightly different score values compared to the same model computed with glm.

Steps to Reproduce

Compare mycache.mle with modglm in test-build_score_cache_mle.R .

Current Bug Behaviour

> mycache.mle$mlik
[1] -1418.438      -Inf
> logLik(modglm)
'log Lik.' -1410.645 (df=2)

Analogous for AIC and BIC scores.

Actual expected Behaviour

I'm unsure if this variation in score values is expected.

Relevant Logs

This was temporarily fixed with an increased tolerance to pass the tests.

Possible Solutions

Double-check IRLS Poisson Fast algorithm. It has been shown that numerical overflow is not handled properly for large values of eta. Unsure if eta should ever be that large or if this was only caused by a faulty test. If the latter, consider catching such cases upstream properly and investigate why glm did not raise a warning.

Fix CRAN submission

CRAN has the package archived. Fix this with a new release.

capture output of tests on windows to "dev/null"

the pendent of "/dev/null" on windows is "nul".
https://stackoverflow.com/questions/4507312/how-to-redirect-stderr-to-null-in-cmd-exe

Currently, when the output of tests is captured in "/dev/null" the tests are omitted on windows.
Consider instead sth like this:

test_that("plot.abnDag() works.", {
  mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1))

  if(.Platform$OS.type == "unix") {
    FILE <- "/dev/null"
  } else {
    FILE <- "nul"
  }
  capture.output({
    expect_no_error({
      plot(mydag)
      })
    },
    file = FILE)
})

Not sure this example works well...

Staggered run of fast tests

We can save potentially quite some computational power if we modify the regular runs of the fast pipeline such that first a single job runs and then, only if it does not fail, all the other flavors run.

We might even consider designing the fast pipeline to only run on a subset of flavors and postpone the extensive checks (i.e. on all combinations) to ongoing pull requests and commits to master.

buildScoreCache: `max.parents` as list and `defn.res` doesn't work

if max.parents as list this will fail:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L523-L524
Catch it and/or resolve max.parent list (e.g. when all items in the list are equal).

setup devel environment on public

Reduce tarball size <5MB

Thanks, we see:

Size of tarball: 7142960 bytes

Please reduce to less than 5 MB.

documentation of the testing procedure

the usage of r-containers
- how to change container tag
- note that latest containers are updated each month

Add section to README (subsection of contributing) Development Environment and Testing

pipeline to publish pkgdown site

basically,

Recompile the vignettes by running vignettes/precomile.R. See #11 .
use pkgdown actions workflow or run pkgdown::build_site()

Catch INLA availability in all examples

Go through all examples. Run those who might require INLA only if INLA is available.

Flavor: r-devel-linux-x86_64-debian-gcc
Check: package dependencies, Result: NOTE
  Package suggested but not available for checking: 'INLA'

Flavor: r-devel-linux-x86_64-debian-gcc
Check: examples, Result: ERROR
  Running examples in 'abn-Ex.R' failed
  The error most likely occurred in:
  
  > base::assign(".ptime", proc.time(), pos = "CheckExEnv")
  > ### Name: print.abnCache
  > ### Title: Print objects of class 'abnCache'
  > ### Aliases: print.abnCache
  >
  > ### ** Examples
  >
  > ## Subset of the build-in dataset, see  ?ex0.dag.data
  > mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols
  >
  > ## setup distribution list for each node
  > mydists <- list(b1="binomial", b2="binomial", g1="gaussian",
  +                 g2="gaussian", b3="binomial", g3="gaussian")
  >
  > # Structural constraints
  > # ban arc from b2 to b1
  > # always retain arc from g2 to g1
  >
  > ## parent limits
  > max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2)
  >
  > ## now build the cache of pre-computed scores accordingly to the structural constraints
  >
  > res.c <- buildScoreCache(data.df=mydat, data.dists=mydists,
  +                          dag.banned= ~b1|b2, dag.retained= ~g1|g2, max.parents=max.par)
  Error in library(p, character.only = TRUE) :
    there is no package called 'INLA'
  Calls: buildScoreCache ... buildScoreCache.bayes -> %do% -> <Anonymous> -> library
  Execution halted

setup test pipeline on abn

move testpipline from private devel-abn repo to this public repo.

speed up buildScoreCache

Dear Matteo,

Included is a patch from the latest version to help scale building the score cache for the mle option. I'm using a "Sparse Candidate" type algorithm, so the number
of possible parents is normally quite constrained, in the region of 10s. This algorithm also needs to be able to check the scoring on adding
a single node, so I've had to make max.parents per node (I'm not sure why it was forbidden before). I've tested this on features running to
1000s and it seems to work quite well and replicates the previous results.

I also have code that scales the hill climbing algorithm to 1000s of variables, in R, but this is missing some of the functionality of the C
code, so I won't offer it yet.

Any questions, please get in touch!

Many thanks,

Rónán

On 15 Nov 2023, at 16:57, Delucchi Matteo [xxx] wrote:

Dear Ronan,

Thank you for your interest in our abn package and for taking the time to provide feedback.
We currently host the code on our institute’s GitLab server, which can be found at this link: https://git.math.uzh.ch/mdeluc/abn
I greatly appreciate your contribution towards improving the scalability of the package. I would happily review your patch and consider incorporating it for the next release.
Please don’t hesitate to reach out if you encounter any issues or have further questions.
Thank you again for your feedback!

Best regards,
Matteo

From: Ronan
Subject: abn R package

Dear Mr Delucchi,

I've been experimenting with the abn package and found that scaling up to large numbers of nodes was causing issues
with the code setting up the cache structure, specifically in buildScoreCache.mle where banned possibilities are filtered
out, was causing runtime to grow perhaps quadratically. I've implemented a fix that means the code can now scale to
larger examples and I'm wondering is there a way to incorporate this into the mainline of your package? I haven't seen a
github repository, but could send a patch etc.

Many thanks,

Ronan

diff --git a/R/build_score_cache_mle.R b/R/build_score_cache_mle.R
index 6e3e650..5b03b3f 100755
--- a/R/build_score_cache_mle.R
+++ b/R/build_score_cache_mle.R
@@ -377,6 +377,9 @@ buildScoreCache.mle <-
 
     ############################## Function to create the cache
 
+    if ( length(max.parents) == 1 ) {
+        max.parents <- rep(max.parents, nvars)
+    }
 
     if (!is.null(defn.res)) {
         max.parents <- max(apply(defn.res[["node.defn"]], 1, sum))
@@ -392,83 +395,64 @@ buildScoreCache.mle <-
             return(v)
         }
 
-        node.defn <- matrix(data = as.integer(0), nrow = 1L, ncol = nvars)
-        children <- 1
+        ## Generate all possible bit patterns for n variables, with a maximum of m 1s
+        generateBitPatterns = function(n, m) {
+          z <- rep(0,n)
+          do.call(rbind, lapply(0:m, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
+        }
 
-        for (j in 1:nvars) {
-            if (j != 1) {
-                node.defn <- rbind(node.defn, matrix(data = as.integer(0),
-                                                     nrow = 1L, ncol = nvars))
-                children <- cbind(children, j)
-            }
-            # node.defn <- rbind(node.defn,matrix(data = 0,nrow = 1,ncol = n))
+        # Function to generate all possible combinations of parents
+        filteredCombinations = function(x, m, bannedParents, retainedParents) {
+          # These are the parents that cannot change
+          fixedParents = bannedParents | retainedParents | (fun.return(x, length(x) + 2) + 1) %% 2
+          # These are the parents that can change
+          parentPossibleChoices = which(fixedParents == 0)
+          numPossibleChoices = length(parentPossibleChoices)
+          numRetainedParents = sum(retainedParents)
+
+          # Generate all possible combinations of parents, taking account of banned, retained and maximum number of parents
+          parentChoices = generateBitPatterns(numPossibleChoices, min(m-numRetainedParents, numPossibleChoices)) == 1
+          output = t(apply(parentChoices, 1, function(pc) {
+            combinedRow = 1L*(retainedParents | fun.return(parentPossibleChoices[pc], length(x) + 2))
+            combinedRow
+          }))
+          output
+        }
+
+        children <- matrix(nrow=1, ncol=0)
+        node.defn.list = list()
 
+        for (j in 1:nvars) {
           if(is.list(max.parents)){
             stop("ISSUE: `max.parents` as list is not yet implemented further down here. Try with a single numeric value as max.parents instead.")
             if(!is.null(which.nodes)){
               stop("ISSUE: `max.parents` as list in combination with `which.nodes` is not yet implemented further down here. Try with single numeric as max.parents instead.")
             }
-          } else if (is.numeric(max.parents) && length(max.parents)>1){
-            if (length(unique(max.parents)) == 1){
-              max.parents <- unique(max.parents)
-            } else {
-              stop("ISSUE: `max.parents` with node specific values that are not all the same, is not yet implemented further down here.")
-            }
-          }
-
-          if(max.parents == nvars){
-            max.parents <- max.parents-1
-            warning(paste("`max.par` == no. of variables. I set it to (no. of variables - 1)=", max.parents)) #NOTE: This might cause differences to method="bayes"!
           }
 
-            for (i in 1:(max.parents)) {
-                tmp <- t(combn(x = (nvars - 1), m = i, FUN = fun.return, n = nvars, simplify = TRUE))
-                tmp <- t(apply(X = tmp, MARGIN = 1, FUN = function(x) append(x = x, values = 0, after = j - 1)))
-
-                node.defn <- rbind(node.defn, tmp)
-
-                # children position
-                children <- cbind(children, t(rep(j, length(tmp[, 1]))))
-            }
+            # The parents that are banned and retained for node j
+            bannedParents = dag.banned[j, ]
+            retainedParents = dag.retained[j, ]
+            # All possible parents for node j, which is all nodes except j
+            parentChoice = c(seq.int(from=1, length.out=j-1), seq.int(from=j+1, length.out=nvars-j))
+            # How many parents we are keeping for node j
+            numRetainedParents = sum(retainedParents)
+            # The maximum number of parents for node j
+            m = max.parents[j]
+
+            # Generate all possible combinations of parents for node j
+            tmp <- filteredCombinations(x = parentChoice, m=m, bannedParents=bannedParents, retainedParents=retainedParents)
+            # We need a sparse matrix here to deal with large numbers of variables, otherwise memory usage if very high.
+            tmp2 = Matrix(tmp, sparse = TRUE)
+            node.defn.list[[length(node.defn.list) + 1]] <- tmp2
+            children <- cbind(children, t(rep(j, length(tmp2[, 1]))))
         }
 
-        # children <- rowSums(node.defn)
+        node.defn = do.call(rbind, node.defn.list)
         colnames(node.defn) <- colnames(data.df)
-        ## Coerce numeric matrix into integer matrix !!!
-        node.defn <- apply(node.defn, c(1, 2), function(x) {
-            (as.integer(x))
-        })
-
         children <- as.integer(children)
         # node.defn_ <- node.defn
 
-        ## DAG RETAIN/BANNED
-        for (i in 1:nvars) {
-            for (j in 1:nvars) {
-
-                ## DAG RETAIN
-                if (dag.retained[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 0)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-                ## DAG BANNED
-                if (dag.banned[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 1)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-            }
-        }
-
         mycache <- list(children = as.integer(children), node.defn = (node.defn))
 
         ###------------------------------###

fitabn_mle(): `catcov.mblogit = "single"` is not implemented

manipulate VarCov to bring in correct shape:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/fitabn_mle.R#L455-L457

buildScoreCache doesn't work with `defn.res` and `which.nodes` provided together

This doesn't work only because there is no check for the combination of these arguments implemented.
Check if defn.res and which.nodes are not mismatching and keep if ok:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L517-L518

buildscorecache_mle(): catcov.mblogit = "single" is not implemented

manipulate VarCov to bring in correct shape:

https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache_mle.R#L153-L157

and

https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache_mle.R#L196-L202

This is related to #72 .

check quotations in DESCRIPTION

Please always write package names, software names and API (application
programming interface) names in single quotes in title and description.
e.g: --> 'INLA'
Please note that package names are case sensitive.

Which are valid combination of `cor.var`, `which.nodes` and `group.var`?

Extend the checking procedure of the combination of cor.var, which.nodes and group.var arguments here: https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L446C1-L454C4

Print meaningful warnings/errors for the specific combinations.

Many "known" overflows in node_binomial.c

From case study zero of the old abn-homepage.

MRE

The for-loop comparing INLA, internal C laplace and glm results, shows an over/underflow warning originating from laplace calculations in node_binomial.c.
In different parts (e.g. line 940) , we exponentiate large numbers raising the overflow warning and resulting in Inf values which can lead to issues later down-stream.

load(system.file("extdata", "QA_glm_case1_data.RData", package = "abn")) # or download from here: http://r-bayesian-networks.org/source/Rcode/QA_glm_case2.tar.gz

## 1. plot of raw differences, a wide range of values since both poisson, bin and gaus distributions used.
## vast majority as almost identical, but some are rather different
#plot(mycache.inla$mlik-mycache.c$mlik);

## 2. also look at % differences - gives a crude overview
## as 1. so suggests perhaps not just floating point rounding issue e.g. in log transforms
perc<-100*(mycache.c$mlik-mycache.inla$mlik)/mycache.c$mlik;

## 3. get all mliks which are adrift by more than 1%
bad<-which(abs(perc)>1);

## go through each and check for issues
## 
mydat<-ex2.dag.data;## this data comes with abn see ?ex2.dag.data
mydat.std<-mydat;
## setup distribution list for each node
mydists<-list(b1="binomial",
              g1="gaussian",
              p1="poisson",
              b2="binomial",
              g2="gaussian",
              p2="poisson",
              b3="binomial",
              g3="gaussian",
              p3="poisson",
              b4="binomial",
              g4="gaussian",
              p4="poisson",
              b5="binomial",
              g5="gaussian",
              p5="poisson",
              b6="binomial",
              g6="gaussian",
              p6="poisson"
             );
## create standardised dataset for comparison with glm
for(i in 1:length(mydists)){if(mydists[[i]]=="gaussian"){## then std data for comparison with glm_case
                                                            mydat.std[,i]<-(mydat.std[,i]-mean(mydat.std[,i]))/sd(mydat.std[,i]);}
}
## create empty matrix which will be filled with nodes as needed
mydag<-matrix(rep(0,dim(mydat)[2]^2),ncol=dim(mydat)[2]);colnames(mydag)<-rownames(mydag)<-names(mydat);

## loop through each node which differed from INLA by at least 1% and compare with glm() modes
for(i in 1:length(bad)){

  mydag[,]<-0;## reset
  node<-mycache.c$child[bad[i]];pars<-mycache.c$node.defn[bad[i],];
  form<-as.formula(paste(colnames(mydag)[node],"~",paste(colnames(mydag)[which(pars==1)],collapse="+",sep=""),sep=""));
  family<-mydists[[node]];
  mydag[node,]<-pars;## copy "bad" node into DAG
  myres.c<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=0,compute.fixed=TRUE);## use C
  myres.inla<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=100,compute.fixed=TRUE,n.grid=NULL,std.area=FALSE);## use INLA
  myres.glm<-glm(form,data=mydat.std,family=family);
  cat("################ bad=",i,"#################\n");
  cat("\n# 1. glm()\n");print(coef(myres.glm));
  cat("\n# 2. C\n");print(myres.c$modes[[node]]);
  cat("\n# 3. INLA\n");print(myres.inla$modes[[node]]);
  cat("\n###########################################\n");
}

Suggested solution

The operation from line 940 appears in different locations in the code. Often they are marked with an old note regarding its potential to overflow. There is a note about a workaround in one place. Consider to investigate more on this workaround and check if the other parts of the code could be adapted accordingly or if there exists a better strategy (as the workaround doesn't seem to be the universal solution).

Check and handle collinearity for all distributions.

fitAbn and buildScoreCache (both "mle"): Collinearity is only addressed for binomial variables. Extend to all distributions.

node specific max.parents not implemented for method = "mle"

Issue description

Only with method = "bayes" we can set the number of maximal allowed parents individually per node.

MRE

### Generate data
# Set seed for reproducibility
set.seed(123)

# Number of groups
n_groups <- 5

# Number of observations per group
n_obs_per_group <- 100

# Total number of observations
n_obs <- n_groups * n_obs_per_group

# Simulate group effects
group <- factor(rep(1:n_groups, each = n_obs_per_group))
group_effects <- rnorm(n_groups)

# Simulate variables
G1 <- rnorm(n_obs) + group_effects[group]
B1 <- rbinom(n_obs, 1, plogis(group_effects[group]))
G2 <- 1.5 * B1 + 0.7 * G1 + rnorm(n_obs) + group_effects[group]
B2 <- rbinom(n_obs, 1, plogis(2 * G2 + group_effects[group]))

# Create data frame
data <- data.frame(group = group, G1 = G1, G2 = G2, B1 = factor(B1), B2 = factor(B2))

# Look at data
str(data)
summary(data)

######
# Reproduce issue
######
### method = "mle"
# OK: Build the score cache with 2 parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = 2,
                               method = "mle")

# BUG: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
                               method = "mle")

### method = "bayes"
# OK: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
                               data.dists = list(G1 = "gaussian", 
                                                 G2 = "gaussian", 
                                                 B1 = "binomial", 
                                                 B2 = "binomial"),
                               group.var = "group",
                               max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
                               method = "bayes")

export abn to .net file

export fitted abn to .net file to be read by e.g. HUGIN GUI.

These might help:

the data field in .net file contains the CPT of the nodes.

alternatives to HUGIN (commercial):

they use .dot files.

examples failed that require INLA

run them only if INLA is available.


Flavor: r-devel-linux-x86_64-debian-gcc
Check: examples, Result: ERROR
  Running examples in 'abn-Ex.R' failed
  The error most likely occurred in:
  
  > base::assign(".ptime", proc.time(), pos = "CheckExEnv")
  > ### Name: buildScoreCache
  > ### Title: Build a cache of goodness of fit metrics for each node in a DAG,
  > ###   possibly subject to user-defined restrictions
  > ### Aliases: buildScoreCache buildScoreCache.bayes forLoopContentBayes
  > ###   forLoopContent buildScoreCache.mle
  > ### Keywords: buildScoreCache.bayes buildScoreCache.mle calc.node.inla.glm
  > ###   calc.node.inla.glmm fitAbn.bayes fitAbn.mle internal models
  >
  > ### ** Examples
  >
  > ## Simple example
  > # Generate data
  > N <- 1e6
  > mydists <- list(a="gaussian",
  +                 b="gaussian",
  +                 c="gaussian")
  > a <- rnorm(n = N, mean = 0, sd = 1)
  > b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1)
  > c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1)
  > mydf <- data.frame("a" = scale(a),
  +                    "b" = scale(b),
  +                    "c" = scale(c))
  >
  > # ABN with MLE
  > mycache.mle <- buildScoreCache(data.df = mydf,
  +                                data.dists = mydists,
  +                                method = "mle",
  +                             max.parents = 2)
  Loading required package: Matrix
  > dag.mle <- mostProbable(score.cache = mycache.mle,
  +                         max.parents = 2)
  Step1. completed max alpha_i(S) for all i and S
  Total sets g(S) to be evaluated over: 8
  > myfit.mle <- fitAbn(object = dag.mle,
  +                     method = "mle",
  +                     max.parents = 2)
  > plot(myfit.mle)
  >
  > # ABN with Bayes
  > mycache.bayes <- buildScoreCache(data.df = mydf,
  +                                  data.dists = mydists,
  +                                  method = "bayes",
  +                                  max.parents = 2)
  Error in library(p, character.only = TRUE) :
    there is no package called 'INLA'
  Calls: buildScoreCache ... buildScoreCache.bayes -> %do% -> <Anonymous> -> library
  Execution halted

Tests tracking memory usage (the slow pipeline)

This approach actually includes 3 types of tests:

fast tests with testthat which run regularly

fast tests that are CRAN-like which run on changes (and change requests to) the default branch

slow tests that track memory usage

The first two are implemented (about to be - see furrer-lab/devel-abn#100 ), what remains is the tests that include the tracking of memory usage.

Originally posted by @j-i-l in #81

We want to run tests with valgrind enabled (what else?) if we have a release candidate.

Depending on what it is exactly that we want to track it might be enough to run R CMD check with --use-valgrind, in which case we could handle this by setting some variables in the existing github action CRAN_checks.

We should decide what sort of memory check we want to run
Implement the action accordingly

Proper documentation of C functions

Some C-level functions are mentioned here to silence R CMD check but are not properly documented.

nlminb message: function evaluation limit reached without convergence (9)

buildScoreCache(mle, group.var) warning "nlminb message: false convergence (8)", "nlminb message: function evaluation limit reached without convergence (9)".
See:
https://stackoverflow.com/a/40049233/6098024
https://stat.ethz.ch/pipermail/r-help/2008-June/164797.html
https://stats.stackexchange.com/a/44884/152981)

example not executable in fitAbn()

Unexecutable code in man/fitAbn.Rd.
Please make sure that all your examples are executable. I think you
forgot to comment out a line there:

This is a basic plot of some posterior densities. The algorithm used

for selecting

density points is quite straightforward, but it might result in a

sparse distribution.

Therefore, we also recompute the density over an evenly spaced grid

of 50 points between the two endpoints that had a minimum PDF at f=min.pdf.

Setting max.mode.error=0 forces the use of the internal C code.

wrong URLs and URIs

Fix the following error message

Found the following (possibly) invalid URLs:
    URL: http://aje.oxfordjournals.org/content/176/11/1051.abstract (moved to https://academic.oup.com/aje/article-abstract/176/11/1051/178588)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://aje.oxfordjournals.org/content/176/11/1051.full.pdf?keytype=ref&ijkey=zCJD2Zt88XaDYyY (moved to https://academic.oup.com/aje/article-pdf/176/11/1051/428801/kws183.pdf?keytype=ref&ijkey=zCJD2Zt88XaDYyY)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://download.springer.com/static/pdf/949/art%253A10.1186%252Fs12917-016-0649-0.pdf?originUrl=http%3A%2F%2Fbmcvetres.biomedcentral.com%2Farticle%2F10.1186%2Fs12917-016-0649-0&token2=exp=1455044551~acl=%2Fstatic%2Fpdf%2F949%2Fart%25253A10.1186%25252Fs12917-016-0649-0.pdf*~hmac=e04039a7400eefea35dc05635bccae1688e549b8b0eb36edc0b8fd72caba73fc
      From: README.md
      Status: 404
      Message: Not Found
    URL: http://mcmc-jags.sourceforge.net/ (moved to https://mcmc-jags.sourceforge.io/)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://pdn.sciencedirect.com/science?_ob=MiamiImageURL&_cid=271186&_user=4429&_pii=S0167587711000341&_check=y&_origin=browseVolIssue&_zone=rslt_list_item&_coverDate=2011-06-15&wchp=dGLbVlS-zSkWb&md5=29522e1462a0ac05fe07c787a4cd3d0a&pid=1-s2.0-S0167587711000341-main.pdf
      From: README.md
      Status: Error
      Message: Could not resolve host: pdn.sciencedirect.com
    URL: http://web.cs.iastate.edu/~jtian/cs673/cs673_spring05/references/Friedman-Koller-2003.pdf (moved to https://faculty.sites.iastate.edu/jtian/)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://www.bioconductor.org/ (moved to https://www.bioconductor.org/)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html (moved to https://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://www.ete-online.com/content/10/1/4 (moved to https://link.springer.com/journal/12982)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: http://www.r-inla.org/ (moved to https://www.r-inla.org/)
      From: README.md
      Status: 301
      Message: Moved Permanently
    URL: https://r-bayesian-networks.org/quick_start_example.html
      From: inst/doc/paper.html
      Status: Error
      Message: schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - Der Zielprinzipalname ist falsch.
    For content that is 'Moved Permanently', please change http to https,
    add trailing slashes, or replace the old by the new URL.
  
  Found the following (possibly) invalid file URI:
    URI: quick_start_example.md
      From: README.md

Prediction function

Prediction of a fitted BN can be achieved in several ways.

Allow to skip the quick tests on specific branches

We might not always need to have the quick tests running on every commit to a branch.

When working on the documentation or, as we do now, on the paper we are not interested in the tests.

Therefore it should be easy to skip the tests.

Suggestion:

If a branch name contains the string noT then the quick tests do not run at all on this branch
If a commit message starts with noT then for this commit the tests are skipped

parallelise marginal posterior density

Consider to call the loops in getmarginals() with foreach to gain speed up.

write output to a file provided by verbose
allow FORK and PSOCK provided by fit.control()

Typos in DESCRIPTION

Please omit the redundant " The abn R package is a powerful tool for"
from the Description field.

Please single quote software names with straight (rather than directed)
single quotes in the Description field as in 'abn'.

Please fix and resubmit.

package archive as artifact

Get the .tar.gz from the build command as an artifact. This eases the CRAN submission.

vignettes failed that require INLA

Think about:

Don't run vignettes that need INLA.
Precompute and compile vignettes based on precompute

Flavor: r-devel-linux-x86_64-debian-gcc
Check: re-building of vignette outputs, Result: ERROR
  Error(s) in re-building vignettes:
    ...
  --- re-building 'data_simulation.Rmd' using rmarkdown
  
  Quitting from lines 29-58 [fit_model] (data_simulation.Rmd)
  Error: processing vignette 'data_simulation.Rmd' failed with diagnostics:
  there is no package called 'INLA'
  --- failed re-building 'data_simulation.Rmd'
  
  --- re-building 'mixed_effect_BN_model.Rmd' using rmarkdown
  --- finished re-building 'mixed_effect_BN_model.Rmd'
  
  --- re-building 'model_specification.Rmd' using rmarkdown
  --- finished re-building 'model_specification.Rmd'
  
  --- re-building 'multiprocessing.Rmd' using rmarkdown
  
  Quitting from lines 88-130 [benchmarking] (multiprocessing.Rmd)
  Error: processing vignette 'multiprocessing.Rmd' failed with diagnostics:
  worker initialization failed: there is no package called 'INLA'
  --- failed re-building 'multiprocessing.Rmd'
  
  --- re-building 'paper.Rmd' using rmarkdown
  --- finished re-building 'paper.Rmd'
  
  --- re-building 'parameter_learning.Rmd' using rmarkdown
  
  Quitting from lines 67-72 [unnamed-chunk-3] (parameter_learning.Rmd)
  Error: processing vignette 'parameter_learning.Rmd' failed with diagnostics:
  there is no package called 'INLA'
  --- failed re-building 'parameter_learning.Rmd'
  
  --- re-building 'quick_start_example.Rmd' using rmarkdown
  --- finished re-building 'quick_start_example.Rmd'
  
  --- re-building 'structure_learning.Rmd' using rmarkdown
  --- finished re-building 'structure_learning.Rmd'
  
  SUMMARY: processing the following files failed:
    'data_simulation.Rmd' 'multiprocessing.Rmd' 'parameter_learning.Rmd'
  
  Error: Vignette re-building failed.
  Execution halted

fitabn_mle(): Return very low score when multinomial fit is NULL

Catch fit when its NULL and return a very low score:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/fitabn_mle.R#L606-L609

buildScoreCache: doesn't work with `defn.res` and `which.nodes` provided together

Because there is no test that handles this situation properly, defn.res and which.nodes provided together results in an error. https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L590-L603

Resolve by checking if defn.res and which.nodes are not mismatching and keep if ok.

Streamline versioning process

Currently we have multiple locations where we need to set the version manually (DESCRIPTION, News.md, configure and configure.ac, others?) in addition to the version we set via git tag.

The goal would be to streamline the process of bumping the version, ideally designating one source for the version and have all other mentions be generated automatically.

As @matteodelucchi pointed out, usethis::use_version() might be a solution.
If there does not exist an implementation already that suites our needs, we might also implement this via templating (e.g. with https://github.com/davidchall/jinjar/).

In addition to streamlining the version-bumping process we might also consider to adhere to semver versioning scheme.

Steps

Evaluate if usethis::use_version() allows to fetch the the version number from a git tag and set it in every location we mention the package version.
Implement the process of setting the version number based on the last git tag. (Either with usethis::use_version() or from scratch)

glmm.score with Julia backend

Make glmm.score with interface to julia to
i) increase performance
ii) rank deficiency is handeled properly: https://juliastats.org/MixedModels.jl/dev/rankdeficiency/
- see example package: https://github.com/Non-Contradiction/ipoptjlr/blob/master/R/IPOPT.R
- using the R package JuliaCall.

speed up examples in buildScoreCache()

Check: examples, Result: NOTE
Examples with CPU (user + system) or elapsed time > 10s
user system elapsed
buildScoreCache 8.67 1.61 10.28

Consolidate CRAN tests and test-coverage actions

We have two different actions for running the CRAN like tests and one just for getting the test-coverage.

It is unclear to me why they exists separately.

Suggestion

Remove test-coverage.yml and include its last 3 steps in the CRAN tests

fix URI in README

Found the following (possibly) invalid file URI:
URI: quick_start_example.md
From: README.md