g-walley / cegpy Goto Github PK

Cegpy (/segpaɪ/) is a Python package for working with Chain Event Graphs. It supports learning the graphical structure of a Chain Event Graph from data, encoding of parametric and structural priors, estimating its parameters, and performing inference.

License: MIT License

Python 98.62% Shell 0.33% Dockerfile 1.05%

statistical-inference statistical-models statistics

cegpy's Introduction

cegpy

It is built on top of the Python network modelling package NetworkX.

Documentation

Documentation is hosted on read the docs.

We have also written a paper to explain the statistical methods and algorithms included in the package; ARXIV - cegpy: Modelling with Chain Event Graphs in Python.

Quickstart

If you'd like to get started using the packages, the best place to start is the quick-start documentation.

Example Binder

Use cases have been demonstrated in a binder.

The following image is an example of a chain event graph that is produced by this package:

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Aditi Shenvi}
💻

⚠️

🐛 📆

_{Gareth Walley}
💻 📖 🎨

⚠️

🚧

_{Kasia Kobalczyk}
💻 🐛

⚠️

_{Peter Strong}
💻 🐛 💡

⚠️

This project follows the all-contributors specification. Contributions of any kind welcome!

cegpy's People

Contributors

Stargazers

Watchers

Forkers

peterrhysstrong danturley789

cegpy's Issues

Priors and hyperstages resetting to default

When a user defines a StagedTree with a specified prior or hyper stages, they are overridden by the default ones the calling calculate_AHC_transitions.

Bug: Minimisation algorithm uncertain evidence

@g-walley See whiteboard, update this with a description of the problem.

Propagation: Destroying conditional independence relationship

A problem arises when the evidence given isn't "intrinsic", i.e. it destroys a conditional independence relationship encoded in the original graph. Example added below by Gareth.

This is an edge case. I think the solution should simply be to raise a warning, whenever the propagation algorithm is run, to remind the user to consider edge cases such as these.

Alternatively, we could check whether the minimised/reduced graph obtained by conditioning on the evidence contains any paths where two or more elements from an uncertain nodes set or uncertain edges set occur together and then issue an error saying that this graph results in destroying a conditional independence relationship in the original graph and hence isn't representative of the evidence given.

User defined colour palette

User might like to define colour palette themselves

List of things left to do as of 10/10/2021

For the code

Finish coding up the propagation algorithm
Add more tests
Fix bugs (cond prob bug -- see previous issue)

For the paper

For the documentation

Get started
Update to do list

New Feature: AHC Parallelisation

To work on this, create a new branch from dev called "feature-staged-tree-ahc-parallelisation".

The AHC algorithm should be able to be implemented with some parallelisation.

When the sets in a hyperstage are mutually exclusive, there is no reason the algorithm can't be run independently in paralle over the sets of the hyperstage.

Tasks:

Determine which sets in the hyperstage are independent, and which need to be considered alongside other sets in the hyperstage
Spin up multiple threads based on the configuration setting "threads" which is added as an optional input for the staged tree class. (assume single core operation when threads is not configured)
Combine results.

Remove NA in categories per variable calcs

When counting how many categories per variable, the function doesn't ignore NAs or N/As etc.

Broken CI graphviz calls

CI Is broken due to graphviz calls in the Tests. Calls should be commented out, left there for when developers wish to see the graphs.

CEG testing refactor

The CEG class is cumbersome and needs its complexity reducing.

The _update_probabilities() function has too many nested blocks. It could do with a minor refactor to bring this from 6 down to at least 5, maybe less. It also has too many branches (15) it would be good to simplify the branching of this and break it down into unit testable blocks where possible.

Unit tests on branch feature-ceg-testing-refactor are currently only at 45% coverage. This should be >95% and preferably 100%.

Floret with one edge Issue

When we have a floret with only one outgoing edge, such as s14, s16, s18, s20 in the attached staged tree, they should all be in the same stage (in the attached tree s14 isnt). I think it should be hard coded for situations when there is one potential outcomes and the edge label is the same that those stages should be in the same stage.

Titanic.pdf

DOT importing

Graphviz allows users to create graphs in DOT (https://graphviz.org/doc/info/lang.html).

One great enhancement would be to have the functionality in cegpy to allow a user to create an event tree/staged tree/ceg in dot and some way of adding the data that goes with it, and then importing that into our package in a way in which it can read the dot figure and the other data.

Event tree from staged tree class

Be able to create an image of the event tree in the staged tree class

Output Dot File

Add Dot file output from pdp package. This gives the user the flexibility to use any number of Dot compatible packages for producing graphs in whatever style they like.

Remove unhelpful networkx functionaility

remove networkx functionality that could be used erroneously. Such as adding edges which could be interpreted as how to add sampling zeros.

Need to identify the functions.

Bug in ChainEventGraph.generate()

When running .generate on a chain event graph I am getting a division by zero error.

Example code with the following dataset
Titanic.csv

from cegpy import StagedTree, EventTree, ChainEventGraph
import pandas as pd

data = pd.read_csv("Titanic.csv")

#get missing paths

missing_paths = [('Crew', 'Male', 'Child'),
 ('Crew', 'Male', 'Child'),
 ('Crew', 'Female', 'Child'),
 ('Crew', 'Female', 'Child'),
 ('Crew', 'Male', 'Child', 'Yes'),
 ('Crew', 'Male', 'Child', 'No'),
 ('Crew', 'Female', 'Child', 'Yes'),
 ('Crew', 'Female', 'Child', 'No'),
 ('1st', 'Male', 'Child', 'No'),
 ('1st', 'Female', 'Child', 'No'),
 ('2nd', 'Male', 'Child', 'No'),
 ('2nd', 'Female', 'Child', 'No')]

st = StagedTree(data,sampling_zero_paths=missing_paths)
st.calculate_AHC_transitions()
CEG = ChainEventGraph(st)
CEG.generate() #Issue on this line

To Do 23/07/2021: Aditi

Download package from Pypi after Gareth uploads it
Run pytest locally
Write the abstract, introduction and model description sections

Add a warning for creating a ceg graph without calling `.generate()`

Add a warning / error for the user when trying to create a ceg graph without running .generate() first.

Rewrite EventTree class to inherit networkx.MultiGraph and use it for its data storage and graphing classes

networkx provides a lot of functionality for free that we can make use of.

By extending it to add functionality specific for EventTree generations from statistical datasets, we can provide a very helpful package, whilst giving the user a lot of flexibility.

Create new branch for the networkx modifications.
Determine the functions that we can replace.
Determine which functions need modification
Ensure tests still work

Conditional probabilities bug

The conditional probabilities generated by the AHC algorithm aren't getting updated properly. For e.g. if situations 1, 2 and 3 are in the same stage, they don't have the same posterior conditional probabilities

Calculate AHC taking much longer than expected

chestSim50000.csv

This code is running much longer than expected (hours). Can you see if it is just an issue with my computer?

from cegpy import StagedTree, EventTree, ChainEventGraph
import pandas as pd

data = pd.read_csv("chestSim50000.csv")

#get missing paths
	uniques = [list(set(data[k])) for k in data.columns]
	all_paths=list(product(*uniques))
	paths=data.drop_duplicates().values.tolist()
	all_paths_list = [list(ele) for ele in all_paths]

	Length_of_paths=len(all_paths_list[0])
	missing_paths = []
	for i in range(0,Length_of_paths):
		temp_paths = [path[:Length_of_paths-i] for path in paths]
		new_missing_paths_list=[path[:Length_of_paths-i] for path in all_paths_list if path[:Length_of_paths-i] not in temp_paths]
		if new_missing_paths_list != []:
			missing_paths=[*new_missing_paths_list,*missing_paths]
	missing_paths = list(tuple(x) for x in missing_paths)
	missing_paths = sorted(missing_paths, key=len)


st = StagedTree(data,sampling_zero_paths=missing_paths)
st.calculate_AHC_transitions()

New Feature: Holding Times

Holding Times

Introduction

In dynamic real-world systems, temporal effects could play a large role in the evolution of the process and contribute to our understanding of the system. To study temporal effects within a system, it is important to consider the time it takes for events to occur. For example, reasoning about processes that evolve over time is important in many domains such as medicine, fault diagnosis, policing, any time-sensitive decision making.

We incorporate such a temporal analysis in a CEG by modelling the holding time associated with the transitions taking place within the CEG. Recall that a transition in a CEG is a movement from one node to another via the directed edge connecting them. A holding time along an edge e from node v to node v' indicates how long an individual spends in the state represented by node v before transitioning along edge e to node v'.

Modifications Required

For inputs:

For methods:

A new separate AHC algorithm needs to be implemented to cluster the edges based on their holding time distributions. This does not replace the original AHC algorithm in the code, that is needed to cluster the situations/nodes based on their transition distributions.
There are two possible representations we can have for the output staged tree and CEG.
- Rep1: Need to modify when two nodes are given the same colouring. It would need to check the output from the original and new AHC. Two nodes would be coloured the same when they are in the same cluster for the original AHC and also their outgoing edges which have the same edge label are in the same cluster in the new AHC. Rest remains the same.
- Rep2: Two nodes are coloured the same when they are in the same cluster for the original AHC, and two edges are given the same colouring when they are in the same cluster for the new AHC. Need to modify when two nodes are collapsed into a single node in the CEG.
Need to modify the types of evidence that the propagation algorithm takes as an input. In addition to the types it already takes as evidence, it can also take the holding time in a specific node or a set of nodes as evidence. This would need to be point evidence (e.g. 10 -- interpreted as individual A took 10 days to transition from node v to v') rather than interval evidence (e.g. [10-14] -- interpreted as perhaps individual A took somewhere between 10 to 14 days to transition from v to v').

Unittests

Write unit-tests to show that these features are working as expected:

Structural zeros
Structural missing values
Propagation
StagedTrees.py

R Binding

It would be a great to have a binding/wrapper that allows the cegpy package to be used from R. I don't know much about how to do this, so this issue description itself will need more work!

I know that you can use the reticulate package (https://rstudio.github.io/reticulate/index.html) to run python scripts and functions from R but I am not sure if all the ceg python functions/classes will be callable from R without adding some more code in. So this wrapper/binding code will need to be added in. Additionally, all the functions will have to be tested in R.

Sensible hyperstage and priors

Check that the hyperstage and priors, if provided as input by the user, are sensible.

For the hyperstage, this means that, at the very least:
• Situations in the same hyperstage have the same number of outgoing edges;
• All the situations appear in at least one sublist in the hyperstage.

Note that this keep open the possibility that the user might add two situations into the same sublist in the hyperstage such that they have the same number of edges but not the same edge labels.

This also implies that, if the user adds situations with only one emanating edge into the same sublist in the hyperstage, then they will all be merged into the same stage irrespective of their edge labels.

For the priors, we need to check that:
• The length of each sublist in the prior matches the number of outgoing edges for its corresponding situation.
• All elements in each sublist in the prior must be > 0

More tests for edge_info argument.

Add tests to ensure that the edge_info argument for figure creation is working fine.

Tests have been added to ensure that it shows the right warning message when the argument is inadmissible (i,e. it is a non-existent attribute) but not for when the argument is admissible. For event trees, this only means 'count' and so, no additional tests needed. Additional tests needed for staged trees and cegs as the argument can be set to any of the following: 'prior', 'count', 'posterior', 'probability'.

Writing Paper

@ashenvi10 Please modify this to flesh it out to match any plan you might have.

Python implementation section

Diagram of the software architecture
Description of each class and what it tries to do

Experimental section

Show output for different example datasets
Run experiments to determine execution time
Report experiments in paper

Discussion

Add discussion section
Compare to existing packages

To Do 23/07/2021: Peter

Download package from Pypi after Gareth uploads it
Run pytest locally
Run the code with different datasets and time it

Missing label and complete case True

All empty cells and NaNs are considered structurally missing, irrespective of any labels.

However, when a missing label is given and complete_case = True, we set all missing_label occurrences in the dataframe to np.nan and then drop all na's.
Need to edit the event.py code to fix this.

CI: Fix test fails

The CI is no longer up to scratch. It needs to be rewritten to make use of tox in a docker file.

This is due to the fact that the tests are interacting with graphviz, and the standard Python3 unit testing CI tool does not have it installed.

To Do 23/07/2021: Gareth

Build the package
Upload package to Pypi
Start working on staged tree class

CEG module new features

Run minimisation algorithm functionality
Write new propagation algorithm
Output to graph

Fix import structure of package

The package structure needs cleaning up.

Clean up the imports in the init.py files.

Create event trees from summarised data

Add a feature to create event / staged trees from a a summarised format of data.
I.e. a data frame recording the counts of different combinations of the variables occurring in the data set.

e.g.

var1	var2	n_observations
A	1	500
B	1	345
A	0	234
B	0	6567

Enable displaying graph images without saving to file in python notebooks

Add sampling zero paths as csv file input

This is basically to say that the user does not need to include the sampling zero paths as an argument in the instantiation of the class but can instead provide the sampling zero paths written in a csv.

Type error: generating figures

Type error when generating figures for Event/staged trees created from data frames with numeric elements in them.

Learn about and install Docker

We have added Docker to the project. This will allow us to "containerise" the code, and get it to run in a standard environment that is the same for all of us.

Docker documentation:
https://docs.docker.com/get-started/overview/

Get docker installed
Try and build and run the project's docker image on your own computer

Structural/non-structural labels in the dataset

Currently the code assumes all NaNs and empty cells to be structurally missing.

However, there might be actual sampling missing values.
Users could be given the option to specify the labels used for structural/sampling missing values so they can be treated differently.

To begin with, sampling missingness may be assumed to be missing at random and we can do a complete-case analysis (i.e remove the rows with sampling missing values), or we could include them within the event tree under the edge label "missing".

Error in displaying the CEG graph

cegpy/src/cegpy/graphs/ceg.py

Lines 282 to 283 in b39c9c8

 for (u, v, k, p) in edge_probabilities: 

 full_label = "{}\n{:.2f}".format(k, p)

Edge probabilities are stored as Fractions. This generates the following error:

[TypeError: unsupported format string passed to Fraction.__format__]()

Dataframe with numeric values

Currently the code cannot generate figures for data frames with numeric values.

Enable displaying different information on the edges

Add a parameter to the create_figure to control what information is displayed on the edges (probabilities, edge counts, priors).

Colours in Staged tree too similar

We need to make the colours that are assigned have more "guaranteed" variation from each other.
Perhaps split the colour range evenly, to maximise the difference between colours?

Create Docker Container

Docker images and containers allow us to all run the code with standard dependencies and ensure that there are no "environment" related issues.

Get docker working on my own PC
Install docker CI to "actions" on Github
Assist others in the installation and use of docker on their laptops.

Documentation

For auto-generated documentation, write all the docstrings to make sure that each class' page is readable and understandable.

Auto-gen

EventTree
StagedTree
ChainEventGraph
ChainEventGraphReducer

Readme

Welcome and detailed description of the package and how to use it.
Links to auto-gen documentation page
Links to the notebook and examples
Some pictures of example outputs

New Feature: Define order of columns based on optional parameter "var_order"

Description

The dataset used to generate the Event Tree may have columns out of order. Rather than manually swapping them around in a spreadsheet editor, the user could simply add a list ordering of the columns.

Branch

To work on this, please commit to the "event_tree_feature_var_order" branch.
Once complete, please let @g-walley know with a pull request, and he will merge it into the dev branch.

Modification required

In the EventTree class in event.py, add an optional parameter called "var_order".
Parameter will take the form of a list of column names: e.g: ["col1", "col3", "col2"]

Look at this post on stack overflow: https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns

The response from freddygv looks like a good solution.

Add the struct, missing label, complete case arguments to staged tree class

These arguments were recently added to the event tree class, so need to be added into the staged tree class too.

Merge single floret edges.

Say there are two consecutive florets A and B with only one emanating edge from each (say yes from A, and recovery from B):

... -> [A] -- yes--> [B] -- recovery --> [C] ...

Then this might be simplified to:

.... -> [A] -- yes & recovery --> [C] -> ...

Which would possibly make the graph less cluttered.

As an illustrative example I coded up the kidney liver disorder example from section 5.1.3 in the CEG J.Q.Smith's book.

The idea is to skip position w_6 and create a direct edge from w2 to w_\infty labelled yes & R with a probability of 0.75.

df = pd.read_excel("data/medical_dm_modified.xlsx")
falls_st = StagedTree(dataframe=df,sampling_zero_paths=[('Blast','Experienced', 'Easy','Blast', 'extra')])
falls_st.calculate_AHC_transitions()
falls_st.create_figure("st_fig_extra_paths.pdf")

	for (u, v, k, p) in edge_probabilities:
	full_label = "{}\n{:.2f}".format(k, p)

g-walley / cegpy Goto Github PK

cegpy's Introduction

cegpy

Documentation

Quickstart

Example Binder

Contributors ✨

cegpy's People

Contributors

Stargazers

Watchers

Forkers

cegpy's Issues

Holding Times

Introduction

Modifications Required

Python implementation section

Experimental section

Discussion

Auto-gen

Readme

Description

Branch

Modification required

Recommend Projects

Recommend Topics

Recommend Org