Giter Club home page Giter Club logo

cegpy's Introduction

cegpy

contributions welcomeDocumentation StatusAll Contributors

Cegpy (/segpaɪ/) is a Python package for working with Chain Event Graphs. It supports learning the graphical structure of a Chain Event Graph from data, encoding of parametric and structural priors, estimating its parameters, and performing inference.

It is built on top of the Python network modelling package NetworkX.

Documentation

Documentation is hosted on read the docs.

We have also written a paper to explain the statistical methods and algorithms included in the package; ARXIV - cegpy: Modelling with Chain Event Graphs in Python.

Quickstart

If you'd like to get started using the packages, the best place to start is the quick-start documentation.

Example Binder

Use cases have been demonstrated in a binder.

The following image is an example of a chain event graph that is produced by this package:

Image

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Aditi Shenvi

💻 ⚠️ 🐛 📆

Gareth Walley

💻 📖 🎨 ⚠️ 🚧

Kasia Kobalczyk

💻 🐛 ⚠️

Peter Strong

💻 🐛 💡 ⚠️

This project follows the all-contributors specification. Contributions of any kind welcome!

cegpy's People

Contributors

allcontributors[bot] avatar ashenvi10 avatar g-walley avatar gareth-walley avatar kasia-kobalczyk avatar peterrhysstrong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cegpy's Issues

Propagation: Destroying conditional independence relationship

A problem arises when the evidence given isn't "intrinsic", i.e. it destroys a conditional independence relationship encoded in the original graph. Example added below by Gareth.

This is an edge case. I think the solution should simply be to raise a warning, whenever the propagation algorithm is run, to remind the user to consider edge cases such as these.

Alternatively, we could check whether the minimised/reduced graph obtained by conditioning on the evidence contains any paths where two or more elements from an uncertain nodes set or uncertain edges set occur together and then issue an error saying that this graph results in destroying a conditional independence relationship in the original graph and hence isn't representative of the evidence given.

List of things left to do as of 10/10/2021

For the code

  • Finish coding up the propagation algorithm
  • Add more tests
  • Fix bugs (cond prob bug -- see previous issue)

For the paper

  • Add Python implementation section
  • Run experiments
  • Report experiments in paper
  • Add discussion section
  • Compare to existing packages

For the documentation

  • Get started
  • Update to do list

New Feature: AHC Parallelisation

To work on this, create a new branch from dev called "feature-staged-tree-ahc-parallelisation".

The AHC algorithm should be able to be implemented with some parallelisation.

When the sets in a hyperstage are mutually exclusive, there is no reason the algorithm can't be run independently in paralle over the sets of the hyperstage.

Tasks:

  • Determine which sets in the hyperstage are independent, and which need to be considered alongside other sets in the hyperstage
  • Spin up multiple threads based on the configuration setting "threads" which is added as an optional input for the staged tree class. (assume single core operation when threads is not configured)
  • Combine results.

Broken CI graphviz calls

CI Is broken due to graphviz calls in the Tests. Calls should be commented out, left there for when developers wish to see the graphs.

CEG testing refactor

The CEG class is cumbersome and needs its complexity reducing.

The _update_probabilities() function has too many nested blocks. It could do with a minor refactor to bring this from 6 down to at least 5, maybe less. It also has too many branches (15) it would be good to simplify the branching of this and break it down into unit testable blocks where possible.

Unit tests on branch feature-ceg-testing-refactor are currently only at 45% coverage. This should be >95% and preferably 100%.

Floret with one edge Issue

When we have a floret with only one outgoing edge, such as s14, s16, s18, s20 in the attached staged tree, they should all be in the same stage (in the attached tree s14 isnt). I think it should be hard coded for situations when there is one potential outcomes and the edge label is the same that those stages should be in the same stage.

Titanic.pdf

DOT importing

Graphviz allows users to create graphs in DOT (https://graphviz.org/doc/info/lang.html).

One great enhancement would be to have the functionality in cegpy to allow a user to create an event tree/staged tree/ceg in dot and some way of adding the data that goes with it, and then importing that into our package in a way in which it can read the dot figure and the other data.

Output Dot File

Add Dot file output from pdp package. This gives the user the flexibility to use any number of Dot compatible packages for producing graphs in whatever style they like.

Remove unhelpful networkx functionaility

remove networkx functionality that could be used erroneously. Such as adding edges which could be interpreted as how to add sampling zeros.

Need to identify the functions.

Bug in ChainEventGraph.generate()

When running .generate on a chain event graph I am getting a division by zero error.

Example code with the following dataset
Titanic.csv

from cegpy import StagedTree, EventTree, ChainEventGraph
import pandas as pd

data = pd.read_csv("Titanic.csv")

#get missing paths

missing_paths = [('Crew', 'Male', 'Child'),
 ('Crew', 'Male', 'Child'),
 ('Crew', 'Female', 'Child'),
 ('Crew', 'Female', 'Child'),
 ('Crew', 'Male', 'Child', 'Yes'),
 ('Crew', 'Male', 'Child', 'No'),
 ('Crew', 'Female', 'Child', 'Yes'),
 ('Crew', 'Female', 'Child', 'No'),
 ('1st', 'Male', 'Child', 'No'),
 ('1st', 'Female', 'Child', 'No'),
 ('2nd', 'Male', 'Child', 'No'),
 ('2nd', 'Female', 'Child', 'No')]

st = StagedTree(data,sampling_zero_paths=missing_paths)
st.calculate_AHC_transitions()
CEG = ChainEventGraph(st)
CEG.generate() #Issue on this line 

To Do 23/07/2021: Aditi

  • Download package from Pypi after Gareth uploads it
  • Run pytest locally
  • Write the abstract, introduction and model description sections

Rewrite EventTree class to inherit networkx.MultiGraph and use it for its data storage and graphing classes

networkx provides a lot of functionality for free that we can make use of.

By extending it to add functionality specific for EventTree generations from statistical datasets, we can provide a very helpful package, whilst giving the user a lot of flexibility.

  • Create new branch for the networkx modifications.
  • Determine the functions that we can replace.
  • Determine which functions need modification
  • Ensure tests still work

Conditional probabilities bug

The conditional probabilities generated by the AHC algorithm aren't getting updated properly. For e.g. if situations 1, 2 and 3 are in the same stage, they don't have the same posterior conditional probabilities

Calculate AHC taking much longer than expected

chestSim50000.csv

This code is running much longer than expected (hours). Can you see if it is just an issue with my computer?

from cegpy import StagedTree, EventTree, ChainEventGraph
import pandas as pd

data = pd.read_csv("chestSim50000.csv")

#get missing paths
	uniques = [list(set(data[k])) for k in data.columns]
	all_paths=list(product(*uniques))
	paths=data.drop_duplicates().values.tolist()
	all_paths_list = [list(ele) for ele in all_paths]

	Length_of_paths=len(all_paths_list[0])
	missing_paths = []
	for i in range(0,Length_of_paths):
		temp_paths = [path[:Length_of_paths-i] for path in paths]
		new_missing_paths_list=[path[:Length_of_paths-i] for path in all_paths_list if path[:Length_of_paths-i] not in temp_paths]
		if new_missing_paths_list != []:
			missing_paths=[*new_missing_paths_list,*missing_paths]
	missing_paths = list(tuple(x) for x in missing_paths)
	missing_paths = sorted(missing_paths, key=len)


st = StagedTree(data,sampling_zero_paths=missing_paths)
st.calculate_AHC_transitions()

New Feature: Holding Times

Holding Times

Introduction

In dynamic real-world systems, temporal effects could play a large role in the evolution of the process and contribute to our understanding of the system. To study temporal effects within a system, it is important to consider the time it takes for events to occur. For example, reasoning about processes that evolve over time is important in many domains such as medicine, fault diagnosis, policing, any time-sensitive decision making.

We incorporate such a temporal analysis in a CEG by modelling the holding time associated with the transitions taking place within the CEG. Recall that a transition in a CEG is a movement from one node to another via the directed edge connecting them. A holding time along an edge e from node v to node v' indicates how long an individual spends in the state represented by node v before transitioning along edge e to node v'.

Modifications Required

  1. For inputs:
  • User needs to indicate which columns contain holding times (these need to be necessarily numeric with no non-numeric characters) and to which transitions they relate to. This will need to be an argument that the user enters while first initialising the class.

    • Add argument holding_times = Mapping["{column_name}", "{time_column_name}"]
    • Add sanity checks
    • Tests: check that holding times are all numeric
    • Tests: check that holding times are all non-negative
    • Tests: check that, if variable is observed, then its corresponding holding time is also observed.
    • Add warning, if variable is not observed, corresponding holding time is ignored even if it is observed.
    • Test: check that variable order contains columns that aren't holding times
  • Only providing functionality for conjugate weibull distributions. The shape parameter is known but scale parameter is unknown.

    • Add argument, holding_time_distribution. Mapping["{column_name}", Tuple[Str[distribution], **kwargs]]
    • Test that the input shape parameters make sense
    • Test that all columns provided with a holding time dist also have a holding time column in holding_times
    • Test that all columns provided with a holding time column in holding_times have also been provided with a holding_time dist
  1. For methods:
  • A new separate AHC algorithm needs to be implemented to cluster the edges based on their holding time distributions. This does not replace the original AHC algorithm in the code, that is needed to cluster the situations/nodes based on their transition distributions.
  • There are two possible representations we can have for the output staged tree and CEG.
    • Rep1: Need to modify when two nodes are given the same colouring. It would need to check the output from the original and new AHC. Two nodes would be coloured the same when they are in the same cluster for the original AHC and also their outgoing edges which have the same edge label are in the same cluster in the new AHC. Rest remains the same.
    • Rep2: Two nodes are coloured the same when they are in the same cluster for the original AHC, and two edges are given the same colouring when they are in the same cluster for the new AHC. Need to modify when two nodes are collapsed into a single node in the CEG.
  • Need to modify the types of evidence that the propagation algorithm takes as an input. In addition to the types it already takes as evidence, it can also take the holding time in a specific node or a set of nodes as evidence. This would need to be point evidence (e.g. 10 -- interpreted as individual A took 10 days to transition from node v to v') rather than interval evidence (e.g. [10-14] -- interpreted as perhaps individual A took somewhere between 10 to 14 days to transition from v to v').

Unittests

Write unit-tests to show that these features are working as expected:

  • Structural zeros
  • Structural missing values
  • Propagation
  • StagedTrees.py

R Binding

It would be a great to have a binding/wrapper that allows the cegpy package to be used from R. I don't know much about how to do this, so this issue description itself will need more work!

I know that you can use the reticulate package (https://rstudio.github.io/reticulate/index.html) to run python scripts and functions from R but I am not sure if all the ceg python functions/classes will be callable from R without adding some more code in. So this wrapper/binding code will need to be added in. Additionally, all the functions will have to be tested in R.

Sensible hyperstage and priors

Check that the hyperstage and priors, if provided as input by the user, are sensible.


For the hyperstage, this means that, at the very least:
• Situations in the same hyperstage have the same number of outgoing edges;
• All the situations appear in at least one sublist in the hyperstage.

Note that this keep open the possibility that the user might add two situations into the same sublist in the hyperstage such that they have the same number of edges but not the same edge labels.

This also implies that, if the user adds situations with only one emanating edge into the same sublist in the hyperstage, then they will all be merged into the same stage irrespective of their edge labels.


For the priors, we need to check that:
• The length of each sublist in the prior matches the number of outgoing edges for its corresponding situation.
• All elements in each sublist in the prior must be > 0

More tests for edge_info argument.

Add tests to ensure that the edge_info argument for figure creation is working fine.

Tests have been added to ensure that it shows the right warning message when the argument is inadmissible (i,e. it is a non-existent attribute) but not for when the argument is admissible. For event trees, this only means 'count' and so, no additional tests needed. Additional tests needed for staged trees and cegs as the argument can be set to any of the following: 'prior', 'count', 'posterior', 'probability'.

Writing Paper

@ashenvi10 Please modify this to flesh it out to match any plan you might have.

Python implementation section

  • Diagram of the software architecture
  • Description of each class and what it tries to do

Experimental section

  • Show output for different example datasets
  • Run experiments to determine execution time
  • Report experiments in paper

Discussion

  • Add discussion section
  • Compare to existing packages

To Do 23/07/2021: Peter

  • Download package from Pypi after Gareth uploads it
  • Run pytest locally
  • Run the code with different datasets and time it

Missing label and complete case True

All empty cells and NaNs are considered structurally missing, irrespective of any labels.

However, when a missing label is given and complete_case = True, we set all missing_label occurrences in the dataframe to np.nan and then drop all na's.
Need to edit the event.py code to fix this.

CI: Fix test fails

The CI is no longer up to scratch. It needs to be rewritten to make use of tox in a docker file.

This is due to the fact that the tests are interacting with graphviz, and the standard Python3 unit testing CI tool does not have it installed.

Create event trees from summarised data

Add a feature to create event / staged trees from a a summarised format of data.
I.e. a data frame recording the counts of different combinations of the variables occurring in the data set.

e.g.

var1 var2 n_observations
A 1 500
B 1 345
A 0 234
B 0 6567

Add sampling zero paths as csv file input

This is basically to say that the user does not need to include the sampling zero paths as an argument in the instantiation of the class but can instead provide the sampling zero paths written in a csv.

Structural/non-structural labels in the dataset

Currently the code assumes all NaNs and empty cells to be structurally missing.

However, there might be actual sampling missing values.
Users could be given the option to specify the labels used for structural/sampling missing values so they can be treated differently.

To begin with, sampling missingness may be assumed to be missing at random and we can do a complete-case analysis (i.e remove the rows with sampling missing values), or we could include them within the event tree under the edge label "missing".

Colours in Staged tree too similar

We need to make the colours that are assigned have more "guaranteed" variation from each other.
Perhaps split the colour range evenly, to maximise the difference between colours?

Create Docker Container

Docker images and containers allow us to all run the code with standard dependencies and ensure that there are no "environment" related issues.

  • Get docker working on my own PC
  • Install docker CI to "actions" on Github
  • Assist others in the installation and use of docker on their laptops.

Documentation

For auto-generated documentation, write all the docstrings to make sure that each class' page is readable and understandable.

Auto-gen

  • EventTree
  • StagedTree
  • ChainEventGraph
  • ChainEventGraphReducer

Readme

  • Welcome and detailed description of the package and how to use it.
  • Links to auto-gen documentation page
  • Links to the notebook and examples
  • Some pictures of example outputs

New Feature: Define order of columns based on optional parameter "var_order"

Description

The dataset used to generate the Event Tree may have columns out of order. Rather than manually swapping them around in a spreadsheet editor, the user could simply add a list ordering of the columns.

Branch

To work on this, please commit to the "event_tree_feature_var_order" branch.
Once complete, please let @g-walley know with a pull request, and he will merge it into the dev branch.

Modification required

In the EventTree class in event.py, add an optional parameter called "var_order".
Parameter will take the form of a list of column names: e.g: ["col1", "col3", "col2"]

Look at this post on stack overflow: https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns

The response from freddygv looks like a good solution.

Merge single floret edges.

Say there are two consecutive florets A and B with only one emanating edge from each (say yes from A, and recovery from B):

... -> [A] -- yes--> [B] -- recovery --> [C] ...

Then this might be simplified to:

.... -> [A] -- yes & recovery --> [C] -> ...

Which would possibly make the graph less cluttered.

As an illustrative example I coded up the kidney liver disorder example from section 5.1.3 in the CEG J.Q.Smith's book.

Screenshot 2022-04-08 at 19 07 23

The idea is to skip position w_6 and create a direct edge from w2 to w_\infty labelled yes & R with a probability of 0.75.

Include all unobserved paths

Currently there is no way to create an event tree that includes all unobserved paths, without specifying each path individually

Issue with calculate_AHC_transitions and sampling zero paths in dev branch

The following code raises a type error when calculating the AHC transitions

TypeError: 'Fraction' object is not iterable

df = pd.read_excel("data/medical_dm_modified.xlsx")
falls_st = StagedTree(dataframe=df,sampling_zero_paths=[('Blast','Experienced', 'Easy','Blast', 'extra')])
falls_st.calculate_AHC_transitions()
falls_st.create_figure("st_fig_extra_paths.pdf")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.