ucl / tlomodel Goto Github PK
View Code? Open in Web Editor NEWEpidemiology modelling framework for the Thanzi la Onse project
Home Page: https://www.tlomodel.org/
License: MIT License
Epidemiology modelling framework for the Thanzi la Onse project
Home Page: https://www.tlomodel.org/
License: MIT License
Please can we have some basic functionality for dependency handling.
ie. simulation fails to run unless another module is already registered.
As part of developing a module it would be helpful to see a log of the changes that occur for one particular person. This would encompass:
The way I can think of doing this would be to have a Logging Event that occurs each day (last event each day -- determimed by setting the timestamp to be one microsecond to midnight) that had a 'self.person_id_to_track' property. It would store the 'sim.population.props' dataframe each time-step so that it can compare the df from the "yesterday" with that of "today" for just that one person_id. It could then identify any changes and output this to the log. It could also scan the log from the HealthSystem to identify all HSI that have occurred which have involved that person.
So that we use the GitHub versioning for the excel datafiles and all sets of files remain in one place.
Add more detailed steps to get started from nothing.
Researchers develop their unit tests with relatively small population sizes but would be good to stress test rare events with large population sizes locally.
Researchers and travis would still use small populations but before merging into master this could be used to test very rare events
In Github, it's not obvious that there is a wiki.
Let's add a link to https://github.com/UCL/TLOmodel/wiki from the README.rst file
The following line
Line 167 in 82758d0
creates a copy of the dataframe, adding columns for each external variable. The benefits are that the external variable is indistinguishable from other properties of the population, so can be operated on in the same way. The downstide is...it creates a copy of the dataframe.
We need to monitor how external variables are used, and then determine what to do to avoid copying the dataframe.
On the Implementing a disease module page, code for test file, replace
from tlo.methods import demography
with
from tlo.methods import demography, contraception
Also, how does one know the test is running as expected? I get copious amounts of output (if run as a standalone Python script), but no idea if it is correct. Although when run using pytest
, it says it passed.
In checking disease modules, I am noticing that it is easy for there to be discrepancies regarding parameters and this can be symptomatic of a deeper issue with the code (e.g. typos and changes that come from multiple revision of the code leading to deprecation of some features etc)
It would be good to have a check as follows that would:
Each item in each of the following places must map perfectly 1:1
And .... that each of those parameters must be used somewhere (either in the module itself or referred to from another place)
I can see how the internal consistency can be established between PARAMETERS, self.parameters and the resourceful. However, checking for 'use' of the parameters would require a "cold read" of the files and a recognition of actual usage (as opposed to a comment or the initial declaration).
add to “module design” with:
List of naming conventions
Things to check prior to PR (merge from master; other tests work, one more specific tests; adherance to naming conventjonsa)
A common pattern is for parameter values to be defined in an Excel worksheet having two columns "name" and "value". Value can be any valid Parameter
type. Currently, each has to be loaded by hand (for example, this block of code)
We want to write a utility function that given a name/value dataframe (the module would still be responsible for loading the workbook etc.) and module PARAMETERS definitions (i) gets the corresponding parameter value from the dataframe (ii) checks the types match (does any necessary conversion) and (iii) assigns to the corresponding entry in module.parameters
If types don't match, fail with error.
If parameter with given name doesn't exist in Excel sheet, fail with error.
Current issues with logging:
Aims:
We have a common use case as follows:
It would seem like a solution to this would be to able to save the simulation at a certain point to a file. Then load up the file and resume the simulation under the same or different parametric conditions.
I thought this might be relatively straight forward using pickle (i.e. pickle the sim: which contains the sim.population.props and the event_queue, and all the modules and their internal contents). Then, unpickle the sim, manipulate any parameters in the modules, and restart the sim using sim.simulate(end_date = end_of_part_two_date). (see script below)
However, I tried this and the unpicking failed with a RecursionError. Stack overflow suggested this is a common error for pickling complex classes and suggested increasing the limit on recursions -- but this led to the console crashing for me.
Do you have any thoughts on this?
Short-term:
Medium-term:
from pathlib import Path
from tlo import Date, Simulation
from tlo.methods import contraception, demography
outputpath = Path("./outputs")
resourcefilepath = Path("./resources")
start_date = Date(2010, 1, 1)
end_date_part_one = Date(2011, 1, 2)
popsize = 1000
sim = Simulation(start_date=start_date)
sim.register(demography.Demography(resourcefilepath=resourcefilepath))
sim.register(contraception.Contraception(resourcefilepath=resourcefilepath))
sim.seed_rngs(1)
sim.make_initial_population(n=popsize)
sim.simulate(end_date=end_date_part_one)
import pickle
with open(outputpath / 'pickled_basic_object', 'wb') as f:
pickle.dump({'1': 1, '2': 2}, f)
with open(outputpath / 'pickled_sim', 'wb') as f:
pickle.dump(sim, f)
with open(outputpath / 'pickled_event_queue', 'wb') as f:
pickle.dump(sim.event_queue, f)
with open(outputpath / 'pickled_basic_object', 'rb') as f:
x = pickle.load(f)
with open(outputpath / 'pickled_sim', 'rb') as f:
x = pickle.load(f) # fails
with open(outputpath / 'pickled_event_queue', 'rb') as f:
x = pickle.load(f) # fails
# # Increasing recursion limits -- didn't help!
# # https://stackoverflow.com/questions/3323001/what-is-the-maximum-recursion-depth-in-python-and-how-to-increase-it
# import sys
# sys.getrecursionlimit()
# sys.setrecursionlimit(90000)
Only testing!
Most use cases with 'request_consumables' in the HealthSystem involve getting one particular item_code or package_code. Despite this, the current implementation requires that for each request an elaborate 'cons_req_as_footprint' dict() is created.
Therefore, create a helper function that accepts a single item_code or a package_code and return a bool (for availability), to make this easier.
ie. the usage in that simple case, changes from;
item_code = self.module.parameters['anti_depressant_medication_item_code']
result_of_cons_request = self.sim.modules['HealthSystem'].request_consumables(
hsi_event=self,
cons_req_as_footprint={'Intervention_Package_Code': dict(), 'Item_Code': {item_code: 1}}
)['Item_Code'][item_code]
to:
item_code = self.module.parameters['anti_depressant_medication_item_code']
result_of_cons_request = self.sim.modules['HealthSystem'].request_consumables_as_item_code(self, item_code)
What is the best way for us to capture how uncertainty in model parameters is propagated to model outputs?
We would like all or many input parameters to be associated with several credible values and for it to be easy to run the model to run with each and have the results be bound together in order that summaries can be made that cut across the runs induced with each set of parameter values.
The system we have now would provide a work flow as follows:
... , but perhaps this can be streamlined..... e.g.
It is most common for the usage of this to be:
self.sim.modules['HealthSystem'].schedule_hsi_event(
hsi_event=hsi_event,
priority=0,
topen=self.sim.date,
tclose=None
)
Therefore, add in default such that:
priority=1
topen=self.sim.date
tclose=None
i.e. assert isinstance(date, Date)
in schedule_event
When you declare a categorical type variable, you should also need to specify categories=["b","c","d"], ordered=False
or similar. The list of categories should probably be compulsory to specify; ordered could default to True.
If a user tries to assign a value not in the list, NaN is used instead. If ordered, the order given in the list is used as the sort order for the property.
See also https://pandas.pydata.org/pandas-docs/stable/categorical.html
I think the following can now be added to the wiki:
Management
File locations (e.g. resource file)
Running of analyses
Use of the logger in the disease module
Use of flake8
Substantive
New definitions of skeleton.py which includes the healthsystem
Use of pytests
Use of assert() to check things on the fly
And for the cookbook:
Syntax for establishing and using pytests.
Naming/file conventions are as described on the wiki at module design
Excel and pandas can do some strange things together with boolean and date fields.
Overall it would probably work as expected, fairly low priority but if we start using these types for disease modules then we should put in some error checking.
It may be useful to include age in the population.props dataframe rather than merging population.age into props in many functions. It would need to be constantly updating in the background I think.
I am noticing that the tests in master for several of the disease modules fail when running at small population sizes and using the logger (due to the outputting of inf, in several cases).
Going forward all disease module authors should confirm that their module works at arbitrarily small sample sizes (this is included in the checklist for PR).
But, we need to go back to the disease modules so that a test of this format works:
def test_all_modules_running_at_small_population_size():
# Get ready for temporary log-file
f = tempfile.NamedTemporaryFile(dir='.')
fh = logging.FileHandler(f.name)
fr = logging.Formatter("%(levelname)s|%(name)s|%(message)s")
fh.setFormatter(fr)
logging.getLogger().addHandler(fh)
# Establish the simulation object
sim = Simulation(start_date=start_date)
sim.seed_rngs(0)
# Define the service availability
service_availability = list(['*'])
# Register the appropriate modules
sim.register(demography.Demography(resourcefilepath=resourcefilepath))
sim.register(contraception.Contraception(resourcefilepath=resourcefilepath))
sim.register(lifestyle.Lifestyle())
sim.register(healthsystem.HealthSystem(resourcefilepath=resourcefilepath,
service_availability=service_availability))
sim.register(healthburden.HealthBurden(resourcefilepath=resourcefilepath))
sim.register(oesophageal_cancer.Oesophageal_Cancer(resourcefilepath=resourcefilepath))
sim.register(depression.Depression(resourcefilepath=resourcefilepath))
sim.register(epilepsy.Epilepsy(resourcefilepath=resourcefilepath))
sim.register(hiv.hiv(resourcefilepath=resourcefilepath))
sim.register(tb.tb(resourcefilepath=resourcefilepath))
sim.register(male_circumcision.male_circumcision(resourcefilepath=resourcefilepath))
# Run the simulation and flush the logger
sim.make_initial_population(n=100)
sim.simulate(end_date=end_date)
check_dtypes(sim)
# read the results
fh.flush()
output = parse_log_file(f.name)
f.close()
# Do the checks
# correctly configured index (outputs on 31st decemnber in each year of simulation for each age/sex group)
dalys=output['tlo.methods.healthburden']['DALYS']
age_index = sim.modules['Demography'].AGE_RANGE_CATEGORIES
sex_index = ['M', 'F']
year_index = list(range(start_date.year, end_date.year + 1))
correct_multi_index = pd.MultiIndex.from_product([sex_index, age_index, year_index], names=['sex', 'age_range', 'year'])
dalys['year']=pd.to_datetime(dalys['date']).dt.year
assert (pd.to_datetime(dalys['date']).dt.month == 12).all()
assert (pd.to_datetime(dalys['date']).dt.day == 31).all()
output_multi_index = dalys.set_index(['sex','age_range','year']).index
assert output_multi_index.equals(correct_multi_index)
# check that there is a YLD for each module registered
yld_colnames= list()
for colname in list(dalys.columns):
if 'YLD' in colname:
yld_colnames.append(colname)
module_names_in_output=set()
for yld_colname in yld_colnames:
module_names_in_output.add(yld_colname.split('_',2)[1])
assert module_names_in_output == {'Epilepsey','Depression','Oesephageal cancer'}
@ihawryluk and @jwr42 have experienced two reasons for the unit tests failing when running on windows.
int32
instead of int64
tempfile.NamedTemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None[, delete=True]]]]]])¶
...
(it can be so used on Unix; it cannot on Windows NT or later)...
The LinearModel is almost always used with an intercept of 1.0 and a LinearModelType.MULTIPLICATIVE.
It ** may ** be useful to make this a default setting such that this case can be written more concisely.
Hi Asif,
Have just started working on a new computer and went through all the startup steps again, installing and pycharm configurations etc. I think I got there but it was a bit tricky. Do you think it would be possible to do a screencast of doing it, from a completely blank machine to make it easier?
Thanks very much
Tim
In the course of implementing changes to improve performance (see #63), @stefpiatek noticed that repeated runs of the analysis_hiv1 script do not yield the same final population dataframe at the end of the simulation. Further debugging show the events are not running in the same order (nevermind the different events).
The way TLOmodel is designed, by setting the seed for the simulation (e.g. sim.set_seed(0)
) and then using the rng supplied by the module should always reproduce the same run. That means either:
We're checking both but please can modellers ensure all random calls use the RandomState object supplied by the module (i.e. self.rng
inside module or self.module.rng
inside events).
1: not picking on this module, just a use case!
Can the code that reads the excel file, automatically "know" which properties of the person need to be interrogated in order to determine the appropriate probability of that event occurring, such that the influences of variables on the probability can be manipulated through editing of the excel sheet and without editing of the code?
Add a short tutorial/guide to using LinearModel helper to the wiki. The tests have some detached examples, but probably not enough for real world use.
Initialises the following properties of individuals in the population using demographic data:
For instance, we record date_of_birth
as an actual property, but many processes depend on age, which should be computed on the fly from DOB and current date.
During development of disease modules, it would be useful to be able to dump entire population/affected individuals pre- and/or post-event. Part of a 'debug' mode that can be turned off.
When creating a tox configuration from running within PyCharm, the py36 pytest run fails.
This seems to be because the newest version of pytest (installed in the tox environment) is not compatible with the pytest_runner that PyCharm uses.
Error message given:
py36 runtests: commands[0] | /Users/stef/UCL/TLOmodel/.tox/py36/bin/python /Applications/PyCharm.app/Contents/helpers/pycharm/pytestrunner.py -p pytest_teamcity --cov --cov-report=term-missing -vv tests
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pycharm/pytestrunner.py", line 60, in <module>
main()
File "/Applications/PyCharm.app/Contents/helpers/pycharm/pytestrunner.py", line 34, in main
pluginmanager=_pluginmanager, args=args)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/hooks.py", line 289, in __call__
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/manager.py", line 87, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/manager.py", line 81, in <lambda>
firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/callers.py", line 203, in _multicall
gen.send(outcome)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/_pytest/helpconfig.py", line 89, in pytest_cmdline_parse
config = outcome.get_result()
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/pluggy/callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/_pytest/config/__init__.py", line 720, in pytest_cmdline_parse
self.parse(args)
File "/Users/stef/UCL/TLOmodel/.tox/py36/lib/python3.6/site-packages/_pytest/config/__init__.py", line 924, in parse
assert self.invocation_params.args == args
AssertionError
If the tox environment pytest version is pinned to 5.0.1 it runs fine, broken in 5.1.0+. But might need to think about how we want to pin the version of pytest or see if PyCharm is updated to fix this
Probably AppVeyor
This is the operation that is done by merge with the reset index() modifier.
It is used to assign a value (often a probability) to individuals in the model based on their properties (age, sex, marital status, for example) by looking up in a long-form data frame that has been imported to the module and is contained within Parameters{}.
As this is a tricky operation it would be good if a specialised helper function could do it. It would perhaps:
some errors appearing on windows machines due to default setting returning int32 instead of in64
demography lines 191 and 276 could be changed to:
df.loc[df.is_alive, 'age_days'] = age_in_days.dt.days
to resolve this temporarily
Whilst testing the Simulation saving/restoring (issue #86), I've noticed this problem.
Currently, logging is set up after the Simulation instance. i.e.:
sim = Simulation(start_date=...)
sim.set_seed(123)
sim.configure_logging(...)
However, this means invocation of set_seed
and configure_logging
can change behaviour based on when they are called. e.g.:
sim.configure_logging()
, logging from the __init__
of disease objects is lost.set_seed
is called before methods are registered, disease modules don't have their RNG set properly.Possible solutions:
sim = Simulation(start_date=...)
# sim.register() disease methods here
sim.configure_logging(...)
sim.set_seed(123)
This can still lead to non-reproducible behaviour because the seed is not recorded for simulation and disease method instances. If there is any randomness in their __init__
, it can't be reproduced.
a) Add 'filename' and 'seed' options to Simulation constructor, which sets up logging output file and simulation's RNG right at the start.
b) Set the seed of module's RNG in register
. Although, any randomness in __init__
will still be lost. I can't think of a way around this. We need enforce only having minimal setup in the __init__
and any meat in read_parameters()
(which is called immediately by Simulation on register
)
Welcome comments!
Useful to have a summary of number of deaths each year (or month) by COD in the demography module. Currently this is tracked separately in the HIV and TB modules but summary outputs would be better
There are two pages with set-up instructions on: the README.rst
file at the top-level directory, and those on the wiki page at https://github.com/UCL/TLOmodel/wiki/Installation. A user could end up using the instructions which aren't necessarily the better ones for them, without realising other instructions exist.
We need to add a link on both pages to point to the other page, with suitable text to put them in context, as they are for different audiences, i.e. the README are instructions sufficient for a software developer, but those on the wiki are for the epi modellers, who are not software developers.
The modellers are using PyCharm.
This is a parent issue to track other issues related to framework and/or model performance.
To explore:
df.at
calls in a given onbirth()Notes:
In our currrent workflow there are multiple places where we have to document the same thing --- this invites error and discrepancies and it would be better to agree on a standard way of documenting thing and to use tools to pull together summary tables etc for write-up.
What I propose is a pattern that is used in #107
The declaration of PARAMETERS and PROPERTIES -- both the name of the description -- are considered the single source of truth. The description of the parameter name should be a full description such as would make sense outside the context of the code itself.
The proposed value of each parameter is provided in the resourcefile. This can be a .csv file. However, it will often be useful to have this in .xlsx file in order that each parameter value can be associated with:
The word document write-up only provides:
This would entail an update to the checklist [https://github.com/UCL/TLOmodel/wiki/Checklist-For-Developing-A-Disease-Module]
This then requires a helper function that creates tables that can be pasted into word documents:
Create a guidelines document that stores record of agreed best practices in the design of modules.
Tara's HIV model outputs summary statistics. The parameters of the module can be optimised by, for example, minimising sse of those statistics wrt historical data.
Rather than having to implement the model in two places, explore wrapping the module in an function that can be passed to an optimiser.
(This might be related to: #98 (comment))
The order in which modules are registered with a simulation object matters;
Similarly, there is now a standard set of modules that are required for the most basic simulation; i.e. Demography, Contraception, Enhanced_LifeSytle, SymptomManager (and Labour, ....soon)
And, if the healthsystem is being used:
HealthSystem, HealthSeekingBehaviour and DxManager.
The list of essential modules is changing and bugs are developing analysis scripts not keeping track. Erorrs that arise from a module not being registered are hard to track because it might not be obvious what is missing (i.e. that births are not happening or that healthseeking is not occurring).
It would be good if we could find a way to:
let the simulation module re-order the 'order' of the modules that are registered with it (before doing anything else), according to rules we set. This would allow an explicit logic to be relied upon without demanding any consistency in the 'analysis scripts' etc.
let the simulation issue a warning if one of those recognised 'fundamental' modules is not registered with it before the simulation is run.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.