zaneveld / karenina Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 6.0 429.64 MB

Simulation and modeling tools for studying Anna Karenina effects in animal microbiomes

Python 100.00%

karenina's People

Contributors

Stargazers

Watchers

Forkers

hoikincheng slpeoples hanullee paulinetrinh sonettd jwdebelius

karenina's Issues

Update PCOA simulation output

addition of proportion explained, default to 1/3 for each pc axis.

All sequencing intrinsically misestimated microbial community composition to greater or lesser extent. In PCoA plots, this translates into uncertainty in the relative positions of each point - a cloud of possibility around the inferred location in PCoA space that likely contains the real composition. This may complicate attempts to estimate sigma because if not explicitly modeled, then sequencing error may be incorporated into the estimated community volatility.

With enough data Brownian motion vs. random sequencing error can be told apart, at least in principle, because random sequencing errors do not carry over from one timepoint to the next - unlike random community change in Brownian motion, which depends on random changes in previous timesteps.

Currently, karenina does not account for sequencing error explicitly. I propose that we add an error parameter e to our model to account for this.

Potential use-cases:
-- Using AICc values, test whether a model with Brownian motion or OU dynamics + noise is justified above and beyond a noise-only model (in which the last x doesn't matter, and only momentary apparent displacements from theta are produced at each timestep).

-- Since the noise term e that scales Gaussian noise and the Brownian motion process are on the same scale, I think the parameter estimates will tell us how large estimated biological microbiome stochasticity is relative to sequencing error.

-- In many sampling designs, users may conduct replicate sequencing of the same biological sample. In such cases, we would ideally specify these replicate samples based on a column of the mapping file (or based on the individual and timepoint columns the user already specified). If there were no sequencing/lab error these replicates would be identical. Therefore, variance between replicate sequence libraries from the same biological samples can be used directly to estimate our sequencing error term. This estimate can then be used to fit the rest of the model using a small modification of existing code

Considerations:
-- As we explore more varied models, we may need another layer of abstraction to avoid having messy code in multiple places for handling them. This may require some deep and careful thinking.

-- Adding noise is trivial in the Process object (we just need a parameter saying how much, and then we add a draw from a 0-centered normal distribution to the position on the PC axis). I think this will be slightly more tricky in fit_timeseries as currently implemented, since we now need to consider the probability of an observed displacement accounting for the chances of all sums that could produce it under two normal distributions (instead of having only one random draw from a normal distribution to account for).

test_experiment.py

Complete tests for experiment.py

test_check_variable_specified_per_treatment

test_check_n_timepoints_is_int

test_simulate_timesteps

test_simulate_timestep

test_write_to_movie_file

Qiime2 Plugin Support

Once Karenina can take Qiime2 input data, and the program is stable, make sure to follow the steps here:

https://docs.qiime2.org/2018.4/plugins/developing/

test_perturbation.py

Complete tests for perturbation.py.

test_apply_perturbation

test_remove_perturbation

test_remove_perturbation_from_axis

test_apply_perturbation_to_axis

test_check_identity

test_simulate_movement

test_get_data

Create visualization scatterplots for cohorts

duplicate visualization outputs (scatterplots) for cohort fits.

Benchmarking

Add ability to fit timeseries to data

We need to be able to fit OU models to time-series data

test_visualization.py

Complete tests for visualization.py

test_get_timeseries_data

test_save_simulation_figure

test_save_simulation_movie

test_update_3d_plot

Add QIIME QZA format PCoA and distance matrix output to simulate_OU_timeseries

This would let us simulate under our model and then test using standard workflows in QIIME

Refactor fit_timeseries_benchmark.py

I am doing a quick refactoring pass on fit_timeseries_benchmark.py

Intended changes:
--move to benchmark.py
--separate logfile management from core benchmarking code
--change vis to a more verbose function name
--remove unneeded code simulating/fitting normal distributions (this was from when benchmark was a test script)

Fix unique identifier

[UniqueID]_[Timepoint]
generate individual_ID on individual object generation
Experiment governs IDs

Drop support for python 2.0

This will allow e.g. removal of checks for sys.version and help clean up the code.

AIC receives incorrect value in benchmark.py

Previously AIC was passed niter instead of the number of model parameters

Display Help Text When Run Without Parameters

When karenina is run without parameters, it displays an obscure error rather than displaying help with text.

Command:
python spatial_ornstein_uhlenbeck.py

Output:
{'output': None, 'pert_file_path': '/karenina/data/perturbations/set_xyz_lambda_zero.tsv', 'treatment_names': 'control,destabilizing_treatment', 'n_individuals': '35,35', 'n_timepoints': 10, 'perturbation_timepoint': 5, 'perturbation_duration': 100, 'interindividual_variation': 0.01, 'delta': 0.25, 'L': 0.2, 'fixed_start_pos': None}
Traceback (most recent call last):
File "spatial_ornstein_uhlenbeck.py", line 271, in
main()
File "spatial_ornstein_uhlenbeck.py", line 230, in main
write_options_to_log("log.txt", opts)
File "spatial_ornstein_uhlenbeck.py", line 116, in write_options_to_log
logfile = open(join(opts.output, log),"w+")
File "/miniconda3/lib/python3.6/posixpath.py", line 78, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Fix Cohort size-checks

Currently, cohort support only checks the size of the input X, which originally has dimensions [1,2,3], where they are separated by timepoints (as a cohort) like so: [[1,2],[1,2,3],[1]], which allows for optimization, avoiding overlapping timepoints. The check for the length of x in make_ou_objective_fn which throws a value error if len(x)<=1 does not get thrown for single-point observations within the nLogLik loop. Checks for single-point observations within the loop need to be added, and either ignore the observation, or throw a ValueError.

Moving Pictures is a good example of this, where the antibiotic treatment is provided at timepoint zero, and all other time points have no reported antibiotic usage. In this instance, the values passed in for the treatment are of the form [[pc_value],[pc_value],...], with timepoints [[0],[0],...]. These single-point observations are not supported, and thus Moving Pictures (and similar experiments) will not properly run at this time.

Further testing is necessary on the 2485 obesity data to determine the functionality/ universality of the cohort separation/ optimization.

Permutation test for significance of between treatment parameter differences

Typically, we should expect that every treatment in a time-series study will show some differences in theta, lambda and sigma. Users will want to know which differences are attributable to chance, vs which should be investigated further. I suggest we provide this capability with an individual-wise permutation test.

Here's the idea. After reading in user data and separating out time-series according to individuals, we infer the model once using the real associations between individuals and treatments. We record the difference in parameter values between each pair of treatments (i.e. the observed effect size of the treatment.) Then for some number of user-specified permuations (e.g. --n_permutations = 100), we first shuffle which individuals are in each treatment, then repeat the inference process. This gives us a null distribution of how large of parameter differences we should expect between treatments just due to chance. Our permutational p-value is the number of individual-wise permutations in which an equal or larger difference in parameter values as the real difference was observed, divided by the number of permutations.

We then report a table with columns: treatment1,treatment2, parameter, average parameter value treatment 1, average parameter value treatment 2, n_individuals treatment 1, n_individuals treatment 2, effect size (real data), average effect size (permutational null model), permutational p-value, and the Bonferroni-corrected p-value (multiplying the p-value by number of treatment-treatment comparisons).

Users could therefore run the script and reach conclusions like: lambda was significantly different between IBD and non-IBD patients (p=0.01)

Critically, I think the permutations must be done at the level of assignment of individuals to treatments (not at the level of timepoints within individuals or by just scrambling all samples). If there is inter-individual variation or non-treatment temporal effects then just permuting sample labels won't make a lot of sense and will I think give incorrect p-values. For example, in coral data, if there is a heat-wave at timepoint 10 in the real data, and we just permuted all data, then real differences between treatments may not show up as significant because in the permuted data that timepoint 10 is scrambled between many time-points, inflating differences between treatments in the null model.

Basinhopping in fit_timeseries testing

Include an additional test constraining the xmin between a narrow range of expected values to verify that the basinhopping is properly fitted.

fix benchmarking graph output

either remove linear regression, or fit to regression splines.

http://austinrochford.com/posts/2015-08-29-additive.html

Add installation support for fit_timeseries

Include support for fit_timeseries to be called after installation from within setup.py

Extending models to fit cohorts

Quantify cohort accuracy.

fit_timeseries still has a PyCogent dependency for commandline option parsing

We should replace the dependency on PyCogent with default Python 3 option parsing as has been done in other scripts.

Verbose should have a default value of False and be the last parameter

This occurs in e.g. save_simulation_movie

fit_timeseries.py Accept input Qiime2 files

Ensure fit_timeseries.py accepts PCOA and Mapping file.

Standardize all variable names

Go through and make all variables human-readable.

http://pycogent.org/coding_guidelines.html

Output row data (.pc)

Format output file in congruence with QIIME2 PCOA format.
Reference https://docs.qiime2.org/2018.2/tutorials/moving-pictures/

Take QIIME2 PCoA & Mapping File as input

Allow spatial_ornstein_uhlenbeck.py to function with input PCoA and Mapping Files.

Generate mapping file on OU simulation

test_spatial_ornstein_uhlenbeck.py

Complete tests for spatial_ornstein_uhlenbeck.py

test_check_perturbation_timepoint

test_write_options_to_log

test_parse_perturbation_file

Remove duplicate datapoints

save_simulation_movie() data has same first and second position

update visualization to count save_* parameter values

n_timepoints, n_individuals, and perturbation_timepoint are hardcoded as 50,50,25, and visualization needs to vary these parameters based on the input data.

Verbose keyword arguments

Change verbose statements to be handled by keyword arguments, in each method from opts.verbose.

benchmark_model_fit.py

Generate benchmarking file for model fitting.

Move test_fit_times_series_recovers_OU_params to its own benchmarking file for fit_timeseries

Individual time-series quantity accuracy

Modify Moving Pictures Treatment (In Metadata)

Currently treatments are labeled by timepoint, and should be labeled by individual, so that we can test the cohorts without single-point observations. This will make the cohort testing much more applicable. If this is a good fix, we could require input data to be individual/ subject based treatments, and that timepoint-based treatment labeling is unsupported.

Read/write qza outputs natively by simplifying interface between simulate_ornstein_uhlenbeck and visualizer.py

simulate_ornstein_uhlenbeck.py should output a QIIME2 qza file with OrdinationResult and Metadata:

-- 1. Separate out distance matrix output capabilities out of the simulate_ornstein_uhlenbeck.py code. This will be a new script distance_matrix_from_ordination.py
-- 2. rename simulate_ornstein_uhlenbeck.py simulate_ordination.py

-- 3. visualizer.py should have a user option to specify a 'color' column in the metadata file. It will just read in ordination results + metadata and be edited to not use any Process, Individual or Experiment objects. Any info that used to derive from these will be obtained from user options/columns in the metadata file.

-- 4. simulate_ordination will just output an ordination result and a metadata file

Add an interpolation option to visualizer.py

To support PCoA files with few data points, we can add two things:

A fixed duration for the movie as a user option. Then the dt between frames is just movie_length/n_timepoints.
A second user option specify if you want linear interpolation done and the minimum space between timepoints (min_dt)