Giter Club home page Giter Club logo

karenina's People

Contributors

hanullee avatar hoikincheng avatar slpeoples avatar zaneveld avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

karenina's Issues

Modeling sequencing error

All sequencing intrinsically misestimated microbial community composition to greater or lesser extent. In PCoA plots, this translates into uncertainty in the relative positions of each point - a cloud of possibility around the inferred location in PCoA space that likely contains the real composition. This may complicate attempts to estimate sigma because if not explicitly modeled, then sequencing error may be incorporated into the estimated community volatility.

With enough data Brownian motion vs. random sequencing error can be told apart, at least in principle, because random sequencing errors do not carry over from one timepoint to the next - unlike random community change in Brownian motion, which depends on random changes in previous timesteps.

Currently, karenina does not account for sequencing error explicitly. I propose that we add an error parameter e to our model to account for this.

Potential use-cases:
-- Using AICc values, test whether a model with Brownian motion or OU dynamics + noise is justified above and beyond a noise-only model (in which the last x doesn't matter, and only momentary apparent displacements from theta are produced at each timestep).

-- Since the noise term e that scales Gaussian noise and the Brownian motion process are on the same scale, I think the parameter estimates will tell us how large estimated biological microbiome stochasticity is relative to sequencing error.

-- In many sampling designs, users may conduct replicate sequencing of the same biological sample. In such cases, we would ideally specify these replicate samples based on a column of the mapping file (or based on the individual and timepoint columns the user already specified). If there were no sequencing/lab error these replicates would be identical. Therefore, variance between replicate sequence libraries from the same biological samples can be used directly to estimate our sequencing error term. This estimate can then be used to fit the rest of the model using a small modification of existing code

Considerations:
-- As we explore more varied models, we may need another layer of abstraction to avoid having messy code in multiple places for handling them. This may require some deep and careful thinking.

-- Adding noise is trivial in the Process object (we just need a parameter saying how much, and then we add a draw from a 0-centered normal distribution to the position on the PC axis). I think this will be slightly more tricky in fit_timeseries as currently implemented, since we now need to consider the probability of an observed displacement accounting for the chances of all sums that could produce it under two normal distributions (instead of having only one random draw from a normal distribution to account for).

test_experiment.py

Complete tests for experiment.py

test_check_variable_specified_per_treatment

test_check_n_timepoints_is_int

test_simulate_timesteps

test_simulate_timestep

test_write_to_movie_file

test_perturbation.py

Complete tests for perturbation.py.

test_apply_perturbation

test_remove_perturbation

test_remove_perturbation_from_axis

test_apply_perturbation_to_axis

test_check_identity

test_simulate_movement

test_get_data

test_visualization.py

Complete tests for visualization.py

test_get_timeseries_data

test_save_simulation_figure

test_save_simulation_movie

test_update_3d_plot

Refactor fit_timeseries_benchmark.py

I am doing a quick refactoring pass on fit_timeseries_benchmark.py

Intended changes:
--move to benchmark.py
--separate logfile management from core benchmarking code
--change vis to a more verbose function name
--remove unneeded code simulating/fitting normal distributions (this was from when benchmark was a test script)

Fix unique identifier

[UniqueID]_[Timepoint]
generate individual_ID on individual object generation
Experiment governs IDs

Display Help Text When Run Without Parameters

When karenina is run without parameters, it displays an obscure error rather than displaying help with text.

Command:
python spatial_ornstein_uhlenbeck.py

Output:
{'output': None, 'pert_file_path': '/karenina/data/perturbations/set_xyz_lambda_zero.tsv', 'treatment_names': 'control,destabilizing_treatment', 'n_individuals': '35,35', 'n_timepoints': 10, 'perturbation_timepoint': 5, 'perturbation_duration': 100, 'interindividual_variation': 0.01, 'delta': 0.25, 'L': 0.2, 'fixed_start_pos': None}
Traceback (most recent call last):
File "spatial_ornstein_uhlenbeck.py", line 271, in
main()
File "spatial_ornstein_uhlenbeck.py", line 230, in main
write_options_to_log("log.txt", opts)
File "spatial_ornstein_uhlenbeck.py", line 116, in write_options_to_log
logfile = open(join(opts.output, log),"w+")
File "/miniconda3/lib/python3.6/posixpath.py", line 78, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

Fix Cohort size-checks

Currently, cohort support only checks the size of the input X, which originally has dimensions [1,2,3], where they are separated by timepoints (as a cohort) like so: [[1,2],[1,2,3],[1]], which allows for optimization, avoiding overlapping timepoints. The check for the length of x in make_ou_objective_fn which throws a value error if len(x)<=1 does not get thrown for single-point observations within the nLogLik loop. Checks for single-point observations within the loop need to be added, and either ignore the observation, or throw a ValueError.

Moving Pictures is a good example of this, where the antibiotic treatment is provided at timepoint zero, and all other time points have no reported antibiotic usage. In this instance, the values passed in for the treatment are of the form [[pc_value],[pc_value],...], with timepoints [[0],[0],...]. These single-point observations are not supported, and thus Moving Pictures (and similar experiments) will not properly run at this time.

Further testing is necessary on the 2485 obesity data to determine the functionality/ universality of the cohort separation/ optimization.

Permutation test for significance of between treatment parameter differences

Typically, we should expect that every treatment in a time-series study will show some differences in theta, lambda and sigma. Users will want to know which differences are attributable to chance, vs which should be investigated further. I suggest we provide this capability with an individual-wise permutation test.

Here's the idea. After reading in user data and separating out time-series according to individuals, we infer the model once using the real associations between individuals and treatments. We record the difference in parameter values between each pair of treatments (i.e. the observed effect size of the treatment.) Then for some number of user-specified permuations (e.g. --n_permutations = 100), we first shuffle which individuals are in each treatment, then repeat the inference process. This gives us a null distribution of how large of parameter differences we should expect between treatments just due to chance. Our permutational p-value is the number of individual-wise permutations in which an equal or larger difference in parameter values as the real difference was observed, divided by the number of permutations.

We then report a table with columns: treatment1,treatment2, parameter, average parameter value treatment 1, average parameter value treatment 2, n_individuals treatment 1, n_individuals treatment 2, effect size (real data), average effect size (permutational null model), permutational p-value, and the Bonferroni-corrected p-value (multiplying the p-value by number of treatment-treatment comparisons).

Users could therefore run the script and reach conclusions like: lambda was significantly different between IBD and non-IBD patients (p=0.01)

Critically, I think the permutations must be done at the level of assignment of individuals to treatments (not at the level of timepoints within individuals or by just scrambling all samples). If there is inter-individual variation or non-treatment temporal effects then just permuting sample labels won't make a lot of sense and will I think give incorrect p-values. For example, in coral data, if there is a heat-wave at timepoint 10 in the real data, and we just permuted all data, then real differences between treatments may not show up as significant because in the permuted data that timepoint 10 is scrambled between many time-points, inflating differences between treatments in the null model.

test_spatial_ornstein_uhlenbeck.py

Complete tests for spatial_ornstein_uhlenbeck.py

test_check_perturbation_timepoint

test_write_options_to_log

test_parse_perturbation_file

benchmark_model_fit.py

Generate benchmarking file for model fitting.

Move test_fit_times_series_recovers_OU_params to its own benchmarking file for fit_timeseries

Modify Moving Pictures Treatment (In Metadata)

Currently treatments are labeled by timepoint, and should be labeled by individual, so that we can test the cohorts without single-point observations. This will make the cohort testing much more applicable. If this is a good fix, we could require input data to be individual/ subject based treatments, and that timepoint-based treatment labeling is unsupported.

Read/write qza outputs natively by simplifying interface between simulate_ornstein_uhlenbeck and visualizer.py

simulate_ornstein_uhlenbeck.py should output a QIIME2 qza file with OrdinationResult and Metadata:

-- 1. Separate out distance matrix output capabilities out of the simulate_ornstein_uhlenbeck.py code. This will be a new script distance_matrix_from_ordination.py
-- 2. rename simulate_ornstein_uhlenbeck.py simulate_ordination.py

-- 3. visualizer.py should have a user option to specify a 'color' column in the metadata file. It will just read in ordination results + metadata and be edited to not use any Process, Individual or Experiment objects. Any info that used to derive from these will be obtained from user options/columns in the metadata file.

-- 4. simulate_ordination will just output an ordination result and a metadata file

Add an interpolation option to visualizer.py

To support PCoA files with few data points, we can add two things:

  1. A fixed duration for the movie as a user option. Then the dt between frames is just movie_length/n_timepoints.

  2. A second user option specify if you want linear interpolation done and the minimum space between timepoints (min_dt)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.