Comments (19)
Hi Thomas,
There is a first implementation in the sandbox that should work with the txt trace. You certainly don't have to use this as a starting point, but it might give you some ideas.
David
from pymc.
Cool, but I don't see a sandbox in the trunk.
from pymc.
pymc / pymc / sandbox / parallel.py
from pymc.
oops, I thought the sandbox was removed.
That looks interesting, any reason you used IPython.kernel instead of multiprocessing directly? It seems you could get away without passing individual arguments as strings to the clients.
from pymc.
I started this years ago, before multiprocessing got into the standard lib...
from pymc.
There was a discussion regarding mulitprocessing on the GoogleCode site for additional context.
from pymc.
OK -- it is more difficult than I initially thought with the multiprocessing package. The problem is that to send the model (i.e. first argument to MCMC()) to the other processes, it would have to be pickable. Since a model consists of objects with methods this is probably not possible.
The other approach as taken by David Huard seems limited to cases where the model description is a runnable python file (like all the examples are). This seems what most people are doing so its probably fine, but it wouldn't work generally (e.g. not in my case).
Not sure if there is any way to make a model serializable (i.e. pickable).
from pymc.
I think I came up with a way to use multiprocessing for multiple chains.
As described above, one can't just pass a model to an object (e.g. ParallelMCMC(model...)) because each process must have access to the model (or a copy thereof). For that, the model would have to serializable which is impossible I think.
However, if instead one defines a function that creates the model when called and passes that, each process could create its own model by calling that function.
E.g.:
def create_model():
#definition of distributions
[...]
return model
m = ParallelMCMC(create_model, chains=10, num_procs=4)
m.sample() # Launches 4 python processes that sample from 10 chains
Feedback?
from pymc.
Hi Thomas,
I don't have any experience with multiprocessing since I usually work with the parallel tools in ipython. That being said, I think the idea has potential. Have you thought about how are you going to handle the database backends ?
from pymc.
Hi David,
I'm not too familiar with the IPython parallel tools. What is nice about multiprocessing is that it is in the python standard library (although IPython is no unreasonable dependency).
No, the database is still an open question. From a design point of view it might seem best to use one database that all processes write to, but I'm afraid this will lead to many problems and also reduce performance. Instead, I think each chain should write to a different db output (and then, all backends could be used).
I hacked something similar before which worked OK. The problem is that after sampling, all chains have to be read in, a new model created, and the chains linked to it. This can be done but what I don't like is that (i) one couldn't easily continue to sample on those chains and (ii) it diverges a lot from a possible implementation of multiple chains without multiprocessing capabilities. Regarding (ii), I do not think that there is a way to make this work to begin with.
As an added bonus of this approach, however, is that it would not be hard to extend this to MPI and make it possible to run multiple chains on a cluster.
I understand that there is some support already in PyMC to have multiple traces for each distribution, can someone enlighten me of how well this is expected to work?
from pymc.
Thomas,
Whatever the parallel tools used, I'm guessing the main issues are the same, so I guess it doesn't really matter which one we use. Since multiprocessing is in the stdlib, as well go with it.
I'm not sure I understand your question however. I'm guessing you already know this, but when you call sample multiple times, it creates individual traces in the backend. You can merge those for the analysis, but they're kept separate under the hood. When I wrote the backends, it was with parallelism in mind, meaning that I tried to design it so each trace could be written by different processes (theoretically, in practice, its trickier). For example, the txt backend supports parallelism. Similarly, it should not be hard to create a parallel pickle db. For hdf5, it should be possible to write to the same database using different processes. It requires compilation with parallel flags and I never had the time and motivation to try it out.
Another solution is to improve on your hack. Instead of waiting until the end to link the individual chains to the master db, you do it during sampling. The "master" launches independent samplers, but modifies them so that when they call "tally", it launches the master ParallelSampler "tally", which fetches the object's value (see Sampler._funs_to_tally). Using this, you have multiple processes running the computations, but only one tallying the data. If you look at the Database.tally method, it takes chain as an argument. By default it uses the last, but you could easily modify Sampler.tally to provide this information.
To get a better feeling of how the database work, look at base.py and ram.py in the database directory. Then look at the Sampler class to see how the database is initialized, and how tally is called.
from pymc.
My prior plan was to just run multiple models in parallel and have them all write to separate DBs and then collect them all. But there is another approach which you are referring to. Namely, that the master is the only one writing to the DB and the workers send the samples to it. I think I might like your approach better (but haven't thought about it before).
Maybe the cleanest way then would be to write a new db backend that the workers use that is actually a connection to the master (via a multiprocessing pipe) which then itself saves the samples to the central DB (into multiple chains). So the individual MCMC objects the workers sample from would not have to be changed but just given a new backend. What do you think?
Once we came up with a plan of action I'll start looking into the code.
from pymc.
I don't think you even need a new db backend, just a modification of a couple of Sampler methods. That is, all you need to do is connect the Trace objects in the master DB to their respective MCMC object in each slave Sampler. Normally, this is done in the _assign_database_backend method of Sampler. The critical bit is here:
self._funs_to_tally[object.__name__] = object.get_value
This associates an object name with a function returning its current value. Then, when Sampler.sample is called:
self.db._initialize(self._funs_to_tally, length)
For a parallel sampler, what we would have is rather
self._funs_to_tally[chain][object.__name__] = function that returns the object's value in slave process #chain.
and
self.db._initialize(self._funs_to_tally[chain], length)
OK, its not that simple, but you get the idea.
from pymc.
I've got parallel sampling working for pymc3 in the 'parallel' branch, using the pymc.psample function. See the examples/logistic.py for an example.
Unfortunately this prohibits any nested functions from being used, so it required some things to be rewritten in a less clean way.
from pymc.
+1 to parallel sampling. Will you be merging that any time soon?
from pymc.
It would be nice to have, but given how nice pymc3 is shaping up and that it has parallel sampling I'm not sure it makes sense to spend the resources on adding those capabilities to pymc2.
from pymc.
Ahh, OK, thanks - I didn't realize there was that level of difference. Perhaps this should be (re-)closed as "won't-fix" then?
from pymc.
Well this is just my opinion. I guess it's up to @fonnesbeck. I see there is a 2.3 milestone. Is that still a goal or will the next release be the new code base?
from pymc.
There will probably be a 2.3 release, but it will not include parallel chains, for the reasons that Thomas cited. Sorry, I have not "weeded the garden" recently. Re-closing.
from pymc.
Related Issues (20)
- BUG: pm.Data does not accept dtype argument HOT 1
- New progressbar not showing divergences? HOT 1
- Implement `set_data` as a model transformation
- ENH: Overloading of the `TensorVariable` CLass add(+) operator(or new attributes) HOT 13
- DOC: format of the initial values for the sample_smc function HOT 1
- BUG: Forward sampling with dims fails when mode="JAX" HOT 7
- BUG: NaNs in logp mess up resampling step in SMC HOT 6
- BUG: Slicing after adstock transformation not working HOT 3
- Implement `Ordered` distribution factory HOT 6
- Add PyArrow backend
- Runtime error windows. Anaconda HOT 1
- BUG: NotConstantValueError when using coordinates observed data with missing values with pymc==5.14.0 HOT 3
- BUG: passing the 'max_n_steps' parameter as kwarg to HurdleNegativeBinomial distribution does not work HOT 2
- Refactor/Upgrade `find_MAP` HOT 1
- ENH: Update reported sampler diagnostics for `pm.Slice()` step method HOT 2
- BUG: Adding a Deterministic results in the error: `All variables needed to compute inner-graph must be provided as inputs under strict=True` HOT 8
- Test `freeze_rv_and_dims` with `PartiallyObservedRV`
- ENH: Customizable plate labels HOT 7
- First time congratulatory message images are huge HOT 1
- DOC: Plots squished in the documentation page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymc.