Hi, clearly it would be nice if multiple chains could be run in para

There was a discussion regarding mulitprocessing on the <a href="http://code.google.co

Proposal: Add multiprocessor/MPI support about pymc HOT 19 CLOSED

pymc-devs commented on June 14, 2024

Proposal: Add multiprocessor/MPI support

from pymc.

Comments (19)

huard commented on June 14, 2024

Hi Thomas,

There is a first implementation in the sandbox that should work with the txt trace. You certainly don't have to use this as a starting point, but it might give you some ideas.

David

from pymc.

twiecki commented on June 14, 2024

Cool, but I don't see a sandbox in the trunk.

from pymc.

huard commented on June 14, 2024

pymc / pymc / sandbox / parallel.py

from pymc.

twiecki commented on June 14, 2024

oops, I thought the sandbox was removed.

That looks interesting, any reason you used IPython.kernel instead of multiprocessing directly? It seems you could get away without passing individual arguments as strings to the clients.

from pymc.

huard commented on June 14, 2024

I started this years ago, before multiprocessing got into the standard lib...

from pymc.

fonnesbeck commented on June 14, 2024

There was a discussion regarding mulitprocessing on the GoogleCode site for additional context.

from pymc.

twiecki commented on June 14, 2024

OK -- it is more difficult than I initially thought with the multiprocessing package. The problem is that to send the model (i.e. first argument to MCMC()) to the other processes, it would have to be pickable. Since a model consists of objects with methods this is probably not possible.

The other approach as taken by David Huard seems limited to cases where the model description is a runnable python file (like all the examples are). This seems what most people are doing so its probably fine, but it wouldn't work generally (e.g. not in my case).

Not sure if there is any way to make a model serializable (i.e. pickable).

from pymc.

twiecki commented on June 14, 2024

I think I came up with a way to use multiprocessing for multiple chains.

As described above, one can't just pass a model to an object (e.g. ParallelMCMC(model...)) because each process must have access to the model (or a copy thereof). For that, the model would have to serializable which is impossible I think.

However, if instead one defines a function that creates the model when called and passes that, each process could create its own model by calling that function.

E.g.:
def create_model():
#definition of distributions
[...]
return model

m = ParallelMCMC(create_model, chains=10, num_procs=4)
m.sample() # Launches 4 python processes that sample from 10 chains

Feedback?

from pymc.

huard commented on June 14, 2024

Hi Thomas,

I don't have any experience with multiprocessing since I usually work with the parallel tools in ipython. That being said, I think the idea has potential. Have you thought about how are you going to handle the database backends ?

from pymc.

twiecki commented on June 14, 2024

Hi David,

I'm not too familiar with the IPython parallel tools. What is nice about multiprocessing is that it is in the python standard library (although IPython is no unreasonable dependency).

No, the database is still an open question. From a design point of view it might seem best to use one database that all processes write to, but I'm afraid this will lead to many problems and also reduce performance. Instead, I think each chain should write to a different db output (and then, all backends could be used).

I hacked something similar before which worked OK. The problem is that after sampling, all chains have to be read in, a new model created, and the chains linked to it. This can be done but what I don't like is that (i) one couldn't easily continue to sample on those chains and (ii) it diverges a lot from a possible implementation of multiple chains without multiprocessing capabilities. Regarding (ii), I do not think that there is a way to make this work to begin with.

As an added bonus of this approach, however, is that it would not be hard to extend this to MPI and make it possible to run multiple chains on a cluster.

I understand that there is some support already in PyMC to have multiple traces for each distribution, can someone enlighten me of how well this is expected to work?

from pymc.

huard commented on June 14, 2024

Thomas,

Whatever the parallel tools used, I'm guessing the main issues are the same, so I guess it doesn't really matter which one we use. Since multiprocessing is in the stdlib, as well go with it.

I'm not sure I understand your question however. I'm guessing you already know this, but when you call sample multiple times, it creates individual traces in the backend. You can merge those for the analysis, but they're kept separate under the hood. When I wrote the backends, it was with parallelism in mind, meaning that I tried to design it so each trace could be written by different processes (theoretically, in practice, its trickier). For example, the txt backend supports parallelism. Similarly, it should not be hard to create a parallel pickle db. For hdf5, it should be possible to write to the same database using different processes. It requires compilation with parallel flags and I never had the time and motivation to try it out.

Another solution is to improve on your hack. Instead of waiting until the end to link the individual chains to the master db, you do it during sampling. The "master" launches independent samplers, but modifies them so that when they call "tally", it launches the master ParallelSampler "tally", which fetches the object's value (see Sampler._funs_to_tally). Using this, you have multiple processes running the computations, but only one tallying the data. If you look at the Database.tally method, it takes chain as an argument. By default it uses the last, but you could easily modify Sampler.tally to provide this information.

To get a better feeling of how the database work, look at base.py and ram.py in the database directory. Then look at the Sampler class to see how the database is initialized, and how tally is called.

from pymc.

twiecki commented on June 14, 2024

My prior plan was to just run multiple models in parallel and have them all write to separate DBs and then collect them all. But there is another approach which you are referring to. Namely, that the master is the only one writing to the DB and the workers send the samples to it. I think I might like your approach better (but haven't thought about it before).

Maybe the cleanest way then would be to write a new db backend that the workers use that is actually a connection to the master (via a multiprocessing pipe) which then itself saves the samples to the central DB (into multiple chains). So the individual MCMC objects the workers sample from would not have to be changed but just given a new backend. What do you think?

Once we came up with a plan of action I'll start looking into the code.

from pymc.

huard commented on June 14, 2024

I don't think you even need a new db backend, just a modification of a couple of Sampler methods. That is, all you need to do is connect the Trace objects in the master DB to their respective MCMC object in each slave Sampler. Normally, this is done in the _assign_database_backend method of Sampler. The critical bit is here:

self._funs_to_tally[object.__name__] = object.get_value

This associates an object name with a function returning its current value. Then, when Sampler.sample is called:

self.db._initialize(self._funs_to_tally, length)

For a parallel sampler, what we would have is rather

self._funs_to_tally[chain][object.__name__] = function that returns the object's value in slave process #chain.

and

    self.db._initialize(self._funs_to_tally[chain], length)

OK, its not that simple, but you get the idea.

from pymc.

jsalvatier commented on June 14, 2024

I've got parallel sampling working for pymc3 in the 'parallel' branch, using the pymc.psample function. See the examples/logistic.py for an example.

Unfortunately this prohibits any nested functions from being used, so it required some things to be rewritten in a less clean way.

from pymc.

keflavich commented on June 14, 2024

+1 to parallel sampling. Will you be merging that any time soon?

from pymc.

twiecki commented on June 14, 2024

It would be nice to have, but given how nice pymc3 is shaping up and that it has parallel sampling I'm not sure it makes sense to spend the resources on adding those capabilities to pymc2.

from pymc.

keflavich commented on June 14, 2024

Ahh, OK, thanks - I didn't realize there was that level of difference. Perhaps this should be (re-)closed as "won't-fix" then?

from pymc.

twiecki commented on June 14, 2024

Well this is just my opinion. I guess it's up to @fonnesbeck. I see there is a 2.3 milestone. Is that still a goal or will the next release be the new code base?

from pymc.

fonnesbeck commented on June 14, 2024

There will probably be a 2.3 release, but it will not include parallel chains, for the reasons that Thomas cited. Sorry, I have not "weeded the garden" recently. Re-closing.

from pymc.

Proposal: Add multiprocessor/MPI support about pymc HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent