h2r / pomdp-py Goto Github PK

View Code? Open in Web Editor NEW

193.0 11.0 45.0 7.01 MB

A framework to build and solve POMDP problems. Documentation: https://h2r.github.io/pomdp-py/

License: MIT License

Python 65.85% Makefile 0.02% Cython 32.88% Dockerfile 0.12% Shell 1.13%

pomdps pomdp markov-decision-processes python cython easy-to-use

pomdp-py's Introduction

pomdp_py

pomdp_py is a framework to build and solve POMDP problems, written in Python and Cython.

Why pomdp_py? It provides a POMDP framework in Python with clean and intuitive interfaces. This makes POMDP-related research or projects accessible to more people. It also helps sharing code and developing a community.

Please refer to the full documentation and installation instructions.

If you find this library helpful to your work, please cite the following paper:

@inproceedings{zheng2020pomdp_py,
  title = {pomdp\_py: A Framework to Build and Solve POMDP Problems},
  author = {Zheng, Kaiyu and Tellex, Stefanie},
  booktitle = {ICAPS 2020 Workshop on Planning and Robotics (PlanRob)},
  year = {2020},
  url = {https://icaps20subpages.icaps-conference.org/wp-content/uploads/2020/10/14-PlanRob_2020_paper_3.pdf},
  note = {Arxiv link: "\url{https://arxiv.org/pdf/2004.10099.pdf}"}

}

Call for Contributions: please check out - h2r/pomdp-py/issues/25.

pomdp-py's People

Contributors

Stargazers

Watchers

pomdp-py's Issues

Can't find module named 'pomdp_py.algorithms.po_uct'

Hi all & happy new year,

I tried installing this version(1.3.2) of the pomdp_py

I followed the instructions on how to install pomdp_py via https://h2r.github.io/pomdp-py/html/installation.html .
So I did the following in my Anaconda prompt (Anaconda3) :

Install Cython: pip install Cython
Download the packages from this GitHub page : Source code (zip) Apr 3, 2022 and Source code (tar.gz) Apr 3, 2022
pip install -e .

This all seemed to work.

Then I tried to test things out:
python -m pomdp_py -r tiger
python -m pomdp_py -r rocksample
python -m pomdp_py -r mos

But for all, the error my Anaconda Prompt gives is :
ModuleNotFoundError: No module named 'pomdp_py.algorithms.po_uct'

I also tried it directly from Spyder(5.3.3), but got the same result.
\pomdp_py\utils\debugging.py", line 91, in
from pomdp_py.algorithms.po_uct import TreeNode, QNode, VNode, RootVNode
ModuleNotFoundError: No module named 'pomdp_py.algorithms.po_uct'

Does anyone know where I go wrong or how to fix it?

Sorry if it is foolish, I never download packages.

Potential issue in value iteration algorithm

Hi h2r,

I'd like to query whether the state used to calculate the value of the subtrees in the value iteration algorithm should be sp rather than s:

pomdp-py/pomdp_py/algorithms/value_iteration.pyx

Line 48 in df59f26

subtree_value = self.children[o].values[s] # corresponds to V_{oi(p)} in paper

So:
subtree_value = self.children[o].values[sp] # corresponds to V_{oi(p)} in paper

Rather than:
subtree_value = self.children[o].values[s] # corresponds to V_{oi(p)} in paper

The source paper uses s' when calculating V_{oi(p)} and this is also my understanding of the child nodes starting from "the state that this node brought you to".

I'm new to the area, to the library and to the etiquette of issues on GitHub so apologies in advance if this is my misunderstanding.

Many thanks for your useful and clearly structured library.

Rachel

How to correctly modify a planner object

Hello! I want to implement an adapive exploration constant for POMCP as a function of the uncertainty over different states, as described in this paper. For that, I intend to do the following:

As the different entropy-based exploration constant parameters depend on the observation model, they can be calculated outside of the planner
There is one value per state, so it can be pased to the POMCP object's exploration_const as a list instead of a single value
In the ucb function, modify the function call to take the state being simulated by the simulation, and index _exploration_const by the index corresponding to the state

Would it be a way of implementing this without modifying the source code? I would prefer to not have to modify the library if possible. Also, I don't know if you would consider this interesting for a pull request, in which case I'd gladly discuss how to implement it in the library.

General question about goal and failure states?

Hi, I have a general question. How can I explicitly specify goal and failure states using this package?

How to correctly implement a goal/terminal state

Hello. When thinking about changing the rollout function as mentioned in #34 , I realized that the model should have a terminal state in order for the POMCP planner to not add any value after taking a decision within a trial. However, after implementing such a state using the reward and transition models, I found that the POMCP planning tree always takes a decision after the first observation, with really low belief on the corresponding state. Since this was not happening before, I was wondering whether I incorrectly implemented the goal state.

First, this is how the tree for a given trial looks

TRIAL 0 (true state s_3-t_0)
--------------------

  STEP 0 (6 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_0 -> 0.096
        s_1-t_0 -> 0.075
        s_2-t_0 -> 0.065
        s_3-t_0 -> 0.062
        s_4-t_0 -> 0.079
        s_5-t_0 -> 0.066
        s_6-t_0 -> 0.089
        s_7-t_0 -> 0.1
        s_8-t_0 -> 0.085
        s_9-t_0 -> 0.087
        s_10-t_0 -> 0.099
        s_11-t_0 -> 0.097

TreeDebugger@
_VNodePP(n=14603, v=-26.635)
    - [0] a_0: QNode(n=1, v=-100.000)
    - [1] a_1: QNode(n=7, v=-84.286)
    - [2] a_10: QNode(n=7, v=-84.286)
    - [3] a_11: QNode(n=7, v=-84.286)
    - [4] a_2: QNode(n=1, v=-100.000)
    - [5] a_3: QNode(n=7, v=-84.286)
    - [6] a_4: QNode(n=1, v=-100.000)
    - [7] a_5: QNode(n=1, v=-100.000)
    - [8] a_6: QNode(n=1, v=-100.000)
    - [9] a_7: QNode(n=1, v=-100.000)
    - [10] a_8: QNode(n=1, v=-100.000)
    - [11] a_9: QNode(n=1, v=-100.000)
    - [12] a_wait: QNode(n=14567, v=-26.635)

(Pdb) dd[12]
a_wait⟶_QNodePP(n=14567, v=-26.635)
    - [0] o_0: VNodeParticles(n=1567, v=-24.875)
    - [1] o_1: VNodeParticles(n=880, v=-29.977)
    - [2] o_10: VNodeParticles(n=1942, v=-24.316)
    - [3] o_11: VNodeParticles(n=1496, v=-30.452)
    - [4] o_2: VNodeParticles(n=1201, v=-26.077)
    - [5] o_3: VNodeParticles(n=788, v=-56.256)
    - [6] o_4: VNodeParticles(n=756, v=-36.302)
    - [7] o_5: VNodeParticles(n=1165, v=-31.310)
    - [8] o_6: VNodeParticles(n=1238, v=-24.548)
    - [9] o_7: VNodeParticles(n=986, v=-28.266)
    - [10] o_8: VNodeParticles(n=1150, v=-31.324)
    - [11] o_9: VNodeParticles(n=1386, v=-29.629)

Here we suppose we get o_3, the observation corresponding to the correct state:

(Pdb) dd[12][5]
o_3⟶_VNodePP(n=788, v=-56.256)
    - [0] a_0: QNode(n=1, v=-100.000)
    - [1] a_1: QNode(n=1, v=-100.000)
    - [2] a_10: QNode(n=1, v=-100.000)
    - [3] a_11: QNode(n=1, v=-100.000)
    - [4] a_2: QNode(n=1, v=-100.000)
    - [5] a_3: QNode(n=772, v=-56.256)
    - [6] a_4: QNode(n=1, v=-100.000)
    - [7] a_5: QNode(n=1, v=-100.000)
    - [8] a_6: QNode(n=1, v=-100.000)
    - [9] a_7: QNode(n=1, v=-100.000)
    - [10] a_8: QNode(n=1, v=-100.000)
    - [11] a_9: QNode(n=1, v=-100.000)
    - [12] a_wait: QNode(n=5, v=-62.203)

What surprises me here is the high number of simulations going to a_3 here given that the only thing that happens is going to the terminal state:

(Pdb) dd[12][5][5]
a_3⟶_QNodePP(n=772, v=-56.256)
    - [0] o_term: VNodeParticles(n=771, v=0.000)

While there are only 5 simulations for the 'wait' action, which in turn would collect more information and has different possible observations:

(Pdb) dd[12][5][12]
a_wait⟶_QNodePP(n=5, v=-62.203)
    - [0] o_3: VNodeParticles(n=0, v=0.000)
    - [1] o_4: VNodeParticles(n=0, v=0.000)
    - [2] o_5: VNodeParticles(n=0, v=0.000)
    - [3] o_6: VNodeParticles(n=0, v=0.000)
    - [4] o_8: VNodeParticles(n=0, v=0.000)

I don't think this behavior is normal... it did not change when changing the exploration constant between the default value and 50. The terminal state is implemented as a state with id 'term' (i.e. s_term). Here are my transition and reward models:

class TDTransitionModel(TransitionModel):
    def __init__(self, n_targets, n_steps):
        super().__init__(n_targets)

        if not isinstance(n_steps, int):
            raise TypeError(f"Invalid number of steps: {n_steps}. It must be an integer.")
        self.n_steps = n_steps
        self.max_t = self.n_steps - 1  # To handle 0 indexing of states and time steps

    def probability(self, next_state, state, action):
        """Returns the probability p(s'|s, a)"""
        # If the current state is the terminal state, transition to itself with probability 1
        if "term" in state.name:
            if "term" in next_state.name:
                return 1.0
            else:
                return 0.0
        else:  # Not terminal state
            if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
                if "term" in next_state.name:  # Transition to terminal state
                    return 1
                else:
                    return 0
            else:  # Wait action on time steps other than the last one
                if next_state.t == state.t + 1:  # For the next time step
                    if next_state.id == state.id: 
                        return 1.0 - 1e-9
                    else:  # Other states in the next time step
                        return 1e-9
                else:  # Can't travel through time... yet
                    return 0

    def sample(self, state, action):
        """Randomly samples next state according to transition model"""
        # Always sample the terminal state if current state is terminal
        if "term" in state.name:
            return TDState("term", self.max_t)
        else:  # Not terminal state
            if state.t == self.max_t or "wait" not in action.name:  # Last time step or decision
                return TDState("term", 0)
            else:  # Wait action on time steps other than the last one
                next_step = state.t + 1
                return TDState(state.id, next_step)
    
    def get_all_states(self, t_step=None):
        """Returns a list of all states"""
        if t_step is not None:  # Get list of states for a given time_step
            all_states = [TDState(s, t_step) for s in range(self.n_targets)]
        else:  # All states, all time steps
            all_states = [TDState(s, d) for s, d in itertools.product(range(self.n_targets), range(self.n_steps))]
            all_states.append(TDState("term", 0))
        return all_states

class RewardModel(pomdp_py.RewardModel):
    def __init__(self, hit_reward=10, miss_cost=-100, wait_cost=-1):
        if not all(isinstance(attr, int) for attr in [hit_reward, miss_cost, wait_cost]):
            raise TypeError("All cost/reward values must be integers.")

        self.hit_reward = hit_reward
        self.miss_cost = miss_cost
        self.wait_cost = wait_cost

    def _reward_func(self, state, action):
        """
        The correct action is assumed to be the one that shares ID (i.e., number) with a given state,
        since we assume that each flicker is embedded in a button or actuator. Any action on the 
        terminal state gives a reward of 0. 
        """
        if 'term' in state.name:
            return 0
        else:
            if 'wait' in action.name:
                return self.wait_cost
            else:
                if action.id == state.id:  # HIT
                    return self.hit_reward
                else:  # MISS
                    return self.miss_cost

    def sample(self, state, action, next_state):
        """Deterministic reward"""
        return self._reward_func(state, action)

Here is also my observation_model in case it helps:

class TDObservationModel(ObservationModel):
    def __init__(self, features):
        self.observation_matrix = self._make_obs_matrix(features)
        self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape

    def probability(self, observation, next_state, action):
        if "term" in observation.name:  # Terminal observation
            if "term" in next_state.name or "wait" not in action.name:  # Transition to terminal state
                return 1
            else:
                return 0
        else:  # Non-terminal observation
            if "term" in next_state.name or "wait" not in action.name:
                return 0
            else:
                obs_idx = observation.id
                state_idx = next_state.id
                state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
                return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        if "term" in next_state.name or "wait" not in action.name:  # Transition to terminal state
            return BCIObservation('term')
        else:  # Other transitions
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            obs_p = self.observation_matrix[state_step][state_idx]
            return np.random.choice(self.get_all_observations(include_terminal=False), p=obs_p)

Does this library support continuous pomdp problems?

Seems the examples in the documentation are mostly discrete?

Many thanks!

Enable random seeding in POUCT / POMCP for deterministic behavior

Make it so that POUCT/POMCP planning result could be deterministic. This requires seeding both sampling states from the belief and sampling from generative models. A random generator or a seed number should be provided when creating the planner and passed into the agent. This would imply quite a lot of changes as each domain would have to update the sample methods of its models to be using the passed-in random generator. By default, no seed is provided and the behavior should be random (non-deterministic).

Multi object search history and tree construction

Hi. I was experimenting with the multi-object search and came across this function clear_history() (

pomdp-py/pomdp_problems/multi_object_search/agent/agent.py

Line 59 in 85d9a31

def clear_history(self):

) as part of the agent which is called here :

pomdp-py/pomdp_problems/multi_object_search/problem.py

Line 236 in 85d9a31

problem.agent.clear_history() # truncate history

But when I run the code, it looks like the history is not being deleted but being expanded upon. I assume the expected behavior is once the real action is executed, the previous history is deleted and only the last action and observation are stored (in the very next line).

In the case of the above assumed behaviour - (that only 1 item will be stored in history), there might be a potential issue at :

pomdp-py/pomdp_problems/multi_object_search/models/policy_model.py

Line 31 in 85d9a31

if history is not None and len(history) > 1:

as it is checking for history being larger than 1.

Separate issue - is it possible the tree construction in PO_UCT also needs to be reset every time a real action is taking? I'm seeing a weird behavior when the next suggested real action is an action that should not be possible and the only way it could potentially happen is if the tree is not being cleared fully. I have been unable to debug this well as it's in .pyx code and I'm still seeing how to debug that.

Remove the dependency on pygraphviz

pygraphviz is only ever used to plot the search tree for tiger as a demonstration. Such a plot does not work well for large search trees. Instead, directly interacting with the tree and printing it using the TreeDebugger (upcoming release) may be more useful.

Unable to install. ModuleNotFoundError

Hi,
I'm following the instruction on https://h2r.github.io/pomdp-py/html/installation.html by cloning from Git, and pip install -e . but when I tried to run the example python -m pomdp_problems.tiger.tiger_problem. I ran into this ModuleNotFound problem.

I ran into this issue with both Python 3.7 and 3.8 (using conda virtual envs). Thanks.

Pygame shows empty display

Hi,

I was running the MOS example at https://h2r.github.io/pomdp-py/html/examples.mos.html by
$ python -m pomdp_problems.multi_object_search.problem

A pygame window is displayed, but it remains empty during the running of the program.
The console output indicates that the planning is progressing.

Does anyone have any insights of how to fix this bug? Thanks in advance!

Version
MacOS 10.15.5 (19F101)

$ conda list
# packages in environment at /Users/UserName/anaconda/envs/ltl_pomdp:
#
# Name                    Version                   Build  Channel
ca-certificates           2020.6.24                     0  
certifi                   2020.6.20                py37_0  
cycler                    0.10.0                    <pip>
cython                    0.29.21          py37hb1e8313_0  
decorator                 4.4.2                     <pip>
kiwisolver                1.2.0                     <pip>
libcxx                    10.0.0                        1  
libedit                   3.1.20191231         h1de35cc_1  
libffi                    3.3                  hb1e8313_2  
matplotlib                3.3.0                     <pip>
ncurses                   6.2                  h0a44026_1  
networkx                  2.4                       <pip>
numpy                     1.19.1                    <pip>
opencv-python             4.4.0.40                  <pip>
openssl                   1.1.1g               h1de35cc_0  
Pillow                    7.2.0                     <pip>
pip                       20.2.2                   py37_0  
pomdp-py                  1.2                       <pip>
pygame                    1.9.6                     <pip>
pygraphviz                1.6                       <pip>
pyparsing                 2.4.7                     <pip>
python                    3.7.7                hf48f09d_4  
python-dateutil           2.8.1                     <pip>
readline                  8.0                  h1de35cc_0  
scipy                     1.5.2                     <pip>
setuptools                49.2.1                   py37_0  
six                       1.15.0                    <pip>
sqlite                    3.32.3               hffcf06c_0  
tk                        8.6.10               hb0a8c7a_0  
wheel                     0.34.2                   py37_0  
xz                        5.2.5                h1de35cc_0  
zlib                      1.2.11               h1de35cc_3

Partial Console output

==== Step 112 ====
Action: move-xyth-North
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2796
Find Actions Count: 3
__num_sims__: 225
==== Step 113 ====
Action: move-xyth-South
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2794
Find Actions Count: 3
__num_sims__: 229
==== Step 114 ====
Action: move-xyth-West
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2792
Find Actions Count: 3
__num_sims__: 230
==== Step 115 ====
Action: move-xyth-West
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2790
Find Actions Count: 3
__num_sims__: 216
==== Step 116 ====
Action: move-xyth-North
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2788
Find Actions Count: 3
__num_sims__: 226
==== Step 117 ====
Action: move-xyth-East
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2786
Find Actions Count: 3
__num_sims__: 245
==== Step 118 ====
Action: move-xyth-South
Observation: MosOOObservation({})
Reward: -2
Reward (Cumulative): 2784
Find Actions Count: 3
__num_sims__: 248

Is it possible to do multiagent in pomdp-py?

Excellent framework and clear examples!
I am new to POMDP. I wonder is it possible to implement decentralized like DEC-Tiger (Nair et.al. 2003) or centralized POMDP? Thanks!

Can this solve Dec MDPs?

Hi,

I'm looking for a Dec-MDP solver that I can easily integrate with my existing python code. Could this package be used for the same? If not, do you have any suggestions/directions I can look at? Any help is much appreciated. Thanks.

Best,
Prasanth

Call for Contributions

pomdp-py has been useful to many since it was released two years ago. It is encouraging to receive emails from users of different backgrounds. I believe there is value and beauty in keep improving this library for what it is and making POMDP-related research more accessible to more people programmatically. This is the basis of pomdp-py.

I will keep maintaining this library, but I have also graduated from my PhD and am moving on to the next chapter. Therefore, I am calling for contributors who are willing to take part in improving & maintaining this library. Below, I provide some guidelines for how you could contribute. In the end, I include my own todo-list -- open for anybody. Feel free to contact me at [email protected] if you have any thoughts or comments.

How to Contribute

There are multiple ways you could contribute:

Open an issue. If you identify a problem or have questions, feel free to open an issue about it.
Fork & make a pull request.
- Existing tests under pomdp-py/tests should pass. Run python test_all.py before making your pull request.
- If you make breaking changes, please write your own tests under the same directory and overwrite previous tests (and justify why).
- Communicate effectively what you have done, how it works, and how should people use it. Effective communication is not limited to text writing.
Blog. If you have used pomdp-py, it would be useful to others if you blog about the process of your use and your thoughts. Share your link with me - thanks!

Coding

Follow Google Python style guide. In particular:
- Use 4 spaces instead of tabs.
- Use Google style docstrings (examples)

Documentation

We use Sphinx, which uses RST (reStructuredText).
- The documentation source code is in pomdp-py/docs/_sphinx_src.
- The generated website is in pomdp-py/docs/html.

Read about how to build the docs on this documentation page.

Git and Github

Follow Github's Best Practices for Projects. In particular:
- Break down large issues into smaller issues
- Communicate
- Make use of the description and README
- Have a single source of truth

Caveats

Building a Cython-Python project for PyPI release is tricky. Make sure both Cython (.pyx, .pxd) and Python (.py) source files are included in the PyPI source package.
Writing documentation for a Cython-Python combined project is tricky.

TODO List

Your contribution could be about, but not limited to, the following TODO items in my notes. The ordering is somewhat random; they are each assigned to a potential future release.

Add a tutorial for writing tabular domains; Add crying-baby example; Add the tabular models into pomdp-py/utils/templates.py
Add a tutorial for using the functions in value_functions.py.
Add DESPOT support. It is best to integrate with the official DESPOT repository.
Add OctreeDistribution (octree belief; ref) as one of the distribution reps in pomdp_py.
Finish solving the light-dark domain; Existing B-LQR planning implementation does not work.
- Implement POMCPOW, AdaOPS, and apply them to this domain.
Add support for multi-agent POMDPs.
- Issues requesting Dec-POMDP #19, #14
Add examples and solvers / planners for continuous POMDPs.
Benchmark comparison against David Silver's C++ codebase on RockSample. The codebase could be found in this repo: https://github.com/kyr-pol/pomcp

Particle deprivation in POMCP at the end of the trial

Hello! While running a POMCP problem, I'm sometimes getting particle deprivation at the end of a trial. It is the same problem as in #32 and #27, so it is a finite-horizon problem that has N states and observations, with N+1 possible actions (i.e. one action per state and the 'wait' action).

Since real observations are coming from data, the observation model is constructing by predicting from this data with a decoder to obtain $p(o|s')$ (the action does not change the observation function as of now). The trial ends when any action other than 'wait' is taken or when the last time-step of a trial is reached without any action. When this happens, a new trial starts, and the transition probability is uniform to all the possible states.

I noticed sometimes I get particle deprivation right after the trial ends. After looking at the code, I saw that particle deprivation can happen if there is an observation that was never anticipated in the tree . I had a look and saw that all trials have a much higher count for particle reinvigoration at the last time step compared to others (n_particles = 1000), for example:

TRIAL 20 (true state s_6-t_0)
--------------------

  STEP 0 (6 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_0 -> 0.08
        s_1-t_0 -> 0.083
        s_2-t_0 -> 0.079
        s_3-t_0 -> 0.09
        s_4-t_0 -> 0.087
        s_5-t_0 -> 0.091
        s_6-t_0 -> 0.083
        s_7-t_0 -> 0.079
        s_8-t_0 -> 0.078
        s_9-t_0 -> 0.079
        s_10-t_0 -> 0.091
        s_11-t_0 -> 0.08

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_1
  Observation: o_6
Particle reinvigoration for 442 particles

  STEP 1 (5 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_1 -> 0.023
        s_1-t_1 -> 0.039
        s_2-t_1 -> 0.024
        s_3-t_1 -> 0.027
        s_4-t_1 -> 0.033
        s_5-t_1 -> 0.024
        s_6-t_1 -> 0.677
        s_7-t_1 -> 0.033
        s_8-t_1 -> 0.019
        s_9-t_1 -> 0.052
        s_10-t_1 -> 0.025
        s_11-t_1 -> 0.024

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_2
  Observation: o_6
Particle reinvigoration for 670 particles

  STEP 2 (4 steps remaining)
  Current belief (based on 0s of data):
        s_0-t_2 -> 0.004
        s_3-t_2 -> 0.002
        s_4-t_2 -> 0.005
        s_6-t_2 -> 0.977
        s_7-t_2 -> 0.003
        s_9-t_2 -> 0.006
        s_10-t_2 -> 0.003

  Action: a_wait
  Reward: -1.0. Transition to s_6-t_3
  Observation: o_6
Particle reinvigoration for 472 particles

  STEP 3 (3 steps remaining)
  Current belief (based on 0s of data):
        s_6-t_3 -> 1.0

  Action: a_6
  Reward: 10.0. Transition to s_10-t_0
Particle reinvigoration for 920 particles

This makes sense to me as the code is sampling an observation at random instead of getting it from data when the trial is going to end. Given this, I can see why usually it gets an observation that has not being simulated a lot in the tree. I get particle deprivation sometimes, so I guess it depends on how the simulation goes and which observations are sampled.

Since trials are independent (i.e., I am creating a new instance of the problem for every trial and resetting the belief), I wonder if it makes sense to update the belief when the trial ends at all. Not doing it would avoid this issue with particle deprivation, if I understood correctly. Even so, I am wondering whether this method of providing observations during the last time step is correct. To clarify, the observation model currently looks like this:

class TDObservationModel(ObservationModel):
    """
    Time-dependent extension of the ObservationModel class that takes into account
    the time-step within each POMDP trial and allows the observation function to have
    different observation probabilities depending on the time step.

    This allows the time-dependent POMDP to leverage the fact that the more a trial
    advances, the more brain data from the subject is available. Thus, the probability
    p(o | s, a, d) (where d is the time step) should be less uncertain the longer the
    trial is. 

    This also removes the contraint present in the basic model where the
    initial time step of each trial needs a sufficiently large brain data window to
    yield good classification (e.g. 0.5s), since restrictions on the previous observation
    function required all time steps to use data windows of the same length

    Parameters
    ----------

    Features: 3-D np.array, shape (n_steps, n_states, n_observations) 
        Feature array for the observation matrix. 

    Attributes
    ----------
    discretization: str, ['conf_matrix']
        Method used to define the observation model. Value 'conf_matrix' uses a 3D confusion 
        matrix obtained from stacking the confusion matrix of the decoding algorithm using
        a given brain data window length.

    observation_matrix: 3D np.array, (n_timesteps, n_class, n_observation)
        Matrix representing the observation model, where each element represents the probability of obtaining
        the observation corresponding on the third dimension given that the agent is currently at the state 
        corresponding to the second simension and the current time step of the trial is that of the first dimension.

        Example:
            observation_matrix[3][2][5] = p(o=o_5|s=s_2, d=3)
    """
    def __init__(self, features, discretization='conf_matrix'):
        self.discretization = discretization
        self.observation_matrix = self._make_obs_matrix(features)
        self.n_steps, self.n_states, self.n_obs = self.observation_matrix.shape

    def probability(self, observation, next_state, action):
        # Probability of obtaining a new observation given the next state and the time step
        if next_state.t == 0:  # If next_state.t is 0, either state was last step or action was decision 
            return 1 / self.n_states

        else:  # Wait action on other time steps
            obs_idx = observation.id
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            return self.observation_matrix[state_step][state_idx][obs_idx]

    def sample(self, next_state, action):
        # Return a random observation according to the probabilities given by the confusion matrix
        if next_state.t == 0:  # If next_state.t is 0, either state was last step or action was decision 
            return np.random.choice(self.get_all_observations())

        else:  # Wait action on other time steps
            state_idx = next_state.id
            state_step = next_state.t - 1  # observation_matrix[0] corresponds to when next_state.t is 1
            obs_p = self.observation_matrix[state_step][state_idx]
            return np.random.choice(self.get_all_observations(), p=obs_p)

(Redundant) code in POUCT implementation

In the _simulate function of po_uct.pyx:

        root[action].num_visits += 1
        root.value = root.value + (total_reward - root.value) / (root.num_visits)
        root[action].value = root[action].value + (total_reward - root[action].value) / (root[action].num_visits)

Both the value of root (VNode) and root[action] (QNode) are updated based on total_reward. However, in fact, the algorithm in the paper only requires updating the value of the QNode, i.e. root[action].

I also noticed in the source code of the original author the expected discounted cumulative value is also not maintained in both the VNode and the QNode.

Also in the current POUCT implementation in pomdp_py, commenting out root.value = ... and stick to only updating the QNode's value according to the paper, does not change the output behavior of the planner, since it eventually outputs an action based on the values of the QNodes that are immediate children of the root node. So we should remove this redundant line because it causes confusion.

PO-UCT rollout not terminated after reaching goal state

Hi,

In POMCP, the rollout terminates when either maximum depth or a goal state is reached. For a reference, please see line 327 of mcts.cpp.

I wonder if you need a check at the end of the while loop in po_uct.rollout

if state.terminal: 
    break

To reproduce the bug, please create a RockSample(1, 0), change one of the default arguments of POMCP, max_depth=10000.

It is probably not a severe bug for your particular implementation of RSRewardModel.sample in terms of getting the correct action values because transitioning into exit area again after already in the exit area receives 0 reward. Also in the case where state space is large and max_depth is small, max_depth is likely to be reached before a goal state. But it is nice to not have to run many iterations of rollouts when it is not necessary, and possibly important for a different reward implementation.

Thank you!

Transition and probability in Update belief of Multi Object Search

I am currently using the Multi-object search example. While going through the code of update_histogram_belief ... I came across this issue :

pomdp-py/pomdp_py/representations/belief/histogram.py

Line 64 in 85d9a31

transition_prob += transition_model.probability(next_state,

If the transitions are not deterministic, the probability function of the current object will be called with the "ObjectState" .

But in the line

pomdp-py/pomdp_problems/multi_object_search/models/transition_model.py

Line 57 in 85d9a31

if next_object_state != state.object_states[next_object_state['id']]:

, it expects the full "MosOOState". Is there any way to resolve this?

Visualization errors of dot

Random.choice behavior change in python 3.9

At Line 76 in Histogram.pyx, there is a random.choice function being used

pomdp-py/pomdp_py/representations/distribution/histogram.pyx

Line 76 in 588d594

return rnd.choices(candidates, weights=prob_dist, k=1)[0]

The behavior of this function has changed in python 3.9 - if all the provided weights are 0, it returns the following error :

raise ValueError('Total of weights must be greater than zero')
ValueError: Total of weights must be greater than zero

(I encountered the issue while running cos-pomdp where some of the correlation distributions all have 0 probability)

One other thing to note - in python 3.8 and older - if all the weights are 0, the random.choice function actually picks the last element of the list every time.

Modernize repo

Migrate to pyproject.toml to specify dependencies with minimal setup.py to build Cython extensions.
CI testing
Pre-commit checks

SARSOP choose same action everytime.

I'm trying to solve my own problem but ValueIteration and SARSOP gives me same action everytime.

Can you check my model setting?
This is the temp-pomdp.pomdp that contains every info about my problem.

It has 4 sates, 3 actions.
S1 can go to S1, S2,
S2 can go to S2, S3,
S3 can go to S3, S4.
S4 is the goal state.

On, S1, a1 gives the highest probability to go to S2.
On, S2, a1 gives the highest probability to go to S3.
On, S3, a1 gives the highest probability to go to S4.
But the final policy always selects a1 to proceed.

discount: 0.950000
values: reward
states: s1 s2 s3 s4
actions: a1 a3 a2
observations: wrong correct
start: 1.000000 0.000000 0.000000 0.000000
T : a1 : s1 : s1 0.386250
T : a1 : s1 : s2 0.613750
T : a1 : s1 : s3 0.000000
T : a1 : s1 : s4 0.000000
T : a3 : s1 : s1 0.994609
T : a3 : s1 : s2 0.005391
T : a3 : s1 : s3 0.000000
T : a3 : s1 : s4 0.000000
T : a2 : s1 : s1 0.812058
T : a2 : s1 : s2 0.187942
T : a2 : s1 : s3 0.000000
T : a2 : s1 : s4 0.000000
T : a1 : s2 : s1 0.000000
T : a1 : s2 : s2 0.812057
T : a1 : s2 : s3 0.187943
T : a1 : s2 : s4 0.000000
T : a3 : s2 : s1 0.000000
T : a3 : s2 : s2 0.812044
T : a3 : s2 : s3 0.187956
T : a3 : s2 : s4 0.000000
T : a2 : s2 : s1 0.000000
T : a2 : s2 : s2 0.386244
T : a2 : s2 : s3 0.613756
T : a2 : s2 : s4 0.000000
T : a1 : s3 : s1 0.000000
T : a1 : s3 : s2 0.000000
T : a1 : s3 : s3 0.994612
T : a1 : s3 : s4 0.005388
T : a3 : s3 : s1 0.000000
T : a3 : s3 : s2 0.000000
T : a3 : s3 : s3 0.386242
T : a3 : s3 : s4 0.613758
T : a2 : s3 : s1 0.000000
T : a2 : s3 : s2 0.000000
T : a2 : s3 : s3 0.812049
T : a2 : s3 : s4 0.187951
T : a1 : s4 : s1 0.000000
T : a1 : s4 : s2 0.000000
T : a1 : s4 : s3 0.000000
T : a1 : s4 : s4 1.000000
T : a3 : s4 : s1 0.000000
T : a3 : s4 : s2 0.000000
T : a3 : s4 : s3 0.000000
T : a3 : s4 : s4 1.000000
T : a2 : s4 : s1 0.000000
T : a2 : s4 : s2 0.000000
T : a2 : s4 : s3 0.000000
T : a2 : s4 : s4 1.000000
O : a1 : s1 : wrong 0.499999
O : a1 : s1 : correct 0.500001
O : a3 : s1 : wrong 0.880801
O : a3 : s1 : correct 0.119199
O : a2 : s1 : wrong 0.731068
O : a2 : s1 : correct 0.268932
O : a1 : s2 : wrong 0.268932
O : a1 : s2 : correct 0.731068
O : a3 : s2 : wrong 0.731067
O : a3 : s2 : correct 0.268933
O : a2 : s2 : wrong 0.500007
O : a2 : s2 : correct 0.499993
O : a1 : s3 : wrong 0.119207
O : a1 : s3 : correct 0.880793
O : a3 : s3 : wrong 0.499996
O : a3 : s3 : correct 0.500004
O : a2 : s3 : wrong 0.268939
O : a2 : s3 : correct 0.731061
O : a1 : s4 : wrong 0.000000
O : a1 : s4 : correct 1.000000
O : a3 : s4 : wrong 0.000000
O : a3 : s4 : correct 1.000000
O : a2 : s4 : wrong 0.000000
O : a2 : s4 : correct 1.000000
R : a1 : s1 : s1 : *  -1.000000
R : a1 : s1 : s2 : *  1.000000
R : a1 : s1 : s3 : *  -1.000000
R : a1 : s1 : s4 : *  -1.000000
R : a3 : s1 : s1 : *  -1.000000
R : a3 : s1 : s2 : *  1.000000
R : a3 : s1 : s3 : *  -1.000000
R : a3 : s1 : s4 : *  -1.000000
R : a2 : s1 : s1 : *  -1.000000
R : a2 : s1 : s2 : *  1.000000
R : a2 : s1 : s3 : *  -1.000000
R : a2 : s1 : s4 : *  -1.000000
R : a1 : s2 : s1 : *  -1.000000
R : a1 : s2 : s2 : *  -1.000000
R : a1 : s2 : s3 : *  1.000000
R : a1 : s2 : s4 : *  -1.000000
R : a3 : s2 : s1 : *  -1.000000
R : a3 : s2 : s2 : *  -1.000000
R : a3 : s2 : s3 : *  1.000000
R : a3 : s2 : s4 : *  -1.000000
R : a2 : s2 : s1 : *  -1.000000
R : a2 : s2 : s2 : *  -1.000000
R : a2 : s2 : s3 : *  1.000000
R : a2 : s2 : s4 : *  -1.000000
R : a1 : s3 : s1 : *  -1.000000
R : a1 : s3 : s2 : *  -1.000000
R : a1 : s3 : s3 : *  -1.000000
R : a1 : s3 : s4 : *  100.000000
R : a3 : s3 : s1 : *  -1.000000
R : a3 : s3 : s2 : *  -1.000000
R : a3 : s3 : s3 : *  -1.000000
R : a3 : s3 : s4 : *  100.000000
R : a2 : s3 : s1 : *  -1.000000
R : a2 : s3 : s2 : *  -1.000000
R : a2 : s3 : s3 : *  -1.000000
R : a2 : s3 : s4 : *  100.000000
R : a1 : s4 : s1 : *  -1.000000
R : a1 : s4 : s2 : *  -1.000000
R : a1 : s4 : s3 : *  -1.000000
R : a1 : s4 : s4 : *  -1.000000
R : a3 : s4 : s1 : *  -1.000000
R : a3 : s4 : s2 : *  -1.000000
R : a3 : s4 : s3 : *  -1.000000
R : a3 : s4 : s4 : *  -1.000000
R : a2 : s4 : s1 : *  -1.000000
R : a2 : s4 : s2 : *  -1.000000
R : a2 : s4 : s3 : *  -1.000000
R : a2 : s4 : s4 : *  -1.000000

Making a greedy rollout policy for POMCP

Hello! In tuning with my particular problem, I am exploring potential solutions to some problems I'm having related to the different steps of the POMCP simulation. One of them is to change the rollout policy used to add new nodes to the tree.

Right now, the POMCP is using a random rollout that returns any actions with a uniform probability. I would like to implement a rollout that always selects the action with the highest immediate value, but it is not clear to me how to code it with the existing PolicyModel interface.

MOMDP representation?

Hello,

I would like to know if it is currently possible to create a problem with fully observable state variables and solve them using a .pomdpx file using SARSOP.

Made a Docker image for Pomdp_py

Hello,

In case you are interested I made a Docker image available at: https://hub.docker.com/repository/docker/romainegele/pomdp/general

for this package.

Sarsop won't compute policy after changing observation model

Hello. I have a model using SARSOP that was working until now. Following the discussion we had on #27 , I thought about re-running this model with a small change on the observation model's probability() function. The old function was:

def probability(self, observation, next_state, action):
    # The probability of obtaining a new observation knowing the state is given by the discretization / conf matrix
    obs_idx = int(observation.id)
    state_idx = int(next_state.id)
    return self.observation_matrix[state_idx][obs_idx]

And I changed it to:

def probability(self, observation, next_state, action):
    # The probability of obtaining a new observation knowing the state is given by the discretization / conf matrix
    if 'wait' in action.name:
        obs_idx = int(observation.id)
        state_idx = int(next_state.id)
        return self.observation_matrix[state_idx][obs_idx]
    else:
        return 1 / self.n_obs

Then, when I try to compute the SARSOP policy, I get:

-------------------------------------------------------------------------------
 Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs  
-------------------------------------------------------------------------------
 0.25    0       0        -99.9996   452.573    552.573     13       1        
ERROR: min_ratio > 1 in upperBoundInternal!
  (min_ratio-1)=4.00002e-06
  normb=1
  b=size: 12,
 data: [0= 0.0833333, 1= 0.0833333, 2= 0.0833333, 3= 0.0833333, 4= 0.0833333, 5= 0.0833333, 6= 0.0833333, 7= 0.0833333, 8= 0.0833333, 9= 0.0833333, 10= 0.0833333, 11= 0.0833333]
  normc=0.999996
  c=size: 12,
 data: [0= 0.083333, 1= 0.083333, 2= 0.083333, 3= 0.083333, 4= 0.083333, 5= 0.083333, 6= 0.083333, 7= 0.083333, 8= 0.083333, 9= 0.083333, 10= 0.083333, 11= 0.083333]
------------------------------------------

What may be the cause of this?

Error with pip install -e .

Hello, in the basic setup instructions after git cloning the repo, the instructions fail for me at this second command.

ModuleNotFoundError: No module named 'pomdp_py.framework.basics'

I follow the install instruction and successfully install the package. But when I try to run the example, it gives the following error. No idea what happened.

python -m pomdp_problems.tiger.tiger_problem
Traceback (most recent call last):
  File "/Users/chenjie/opt/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/chenjie/opt/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/chenjie/opt/anaconda3/lib/python3.8/site-packages/pomdp_problems/tiger/tiger_problem.py", line 38, in <module>
    import pomdp_py
  File "/Users/chenjie/opt/anaconda3/lib/python3.8/site-packages/pomdp_py/__init__.py", line 4, in <module>
    from pomdp_py.framework.basics import *
ModuleNotFoundError: No module named 'pomdp_py.framework.basics'

Changing max_depth and planning_time for POMCP

Hello!

I am working on the same problem discussed in #27 and, as the problem includes a limit of time-steps for each trial, I am trying to model it as a finite-horizon POMDP. As such, I would like to initialize POMCP with a max_depth equal to the maximum number of time-steps for the trial, and then change it every time step so each simulation takes into account the horizon when planning online. I noticed that the belief for a given state reaches 1 at times and the agent does not make a decision until the last or second-to-last time-step, and I thought that may be a potential cause.

However, I am getting an error saying that POMCP has no attribute max_depth (or _max_depth). What can I do?

On a similar fashion, I would like to change the planning time. At the beginning of each trial, the model has 0.5 seconds to plan, then 0.1 seconds at each time-step.

Using POMCP to solve a time-dependent problem

Hello! First of all, I wanted to thank you for this package, I have been using it since the beginning of my PhD and it saved me a lot of work interfacing with solvers and other stuff.

I am new to POMDPs, and I am studying the possibility of using them to help decode brain signals for my thesis. So far, I have had good results using fixed windows of brain data and generating a confusion matrix from a classifier as the observation model for the POMDP, and solving it with SARSOP.

For the particular problem I am interested in, it would be beneficial to accumulate the data for every step of the model. This is, for example, instead of using a window from time 0 to time 200 for the first time step, and another from time 200 to time 400 for the second time step, I would like to use the window 0-200 for t1, and 0-400 for t2. The number of time steps and the maximum length of the window are fixed. The problem with this approach is that using more data makes the classifier that generates the confusion matrix more precise, yielding a different confusion matrix. As such, the observation model for every time step is different.

After discussing with my supervisor, we decided to use POMCP as an algorithm, and I was wondering about the implementation.

Until now, I implemented the ObservationModel as indexing a 2D matrix using the true state index and the observation index to represent the probabilities of p(o|s). For a variant that has a different model for every time step, a 3D confusion matrix would be required, that could be indexed by passing the observation, the state, and the time step to the ObservationModel.probability() function.

Would it be possible to implement this in the POMCP implementation of the package? Please let me know. I hope I have included all the pertinent information for the problem, otherwise please tell me and I will provide the extra detaild that may be needed. Thank you very much.

"TypeError: Cannot convert tuple to pomdp_py.framework.basics.State"

Hello, I am trying to solve a POMDP that is two dimensional. I am modeling the code after the Tiger problem setup, but there the state was one dimensional. So I thought it would be easy to pass a tuple to create and define my State class as such for two dimensions. But the framework does not allow that. What is the correct way to form a vector for the state when passing to the pomdp instance environment from the main function?

Cannot convert type to pomdp_py.framework.basics.Action

Hello, we have used these code files for the Multi-Object Search in order to solve a problem for our final project of a course called "cognitive robotics". In our project the problem involves an agent with partial observability (similar to the original files) which is a submarine that needs to find treasure on the bottom of the ocean while avoiding obstacles (at the moment we only did rocks at the bottom of the ocean, we would also like to have fish that swim in straight lines but one step at a time). the submarine will need to move, look, find (so far actions from the original files), pickup the treasure and unload it to a boat that is static at the top of the map (the ocean surface). We have changed all the files to try and represent this problem and we have come to a problem that is not directly in the files we wrote but it seems its inside the pomd-py library files them selves (of course the issue is with the files we wrote but its harder for us to understand what it is and how to fix it). We would love some assistance with this issue, thanks in advance.

We are also including all the files we have changed and used
problem.zip

Benchmark w.r.t the original Silver's code?

Hi,
Is it available a benchmark (speed and memory) w.r.t the original POMCP C++ code? It would interesting also for large POMDPs

Moving a problem from SARSOP to POMCP

Hello! I am currently trying to use pomdp_py to explore using POMDP as a framework to help decision-making in Brain-Computer Interfaces (BCI). A while back I asked about the possibility of defining a model that included a time component #21.

TL;DR - These are the problems/questions I have:

Particles is not constructed correctly when passed a list of tuples generated from the Histogram belief representation
I am not sure that the way I worked around the previous problem is correct
Even if it is, when I try to print the current belief (which should be a bunch of 0s, and only a handful of values other than 0), I get what I believe is the list of particles.
Technically not a question, but I also think this would be a useful piece of info to include in the Tiger problem's documentation
Also, thank you very much for this fantastic library. It is awesome and has made my life much easier in the last year.

After finishing a first work exploring BCIs using a simple POMDP that relied on SARSOP for solving. I am now trying to move to POMCP to solve a problem similar to what we discussed in #21.

I ran into a problem when trying to call pomcp.plan() in POMCP as I was initializing my belief like the following:

import pomdp_py
import itertools

n_targets = 12  # For example
n_steps = 8  # Number of time-steps before the true state changes

# Full state space
all_states = [TDState(state_id, t_step) for state_id, time_step 
                       in itertools.product(range(n_targets), range(n_steps))]
# States at time-step 0 only
all_init_states = [TDState(state_id, 0) for state_id in n_targets]

# Initialize belief with a uniform probability over n_targets for all states at time-step 0
init_belief = pomdp_py.Histogram({state: 1 / n_targets if state in all_init_states 
                                  else 0 for state in all_states})

I got then an error saying that the belief needs to be represented as particles in order to use POMCP. I notices this is not present in the Tiger's example in the documentation. After looking around for a bit, I found the Particles module. However, I do not understand how to construct it. I tried to do it in the following way (starting from the previous init_belief)

belief_dict = init_belief.get_histogram()
belief_tuples = [(value, weight) for value, weight in init_belief]

init_belief_particles = pomdp_py.Particles(belief_tuples)

But then I get the entire tuples (both State and its probability) as values, and all weights are set to None. I ended up making it like the following:

init_belief_particles = pomdp_py.Particles([]).from_histogram(init_belief)

That way it looks like it works. However, when I start the simulation I have the following piece of code where I print the current belief (right now as it was for the histogram representation of the belief):

for trial_n in range(n_trials):

    # Separate next trial and label
    next_trial = X_test[trial_n, :, :]
    next_label = int(y_test[trial_n])

    # Set belief to uniform, in case last trial ended without a decision
    vep_problem.agent.set_belief(init_belief_particles)

    # Set the true state as the env state (time step = 0)
    true_state = TDState(next_label, 0)
    vep_problem.env.apply_transition(true_state)

    print('')
    print(f'TRIAL {trial_n} (true state {true_state})')
    print('-' * 20)

    # For every time step...
    for step_n in range(n_steps):
        cur_belief = [vep_problem.agent.cur_belief[st] for st in vep_problem.agent.cur_belief]
        print('')
        print(f'  STEP {step_n}')
        print('  Current belief:')
        print(f'  {cur_belief}')

And then I get a very long list where none of the values is 0. At any time step, only n_targets states should be different than 0, as the initial belief is specified as stated previously and the transition model (not shown) specifies that all states can only go to the same state with an increase in time_step. That said, I don't know if what I am getting is the belief or rather the list of particles (my understanding of how the particle representation works is limited, even after reading the POMCP paper, so it may be that). If that is the case, I wonder if it is correct calling the belief with something like problem.agent.cur_belief.get_histogram().

Thank you very much for your time.