qzed / irl-maxent Goto Github PK

View Code? Open in Web Editor NEW

214.0 214.0 56.0 1.38 MB

Maximum Entropy and Maximum Causal Entropy Inverse Reinforcement Learning Implementation in Python

License: MIT License

Python 19.27% Jupyter Notebook 80.69% Shell 0.04%

inverse-reinforcement-learning machine-learning maximum-entropy

irl-maxent's People

Contributors

Stargazers

Watchers

irl-maxent's Issues

Presentation template

Hey! Thanks for open-sourcing your code. I really like your presentation slide deck. I was wondering whether you could point me to the template you used. I'd love to adapt it.

Thanks!

Multiple terminal states

Thanks a lot for such a well-documented code. It is making it really easy for me to adapt for my use-case.

My MDP has multiple terminal states and I was wondering how to change the local_action_probabilities() code for that scenario. The Ziebert paper mentions you have to do it for all terminal states but not sure how to combine them. Thanks for your help!

Supporting MDPs with negative reward states?

Hello, thanks for sharing your code. Is it possible to use this for MDPs with negative reward states?

I've tried setting negative rewards inside setup_mdp() in example.py, e.g. like:

def setup_mdp():
    """
    Set-up our MDP/GridWorld
    """
    # create our world
    world = W.IcyGridWorld(size=5, p_slip=0.2)

    # set up the reward function
    reward = np.zeros(world.n_states)
    reward[-1] = 1.0
    reward[17] = -0.75
    reward[18] = -0.75
    reward[19] = -0.75

    # set up terminal states
    terminal = [24]

    return world, reward, terminal

-0.75 seems to be around the lowest I can set - lower than that, and running example.py results in an error:

Traceback (most recent call last):
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 141, in <module>
    main()
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 113, in main
    trajectories, expert_policy = generate_trajectories(world, reward, terminal)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/example.py", line 51, in generate_trajectories
    tjs = list(T.generate_trajectories(n_trajectories, world, policy_exec, initial, terminal))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 128, in <genexpr>
    return (_generate_one() for _ in range(n))
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 126, in _generate_one
    return generate_trajectory(world, policy, s, final)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 77, in generate_trajectory
    action = policy(state)
  File "/Users/kierad/Documents/GitHub/irl-maxent/src/irl_maxent/trajectory.py", line 169, in <lambda>
    return lambda state: np.random.choice([*range(policy.shape[1])], p=policy[state, :])
  File "mtrand.pyx", line 956, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities are not non-negative

And even with the above setup_mdp, the IRL methods don't seem to produce negative reward estimates (see the colourbar I've added):

True rewards:

Estimated rewards with maxent:

Estimated rewards with causal maxent:

if the trajectory stays n the terminal state (for a limited number of times)

hi thank you sooooooo much for this amazing repo.
I have been trying to build mu own environment but I faced some issues.
what if we have something like this : going from s0 to s1 to s2 and then staying in s3 for ever
(I changed the value iteration so now my trajectories are all 50 steps ) so my svf is something like(1,1,1,47, 0,...,0)
However I am facing some difficulties.
my zs and za start getting so big and then they become nan. and this ends in my omega to be nan as well
I was wondering if you have any idea how I can fix it? and what is the problem.
I am reading Dr.Zeibart's thesis but still have no clue how to tackle such problem(since z_terminal is 1 I am thinking maybe that results in the problem)
if you have any idea I would be so grateful if you share your thoughts
Thanks again

How do I cite this?

Hi @qzed , do you have a preferred citation for your package?

Thanks so much!
catubc

qzed / irl-maxent Goto Github PK

irl-maxent's People

Contributors

Stargazers

Watchers

Forkers

irl-maxent's Issues

Presentation template

Multiple terminal states

Supporting MDPs with negative reward states?

if the trajectory stays n the terminal state (for a limited number of times)

How do I cite this?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent