ziksby / metalearncuriosity Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 46.67 MB

Python 94.36% Jupyter Notebook 5.64%

metalearncuriosity's People

Contributors

Stargazers

Watchers

metalearncuriosity's Issues

Problem Description:

Change template style

Proposed Solution:

edit template files

Alternative Solutions (if any):

Additional Context:

Problem Description

The flax RNNCellBase API has undergone some key updates. Need to update code to match these changes.

Proposed Solution

Follow the tutorial here.

Alternative Solutions (if any)

Downgrade to an older version of flax.

RL loss for BYOL-explore is too high

Bug Description:

RL loss for BYOL-explore is too high.

Steps to Reproduce:

Simply run the BYOL-explore file and monitor the RL loss.

Expected Behavior:

The loss should decrease as training progresses.

Actual Behavior:

The loss decreases but the magnitude of the loss is still way too high,

Environment:

Empty-misc

Feature Request: Add pmapped reward combiner

Problem Description

With TPU access we need to adjust our code for training reward combiner to take full advantage of the TPUs.

Proposed Solution

Code up reward combiner script that uses pmap.

Alternative Solutions (if any)

Additional Context

Add other metrics/hyperparameter visions of agents.

Problem Description

Add other metrics to view instead of the average reward. Such as the standard error and add hyperparameter versions of the code.

Proposed Solution

Add a file so that we may view these metrics. Have a folder that contains files that sweeps through the hyperparameters of the algorithms.

Alternative Solutions (if any)

Additional Context

Problem Description

Add the FAST algorithm which was meta-learned from the Meta-Learning Curiosity Algorithms paper.

Proposed Solution

Add a single file implementation of the FAST algorithm.

Additional Context

Mainly for the ICLR blogpost. But can also serve as one of the baselines for my master's.

Problem Description

To test the exploration power of our RL algorithms we will implement Brax with delayed rewards. This idea is taken from here.

Proposed Solution

Add a function that is able to provide the reward at the specified step-interval.

Add saver file or checkpoint saver and plotter file.

Problem Description

Add a way to save the runner_state and the hence the network params of the networks. This will help with the visualisation as well perhaps. Also add a file that plots the W&B CSV files.

Proposed Solution

Provide a file that saves the runner_state and the config used. Save the file locally.

Alternative Solutions (if any)

Perhaps use Weights & Bias to save the models.

BYOL-Explore-toy-example does not learn with intrinsic rewards.

Bug Description:

As soon as we add the intrinsic rewards BYOL-Explore does not learn. The learning curves crashes.

Steps to Reproduce:

Just simply run the experiments with BYOL-explore and set INT_LAMBDA to a positive number.

Expected Behavior:

We see BYOL explore perform better with curiosity in the latter stages of training.

Actual Behavior:

We see BYOL explore perform considerably worse with curiosity.

Environment:

This was for Empty-misc environment.

Have two value heads for byol-explore.

Problem Description

Implement BYOL-explore with two value heads. This could help determine why the value loss is so high.

Proposed Solution

Add second value head for BYOL explore. Which means we have second an advantage and target for the intrinsic rewards as well. This will help keep the intrinsic reward non-episodic and the extrinsic reward episodic.

Additional Context

Since we have one value head, if we make the intrinsic reward non-episodic we are making the extrinsic reward non-episodic as well. This could cause issues as we are leaking information about the task to the agent. See original random network distillation paper.

TPU Branch

Problem Description

The current code base does not take advantage of pmap. This is needed since we are using TPUs.

Proposed Solution

Make the code base suited for TPUs.

Feature Request: Add gymnax and brax wrappers from purejaxrl

Feature Title:

Wrappers for brax and gymnax needed.

Problem Description:

The wrappers are needed so that we are able to train our agents on the environments.

Proposed Solution:

A file that contains the function wrappers.

Using Episode Horizon Instead Training Horizon

Bug Description:

The temporal reward combiner is using normalised episode time step. It should the normalised training time step.

Expected Behavior:

Reward combiner should generalise to other environments.

Actual Behavior:

Struggles to generalise beyond the minigrid environments

Feature Request: Add Discrete PPO agent

Feature Title:

Add PPO agent from purejaxrl

Problem Description:

The base PPO is necessary to ensure we are able to implement benchmarks

Proposed Solution:

This should implement the PPO agent for discrete actions gymnax.

Feature Request: Add toy example for BYOL-Explore

Feature Title:

Add toy example for BYOL-Explore

Problem Description:

We need a file that implements BYOL-Explore

Proposed Solution:

This file should contain a toy example of BYOL-Explore where we simply add the world model and the encoder and the normalisation methods.

Add Meta-learner for extrinsic and intrinsic reward combiner.

Problem Description

In the Curiosity literature it is not clear how one should combine the intrinsic rewards and the extrinsic rewards. This reward combiner has been handcrafted and perhaps it can be meta-learned.

Proposed Solution

Add a meta-learner that tries to learn how to be able to combine the intrinsic reward and the extrinsic reward.
It will be a single file implementation.

Add episodic intrinsic reward for BYOL-explore

Feature Title:

Add episodic intrinsic reward for BYOL-explore

Problem Description:

Intrinsic rewards should be episodic if desired by the user. Currently just none-episodic.

Proposed Solution:

Make the intrinsic reward non-episodic by adding another batch of last_dones so that we may times the intrinsic reward by (1-last_dones).

Implement BYOL-Explore Properly

Problem Description

The last baseline to implement is BYOL-Explore

Proposed Solution

Add a single file implementation of it.

Alternative Solutions (if any)

Additional Context

Add Reward Prioritisation for BYOL-explore

Feature Title:

Add Reward Prioritisation for BYOL-explore

Problem Description:

Add Reward Prioritisation for BYOL-explore as this could help stabilise training for lower values for int_lambda.

Proposed Solution:

Add reward Prioritisation function as described in the paper.

Add Random Agent Code

Problem Description

Add code that runs the random agent on the environment. This is to normalise the scores from the agents.

Proposed Solution

Add random agent code for each macro-environment.

Cannot divide evenly which means cannot form batches

Bug Description:

When attempting to create a batch we run into an issue whereby we cannot divide evenly the shapes 128 and 512 when we do not times the intrinsic reward by (1-done).

Steps to Reproduce:

Do not times the intrinsic reward by (1-done) in BYOL explore.
Then run the BYOL-explore file.

Expected Behavior:

We should expect the file to execute and we should be able to run the experiments.

Actual Behavior:

We obtain this error:
jax._src.core.InconclusiveDimensionOperation: Cannot divide evenly the sizes of shapes (128,) and (512,)

Environment:

Empty-misc

Additional Information:

config = { "SEED": 42, "NUM_SEEDS": 30, "LR": 2.5e-4, "NUM_ENVS": 4, "NUM_STEPS": 128, "TOTAL_TIMESTEPS": 5e5, "UPDATE_EPOCHS": 4, "NUM_MINIBATCHES": 4, "GAMMA": 0.99, "GAE_LAMBDA": 0.95, "CLIP_EPS": 0.2, "ENT_COEF": 0.01, "VF_COEF": 0.5, "MAX_GRAD_NORM": 0.5, "ACTIVATION": "tanh", "ENV_NAME": "Empty-misc", "ANNEAL_LR": True, "DEBUG": False, "EMA_PARAMETER": 0.99, "REW_NORM_PARAMETER": 0.99, "INT_LAMBDA": 0.1, }

Update logggers

Problem Description

Currently we log after training is complemented. It would be ideal to log during training. And then pause training and continue training.

Proposed Solution

It is possible to log the results during training using jax.experimental.io_callback(). For example, we can update the parameters after every update step and use it in the print function to log training.

Alternative Solutions (if any)

Additional Context

Please see this issue.

Add visualisation file

Problem Description

We need a way to visualise the agent's performance on environments.

Proposed Solution

Add a file that takes a sequence of states and provides a gif.

Add cts ppo, rnn ppo, dpo

Feature Title:

Add other PPO types from PureJaxRL.

Problem Description:

So that we can test PPO on other types of environments.

Proposed Solution:

Add the single file implementations

Add loggers for intrinsic reward, and the losses (RL-loss, BYOL-Loss, Encoder Loss)

Feature Title:

Add loggers for intrinsic reward, and the losses (RL-loss, BYOL-Loss, Encoder Loss)

Problem Description:
Currently BYOL explore doesn't work with intrinsic reward added. The loggers will allow us to monitor why it can't learn. This is also for the future in case we run into other bugs.

Proposed Solution:

Add loggers that keep track of the losses, and intrinsic reward.

Add logger to log results (W&B)

Feature Title:

W&B logger

Problem Description:

There's no logger to log results and experiments.

Proposed Solution:

Add a run file and a logger file.

Add Random Network Distillation (RND).

Problem Description

Add the Random Network Distillation single file implementation.

Proposed Solution

A file that is able to be executed so that we may run the RND.

Additional Context

Serves as one of the baselines for our Master's

Add heat map for grid world environments.

Problem Description

Add heat maps for the grid world environments so that we may understand the RL algorithms behaviour better

Proposed Solution

Load the params of the models and see how they move in the environment. The user can specify which seeds to use.

Alternative Solutions (if any)

Additional Context

Add Cycle-consistency intrinsic motivation

Problem Description

Add Cycle-consistency intrinsic motivation

Proposed Solution

Just add a single file implementation of it.

Alternative Solutions (if any)

Additional Context

Add logger for the individual RL losses

Feature Title:

Add logger for the individual RL losses

Problem Description:

Add logger for individual RL losses. This will help with issue #15. And help with the debuggi8ng process for that loss.

Proposed Solution:

Add loggers for entrophy, actor_loss and value loss.

ziksby / metalearncuriosity Goto Github PK

metalearncuriosity's People

Contributors

Stargazers

Watchers

metalearncuriosity's Issues

Problem Description:

Proposed Solution:

Alternative Solutions (if any):

Additional Context:

Problem Description

Proposed Solution

Alternative Solutions (if any)

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Problem Description

Proposed Solution

Additional Context

Problem Description

Proposed Solution

Problem Description

Proposed Solution

Alternative Solutions (if any)

Problem Description

Proposed Solution

Additional Context

Problem Description

Proposed Solution

Bug Description:

Expected Behavior:

Actual Behavior:

Problem Description

Proposed Solution

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Problem Description

Proposed Solution

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Problem Description

Proposed Solution

Problem Description

Proposed Solution

Additional Context

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Problem Description

Proposed Solution

Alternative Solutions (if any)

Additional Context

Recommend Projects

Recommend Topics

Recommend Org