Giter Club home page Giter Club logo

assistive-bandits's Introduction

Assistive Bandits

Environment and solver code for The Assistive Multi-Armed Bandit (2019).

Introduction to our framework

Traditionally, techniques for preference learning assume that human behavior can be modelled as (noisily-)optimal with respect to a set of fixed preferences. However, our revealed preferences change all the time - especially in cases where we might not be fully aware of our preferences! For example, when we're organizing our desks, we might experiment with different configurations, to see what works.

Now imagine a personal robot is trying to help you organize that same desk. If the robot believes you are optimal, it will infer the wrong preferences. Instead, if the robot accounts for the fact that you are learning about your preferences, it has a better shot at understanding what you want. Even more crucially, the robot can expose you to new configurations -- ones that might improve your posture, something you hadn't considered before and were not going to explore if left to your own devices.

The setup of the assistive bandit framework. In each round, the human observes reward and tells the robot which arm they would like it to pull. The robot observes these requests, attempts to infer the reward values, and selects an arm to pull.

The assistive bandit framework seeks to formalize the task of helping an agent that is still learning about their preferences. It extends the Multi-Armed Bandit model of reinforcement learning, and is heavily inspired by the Cooperative Inverse Reinforcement Learning framework. In each round, the human selects an action, referred to in the bandit setting as an arm. However, the robot intercepts their intended action and chooses a (potentially different) arm to pull. The human then observes the pulled arm and corresponding reward, and the process repeats.

Despite appearing simple, we believe the framework captures interesting issues of information assymmetry and preference uncertainty that lie at the heart of assisting a learning agent. (See our paper for more details!)

"Human" Policies

The super class for all Bandit policies is HumanPolicy. Currently, the following Bandit policies have been implemented:

  • The random policy, which pulls arms entirely at random.
  • The epsilon-greedy policy. With probability epsilon, sample a random arm. Otherwise, pull the arm with highest empirical mean.
  • The win-stay-lose-shift policy. If the reward gotten from an arm is greater than the mean (over all arms), keep pulling the arm. Otherwise, pick another arm to pull at random.
  • The Thompson sampling policy. Maintain a posterior over each arm's parameters, and pull arms with probability equal to the chance the arm is optimal. This is implemented by sampling a particle from the posterior over arm means, then picking the arm with the highest mean in that particle. In addition, we also implement an annealed variant where many particles are sampled, and the arm with the highest mean amongst sampled particles is picked. Currently, the implementation assumes the bandit is Beta-Bernoulli - that is, the arm rewards are sampled from independent Bernoulli distributions, with parameters distributed according to a Beta distribution.
  • The knowledge gradient policy. The knowledge gradient policy assumes (falsely) that the current timestep is the last opportunity to learn, and that the policy will continue to act later without any learning. It therefore picks an arm that maximizes the expected reward of the current timestep, plus the discounted sum of rewards of always pulling the best arm after this timestep. Currently, the implementation assumes the bandit is Beta-Bernoulli.
  • The Gittins index policy. This is the Bayes-optimal solution to a discounted, infinite horizon bandit. We use the approximation method of Chakravorty and Mahajan. Currently, the implementation assumes the bandit is Beta-Bernoulli.
  • The upper confidence bound (UCB) policy, which maintains upper confidence bounds on the mean of each arm, picks the arm with the highest upper confidence bound.
  • The Bayesian upper confidence bound (Bayes-UCB) policy, which maintains posteriors over the mean of each arm, and picks the arm with the highest (1-1/t)-quantile. Currently, the implementation assumes the bandit is Beta-Bernoulli.
  • The upper credible limit policy, which is very similar to Bayes-UCB, but with softmax arm selection noise. Currently, the implementation assumes the bandit is Beta-Bernoulli.
  • The epsilon-optimal policy, which has full knowledge of the arm means. It pulls a random arm with probability epsilon. Otherwise, it pulls the arm with the highest mean.

Alternative interaction modes

In addition to the Assistive bandit setup (HumanTeleopWrapper), we also implement the following two interaction modes in HumanWrapper.py:

  • Preemptive: the robot must choose which arm to pull before seeing the human's desired action, and only sees the human action if the robot did not act. This is implemented by HumanCRLWrapper
  • Turn-taking: the robot pulls arms during even timesteps, while the human pulls during odd timesteps. This is implemented by HumanIterativeWrapper.

Usage

Installation Requirements

gym (0.9.x)
numpy
scipy
rllab
tensorflow 1.2+
sklearn
pyprind

(Recommended) Installing using install script

We recommend that you set up an isolated environment to install and run this code in, as it depends on an older version of gym.

First, setup your virtual environment. install tensorflow or tensorflow-gpu. Then, run:

chmod u+x install.sh
./install.sh

Reproducing MAB results

To reproduce our results from The Assistive MAB paper, run

python assistive-bandits/experiments/mab-experiments.py 

By default, our code will sequentially train an agent against every human policy included in the paper, and then test this policy against every human policy. This takes a significant amount of time (about 2-3 hours per policy per seed on an AWS p3.2xlarge).

Evaluating trained policies

To evaluate a trained assistive policy, run: python assistive-bandits/experiments/evaluate_policy.py <path to policy file>

assistive-bandits's People

Contributors

chanlaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

assistive-bandits's Issues

saving policy with mab_experiment script

Hi, I am trying to reproduce your results. I ran mab_experiments.py and in .../data/50/eps-opt folder, I see .txt and .csv files generated. But there is no policy saved. How can I save/find the policy to evaluate? Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.