Giter Club home page Giter Club logo

preference-rl's Introduction

Preference-RL

A survey of Preference Reinforcement Learning

Why preference?

For hard-to-specify tasks. It is difficult to quantify the utilities of decisions in a large number of daily tasks. For example, manually designing a reward function for general text-generation systems is impossible, since the contexts required to be concerned is massive. Similar to traditional Inverse Reinforcement Learning (IRL) learning a reward function based on expert demonstrations, Preference-based RL (PBRL) aims to learn a reward function given limited preference signals of paired trajectories / sub-trajectories from human / rule-based systems.

Existing works

Categorized by the research problems then methods.

Directly learning from preferences without fitting a reward model?

  • Programming by Feedback.
    Akrour R, Schoenauer M, Sebag M, et al. (ICML 2014). [paper]

  • Model-free preference-based reinforcement learning.
    Wirth C, Fürnkranz J, Neumann G. (AAAI 2016) [paper]

  • Inverse Preference Learning: Preference-based RL without a Reward Function.
    Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]

Learning with as few preferences as possible (Feedback-efficient) ?

  • Active Reward Learning

    • Active preference-based learning of reward functions.
      Sadigh D, Dragan A D, Sastry S, et al. (RSS 2017) [paper]

    • Asking Easy Questions: A User-Friendly Approach to Active Reward Learning.
      Bıyık E, Palan M, Landolfi N C, et al. (CoRL 2019) [paper]

    • Active Preference-Based Gaussian Process Regression for Reward Learning.
      Bıyık E, Huynh N, Kochenderfer M J, et al. (RSS 2020) [paper]

    • Active preference learning using maximum regret.
      Wilde N, Kulić D, Smith S L. (IROS 2020) [paper]

    • Information Directed Reward Learning for Reinforcement Learning.
      Lindner D, Turchetta M, Tschiatschek S, et al. (NIPS 2021) [paper]

  • Propose Hypothetical/Generated Trajectories

    • Learning Human Objectives by Evaluating Hypothetical Behavior.
      Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
      * Utilizing a generative/dynamics model to synthesize hypothetical behaviors.

    • Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models.
      Liu Y, Datta G, Novoseller E, et al. (NIPS 2022 Workshop) [paper]

  • Pretraining for Diverse Trajectories

    • PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training.
      Lee K, Smith L, Abbeel P. Pebble. (ICML 2021) [paper]
      * Pretraining with unsupervised objectives (e.g., maximizing state coverage) to ensure informative feedbacks.
  • Actively collect uncertain data w.r.t the reward models through guided exploration

    • Reward Uncertainty for Exploration in Preference-based Reinforcement Learning.
      Liang X, Shu K, Lee K, et al. (ICLR 2022) [paper]
      * Regard the disagreements of the reward model ensemble as intrinsic reward to collect informative data.
  • Training the reward model feedback-efficiently

    • SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning.
      Park J, Seo Y, Shin J, et al. (ICLR 2022) [paper]
      * Using semi-supervised loss with self-generated pseudo-labels, and augmenting the data by exchanging sub-trajectory between paired trajectories.
  • Training the reward model data-efficiently

    • Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
      Liu R, Bai F, Du Y, et al. (NIPS 2022) [paper]

Can the learned reward models generalize or assist downstream tasks ?

  • Few-Shot Preference Learning for Human-in-the-Loop RL.
    Hejna III D J, Sadigh D. (CoRL 2022) [paper]
    * Training the reward model with MAML, and fast adapt the model with few-shot trajectories in downstream tasks.

  • Learning a Universal Human Prior for Dexterous Manipulation from Human Preference.
    Ding Z, Chen Y, Ren A Z, et al. (Preprint 2023) [paper]
    * Train a universal reward model with multi-task dataset, and directly use the reward to guide learning in downstream tasks

Reward misidentification / Robustness of reward models

  • Causal Confusion and Reward Misidentification in Preference-Based Reward Learning.
    Tien J, He J Z Y, Erickson Z, et al. (ICLR 2023) [paper]

General Models of rewards?

  • Deep reinforcement learning from human preferences.
    Christiano P F, Leike J, Brown T, et al. (NIPS 2017) [paper]
    * Firstly proposed Bradley-Terry Model.

  • Models of human preference for learning reward functions.
    Knox W B, Hatgis-Kessell S, Booth S, et al. (Preprint 2022) [paper]
    * Propose using regrets as an alternative form to model the preferences instead of the trajectory return.

  • Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning.
    Early J, Bewley T, Evers C, et al. (NIPS 2022) [paper]
    * Non-markovian reward modelling with preferences

  • Preference Transformer: Modeling Human Preferences using Transformers for RL
    Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee. (ICLR 2023) [paper]

Different forms of (human) feedbacks?

  • Learning Human Objectives by Evaluating Hypothetical Behavior.
    Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
    * Requiring human choosing one of three labels ([good, unsafe, neutral]) to describe the trajectories.

  • Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation.
    Guan L, Verma M, Guo S S, et al. (NIPS 2021) [paper]
    * Requiring human annotating the task-related features on the observation features.

  • Boosting Offline Reinforcement Learning with Action Preference Query.
    Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
    * Assume access to limited action preference queries.

  • Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences
    Guan L, Valmeekam K, Kambhampati S. (ICLR 2023) [paper]
    * Symbolic human feedback

Novel problem settings related to PBRL

  • Beyond Reward: Offline Preference-guided Policy Optimization.
    Kang Y, Shi D, Liu J, et al. (ICML 2023) [paper]
    * Offline + Preference (Same as Preference Transformer)

  • Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation.
    Ren Z, Liu A, Liang Y, et al. (NIPS 2022) [paper]
    * For Meta RL, we can only access to the preferences at the meta-testing stage while the ground truth rewards are only accessible in the meta-training stage.

  • Deploying Offline Reinforcement Learning with Human Feedback.
    Li Z, Xu K, Liu L, et al. (Preprint 2023) [paper]
    * Select the offline trained models for downstream deployment with human feedback

  • Boosting Offline Reinforcement Learning with Action Preference Query.
    Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
    * Assume access to limited action preference queries.

  • Prompt-Tuning Decision Transformer with Preference Ranking.
    Hu S, Shen L, Zhang Y, et al. (Arxiv) [paper] * Use preference to tunning the prompt ot DT

(Skill Discovery) Discovering skills aligning with human intents

  • Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback.
    Wang X, Lee K, Hakhamaneshi K, et al. (CoRL 2021) [paper]

  • Controlled Diversity with Preference : Towards Learning a Diverse Set of Desired Skills.
    Hussonnois M, Karimpanal T G, Rana S. (AAMAS 2023) [paper]

(Learning from Demonstrations) Learning with diverse sources of feedbacks

  • Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences.
    Bıyık E, Losey D P, Palan M, et al. (IJRR 2022) [paper]

(Learning from Demonstrations) Extrapolating beyond the demonstrations using preferences

  • Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations.
    Brown D, Goo W, Nagarajan P, et al. (ICML 2019) [paper]

  • Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations.
    Brown D S, Goo W, Niekum S. (CoRL 2020) [paper]

(LLM) Align with human preferences for LLM / RLHF

  • Learning to summarize from human feedback.
    Stiennon N, Ouyang L, Wu J, et al. (NIPS 2020) [paper]

  • Training language models to follow instructions with human feedback.
    Ouyang L, Wu J, Jiang X, et al. (NIPS 2022) [paper]

  • Chain of Hindsight Aligns Language Models with Feedback.
    Hao Liu, Carmelo Sferrazza, Pieter Abbeel (Preprint 2023) [paper]

  • RRHF: Rank Responses to Align Language Models with Human Feedback without tears.
    Yuan Z, Yuan H, Tan C, et al. (Preprint 2023) [paper]

  • Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models.
    Xu W, Dong S, Arumugam D, et al. (Preprint 2023) [paper]

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
    Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]

Theory

  • Dueling RL: Reinforcement Learning with Trajectory Preferences.
    Pacchiano A, Saha A, Lee J. (AISTATS 2023) [paper]
    * Regret analysis of the preference-based algorithm under assumption of trajectory embedding and preferences

  • Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.
    Chen X, Zhong H, Yang Z, et al. (ICML 2022) [paper]

  • Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning.
    Kong D, Yang L. (NIPS 2022) [paper]

Benchmarks

  • B-Pref: Benchmarking Preference-Based Reinforcement Learning
    Lee K, Smith L, Dragan A, et al.(NIPS 2021 Track on Datasets and Benchmarks) [paper]

preference-rl's People

Contributors

kangxu023 avatar kavka1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.