Preference-RL

A survey of Preference Reinforcement Learning

Why preference?

For hard-to-specify tasks. It is difficult to quantify the utilities of decisions in a large number of daily tasks. For example, manually designing a reward function for general text-generation systems is impossible, since the contexts required to be concerned is massive. Similar to traditional Inverse Reinforcement Learning (IRL) learning a reward function based on expert demonstrations, Preference-based RL (PBRL) aims to learn a reward function given limited preference signals of paired trajectories / sub-trajectories from human / rule-based systems.

Existing works

Categorized by the research problems then methods.

Directly learning from preferences without fitting a reward model?

Programming by Feedback.
Akrour R, Schoenauer M, Sebag M, et al. (ICML 2014). [paper]
Model-free preference-based reinforcement learning.
Wirth C, Fürnkranz J, Neumann G. (AAAI 2016) [paper]
Inverse Preference Learning: Preference-based RL without a Reward Function.
Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]

Learning with as few preferences as possible (Feedback-efficient) ?

Active Reward Learning
- Active preference-based learning of reward functions.
  Sadigh D, Dragan A D, Sastry S, et al. (RSS 2017) [paper]
- Asking Easy Questions: A User-Friendly Approach to Active Reward Learning.
  Bıyık E, Palan M, Landolfi N C, et al. (CoRL 2019) [paper]
- Active Preference-Based Gaussian Process Regression for Reward Learning.
  Bıyık E, Huynh N, Kochenderfer M J, et al. (RSS 2020) [paper]
- Active preference learning using maximum regret.
  Wilde N, Kulić D, Smith S L. (IROS 2020) [paper]
- Information Directed Reward Learning for Reinforcement Learning.
  Lindner D, Turchetta M, Tschiatschek S, et al. (NIPS 2021) [paper]
Propose Hypothetical/Generated Trajectories
- Learning Human Objectives by Evaluating Hypothetical Behavior.
  Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
  * Utilizing a generative/dynamics model to synthesize hypothetical behaviors.
- Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models.
  Liu Y, Datta G, Novoseller E, et al. (NIPS 2022 Workshop) [paper]
Pretraining for Diverse Trajectories
- PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training.
  Lee K, Smith L, Abbeel P. Pebble. (ICML 2021) [paper]
  * Pretraining with unsupervised objectives (e.g., maximizing state coverage) to ensure informative feedbacks.
Actively collect uncertain data w.r.t the reward models through guided exploration
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning.
  Liang X, Shu K, Lee K, et al. (ICLR 2022) [paper]
  * Regard the disagreements of the reward model ensemble as intrinsic reward to collect informative data.
Training the reward model feedback-efficiently
- SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning.
  Park J, Seo Y, Shin J, et al. (ICLR 2022) [paper]
  * Using semi-supervised loss with self-generated pseudo-labels, and augmenting the data by exchanging sub-trajectory between paired trajectories.
Training the reward model data-efficiently
- Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
  Liu R, Bai F, Du Y, et al. (NIPS 2022) [paper]

Can the learned reward models generalize or assist downstream tasks ?

Few-Shot Preference Learning for Human-in-the-Loop RL.
Hejna III D J, Sadigh D. (CoRL 2022) [paper]
* Training the reward model with MAML, and fast adapt the model with few-shot trajectories in downstream tasks.
Learning a Universal Human Prior for Dexterous Manipulation from Human Preference.
Ding Z, Chen Y, Ren A Z, et al. (Preprint 2023) [paper]
* Train a universal reward model with multi-task dataset, and directly use the reward to guide learning in downstream tasks

Reward misidentification / Robustness of reward models

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning.
Tien J, He J Z Y, Erickson Z, et al. (ICLR 2023) [paper]

General Models of rewards?

Deep reinforcement learning from human preferences.
Christiano P F, Leike J, Brown T, et al. (NIPS 2017) [paper]
* Firstly proposed Bradley-Terry Model.
Models of human preference for learning reward functions.
Knox W B, Hatgis-Kessell S, Booth S, et al. (Preprint 2022) [paper]
* Propose using regrets as an alternative form to model the preferences instead of the trajectory return.
Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning.
Early J, Bewley T, Evers C, et al. (NIPS 2022) [paper]
* Non-markovian reward modelling with preferences
Preference Transformer: Modeling Human Preferences using Transformers for RL
Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee. (ICLR 2023) [paper]

Different forms of (human) feedbacks?

Learning Human Objectives by Evaluating Hypothetical Behavior.
Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
* Requiring human choosing one of three labels ([good, unsafe, neutral]) to describe the trajectories.
Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation.
Guan L, Verma M, Guo S S, et al. (NIPS 2021) [paper]
* Requiring human annotating the task-related features on the observation features.
Boosting Offline Reinforcement Learning with Action Preference Query.
Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
* Assume access to limited action preference queries.
Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences
Guan L, Valmeekam K, Kambhampati S. (ICLR 2023) [paper]
* Symbolic human feedback

Novel problem settings related to PBRL

Beyond Reward: Offline Preference-guided Policy Optimization.
Kang Y, Shi D, Liu J, et al. (ICML 2023) [paper]
* Offline + Preference (Same as Preference Transformer)
Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation.
Ren Z, Liu A, Liang Y, et al. (NIPS 2022) [paper]
* For Meta RL, we can only access to the preferences at the meta-testing stage while the ground truth rewards are only accessible in the meta-training stage.
Deploying Offline Reinforcement Learning with Human Feedback.
Li Z, Xu K, Liu L, et al. (Preprint 2023) [paper]
* Select the offline trained models for downstream deployment with human feedback
Boosting Offline Reinforcement Learning with Action Preference Query.
Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
* Assume access to limited action preference queries.
Prompt-Tuning Decision Transformer with Preference Ranking.
Hu S, Shen L, Zhang Y, et al. (Arxiv) [paper] * Use preference to tunning the prompt ot DT

(Skill Discovery) Discovering skills aligning with human intents

Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback.
Wang X, Lee K, Hakhamaneshi K, et al. (CoRL 2021) [paper]
Controlled Diversity with Preference : Towards Learning a Diverse Set of Desired Skills.
Hussonnois M, Karimpanal T G, Rana S. (AAMAS 2023) [paper]

(Learning from Demonstrations) Learning with diverse sources of feedbacks

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences.
Bıyık E, Losey D P, Palan M, et al. (IJRR 2022) [paper]

(Learning from Demonstrations) Extrapolating beyond the demonstrations using preferences

Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations.
Brown D, Goo W, Nagarajan P, et al. (ICML 2019) [paper]
Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations.
Brown D S, Goo W, Niekum S. (CoRL 2020) [paper]

(LLM) Align with human preferences for LLM / RLHF

Learning to summarize from human feedback.
Stiennon N, Ouyang L, Wu J, et al. (NIPS 2020) [paper]
Training language models to follow instructions with human feedback.
Ouyang L, Wu J, Jiang X, et al. (NIPS 2022) [paper]
Chain of Hindsight Aligns Language Models with Feedback.
Hao Liu, Carmelo Sferrazza, Pieter Abbeel (Preprint 2023) [paper]
RRHF: Rank Responses to Align Language Models with Human Feedback without tears.
Yuan Z, Yuan H, Tan C, et al. (Preprint 2023) [paper]
Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models.
Xu W, Dong S, Arumugam D, et al. (Preprint 2023) [paper]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]

Theory

Dueling RL: Reinforcement Learning with Trajectory Preferences.
Pacchiano A, Saha A, Lee J. (AISTATS 2023) [paper]
* Regret analysis of the preference-based algorithm under assumption of trajectory embedding and preferences
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.
Chen X, Zhong H, Yang Z, et al. (ICML 2022) [paper]
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning.
Kong D, Yang L. (NIPS 2022) [paper]

Benchmarks

B-Pref: Benchmarking Preference-Based Reinforcement Learning
Lee K, Smith L, Dragan A, et al.(NIPS 2021 Track on Datasets and Benchmarks) [paper]

kavka1 / preference-rl Goto Github PK

preference-rl's Introduction

Preference-RL

Why preference?

Existing works

Directly learning from preferences without fitting a reward model?

Learning with as few preferences as possible (Feedback-efficient) ?

Can the learned reward models generalize or assist downstream tasks ?

Reward misidentification / Robustness of reward models

General Models of rewards?

Different forms of (human) feedbacks?

Novel problem settings related to PBRL

(Skill Discovery) Discovering skills aligning with human intents

(Learning from Demonstrations) Learning with diverse sources of feedbacks

(Learning from Demonstrations) Extrapolating beyond the demonstrations using preferences

(LLM) Align with human preferences for LLM / RLHF

Theory

Benchmarks

preference-rl's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org