A survey of Preference Reinforcement Learning
For hard-to-specify tasks. It is difficult to quantify the utilities of decisions in a large number of daily tasks. For example, manually designing a reward function for general text-generation systems is impossible, since the contexts required to be concerned is massive. Similar to traditional Inverse Reinforcement Learning (IRL) learning a reward function based on expert demonstrations, Preference-based RL (PBRL) aims to learn a reward function given limited preference signals of paired trajectories / sub-trajectories from human / rule-based systems.
Categorized by the research problems then methods.
-
Programming by Feedback.
Akrour R, Schoenauer M, Sebag M, et al. (ICML 2014). [paper] -
Model-free preference-based reinforcement learning.
Wirth C, Fürnkranz J, Neumann G. (AAAI 2016) [paper] -
Inverse Preference Learning: Preference-based RL without a Reward Function.
Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]
-
Active Reward Learning
-
Active preference-based learning of reward functions.
Sadigh D, Dragan A D, Sastry S, et al. (RSS 2017) [paper] -
Asking Easy Questions: A User-Friendly Approach to Active Reward Learning.
Bıyık E, Palan M, Landolfi N C, et al. (CoRL 2019) [paper] -
Active Preference-Based Gaussian Process Regression for Reward Learning.
Bıyık E, Huynh N, Kochenderfer M J, et al. (RSS 2020) [paper] -
Active preference learning using maximum regret.
Wilde N, Kulić D, Smith S L. (IROS 2020) [paper] -
Information Directed Reward Learning for Reinforcement Learning.
Lindner D, Turchetta M, Tschiatschek S, et al. (NIPS 2021) [paper]
-
-
Propose Hypothetical/Generated Trajectories
-
Learning Human Objectives by Evaluating Hypothetical Behavior.
Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
* Utilizing a generative/dynamics model to synthesize hypothetical behaviors. -
Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models.
Liu Y, Datta G, Novoseller E, et al. (NIPS 2022 Workshop) [paper]
-
-
Pretraining for Diverse Trajectories
- PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training.
Lee K, Smith L, Abbeel P. Pebble. (ICML 2021) [paper]
* Pretraining with unsupervised objectives (e.g., maximizing state coverage) to ensure informative feedbacks.
- PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training.
-
Actively collect uncertain data w.r.t the reward models through guided exploration
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning.
Liang X, Shu K, Lee K, et al. (ICLR 2022) [paper]
* Regard the disagreements of the reward model ensemble as intrinsic reward to collect informative data.
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning.
-
Training the reward model feedback-efficiently
- SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning.
Park J, Seo Y, Shin J, et al. (ICLR 2022) [paper]
* Using semi-supervised loss with self-generated pseudo-labels, and augmenting the data by exchanging sub-trajectory between paired trajectories.
- SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning.
-
Training the reward model data-efficiently
- Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
Liu R, Bai F, Du Y, et al. (NIPS 2022) [paper]
- Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
-
Few-Shot Preference Learning for Human-in-the-Loop RL.
Hejna III D J, Sadigh D. (CoRL 2022) [paper]
* Training the reward model with MAML, and fast adapt the model with few-shot trajectories in downstream tasks. -
Learning a Universal Human Prior for Dexterous Manipulation from Human Preference.
Ding Z, Chen Y, Ren A Z, et al. (Preprint 2023) [paper]
* Train a universal reward model with multi-task dataset, and directly use the reward to guide learning in downstream tasks
- Causal Confusion and Reward Misidentification in Preference-Based Reward Learning.
Tien J, He J Z Y, Erickson Z, et al. (ICLR 2023) [paper]
-
Deep reinforcement learning from human preferences.
Christiano P F, Leike J, Brown T, et al. (NIPS 2017) [paper]
* Firstly proposed Bradley-Terry Model. -
Models of human preference for learning reward functions.
Knox W B, Hatgis-Kessell S, Booth S, et al. (Preprint 2022) [paper]
* Propose using regrets as an alternative form to model the preferences instead of the trajectory return. -
Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning.
Early J, Bewley T, Evers C, et al. (NIPS 2022) [paper]
* Non-markovian reward modelling with preferences -
Preference Transformer: Modeling Human Preferences using Transformers for RL
Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee. (ICLR 2023) [paper]
-
Learning Human Objectives by Evaluating Hypothetical Behavior.
Reddy S, Dragan A, Levine S, et al. (ICML 2020) [paper]
* Requiring human choosing one of three labels ([good, unsafe, neutral]) to describe the trajectories. -
Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation.
Guan L, Verma M, Guo S S, et al. (NIPS 2021) [paper]
* Requiring human annotating the task-related features on the observation features. -
Boosting Offline Reinforcement Learning with Action Preference Query.
Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
* Assume access to limited action preference queries. -
Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences
Guan L, Valmeekam K, Kambhampati S. (ICLR 2023) [paper]
* Symbolic human feedback
-
Beyond Reward: Offline Preference-guided Policy Optimization.
Kang Y, Shi D, Liu J, et al. (ICML 2023) [paper]
* Offline + Preference (Same as Preference Transformer) -
Efficient Meta Reinforcement Learning for Preference-based Fast Adaptation.
Ren Z, Liu A, Liang Y, et al. (NIPS 2022) [paper]
* For Meta RL, we can only access to the preferences at the meta-testing stage while the ground truth rewards are only accessible in the meta-training stage. -
Deploying Offline Reinforcement Learning with Human Feedback.
Li Z, Xu K, Liu L, et al. (Preprint 2023) [paper]
* Select the offline trained models for downstream deployment with human feedback -
Boosting Offline Reinforcement Learning with Action Preference Query.
Yang Q, Wang S, Lin M G, et al. (ICML 2023) [paper]
* Assume access to limited action preference queries. -
Prompt-Tuning Decision Transformer with Preference Ranking.
Hu S, Shen L, Zhang Y, et al. (Arxiv) [paper] * Use preference to tunning the prompt ot DT
-
Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback.
Wang X, Lee K, Hakhamaneshi K, et al. (CoRL 2021) [paper] -
Controlled Diversity with Preference : Towards Learning a Diverse Set of Desired Skills.
Hussonnois M, Karimpanal T G, Rana S. (AAMAS 2023) [paper]
- Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences.
Bıyık E, Losey D P, Palan M, et al. (IJRR 2022) [paper]
-
Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations.
Brown D, Goo W, Nagarajan P, et al. (ICML 2019) [paper] -
Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations.
Brown D S, Goo W, Niekum S. (CoRL 2020) [paper]
-
Learning to summarize from human feedback.
Stiennon N, Ouyang L, Wu J, et al. (NIPS 2020) [paper] -
Training language models to follow instructions with human feedback.
Ouyang L, Wu J, Jiang X, et al. (NIPS 2022) [paper] -
Chain of Hindsight Aligns Language Models with Feedback.
Hao Liu, Carmelo Sferrazza, Pieter Abbeel (Preprint 2023) [paper] -
RRHF: Rank Responses to Align Language Models with Human Feedback without tears.
Yuan Z, Yuan H, Tan C, et al. (Preprint 2023) [paper] -
Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models.
Xu W, Dong S, Arumugam D, et al. (Preprint 2023) [paper] -
Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Rafailov R, Sharma A, Mitchell E, et al. (Arxiv) [paper]
-
Dueling RL: Reinforcement Learning with Trajectory Preferences.
Pacchiano A, Saha A, Lee J. (AISTATS 2023) [paper]
* Regret analysis of the preference-based algorithm under assumption of trajectory embedding and preferences -
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation.
Chen X, Zhong H, Yang Z, et al. (ICML 2022) [paper] -
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning.
Kong D, Yang L. (NIPS 2022) [paper]
- B-Pref: Benchmarking Preference-Based Reinforcement Learning
Lee K, Smith L, Dragan A, et al.(NIPS 2021 Track on Datasets and Benchmarks) [paper]