Deep Reinforcement Learning for Recommender Systems

Courses on Deep Reinforcement Learning (DRL) and DRL papers for recommender system

Courses

Book

Reinforcement Learning: An Introduction (Second Edition). Richard S. Sutton and Andrew G. Barto. book

Papers

Paper to read

A Brief Survey of Deep Reinforcement Learning. Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath. 2017. paper
Deep Reinforcement Learing: An Overview. Yuxi Li. 2017. paper
Learning to Collaborate: Multi-Scenario Ranking via Multi-Agent Reinforcement Learning. Jun Feng, Heng Li, Minlie Huang, Shichen Liu, Wenwu Ou, Zhirong Wang, Xiaoyan Zhu. WWW 2018. paper
Reinforcement Mechanism Design for e-commerce. Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, Yiwei Zhang. WWW 2018. paper
Deep Reinforcement Learning for Page-wise Recommendations. Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, Jiliang Tang. RecSys 2018. paper
Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation. Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, Hai-Hong Tang. KDD 2018. paper
Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application. Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, Yinghui Xu. KDD 2018. paper
A Reinforcement Learning Framework for Explainable Recommendation. Xiting Wang, Yiru Chen, Jie Yang, Le Wu, Zhengtao Wu, Xing Xie. ICDM 2018. paper
Aggregating E-commerce Search Results from Heterogeneous Sources via Hierarchical Reinforcement Learning. Ryuichi Takanobu, Tao Zhuang, Minlie Huang, Jun Feng, Haihong Tang, Bo Zheng. WWW 2019. paper
Policy Gradients for Contextual Recommendations. Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, Qing He. WWW 2019. paper
Reinforcement Knowledge Graph Reasoning for Explainable Recommendation. Yikun Xian, Zuohui Fu, S. Muthukrishnan, Gerard de Melo, Yongfeng Zhang. SIGIR 2019. paper
A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. Xueying Bai, Jian Guan, Hongning Wang. NeurIPS 2019. paper
Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning. Ruiyi Zhang, Tong Yu, Yilin Shen, Hongxia Jin, Changyou Chen, Lawrence Carin. NeurIPS 2019. paper
DRCGR: Deep reinforcement learning framework incorporating CNN and GAN-based for interactive recommendation. Rong Gao, Haifeng Xia, Jing Li, Donghua Liu, Shuai Chen, and Gang Chun. ICDM 2019. paper
Pseudo Dyna-Q: A Reinforcement Learning Framework for Interactive Recommendation. Lixin Zou, Long Xia, Pan Du, Zhuo Zhang, Ting Bai, Weidong Liu, Jian-Yun Nie, Dawei Yin. WSDM 2020. paper
End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding. Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, Xiuqiang He. WSDM 2020. paper
Reinforced Negative Sampling over Knowledge Graph for Recommendation. Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, Tat-Seng Chua. WWW 2020. paper
An MDP-Based Recommender System. Guy Shani, David Heckerman, Ronen I. Brafman. JMLR 2005. paper
Usage-Based Web Recommendations: A Reinforcement Learning Approach. Nima Taghipour, Ahmad Kardan, Saeed Shiry Ghidary. Recsys 2007. paper

Preprint Papers

Reinforcement Learning based Recommender System using Biclustering Technique. Sungwoon Choi, Heonseok Ha, Uiwon Hwang, Chanju Kim, Jung-Woo Ha, Sungroh Yoon. arxiv 2018. paper
Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling. Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, Yuzhou Zhang. arxiv 2018. paper
Model-Based Reinforcement Learning for Whole-Chain Recommendations. Xiangyu Zhao, Long Xia, Yihong Zhao, Dawei Yin, Jiliang Tang. arxiv 2019. paper

Reviewed Papers

DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation. Elad Liebman, Maytal Saar-Tsechansky, Peter Stone. AAMAS 2015. paper

Formulate the selecting which sequence of songs to play as a Markov Decision Process, and demonstrate the potential effectiveness of a reinforcement-learning based approach in a new practical domain
DJ-MC is able to generate personalized song sequences within a single listening session of just 25–50 songs.
Playlist are of the length k
Finite set of n musical tracks
States - ordered sequence set of songs played
Actions - next song to play
Reward - utility of listeners pleasure hearing song a in state s. Divide into two types of rewards - reward function over song and reward function over transition.
Deterministic transition function P
The agent must internally represent states and actions in such a way that enables generalization of the listener’s preferences
Use descriptors for song representations. Each song can be factored as a vector of scalar descriptors that reflect details about the spectral fingerprint of the song, its rhythmic characteristics, its overall loudness, and their change over time.
To represent different listener types, we generate 10 different playlist clusters by using k-means clustering on the playlists
Used Million Song Dataset and Yes.com and Last.fm dataset for experimenting setup
For each experiment, we sample a 1000-song corpus from the Million Song Dataset.
Architecture contains of two components: learning the listeners parameters and planning sequence of the songs
Use tree-search heuristic for planning
It repeats this process as many times as possible, finding the randomly generated trajectory which yields the highest expected payoff
Compare agent towards two alternative baselines: random choice and greedy algorithm that always plays the song with highest song rewards.

DRN: A Deep Reinforcement Learning Framework for News Recommendation. Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, Zhenhui Li. WWW 2018. paper

In order to better model the dynamic nature of news characteristics and user preference, propose to use Deep Q-Learning (DQN) framework
Consider user return as another form of user feedback information, by maintaining an activeness score for each user
Apply a Dueling Bandit Gradient Descent (DBGD) method for exploration by choosing random item candidates in the neighborhood of the current recommender. Can avoid using totally unrelated items. The baseline for exploration: epsilon-greedy and upper confidence bound. The agent generates a recommendation list L using current network Q and another list L~ using explore network Q~. The parameters W~ can be obtained by adding small noise to W. Then with a probabilistic interleave to generate the merged recommendation list L^ using L and L~.

News recommendation system:
- 1. content-based systems that maintain news term-frequency and user profile (based on historical news). Then the recommender select the one that is the most similar
- 1. collaborative filtering models - utilize past ratings of current user or similar users.
- 1. combination of these two
Challenges: 1) online learning are needed due to the highly dynamic nature of news characteristics and user preference 2) click / no click labels will not capture full feedback towards news. 3) traditional recommender methods tend to recommend similar items and will narrow down user's choice.
Multi-layer DQN is used to predict the reward
Push: In each timestamp (t1, t2, t3, t4, t5, ...), when a user sends a news request to the system, the recommendation agent G will take the feature representation of the current user and news candidates as input, and generate a top-k list of news to recommend L. L is generated by combining the exploitation of current model and exploration of novel items
Feedback: User u who has received recommended news L will give their feedback B by his clicks on this set of news.
Minor update: After each timestamp with the feature representation of the previous user U and new list L, the feedback B, agent G will update the model by comparing the recommender performance of exploitation network Q and exploration network Q~. If Q~ gives better result the current network will be updated towards Q~.
Major update: After certain period of time agent G will use the feedback B and user activeness stored in the memory to update the network Q. When each update happens, agent G will sample a batch of records to update the model. Major updates happens after a certain time interval.
Feature construction:
- News features: OHE features that describe whether a certain property appears in this piece of news
- User features: mainly describes the features (i.e., headline, provider, ranking, entity name, category, and topic category) of the news that the user clicked in 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively
- User new features: the interaction between user and one certain piece of news, i.e., the frequency for the entity (also category, topic category and provider) to appear in the history of the user’s readings
- Context features: the context when a news request happens, including time, weekday, and the freshness of the news
State features: User features and context features
Action features: User new features and context features
Divide the Q-function into value function V (s) and advantage function A(s, a), whereV (s) is only determined by the state features, and A(s, a) is determined by both the state features and the action features

Use survival model to model user return and user activities
Closed dataset
No code
Metrics: Precision@k, NDCG

Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning. Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, Dawei Yin. KDD 2018. paper

Collaborative filtering and content-based models may fail given the dynamic nature of user's preferences. The current RL systems designed to maximize the short-term rewards
Propose a framework DEERS to model positive and negative feedback simultaneously.
State space: Browsing history of user before time t
Action space: Recommend items to a user at time t.
Rewards: immediate reward (click, skip and etc)
Transition probability - probability of state transition from $s_{t} -> s_{t 1})$
Discount factor - factor to measure the present value of future response
USe DQN ads a base for DEERS model. Instead of just concatinating clicked/ordered items as a positive state s+, introduced the RNN with a GRU architecture to capture user's sequential preferences. Feed positive inputs and negative inputs separately as an input layer. Separate first hidden layers
Add regularization term to maximize the difference between the target and competitor items
Utilized the experienced replay and introduced separated evaluation and target networks, which can help smooth the learning and avoid the divergence of of parameters
Closed dataset on JD.com data
Select MAP and NDCG@40 as a key metrics to evaluate the data

Top-K Off-Policy Correction for a REINFORCE Recommender System. Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed H. Chi. WSDM 2019. paper, video

Leverage on a policy-based algorithm, REINFORCE.
Live experiment on Youtube data.
State - user interest, context
Action - items available for recommendations. Constrain by sampling each item according to the softmax policy.
Reward - clicks, watch time. As the recommender system lists the page of k-items to a user in time, the expected reward of a set equal to the sum of an expected reward of each item in the set.
Discount rate - the trade-off between an immediate reward and long-term reward
Rely on the logged feedback of auctions chosen by a historical policy (or a mixture of policies) as it's not possible to perform online updates of the policy.
Model state transition with a recurrent neural network (tested LSTM, GRU, Chaos Free RNN) $s_{t 1} = f(s_t, u_{a_t})$
Feedback is biased by previous recommender. Policy not update frequently that also cause bias. Feedback could come from other agents in production.
Agent is refreshed every 5 days
Used target (on-policy) and behavioral policy (off-policy) to downweitgth non-target policies.
The epsilon-greedy policy is not acceptable for Youtube as it causes inappropriate recommendations and bad user experience. Employ Boltzman exploration.
Target policy is trained with weighted softmax to max long term reward.
Behavior policy trained using state/action pairs from logged behavior/feedback.
On an online experiment, the reward is aggregated on a time horizon of 4-10 hours.
Parameters and architecture are primarily shared.
RNN is used to represent the user state
Private dataset
Code reproduction by Bayes group [code] (https://github.com/awarebayes/RecNN)

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System. Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song. ICML 2019. paper

Environment: will correspond to a logged online user who can click on one of the k items displayed by the recommendation system in each page view (or interaction);
State S - ordered sequence of historical tasks
Actions - the subset of k items chosen by a recommender from list to display to the user. Actions are represent the subsets of all sets of k items of the total item space I at time t.
State transition - the user behavior model which returns the transition probability for $s_{t 1} = f(s_t, u_{a_t})$ given the previous state $s_{t}$ and the set of items A to display to the system.
Reward - user's utility or satisfaction after making her choice in the state.
Policy - the probability of displaying a subset $A_t$ of $I_t$ at current user state
Inspired by GAN estimate the behavioral function and reward function in a min-max fashion. Given the trajectory of T observed actions and the corresponding clicked items learn jointly behavioral and reward function.

- Designing a novel cascading Q-networks to handle the combinatorial action space, and an algorithm to estimate the parameters by interacting with the GAN user model. - Public datasets: MovieLens, Last.fm, Yelp, RecSys15 YooChoose, - Private datasets: Taobao, Ant Financial datasets - Since we cannot perform online experiments at this moment, we use collected data from the online news platform to fit a user model, and then use it as a test environment. Metrics: Cummulative reward, CTR - Offline metrics: Top-k precision

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems. Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin. KDD 2019. paper

Introduced an RL framework - FeedRec to optimize long-term user engagement. Based on Q-network which designed in hierarchical LSTM and S-network, that simulates the environment
S-network assists Q-Network and voids the instability of convergence in policy learning. Specifically, in each round of recommendations, aligning with the user feedback, S-network generates the user's response, the dwell time, the revisited time, and flag that indicates that the user will leave or not the platform.
The model versatile both instant (cliks, likes, ctr) and delayed (dweel time, revisit and etc)
Trained on internal e-commerce dataset with pre-trained embeddings, which is learned through modeling user's cliking streams with skip-gram.
State - the user's browsing history
Action - the finite space of items
Transition - probability of seeing state $s_{t 1}$ after taking action $i_t$ at $s_t$
Reward as a weighted sum of different metrics. Give some instantiations of reward function, both instant and delayed metrics.
The primary user behaviors, such as clicks, skip, the purchase is tracked separately with different LSTM pipelines as where different user's behavior is captured by the corresponding LSTM-layer to avoid intensive behavior dominance and capture specific characteristics.

Environment reconstruction with hidden confounders for reinforcement learning based recommendation. Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, Jieping Ye. KDD 2019. paper
Exact-K Recommendation via Maximal Clique Optimization. Yu Gong, Yu Zhu, Lu Duan, Qingwen Liu, Ziyu Guan, Fei Sun, Wenwu Ou, Kenny Q. Zhu. KDD 2019. paper
Hierarchical Reinforcement Learning for Course Recommendation in MOOCs. Jing Zhang, Bowen Hao, Bo Chen, Cuiping Li, Hong Chen, Jimeng Sun. AAAI 2019. paper
Large-scale Interactive Recommendation with Tree-structured Policy Gradient. Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, Yong Yu. AAAI 2019. paper

Virtual-Taobao: Virtualizing real-world online retail environment for reinforcement learning. Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, An-Xiang Zeng. AAAI 2019. paper

Instead of training reinforcement learning in Taobao directly, we present our approach: first build Virtual Taobao, a simulator learned from historical customer behavior data through the proposed GAN-SD (GAN for Simulating Distributions) and MAIL (multi-agent adversaria limitation learning), and then we train policies in Virtual Taobao with no physical costs in which ANC (Action Norm Constraint) strategy is proposed to reduce over-fitting.
Comparing with the traditional supervised learning approach, the strategy trained in Virtual Taobao achieves more than 2% improvement of revenue in the real environment.
Agent - the search engine
Environment - the customers
Commodity search and shopping process can see as a sequential decision process. Customers decision process for engine and customers. The engine and customers are the environments of each other

Deep Reinforcement Learning for List-wise Recommendations Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, Jiliang Tang, 2017 paper

Utilized the Actor-critic framework to work with the increasing number of items for recommendations.
Proposed an online environment simulator to pre-train parameters offline and evaluate the model before applying.
Recommender agent interacts with the environment (users) choosing items over a sequence of steps.
State - browsing history of the user, i.e., the previous N items that a user browsed before. Users browsing history stored in a memory M. However a better way is to consider only N previous clicked/ordered items. The positive items represent key information about user's preference.
Action - a list of items recomended to a user at time t based on the current state.
Reward - clicks, order, etc
Transition probability - Probability of state transition from $s_t$ to $s_{t 1}$
Each time observe a K items in temporal order.
Utilized DDPG algorithm with experienced decay to train the parameters of the framework.
MAP and NDCG to evaluate performance.
No code
Private dataset

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling Feng Liu, Ruiming Tangy, Xutao Li, Weinan ZhangzYunming Ye, Haokun Chenz, Huifeng Guoyand Yuzhou Zhangy, 2019 paper

The DRR framework treats recommendations as a sequential decision making procedure and adopts "Actor-critic" reinforcement-learning scheme to model interactions between the user and recommender items. The framework treats recommendations as a sequential decision-making process, which consider both immediate and long-term reward.
State - User's positive interaction history with recommender as well as her demographic situation.
Actions - continuous parameter vector a. Each item has a ranking score, which is defined as the inner product of the action and the item embedding.
Transitions - once the user's feedback is collected the transition is determined
Reward - clicks, rates, etc
Discount rate - the trade-off between immediate reward and long-term reward

Developed three state-representation models:
- DRR-p Utilize the product operator to capture the pairwise local dependencies between items. Compute pairwise interactions between items by using the element-wise product operations.
- DRR-u In addition to local dependencies between items, the pairwise interactions of user-item are also taken into account.
- DRR-ave. Average pooling over DRR-p. The resulting vector is leveraged to model the interactions with the input user. Finally, concat the interaction vector and embedding of the user. On offline evaluation, DRR-ave outperformed DRR-p and DRR-u.
Epsilon-greedy exploration in the Actor-network.
Evaluated on offline datasets (MovieLens, Yahoo! Music, Jester)
Online evaluation with environment simulator. Pretrain PMF (probabilistic matrix factorization) model as an environment simulator.

Reinforcement Learning for Slate-based RecommenderSystems: A Tractable Decomposition and Practical Methodology. Eugene Ie†, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal1, Rui Wu1, Heng-Tze Cheng1, Morgane Lustman, Vince Gatto3, Paul Covington, Jim McFadden, Tushar Chandra, and Craig Boutilier paper video

Developed a SLATEQ, a decomposition of value-based temporal-difference and Q-learning that renders RL tractable with slates
LTV of a slate can be decomposed into a tractable value function of its component item-wise LTV
Introduced a recommender simulation environment, RecSim that allows the straightforward configuration of an item collection (or vocabulary), a user (latent) state model and a user choice model.
Session optimization
State - static users features as demographics, declared interests, user context, summarizaiton of the previous history
Actions - the set of all possible slates.
Transition probability
Reward - measurement of user engagement.
In the MDP model each user should be viewed as a separate environment or separate MDP. Hence it critical to allow generalization across users, since few if any users generates enough experience to allow reasonable recommendations.
Combinatorial optimizaton problem - find the slate with the maximum Q-value. SlateQ allows to decompose into conbination of the item-wise Q-values of the consistent items. Tried top-k, greedy, LP-based methods.
Items and user interests as a topic modeling problem
As a choice model, use an exponential cascade model that accounts for document position in the slate. This choice model assumes "attention" is given to one document at a time, with exponentially decreasing attention given to documents as a user moves down the slate.
A user choice model impacts which document(if any) from the slate is consumed by the user. Made an assumption that a user can observe any recommender document's topic before selection but can't observe its quality before consumption
Users interested in document d defines the relative document appeal to the user and serves the basis of the choice function.
Models are trained periodically and pushed to the server. The ranker uses the latest model to recommend items and logs user feedback, which is used to train new models. Using LTV labels, iterative model training, and pushing can be viewed as a form of generalized policy iteration.
A candidate generator retrieves a small subset (hundreds) of items from a large corpus that best matches a user context. The ranker scores/ranks are candidates using a DNN with both user context and item features as input. It optimizes a combination of several objectives (e.g., clicks, expected engagement, several other factors).

RECSIM : A Configurable Simulation Platform for Recommender Systems Ie E, Hsu C, Mladenov M, Jain V, Narvekar S, Wang J, Wu R, Boutilier C

2019 paper

RECSIM is a configurable platform that allows the natural, albeit abstract, specification of an environment in which a recommender interacts with a corpus of documents (or recommendable items) and a set of users, to support the development of recommendation algorithms.
The user model samples users from a prior distribution over (configurable) user features: these may include latent features such as personality, satisfaction, interests; observable features such as demographics; and behavioral features such as session length, visit frequency, or time budget
The document model samples items from a prior over document features, which again may incorporate latent features such as document quality, and observable features such as topic, document length and global statistics (e.g., ratings, popularity)
User response is determined by user choice model. Once the document is consumed the user state undergoes a transition through a configurable (user) transition model
The user model assumes that user have a various degrees of interests in topic (with some prior distribution) ranking from -1 to 1.

- RECSIM offers stackable hierarchical agent layers intended to solve a more abstract recommendation problem. A hierarchical agent layer does not materialize a slate of documents (i.e., RS action), but relies on one or more base agents to do so. - RECSIM can be viewed as a dynamic Bayesian network that defines a probability distribution over trajectories of slates, choices, and observations. - Several baseline agents implemented: - TabularQAgent - Q-learning algorithm. Enumarates all possible slates to maximize over Q values - Full-slateQAgent - deep Q-Network agent by threating each slate as a single action - Random agent - random slates with no duplicates - [RECSIM code] (https://github.com/google-research/recsim)

sajid3 / drl4recsys Goto Github PK

drl4recsys's Introduction

Deep Reinforcement Learning for Recommender Systems

Courses

UCL Course on RL

CS 294-112 at UC Berkeley

Stanford CS234: Reinforcement Learning

Book

Papers

Paper to read

Preprint Papers

Reviewed Papers

DJ-MC: A Reinforcement-Learning Agent for Music Playlist Recommendation. Elad Liebman, Maytal Saar-Tsechansky, Peter Stone. AAMAS 2015. paper

DRN: A Deep Reinforcement Learning Framework for News Recommendation. Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, Zhenhui Li. WWW 2018. paper

Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning. Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, Dawei Yin. KDD 2018. paper

Top-K Off-Policy Correction for a REINFORCE Recommender System. Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, Ed H. Chi. WSDM 2019. paper, video

Generative Adversarial User Model for Reinforcement Learning Based Recommendation System. Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, Le Song. ICML 2019. paper

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems. Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, Dawei Yin. KDD 2019. paper

Virtual-Taobao: Virtualizing real-world online retail environment for reinforcement learning. Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, An-Xiang Zeng. AAAI 2019. paper

Deep Reinforcement Learning for List-wise Recommendations Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, Jiliang Tang, 2017 paper

Deep Reinforcement Learning based Recommendation with Explicit User-Item Interactions Modeling Feng Liu, Ruiming Tangy, Xutao Li, Weinan ZhangzYunming Ye, Haokun Chenz, Huifeng Guoyand Yuzhou Zhangy, 2019 paper

RECSIM : A Configurable Simulation Platform for Recommender Systems Ie E, Hsu C, Mladenov M, Jain V, Narvekar S, Wang J, Wu R, Boutilier C

drl4recsys's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org