🔥🌟《Machine Learning 格物志》: ML + DL + RL basic codes and notes by sklearn, PyTorch, TensorFlow, Keras & the most important, from scratch!💪 This repository is ALL You Need!
policy_losses= [] # list to save actor (policy) loss
value_losses= [] # list to save critic (value) loss
returns= [] # list to save the true values
# calculate the true value using rewards returned from the environment
forrinself.model.rewards[::-1]:
# calculate the discounted value
R=r+self.gamma*R
returns.insert(0, R)
returns=torch.tensor(returns)
Shouldn't R here be set as the value estimation of the last state?
If I understand correctly, if R is set to 0, the returns for the later t become less accurate, and it also fails to reflect the temporal difference.