Can you explain the goal in computing value loss and action loss when you update the p

In the paper, we show the objective functions to be maximized (gradient ascent).

About the operation in updating policy net about arel HOT 3 CLOSED

eric-xw commented on July 23, 2024

About the operation in updating policy net

from arel.

Comments (3)

eric-xw commented on July 23, 2024

Hi,

The RL part is a common implementation of policy gradient with baseline. The overall implementation is aligned with the formulation. Can you be more specific on your questions?

Thanks,

from arel.

qingyue2014 commented on July 23, 2024

Sorry, my previous statement was not clear. My questions are as follows.

About the loss of reward net. In your paper, the objective of reward function is to minimize the exception of reward under empirical distribution subtracting the reward under policy network' s distribution. But in your code, the sign of loss (train_AREL.py, 138 line) is just on the opposite.
loss = -torch.sum(gt_score) + torch.sum(gen_score) . Why？
About the loss of policy net. Variable opt.rl_weight is used in calculating the loss. What the
meaning of variable loss and tf-loss？
loss = opt.rl_weight * loss + (1 - opt.rl_weight) * tf_loss
Looking forward to your reply！ Thx.

from arel.

eric-xw commented on July 23, 2024

In the paper, we show the objective functions to be maximized (gradient ascent). In practice we usually minimize the loss functions with gradient descent instead. But they are indeed equivalent.
tf_loss is the cross entropy loss to help stabilize the training.

from arel.

Recommend Projects