Authors:
Pasianotto Isac
Paglierani Yuri
Date:
2023-07-21
It is a single-player game where the player has to slide tiles on a grid to combine them and create a tile with the number 2048. The game is played on a 4x4 grid, which is initially filled with two tiles, and every turn a new tile with the number 2 or more rarely 4 appears at a random empty spot on the board.
The player can slide the tiles in four directions: up, down, left, and right, the slide operation moves the tiles as far as possible in the chosen direction until they are stopped by either another tile or the edge of the grid. If two cells with the same number collide while moving, they will merge into a tile with the sum of their two values, and the total score is incremented by the amount of the newly formed tile. The resulting tile cannot merge with another tile again in the same move. Note that due to the game rules, the board will always be filled with tiles with a power of 2.
Fig. 1: example of a move "right" performed on a board
We say that the game is won when a tile with a value of 2048 appears on the board (In the original game the player can continue to play after reaching the 2048 tile to reach higher scores). A game is lost if the player cannot make a valid move anymore, that happens when the board is filled with tiles and no adjacent tiles with the same value are present. We will define a move valid if it produces a change in the board state, otherwise, it is invalid.
Fig. 2: example of a game over state: no valid actions are available
In order to implement the environment, we have taken into account two different strategies:
- A 2D array of integers, where each integer represents the value of the tile in that position, and 0 represents an empty tile.
- A 3D array of size 12x4x4, where the first dimension represents the value of the tile, and the second and third dimensions represent the position of the tile on the board.
After some tests, we have decided to use the first approach, since it seems slightly more efficient in terms of speed execution. Anyway as we will see later, we have implemented a function that allows us to convert the 2D array into a 3D array when it was more convenient to use it.
The environment implementation is done with the Game2048
class which extends the gym.Env
class.
action space
Another choice could have been to define the action space constant in time and always equal to
All the actions are deterministic, which means that the agent can choose an action, and the environment will perform it with
The possible states
In a 2048 game, the score is incremented by the number of new tiles formed by the merging of two tiles. Similar behavior is reasonable for the reward we assign to the agent. In fact, we want the agent to maximize the score and try to reach tiles with values as high as possible.
In order to have a more uniform reward distribution, we have implemented the possibility to consider the logarithm in base 2 of the new tile value as the reward. In this way, the reward obtained by a merge is always between 1 and 11. The behavior of the reward function is controlled by the parameter log_rewards
of the environment class Game2048
:
log_rewards
is False
log_rewards
is True
Even if we have not considered it in our results (due to time constraints) we've implemented the possibility to assign a negative reward to the agent when it performs an action. That is done by setting the parameter penalty_move
in the agent class used (with a positive value, the program will subtract that value). The idea is to encourage the agent to perform principally moves that lead to a merge of tiles, and not moves that simply shift the tiles without any merge; moreover, we forced the model to reach tiles with higher values by adding a reward per action when we reach values greater or equal to 512, in particular for each step, we reward the agent with an additional value of num_tiles==512 + 2.5num_tiles==1024 + 6num_tiles==2048.
Note that due to the fact that at each step the agent can only perform a valid move, and after every move, a new tile is added to the board, the existence of a terminal state is guaranteed, so considering a 0 value for penalty is not a problem.
Even if technically we could find the transition probabilities
which is not feasible for the huge state space
For this reason, the choice was to use a model-free approach, where the agent learns the optimal policy
We will stay in the class of Action-Value methods. In particular, we will use the Temporal Difference (TD) learning algorithm to estimate the optimal action-value function
Where
There are more possible choices: TD(0), TD(
The TD(0) algorithm is the following:
Figure 3: TD(0) algorithm pseudocode
Where
To compute the temporal difference error
In traditional Q-leaning, the action-value function
However, as we already said, the state space
Fig. 4: Difference between traditional Q-learning approach and DQN
Then the action
Fig. 5: DQN algorithm pseudocode
Where
Usually, the TD-learning algorithms are defined considering the eligibility traces
We have chosen the Huber loss function since it is more robust to outliers (and as you'll see we have a lot of noise). The relation between the two is the following:
As written in the algorithm, the agent will take an action with an
- Lai and Robbins:
$\epsilon(t(a)) = \frac{\eta}{1+\nu t(a)}$ , where$\eta$ and$\nu$ are 0.8 and 0.01 respectively, and$t(a)$ is the number of times the action$a$ has been taken. - Entropy:
$\epsilon$ is computed as the softmax of the scalar product between the valid values of the policy, and$\beta(a) = alpha\cdot \log(t(a)+1)$ , where$\alpha$ is 1. - Torch documentation: epsilon decays exponentially from eps_start to eps_end with eps_decay as rate
The choice of the decay strategy is controlled by the parameter kind_action
in the agent class, passing one between lai_robbins
, entropy
, torch_version
.
Since this is a project about Reinforcement Learning, the main effort was dedicated to 2 parts: the implementation of the environment and the implementation of the agent.
You can find the class of the environment in the envs.py
file. The environment is implemented as a class that extends the gym.Env
class.
The class has the following attributes:
attribute | description |
---|---|
z board |
a 2D array of integers, where each integer represents the value of the tile in that position, and 0 represents an empty tile. |
action_space |
the action space of the environment, which is a Discrete space with 4 possible actions: up , down , left , right |
score |
the current score of the game |
legit_actions |
a numpy array of booleans values with size 4, it represents if an action can be performed or not. For example, if legit_actions[0] is True then the action up can be performed, otherwise, it is not a valid action. |
log_rewards |
a boolean value that controls the reward function, if True the reward is the logarithm in base 2 of the new tile value, otherwise the reward is the new tile value. |
The methods and functions of the class are the following:
method/function | description |
---|---|
__init__(self, log_rewards=False) |
the constructor of the class, it initializes the board and the action space. If log_rewards is True then the reward is the logarithm in base 2 of the new tile value, otherwise, the reward is the new tile value. |
reset(self) |
resets the environment, it sets the board to the initial state and returns the state of the environment. |
step(self, action) |
performs the action action on the environment, and returns a tuple (state, reward, done, info) , where state is the new state of the environment, reward is the reward obtained by performing the action, done is a boolean value which is True if the game is over, info is an empty dictionary. |
get_legit_actions(self) |
returns an array of booleans values with size 4, it represents if an action can be performed or not. |
_UP , _DOWN , _LEFT , _RIGHT |
function that performs the action up , down , left and right respectively. They are used by the step method. |
_can_UP , _can_DOWN , _can_LEFT , _can_RIGHT |
functions that return True if the action up , down , left and right respectively can be performed, otherwise it returns False |
_add_random_tile(p) |
adds a new tile into a random free position on the board. p is the probability of getting a 2 tile (otherwise it will spawn a 4) |
is_full_board(self) |
returns True if the board is full, otherwise it returns False |
is_game_over(self) |
returns information about the state of the game (won and/or finished) |
For what concern the agents.py
module, it contains the implementation of 3 different agents:
-
RandomAgent
: it is an agent that takes random actions among the valid ones at each step. It was used to test the environment and as a baseline for assessing the performance of the other agent. -
DQN_Agent
andConvDQN_Agent
: they are the agents that implement the DQN algorithm. The first one uses a fully connected neural network, while the second one uses a convolutional neural network. The two agents have the same attributes and methods, the only difference is in the neural network used.
The attributes of the DQN_Agent
and ConvDQN_Agent
classes are the following:
Attribute | Description |
---|---|
model |
the model used as a policy network. It must be a model of the architectures or more in general a sub-class of torch.nn.Module |
device |
the device used for computation (cuda or cpu). |
bathc_size |
the size of the batch used for training. |
gamma |
the discount factor of the algorithm. |
eps_start |
the starting value of epsilon. |
eps_end |
the ending value of epsilon. |
eps_decay |
the decay rate of epsilon. |
kind_epsg |
the kind of decay of epsilon, it can be lai_robbins , entropy , or torch_version . |
tau |
the parameter used for the soft update of the target network. |
lr |
the learning rate of the optimizer. |
replay_memory_size |
The maximum number of transitions to store in the replay buffer. |
epochs_checkpoint |
The number of epochs between two checkpoints (print of information during the training) |
penality_move |
The penalty for each move, usable for reward shaping |
target_net |
The target network used for the soft update. |
policy_net |
The policy network used for the training. |
loss_history |
A list of the loss values obtained during the training. |
cumulative_reward |
A list of the cumulative reward obtained during the training. |
score |
The score that the agent has achieved playing the game |
steps_done |
The number of steps done by the agent |
steps_done |
The number of steps done by the agent |
steps_per_action |
array of size 4 containing the counting of all the steps the agent did in the game in each direction |
The methods of the DQN_Agent
and ConvDQN_Agent
classes are the following:
method/function | description |
---|---|
__init__(self, model, device, batch_size=128, gamma=0.99, eps_start=1.0, eps_end=0.01, eps_decay=0.0005, kind_eps='lai_robbins', tau=0.001, lr=0.001, replay_memory_size=10000, epochs_checkpoint=10, penality_move=0.0) |
the constructor of the class, it initializes the attributes of the class. |
reset(self) |
resets the specs of the agent |
select_action(self, state, kind) |
selects the action to perform given the state adopting a kind
|
optimize_model(self) |
performs the optimization step of the algorithm |
save(self, path) |
saves the model at the path path
|
load(self, path) |
loads the model from the path path
|
binary_tensor(self, state) |
converts the state into a binary tensor with shape (12,4,4) to be used as input of the neural network |
fit(self, env, num_episodes) |
performs the training of the agent for num_episodes
|
test(self, env, num_episodes) |
plays n_games='num_episodes` by following the learned policy |
In this section, we sum up the results obtained by the different agents, in particular, we tried different decay strategies for epsilon-greedy, different architectures, different reward shapings (you can find the plots Report.ipynb).
The decay strategies we have tried are:
- Lai and Robbins, with
$\epsilon(a) = \frac{\eta}{1+\nu t(a)}$ ,$\eta=0.8$ and$\nu=0.01$ : this strategy shows some issues in the early stages of the training, in fact, the first move induces a strong bias on the policy (the first moves that are selected have a high probability of being reselected), which makes the agent learn slowly; after some episodes, the agent reach approximately a uniform distribution over the actions. By looking at the mean value of the tiles in the board the agents trained in this way prefer to keep the tiles with big values in the center of the board. - Torch Version, with
$\epsilon = \epsilon_{end} + (\epsilon_{start} - \epsilon_{end})\cdot e^{-\frac{n}{\epsilon{decay}}}$ : this strategy is the one proposed in the Pytorch tutorial, it is the one that makes the agent learn faster, but it is also the one that makes the agent learn less. In fact, the agent trained in this way reaches a uniform distribution over the actions after few episodes, but it does not learn to keep the tiles with big values in the center of the board; moreover, with this strategy, we don't have the second criterion of convergence, because the asymptotic value of$\epsilon$ is a constant greater than 0. - Entropy Regularization, with
$\beta(a) = \alpha\log(t(a)+1)$ ,$\alpha=1$ : this strategy was the most interesting one for different aspects. In fact, it makes the agent learn better than the others, and the results obtained seems human-like. In this choice the agent chooses one random move (depending on the initial states in the training), and then forces an hierarchy over the other actions. In particular if the agent chooses "up" as main move in the early stage of the training, then it will prefer "left" or "right" as a second main move, and it will avoid "down" as much as possible. This is due to the fact that the agent learns to keep the tiles with big values in the corner of the board, and it learns to merge them, or to keep them near and merge them later. Finally, the distribution of the actions is strongly unbalanced, this can be a good thing because it means that the agent has learned to prefer some actions over the others.
We propose different reward shapings, we fixed the baseline architecture (BigConvolutionalNetwork), and the entropy regularization:
- No reward shaping (not tested): the agent learns to reach the 2048 tile, but it does not learn to keep the tiles with big values in the corner of the board, and it does not learn to merge them. The only way to get a reward is win the game by reaching 2048.
- Score rewards: for each triple (s, a, s'), the reward corresponds to the sum of the values of the merged tiles in that step (r=4 if we merge 2 tiles of 2s, r=8 if we merge 2 tiles of 4s and so on).
- Log rewards: for each triple (s, a, s'), the reward corresponds to sum of the logarithm in base 2 of the values of the merged tiles in that step (r=2 if we merge 2 tiles of 2s, r=3 if we merge 2 tiles of 4s, and so on).
- Progressive log rewards: as in (3), but this time we rewarded also the presence of tiles bigger or equal to 512, weighted by the number of each tile in the board (matches with big tiles are less probable in the early stages, and we may want to enhance the speed of convergence to those states), in the presentation, we have shown that this reward shaping let the agent reach higher scores if compared with (2).
- Progressive log rewards with penalty: as in (4), but with a penalty of 1.5 for each move, in the presentation we have shown that this reward shaping doesn't improve the score in the testing phase.
In the early stages of the project, we have tried to use a fully connected neural network as a policy network, and the board as input. The results were not good (the distribution of the scores was compatible with choosing a random policy), in fact, the agent was not able to learn interesting behaviors, our hypotesis is that the fully connected network is not able to capture that tiles with different values are de facto different objects, and it is not able to capture the spatial information of the board. For this reason, we have decided to convert the board into a binary tensor, and to use a convolutional neural network as a policy network. The results were much better, and in this case we needed less weights, thanks to the weight sharing of the convolutional layers. However, to be fair we compared also the fully connected network with the convolutional one, by using the same binary tensor representation.
As you may notice in the Report.ipynb, we have not (for now) an agent that can win at this game, however we think that this is a good starting point we can use to perform further analysis and to improve the results. In particular, we think that the entropy regularization is a good choice, but not the optimal one (when we are in the long term dynamics it becomes a greedy strategy, and that might be suboptimal), and we can try to improve the results by tuning the reward shaping and the architecture of the network. Moreover, we can try to use a different decay strategy for epsilon-greedy, and we can try to use a different architecture for the network (for example we can try to use a recurrent neural network, or a transformer); finally, here we've chosen a TD(0) algorithm, but we can try to use a TD(
You can use the given notebook to play around with the environment and the agents.
The notebook contains examples of how to use the modules we implemented in order to train and test the agent.
Here a visual example about how the random and the DQN agent play the game: