These are samples used in the University of Cambridge course Machine Learning for Programming.
Scaffolding for a simple language model is provided in language_model/
, for
TensorFlow 1.X, TensorFlow 2.X, and PyTorch. Python 3.6 or later is required.
If you want to re-use this, pick a framework you want to use, install it and
the requirements for this model using pip install -r requirements.txt
.
To get started, open a console and change your current directory to language_model/
.
Alternatively, set that directory to your PYTHONPATH
enviornment variable:
export PYTHONPATH=/path/to/language_model
The scaffold provides some generic code to simplify the task (such as a
training loop, logic for saving and restoring, ...), but you need to complete
the code in a number of places to obtain a working model (these are marked by
#TODO N#
in the code):
-
In
model.py
, uncomment the line corresponding to the framework you want to use. -
In
dataset.py
,load_data_file
needs to be filled in to read a data file and return a sequence of lists of tokens; each list is considered one sample. This should re-use the code from the first practical to provide one sample for the tokens in each method.It is common practice to normalise capitalization of tokens (as the embedding of
foo
andFoo
should be similar). Make sure thatload_data_file
transforms all tokens to lower (or upper) case.You should be able to test this as follows:
$ python test_step2.py data/jsoup/src/main/java/org/jsoup/Jsoup.java.proto | tail -n -1 ['public', 'static', 'boolean', 'isvalid', 'lparen', 'string', 'bodyhtml', 'comma', 'whitelist', 'whitelist', 'rparen', 'lbrace', 'return', 'new', 'cleaner', 'lparen', 'whitelist', 'rparen', 'dot', 'isvalidbodyhtml', 'lparen', 'bodyhtml', 'rparen', 'semi', 'rbrace', 'rbrace']
-
In
dataset.py
,build_vocab_from_data_dir
needs to be completed to compute a vocabulary from the data. The vocabulary will be used to represent all tokens by integer IDs, and we need to consider three special tokens: theUNK
token used to represent infrequent tokens and those not seen at training time, thePAD
token used to make all samples of the same length, andSTART_SYMBOL
token used to as the first token in every sample and theEND_SYMBOL
used as the last.To do this, we use the class
Vocabulary
fromdpu_utils.mlutils.vocabulary
. Usingload_data_file
from above, compute the frequency of tokens in the passeddata_dir
(collections.Counter
is useful here) and use that information to add thevocab_size
most common of them tovocab
.You can test this step as follows:
$ python test_step3.py data/jsoup/src/main/java/org/jsoup/ Loaded vocabulary for dataset: {'%PAD%': 0, '%UNK%': 1, '%START%': 2, '%END%': 3, 'rparen': 4, 'lparen': 5, 'semi': 6, 'dot': 7, 'rbrace': 8, ' [...]
-
In
dataset.py
,tensorise_token_sequence
needs to be completed to translate a token sequence into a sequence of integer token IDs of uniform length.The output of the function should always be a list of length
length
of token IDs fromvocab
, where longer sequences are truncated and shorter sequences are padded to the correct length. We also want to use this method to insert theSTART_SYMBOL
at the beginning of each sample. The specialEND_SYMBOL
symbol needs to be appended to indicate the end of a list of tokens, whereas a specialPAD_SYMBOL
needs to be added to serve as a filler so that all token sequences will have the same length.You can test this step as follows: (note this is an example output that is using count_threshold of 2)
$ python test_step4.py data/jsoup/src/main/java/org/jsoup/ Sample 0: Real length: 50 Tensor length: 50 Raw tensor: [ 2 13 1 4 3 8 118 4 3 5 7 13 1 4 12 1 3 8 118 4 1 3 5 7 13 1 4 1 1 3 8 118 4 1 3 5 7 13 1 4 12 1 9 1 1 3 8 118 4 1] (truncated) Interpreted tensor: ['%START%', 'public', '%UNK%', 'lparen', 'rparen', 'lbrace', 'super', 'lparen', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'comma', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%'] (truncated) Sample 1: Real length: 46 Tensor length: 50 Raw tensor: [ 2 13 1 4 12 1 3 8 118 4 1 3 5 7 13 1 4 1 1 3 8 118 4 1 3 5 7 13 1 4 12 1 9 1 1 3 8 118 4 1 9 1 3 5 7 7 0 0] (truncated) Interpreted tensor: ['%START%', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'rparen', 'semi', 'rbrace', 'public', '%UNK%', 'lparen', 'string', '%UNK%', 'comma', '%UNK%', '%UNK%', 'rparen', 'lbrace', 'super', 'lparen', '%UNK%', 'comma', '%UNK%', 'rparen', 'semi', 'rbrace', 'rbrace', '%PAD%', '%PAD%'] (truncated) ...
-
The actual model needs to be built. Our goal is to learn to predict
tok[i]
based on the tokentok[:i]
seen so far. The process and scaffold is very similar in all frameworks. The methodcompute_logits
andcompute_loss_and_acc
need to be completed, and thebuild
method can always be used to initialise weights and layers that will be re-used during training and prediction. Parameters such asEmbeddingDim
andRNNDim
should be hyperparameters, but values such as64
work well.-
In
compute_logits
, implement the logic to embed thetoken_ids
input tensor into a distributed representation. In TF 1.x, you can usetf.nn.embedding_lookup
; in TF 2.X, you can usetf.keras.layers.Embedding
; and in PyTorch, you can usetorch.nn.Embedding
for this purpose.This should translate an
int32
tensor of shape[Batch, Timesteps]
into afloat32
tensor of shape[Batch, Timesteps, EmbeddingDim]
. -
In
compute_logits
, implement an actual RNN consuming the results of the embedding layer. You can usetf.keras.layers.GRU
resp.torch.nn.GRU
(or their LSTM variants) for this. This should translate afloat32
tensor of shape[Batch, Timesteps, EmbeddingDim]
into afloat32
tensor of shape[Batch, Timesteps, RNNDim]
. -
In
compute_logits
, implement a linear layer to translate the RNN output into an unnormalised probability distribution over the the vocabulary. You can usetf.keras.layers.Dense
resp.torch.nn.Linear
for this. This should translate afloat32
tensor of shape[Batch, Timesteps, RNNDim]
into afloat32
tensor of shape[Batch, Timesteps, VocabSize]
. -
In
compute_loss_and_acc
, implement a cross-entropy loss that compares the probability distribution computed at timestepT
with the input at timestepT+1
(which is the token that we want to predict). Note that this means that we need to discard the final RNN output, as we do not know the next token. You can usetf.nn.sparse_softmax_cross_entropy_with_logits
resp.torch.nn.CrossEntropyLoss
for this.
After completing these steps, you should be able to train the model and observe the loss going down (the accuracy value will only be filled in after step 6):
$ python train.py trained_models data/jsoup/{,} Loading data ... Built vocabulary of 4697 entries. Loaded 2233 training samples from data/jsoup/. Loaded 2233 validation samples from data/jsoup/. Running model on GPU. Constructed model, using the following hyperparameters: {"optimizer": "Adam", "learning_rate": 0.01, "learning_rate_decay": 0.98, "momentum": 0.85, "max_epochs": 500, "patience": 5, "max_vocab_size": 10000, "max_seq_length": 50, "batch_size": 200, "token_embedding_size": 64, "rnn_type": "GRU", "rnn_num_layers": 2, "rnn_hidden_dim": 64, "rnn_dropout": 0.2, "use_gpu": true, "run_id": "RNNModel-2019-12-29-13-23-18"} Initial valid loss: 0.042. [...] == Epoch 1 Train: Loss 0.0303, Acc 0.000 Valid: Loss 0.0224, Acc 0.000 (Best epoch so far, loss decreased 0.0224 from 0.0423) (Saved model to trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin) == Epoch 2 Train: Loss 0.0213, Acc 0.000 Valid: Loss 0.0195, Acc 0.000 (Best epoch so far, loss decreased 0.0195 from 0.0224) (Saved model to trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin) [...]
The saved models should already be usable as autocompletion models, using the provided
predict.py
script:$ python predict.py trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin public Prediction at step 0 (tokens ['public']): Prob 0.282: static Prob 0.099: void Prob 0.067: string Continuing with token static Prediction at step 1 (tokens ['public', 'static']): Prob 0.345: void Prob 0.173: document Prob 0.123: string Continuing with token void Prediction at step 2 (tokens ['public', 'static', 'void']): Prob 0.301: main Prob 0.104: isfalse Prob 0.089: nonullelements Continuing with token main Prediction at step 3 (tokens ['public', 'static', 'void', 'main']): Prob 0.999: lparen Prob 0.000: filterout Prob 0.000: iterator Continuing with token lparen Prediction at step 4 (tokens ['public', 'static', 'void', 'main', 'lparen']): Prob 0.886: string Prob 0.033: int Prob 0.030: object Continuing with token string
Note: Note that tokens such as
{
and(
are represented aslbrace
andlparen
by the feature extractor and are used the same way here. -
-
Finally,
compute_loss_and_acc
should be extended to also compute the number of (correct) predictions, so that accuracy of the model can be computed. For this, you need to check if the most likely prediction corresponds to the ground truth. You can usetf.argmax
resp.torch.argmax
here. Finally, we also need to discount padding tokens, so you need to compute a mask which predictions correspond to padding. Here, you can useself.vocab.get_id_or_unk(self.vocab.get_pad())
to get the integer ID of the padding token.After completing this step, you should be able to evaluate the model:
$ python evaluate.py trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin data/jsoup/ Loading data ... Loaded trained model from trained_models/RNNModel-2019-12-29-13-23-18_best_model.bin. Loaded 2233 test samples from data/jsoup/. Test: Loss 24.9771, Acc 0.876
-
To improve training, we want to ignore those parts of the sequence that are just
%PAD%
symbols introduced to get to a uniform length. To this end, we need to mask out part of the loss (for tokens that are irrelevant). You can use the mask computed in step 6 again here.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.