homebrewnlp / olmax Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 5.0 6.17 MB

HomebrewNLP in JAX flavour for maintable TPU-Training

License: BSD 2-Clause "Simplified" License

Python 98.20% Shell 1.80%

jax tpu

olmax's Introduction

HomebrewNLP

Overview

A case study of efficient training of large language models using commodity hardware.

Example Command

python3 main.py train --config_path configs/small.yaml

| Discord | WandB

Datasets

Book Dataset
200MB Slice of ThePile

Citing

BibTeX

@misc{nestler2021homebrewnlp,
  title = {{HomebrewNLP}},
  author = {Nestler, Lucas and Gill, David},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.5553247},
  howpublished = {\url{https://github.com/HomebrewNLP/HomebrewNLP}}
}

Latest DOI

olmax's People

Contributors

Stargazers

Watchers

Forkers

shawwn stjordanis ma7dev dashstander batelicm

olmax's Issues

Finalize checkpoint/restore

At the moment, we have one checkpoint, but we've never tested whether restoring it works. The saving pipeline returns files, but it's uncertain whether loading it works equally flawlessly.
This issue is about testing the existing code and potentially fixing problems as they appear.

Stabilize MoE

Currently, our MoE implementation leads to exploding losses and the eventual NaN.
This issue is about finding the cause behind these problems and fixing it.

Square LR-Schedule

Our learning rate scheduler currently uses a linear increase and exponential dropoff, so our learning rate curve looks like the following:

where the duration of the initial ramp-up and the decay are tuneable hyperparameters.

However, others pointed out that square ramp-up and square decay can perform significantly better, so we might also want to use them. The modified curve (orange) would look like the following:

Retrieval Augmented Causal Generation

DeepMind demonstrated in their recent RETRO paper that augmenting a language model's input with text retrieved from a corpus allows it to learn to copy relevant passages instead of storing those in its weights. This text retrieval is another solution to the problem mentioned in #8 and doesn't involve modifying the model. Instead, RETRO first retrieves similar text using BERT embeddings and then feeds that text into the cross-attention of their model together with the original prompt. This way, the decoder of their T5-model is aware of similar texts without storing them in its weights.
We could implement a similar architecture without cross attention (#44) by using only autoregressive language modelling and retrieving chunks using BERT (or our own) embeddings. It would even be possible to test this approach without retraining a model by simply retrieving relevant chunks and feeding them into the context of our model (instead of using padding tokens).
This issue tracks the progress of the initial proof-of-concept, its benchmarks against the baseline and its overall progress.

Momentum Quantization

Many modern optimisers, such as Shampoo, SM3 and 8-Bit Adam quantise the large momentum buffers to a lower precision such as int8. This quantisation gives them significant memory improvements, as they now only need 6 bytes per parameter instead of 12. We could save 16% to 33% of our total memory consumption by adding momentum quantisation, allowing for more parameters and bigger batches.
This issue is about implementing quantised momentum and benchmarking its convergence impact compared to bf16 and fp32 momentum.

Learning-rate schedule as beta schedule

Currently, our model always gives past gradients the same weight as future gradients, regardless of where the model is in the current training. So, if it has already completed a few hundred thousand steps, the optimizer would keep looking at the best 100 steps to compute the approximate update direction.
Instead, we could add the learning rate scheduler directly before passing it into the optimizer. This way, our optimizer would slowly deprioritize the current gradient and focus more on the past ones. Our learning rate would almost act like an implicit beta schedule which could make tuning this hyperparameter easier.

Complex Momentum

Others have had great success with complex momentum, showing improved performance across optimizers and improving momentum stability. Additionally, others have shown momentum to be quite unstable in LLMs. Therefore complex momentum could be an attractive approach to improve stability without decreasing performance.
This issue is about implementing complex momentum and testing it against the baseline.

Hierarchical Network

Our network has a sequential structure through which it passes all messages. This structure implies that we always have a dense computation and assume that all features that are far away are less critical than closer ones. In reality, we might want to pass a "concept" or "general embedding" more quickly through the network than the local information.
We could achieve exactly that by using a hierarchical network, as proposed in Clockwork RNN. Some parts with a higher "clock rate" would propagate information more locally (1 -> 2 -> 3), while layers with a lower clock rate would work with global contexts (1 -> 3 -> 5). This hierarchy can be put to any depth and adds another hyperparameter.
This issue aims to implement such a hierarchical network and benchmark it against the baseline without hierarchy in a long-context setting.

MoE + Weight Sharing

Like WideNet proposed, we could combine a MoE-architecture with weight sharing. Incorporating a WideNet-style architecture should increase performance, decrease training time, and reduce the number of parameters needed.
This issue is about implementing such a weight-sharing protocol and benchmarking its performance.

"Resume" option for tokenizers

Currently, our tokenisers are long-running tasks that cannot be interrupted. If the process is stopped for even just a minute (for example, because GPU or CPU resources are needed elsewhere), the tokenisation has to be restarted from scratch. Instead of enforcing to run a process that can take multiple weeks in one go, we should implement an option to "resume" the state from an earlier checkpoint. This could be done, by, for example, skipping the first few documents or videos.
This issue tracks the progress of implementing such a scheme.

Automated Eval-Demo Update

At the moment, the evaluation demo has to be updated manually by SSHing into the server, pulling the latest code, and restarting the WebAPI. Suppose we'd instead have a script that runs after every long-running experiment (from #31). In that case, we could ensure that the demo will always have the latest code and checkpoint without introducing massive downtimes through potential human error.

Causality Test

Currently, we have to manually verify that a modification doesn't accidentally leak information, which can be prone to errors. Especially in situations where only some tokens can see other future tokens, this can be difficult to notice using only the loss curves. That's why we should need to introduce a test that ensures our model cannot future tokens, as that'd make it much easier to predict future tokens.

Video Generation via Tokens

If we tokenise frames of a video with a VQGAN, we can autoregressively predict the next token using our current language model. More specifically, using our current context of 2 million tokens, we could fit 2048 frames (~34 minutes at 1 FPS) with current state-of-the-art image quantisation models.
This issue is about implementing such a model end-to-end and having a working demo.

Reuse Parameter-Buffers

Currently, our model allocates one set of buffers for the input parameters and another set of buffers for the output parameters. So, for a 16GB GPU, we could fill up to half its memory with buffers as input and output buffers are separate. This separation means that we have 8GB of effective memory, which means we can allocate up to 8GB/(4 Bytes/Buffers)=2 billion Buffers. However, Jax supports buffer donation, which would allow its compiler to deallocate the inputs.

Frontend

Once #19 is done, we could implement a "Playground" as OpenAI and Aleph Alpha have it. A frontend would simplify it even more for others to experiment with our models and research them.

Staged batchsize training

Some papers such as "Don't Decay the Learning Rate, Increase the Batch Size" have shown that training with progressively larger batch sizes instead of progressively lower learning rates helps models find a better local minimum by improving stability in the final stages of training. Additionally, this increases training speed, as the model gets progressively faster (in tokens/s) with increasing batch size.
Intuitively, this allows the model to take many small updates initially, as all samples in the batch will point in a similar direction. However, during later stages of the training, the gradients might point in different directions, so larger batches (or lower learning rates) are required.

Inference CLI

We currently have a basic CLI for inference of our models, but it doesn't support newlines or large-scale embeddings:

https://github.com/HomebrewNLP/HomebrewNLP-Jax/blob/b4a7afa8da619af47c0e661baddf581d7f8a2cce/src/inference.py#L98-L103

It would be nice to have a dedicated command-line interface for inference, as this would help improve the useability of our models.

Long-Context Model

At the moment, our models can fit up to 2 million tokens. However, it seems like Jax has some internal overheads that stop us from using them in one sequence with a batch of one sample, as that'd require 200 GiB RAM instead of the 14 GiB we need for batch=512+sequence=4096.
This issue is about tracking these overheads and finding a sensible solution.

L1 LayerNorm

Others have reported increased stability with L1-BatchNorm

so it might be worth a try for us as well

Image Classification

At the moment, we have a novel architecture that's very powerful in language modelling. However, we don't know whether it will transfer as well to other domains as the transformer. That's why it'd be interesting to test its versatility by training it on ImageNet.
This issue is about implementing the input projection for image tokens (as in ViT), the necessary data pipelines and testing the model on this new modality.

Automated Integration Tests

Currently, our codebase is untested and needs manual evaluation to figure out if a PR broke something or if it's valid and ready to be merged. We could start a dedicated TPU that tries to overfit on a single batch to avoid this effort. This way, we could easily have a sanity check that the model can do one forward pass, learn, and its gradients are correct.
This issue tracks the progress of implementing such an automated testing infrastructure.

Audio Modelling

There are multiple ways we could go about modelling audio. For example, we could tokenise sounds or audio snippets and autoregressively predict the next token. Whether the audio tokens come from a VQGAN or discrete Fourier transformation doesn't matter to the model but could change the performance of our generation a lot. This issue is about finding out how to model sound and develop an end-to-end pipeline to develop a prototype and see how it works.

Initialize deep model from shallow model

We already support the usage of pretrained input embeddings. However, output embeddings and layers still have to be retrained. One way to use smaller checkpoints when training larger ones (if comparing loss curves doesn't matter) would be to initialise the larger model from the weights of the smaller model by replicating them. As our models always have a fixed width for a given number of devices, loading the checkpoint of a shallower model would be as easy as converting input_embedding-layer1-layer2-output_embedding to input_embedding-layer1-layer2-layer1-layer2-output_embedding.
This issue aims to track the progress of such a scheme and achieve faster convergence by effectively skipping the loss of the first thousand steps.

Reduce Compile-Time

Currently, our models take a while to compile.
Compiling a model of 16 layers on a v3-8 takes almost 15 minutes:

Adding the GPT-2 tokenizer adds another 200s runtime:

Using 64 layers (with a char tokeniser) increases the compile time by over 17x to ~4h:

In the future, we'd need our models to compile within a few minutes to ensure that we spend most of our runtime in steps. Especially with hyperparameter sweeps where each run lasts up to 16 hours, a 4-hour compile time is prohibitively long.
This issue discusses possible approaches to reduce compile-time and implement and benchmark them.

Language-Model Evaluation

At the moment, we only have language-modelling loss to go by when experimenting with different architectures. Unfortunately, many methods, such as extra-gradient methods, different loss functions, different tokenisers or even different datasets, will change these loss values dramatically, making comparison almost impossible. We would gain certainty by integrating a dedicated evaluation pipeline such as EleutherAI's eval-harness that one model is better than the other and allow us to compare ourselves with existing models such as GPT-J and GPT-3.

Compact Loss

Our model uses a lot of parameters for the output layer. Specifically, 2 * vocab_size * devices * features, where features=256 and devices=256 for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.
For example, ALBERT used factorized embeddings, reducing the number of parameters from 256*256*vocab = 8.59B to 256*256*sqrt(vocab)*2 = 33.5M .

ALiBi Convolution

Currently, we're implicitly adding a locality bias to our model by using convolutions and giving more gradient signals (via MuParemtrization) to small convolutions. However, ALiBi demonstrated that adding a locality bias to attention can help it converge and extrapolate significantly. ALiBi is the only position embedding that works at all scales, so we should take a closer look at adding it to our codebase.
We already have something akin to ALiBi with our QRNN (with #7 hopefully allowing us to run it more frequently), but that might be not enough bias. We could add a bias inspired by ALiBi into our convolution weights to enforce locality further.
One approach could be as simple as adding a scaling factor to the convolution kernel to penalise long-context interactions by giving them less weight and gradient signal during training. To implement this, we'd just have to expand our existing "parameter_variance" scalar to a tensor of shape [Kernel, 1, 1], which could contain any bias such as the linear bias (1 + jnp.arange(kernel).reshape(-1, 1, 1)) / sum(range(kernel + 1)).

This issue tracks the progress of such an implementation and compares our new ALiBi Convolution with the current baseline model.

Explicit Memory

Many modern architectures, such as Memorizing Transformers, RETRO and PKM have an explicit memory where the model can retrieve information from and optionally even store it. Some hypothesise that Mixture-of-Experts embeds fuzzy representations of books and other things it must memorise into its weights.
That's why adding explicit memory to our models could give them a considerable boost in performance. Instead of storing this information in dense layers and having the weights fight about whether they should be storing concepts or memorising sequences, our model would be able to do both.
This issue is about implementing such an explicit memory (be it PKM, MoE or even a new architecture) and improving the convergence of our language model at the same runtime.

Web API

We only have a rudimentary inference interface (and CLI once #18 is finished). Unfortunately, everyone who wants to try our models needs to install the correct Jax version and drivers and have a GPU large enough to handle our models at reasonable speeds. To allow for faster experimentation of other researchers, we should provide them with a web API.
This issue tracks the progress of such an API.

Faster QRNN

Our long-context QRNN has n*log(n) complexity but a high constant iteration time. For every doubling of the context (~35ms), we could've also added another pointwise convolution block (~50ms). One way to reduce iteration time could be "unrolling". By running seven steps "in parallel" using matrix multiplication, we might reduce the time of a doubling to 5ms.
This issue is about coming up with an idea and benchmarking it against our current language model.

Optimizer Grafting

Currently, we're grafting the Shampoo update onto SGD, which doesn't work well with other NLP models and transformers. However, anecdotal evidence suggests that grafting onto RMSProp improves convergence significantly. Unfortunately, RMSProp requires much more memory. Grafting onto SM3 could be a memory-efficient alternative.
This issue is about exploring such grafting methods, benchmarking them and ideally improving upon the performance of the baseline.

Tokenizing Phonetics

Currently, all tokenisers work on a character level. This means that transferring them to a new language is often not possible. At the same time, this means that a model trained with such a tokeniser is specific for that particular language and won't be able to transfer from Spanish to Italian without significant effort. Additionally, written language is a quantised form of speech to reduce the space you need to store it. However, this conversion is very lossy, as it doesn't contain sarcasm or other vocal information.
We hope to reduce the first issue by using phonetic information while leaving the second untouched. The second could be solved by #9, although that uses less sparsity and therefore needs a bigger context to encode the same information.
This issue tracks the progress of implementing such a tokeniser built on phonetic information and the resulting language model trained with it.

Balance update weights of depthwise vs. pointwise convolution

Currently, we're balancing the update sizes by fan-in features. Unfortunately, our bottleneck convolution has a 5x lower learning rate and 4x fewer output features, meaning that the effective update size is 20x smaller. Similarly, our bottleneck block's dilated (#52) convolution has a 10x larger kernel size and 40x smaller updates. While we intended the 5x/10x difference (MuParametrization), the 4x happens because our current MuParametrization implementation accounts only for fan_in but not fan_out.
This issue tracks the progress of implementing a "fix" for this 4x reweighting and benchmarking it against the baseline.

Long-Context Experiments

Currently, our model can train with a context of 2 million tokens (at 1B parameters) on a v3-8. However, our demo uses 4096 tokens (characters at the time of writing), a significantly shorter context. Instead of using such an unimpressive context, we could scale up and demonstrate that we can few-shot learn from an entire book.
This issue tracks the progress of creating and deploying such a model.

todo list

Self-Convolution

A couple of months ago, we tested a global convolution "self_conv": https://github.com/HomebrewNLP/HomebrewNLP-Jax/blob/97b7e8def9a676b0e5d44a3fa4aaaa826c54fc10/src/model.py#L258-L270.
However, due to unusably slow speeds, this self-convolution was never usable. While the idea of every position generating weights for every other position (like in self-attention) is appealing, it also doesn't scale well with increased sequence length and causes significant computation overheads. Compared to our usual convolution with a kernel size of 5, this convolution can be up to 50,000x slower (with a sequence length of 256ki).
This issue exists to discuss and gather ideas and potentially re-benchmark a faster variant of Self-Convolution.

Non-Autoregressive Generation

Recently there have been papers about non-autoregressive text generation, in which models generate many tokens simultaneously instead of only one. Not only does this mean faster decoding times, but it also means that all hidden states can always attend to one another and know of their existence. Using non-autoregressive text generation, a model could first come up with concepts it wants to talk about in the future and generate text that leads to the future event. With autoregressive language modelling, this isn't possible to the same extent.
This issue involves implementing such a language model and benchmarking against current, autoregressive language models.

Release pretrained weights

All of our models are currently stored on a dedicated bucket on google cloud. Moving them elsewhere would allow others to experiment with them and even fine-tune them for their purposes. To share these checkpoints, we need a cost-effective storage solution with cheap egress and the option to pump out at least one model per week.
This issue tracks the progress of the search for a good host.

Automated Long-Running Experiments

At the moment, I execute all experiments manually. This process means that every config change requires a manual effort to SSH into a machine, change the checkpoint path, change the hyperparameters, etc. Instead, a fail-safe automated system could allow us to run these things without manual intervention, without it ever making a typo or forgetting to change a variable. Such an automated system would free up time to do other things, such as research or engineering.
This issue tracks the progress of implementing such a CI pipeline.

Scaling

Most transformers increase drastically in performance when they are scaled up. ViT-G showed this for vision transformers and Chinchilla for language models. However, as we're not using a transformer, it's uncertain whether we'll see similar improvements.
This issue is about "scaling up" and tracks the progress of large-scale models. Once it's finished, we should be able to run our models on v3-32s, v3-256s and bigger. Using those 32x bigger TPUs, we aim for at least 16x faster steps (in tokens * parameters/second).

Alternative Losses

Currently, we're using only the softmax classification/cross-entropy loss to create a language-modeling loss for next-token prediction. However, other works such as T-Few showed that adding alternative losses for external benefits such as length explicit penalties during training can help downstream-task performance. Additionally, other works like DCL and InfoLOOB demonstrated that changing the fundamental structure of the loss from softmax classification to something different can help speed up convergence. That's why a similar approach could be beneficial for us.
In this issue, we'll explore whether InfoLOOB's classification loss for the language-modeling objective helps or if we should change the entire objective.

Multiple forward per backward

Currently, our model does one forward pass and uses the intermediate states to do one backward pass. However, a backward pass is over 3x as expensive as a forward pass, so we could change the ratio of forward to backward passes to speed up the model.
One such approach would be MESA, which adds KL(model(x), ema_model(x)). Another method is RHO-Loss, which prioritizes some samples over others, by running (model(x) - oracle(x)).topk(). Both of these methods claim to improve sample efficiency by up to 18x.

Encoder-Decoder Architecture

Currently, our model can be either an encoder or a decoder. Combining these two, as in T5, is not possible. The best approximation we could get at the moment would be to expand the context of our decoder, but the performance of a decoder-only model isn't as good. Ideally, we could run full "attention" for one part and sample autoregressive for the other.
This issue discusses ideas for implementing such a scheme and benchmarking it against the baseline fully-autoregressive model.

Gradient Noise

Some have suggested that adding gradient noise helps deep models converge and generalise. Other works, such as DDPG, showed that this is the case even for shallow networks of a different domain. That's why it could be interesting for us to explore gradient noise as an option to improve generalisation and with that convergence by avoiding overfitting and other local minima during training.
One option to further improve gradient noise would be to combine it with #35, by adding different noise to each optimiser. This change would allow us to create combinations like Adam#Adam, where each optimiser sees slightly different noise at each step.
This issue tracks the progress of such a scheme.

Shampoo Optimizer

Second-order optimizers such as K-Fac, LBFGS and AdaHessian promise significantly improved convergence rates at horrific memory costs. Scalable Shampoo promises a low memory footprint and vectorisable computation while retaining the convergence advantage of other second-order optimisers. Adding it to our code could reduce training time by 10% or even up to an order of magnitude.
This issue is about implementing shampoo (reference might help), running a hyperparameter sweep to find its best configuration and comparing the best possible runtime with our previous best.

MuP Normalization

Currently, we apply MuP as a per-layer learning rate scale, which leads to faster and more memory-efficient training than their recommended method of initializing to a larger value and multiplying the outputs of every layer with a constant scalar.
However, we can further improve the speed of our model by fusing MuParametrization's scales with those of normalization. This way, instead of training with mean=0+std=1, we'd train with mean=0+l2norm=1. This way, our new tensors would have a standard deviation of 1/sqrt(numel), which pushes them ever close to 0 - a region where floats become increasingly accurate. Computing this is very cheap as well, as we simply have to compute sqrt(sum(x^2)) instead of sqrt(mean(x^2)), or, in other words, remove a scalar multiplication.

Long-Range-Arena Evaluation

Currently, we only know that our model is better than the baseline because of its lower loss at less training time. However, we could run some benchmarks such as LRA to see how well our long-context model performs in a real-world scenario. While LRA doesn't leverage our capabilities ideally (unlike, for example, #5 and #9), it'd still allow us to have preliminary evaluation results on a well-known benchmark dataset.
This issue tacks the progress of integrating our model into LRA, even though it should happen in a separate codebase.